summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2025-05-10Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds3-19/+17
Pull mount fixes from Al Viro: "A couple of races around legalize_mnt vs umount (both fairly old and hard to hit) plus two bugs in move_mount(2) - both around 'move detached subtree in place' logics" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix IS_MNT_PROPAGATING uses do_move_mount(): don't leak MNTNS_PROPAGATING on failures do_umount(): add missing barrier before refcount checks in sync case __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock
2025-05-10Merge tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2-8/+4
Pull smb client fixes from Steve French: - Fix dentry leak which can cause umount crash - Add warning for parse contexts error on compounded operation * tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: client: Avoid race in open_cached_dir with lease breaks smb3 client: warn when parse contexts returns error on compounded operation
2025-05-10fix IS_MNT_PROPAGATING usesAl Viro3-10/+12
propagate_mnt() does not attach anything to mounts created during propagate_mnt() itself. What's more, anything on ->mnt_slave_list of such new mount must also be new, so we don't need to even look there. When move_mount() had been introduced, we've got an additional class of mounts to skip - if we are moving from anon namespace, we do not want to propagate to mounts we are moving (i.e. all mounts in that anon namespace). Unfortunately, the part about "everything on their ->mnt_slave_list will also be ignorable" is not true - if we have propagation graph A -> B -> C and do OPEN_TREE_CLONE open_tree() of B, we get A -> [B <-> B'] -> C as propagation graph, where B' is a clone of B in our detached tree. Making B private will result in A -> B' -> C C still gets propagation from A, as it would after making B private if we hadn't done that open_tree(), but now the propagation goes through B'. Trying to move_mount() our detached tree on subdirectory in A should have * moved B' on that subdirectory in A * skipped the corresponding subdirectory in B' itself * copied B' on the corresponding subdirectory in C. As it is, the logics in propagation_next() and friends ends up skipping propagation into C, since it doesn't consider anything downstream of B'. IOW, walking the propagation graph should only skip the ->mnt_slave_list of new mounts; the only places where the check for "in that one anon namespace" are applicable are propagate_one() (where we should treat that as the same kind of thing as "mountpoint we are looking at is not visible in the mount we are looking at") and propagation_would_overmount(). The latter is better dealt with in the caller (can_move_mount_beneath()); on the first call of propagation_would_overmount() the test is always false, on the second it is always true in "move from anon namespace" case and always false in "move within our namespace" one, so it's easier to just use check_mnt() before bothering with the second call and be done with that. Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-10do_move_mount(): don't leak MNTNS_PROPAGATING on failuresAl Viro1-3/+2
as it is, a failed move_mount(2) from anon namespace breaks all further propagation into that namespace, including normal mounts in non-anon namespaces that would otherwise propagate there. Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-10do_umount(): add missing barrier before refcount checks in sync caseAl Viro1-1/+2
do_umount() analogue of the race fixed in 119e1ef80ecf "fix __legitimize_mnt()/mntput() race". Here we want to make sure that if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will notice their refcount increment. Harder to hit than mntput_no_expire() one, fortunately, and consequences are milder (sync umount acting like umount -l on a rare race with RCU pathwalk hitting at just the wrong time instead of use-after-free galore mntput_no_expire() counterpart used to be hit). Still a bug... Fixes: 48a066e72d97 ("RCU'd vfsmounts") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-10__legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lockAl Viro1-5/+1
... or we risk stealing final mntput from sync umount - raising mnt_count after umount(2) has verified that victim is not busy, but before it has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see that it's safe to quietly undo mnt_count increment and leaves dropping the reference to caller, where it'll be a full-blown mntput(). Check under mount_lock is needed; leaving the current one done before taking that makes no sense - it's nowhere near common enough to bother with. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-09f2fs: fix freezing filesystem during resizeChristian Brauner1-3/+3
Using FREEZE_HOLDER_USERSPACE has two consequences: (1) If userspace freezes the filesystem after mnt_drop_write_file() but before freeze_super() was called filesystem resizing will fail because the freeze isn't marked as nestable. (2) If the kernel has successfully frozen the filesystem via FREEZE_HOLDER_USERSPACE userspace can simply undo it by using the FITHAW ioctl. Fix both issues by using FREEZE_HOLDER_KERNEL. It will nest with FREEZE_HOLDER_USERSPACE and cannot be undone by userspace. And it is the correct thing to do because the kernel temporarily freezes the filesystem. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09kernfs: add warning about implementing freeze/thawChristian Brauner1-0/+15
Sysfs is built on top of kernfs and sysfs provides the power management infrastructure to support suspend/hibernate by writing to various files in /sys/power/. As filesystems may be automatically frozen during suspend/hibernate implementing freeze/thaw support for kernfs generically will cause deadlocks as the suspending/hibernation initiating task will hold a VFS lock that it will then wait upon to be released. If freeze/thaw for kernfs is needed talk to the VFS. Link: https://lore.kernel.org/r/20250402-work-freeze-v2-4-6719a97b52ac@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09efivarfs: support freeze/thawChristian Brauner2-145/+51
Allow efivarfs to partake to resync variable state during system hibernation and suspend. Add freeze/thaw support. This is a pretty straightforward implementation. We simply add regular freeze/thaw support for both userspace and the kernel. This works without any big issues and congrats afaict efivars is the first pseudofilesystem that adds support for filesystem freezing and thawing. The simplicity comes from the fact that we simply always resync variable state after efivarfs has been frozen. It doesn't matter whether that's because of suspend, userspace initiated freeze or hibernation. Efivars is simple enough that it doesn't matter that we walk all dentries. There are no directories and there aren't insane amounts of entries and both freeze/thaw are already heavy-handed operations. We really really don't need to care. Link: https://lore.kernel.org/r/20250331-work-freeze-v1-2-6dfbe8253b9f@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09libfs: export find_next_child()Christian Brauner2-1/+3
Export find_next_child() so it can be used by efivarfs. Keep it internal for now. There's no reason to advertise this kernel-wide. Link: https://lore.kernel.org/r/20250331-work-freeze-v1-1-6dfbe8253b9f@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09super: add filesystem freezing helpers for suspend and hibernateChristian Brauner7-36/+185
Allow the power subsystem to support filesystem freeze for suspend and hibernate. For some kernel subsystems it is paramount that they are guaranteed that they are the owner of the freeze to avoid any risk of deadlocks. This is the case for the power subsystem. Enable it to recognize whether it did actually freeze the filesystem. If userspace has 10 filesystems and suspend/hibernate manges to freeze 5 and then fails on the 6th for whatever odd reason (current or future) then power needs to undo the freeze of the first 5 filesystems. It can't just walk the list again because while it's unlikely that a new filesystem got added in the meantime it still cannot tell which filesystems the power subsystem actually managed to get a freeze reference count on that needs to be dropped during thaw. There's various ways out of this ugliness. For example, record the filesystems the power subsystem managed to freeze on a temporary list in the callbacks and then walk that list backwards during thaw to undo the freezing or make sure that the power subsystem just actually exclusively freezes things it can freeze and marking such filesystems as being owned by power for the duration of the suspend or resume cycle. I opted for the latter as that seemed the clean thing to do even if it means more code changes. If hibernation races with filesystem freezing (e.g. DM reconfiguration), then hibernation need not freeze a filesystem because it's already frozen but userspace may thaw the filesystem before hibernation actually happens. If the race happens the other way around, DM reconfiguration may unexpectedly fail with EBUSY. So allow FREEZE_EXCL to nest with other holders. An exclusive freezer cannot be undone by any of the other concurrent freezers. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-6-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09fs: use writeback_iter directly in mpage_writepagesChristoph Hellwig1-6/+7
Stop using write_cache_pages and use writeback_iter directly. This removes an indirect call per written folio and makes the code easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250507062124.3933305-1-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09iomap: rework iomap_write_begin() to return folio offset and lengthBrian Foster1-14/+19
iomap_write_begin() returns a folio based on current pos and remaining length in the iter, and each caller then trims the pos/length to the given folio. Clean this up a bit and let iomap_write_begin() return the trimmed range along with the folio. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/20250506134118.911396-7-bfoster@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09iomap: push non-large folio check into get folio pathBrian Foster1-3/+3
The len param to __iomap_get_folio() is primarily a folio allocation hint. iomap_write_begin() already trims its local len variable based on the provided folio, so move the large folio support check closer to folio lookup. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/20250506134118.911396-6-bfoster@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09iomap: helper to trim pos/bytes to within folioBrian Foster1-15/+20
Several buffered write based iteration callbacks duplicate logic to trim the current pos and length to within the current folio. Factor this into a helper to make it easier to relocate closer to folio lookup. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/20250506134118.911396-5-bfoster@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09iomap: drop pos param from __iomap_[get|put]_folio()Brian Foster1-8/+9
Both helpers take the iter and pos as parameters. All callers effectively pass iter->pos, so drop the unnecessary pos parameter. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/20250506134118.911396-4-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09iomap: drop unnecessary pos param from iomap_write_[begin|end]Brian Foster1-10/+12
iomap_write_begin() and iomap_write_end() both take the iter and iter->pos as parameters. Drop the unnecessary pos parameter and sample iter->pos within each function. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/20250506134118.911396-3-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09iomap: resample iter->pos after iomap_write_begin() callsBrian Foster1-8/+11
In preparation for removing the pos parameter, push the local pos assignment down after calls to iomap_write_begin(). Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/20250506134118.911396-2-bfoster@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09fs: Remove redundant errseq_set call in mark_buffer_write_io_error.Jeremy Bongio1-3/+1
mark_buffer_write_io_error sets sb->s_wb_err to -EIO twice. Once in mapping_set_error and once in errseq_set. Only mapping_set_error checks if bh->b_assoc_map->host is NULL. Discovered during null pointer dereference during writeback to a failing device: [<ffffffff9a416dc8>] ? mark_buffer_write_io_error+0x98/0xc0 [<ffffffff9a416dbe>] ? mark_buffer_write_io_error+0x8e/0xc0 [<ffffffff9ad4bda0>] end_buffer_async_write+0x90/0xd0 [<ffffffff9ad4e3eb>] end_bio_bh_io_sync+0x2b/0x40 [<ffffffff9adbafe6>] blk_update_request+0x1b6/0x480 [<ffffffff9adbb3d8>] blk_mq_end_request+0x18/0x30 [<ffffffff9adbc6aa>] blk_mq_dispatch_rq_list+0x4da/0x8e0 [<ffffffff9adc0a68>] __blk_mq_sched_dispatch_requests+0x218/0x6a0 [<ffffffff9adc07fa>] blk_mq_sched_dispatch_requests+0x3a/0x80 [<ffffffff9adbbb98>] blk_mq_run_hw_queue+0x108/0x330 [<ffffffff9adbcf58>] blk_mq_flush_plug_list+0x178/0x5f0 [<ffffffff9adb6741>] __blk_flush_plug+0x41/0x120 [<ffffffff9adb6852>] blk_finish_plug+0x22/0x40 [<ffffffff9ad47cb0>] wb_writeback+0x150/0x280 [<ffffffff9ac5343f>] ? set_worker_desc+0x9f/0xc0 [<ffffffff9ad4676e>] wb_workfn+0x24e/0x4a0 Fixes: 485e9605c0573 ("fs/buffer.c: record blockdev write errors in super_block that it backs") Signed-off-by: Jeremy Bongio <jbongio@google.com> Link: https://lore.kernel.org/20250507123010.1228243-1-jbongio@google.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09configfs: Correct error value returned by API config_item_set_name()Zijun Hu1-1/+1
kvasprintf() failure is often caused by memory allocation which has error code -ENOMEM, but config_item_set_name() returns -EFAULT for the failure. Fix by returning -ENOMEM instead of -EFAULT for the failure. Reviewed-by: Joel Becker <jlbec@evilplan.org> Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Link: https://lore.kernel.org/r/20250507-fix_configfs-v3-3-fe2d96de8dc4@quicinc.com Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
2025-05-09configfs: Do not override creating attribute file failure in populate_attrs()Zijun Hu1-1/+1
populate_attrs() may override failure for creating attribute files by success for creating subsequent bin attribute files, and have wrong return value. Fix by creating bin attribute files under successfully creating attribute files. Fixes: 03607ace807b ("configfs: implement binary attributes") Cc: stable@vger.kernel.org Reviewed-by: Joel Becker <jlbec@evilplan.org> Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Link: https://lore.kernel.org/r/20250507-fix_configfs-v3-2-fe2d96de8dc4@quicinc.com Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
2025-05-09configfs: Delete semicolon from macro type_print() definitionZijun Hu1-1/+1
Macro type_print() definition ends with semicolon, so will cause the subsequent macro invocations end with two semicolons. Fix by deleting the semicolon from the macro definition. Reviewed-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Link: https://lore.kernel.org/r/20250507-fix_configfs-v3-1-fe2d96de8dc4@quicinc.com Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
2025-05-09Merge tag 'atomic-writes-6.16_2025-05-07' of ↵Carlos Maiolino33-123/+1329
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into atomic_writes large atomic writes for xfs [v12.1] Currently atomic write support for xfs is limited to writing a single block as we have no way to guarantee alignment and that the write covers a single extent. This series introduces a method to issue atomic writes via a software-based method. The software-based method is used as a fallback for when attempting to issue an atomic write over misaligned or multiple extents. For xfs, this support is based on reflink CoW support. The basic idea of this CoW method is to alloc a range in the CoW fork, write the data, and atomically update the mapping. Initial mysql performance testing has shown this method to perform ok. However, there we are only using 16K atomic writes (and 4K block size), so typically - and thankfully - this software fallback method won't be used often. For other FSes which want large atomics writes and don't support CoW, I think that they can follow the example in [0]. Catherine is currently working on further xfstests for this feature, which we hope to share soon. About 17/17, maybe it can be omitted as there is no strong demand to have it included. Based on bfecc4091e07 (xfs/next-rc, xfs/for-next) xfs: allow ro mounts if rtdev or logdev are read-only [0] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ Differences to v12: - add more review tags Differences to v11: - split "xfs: ignore ..." patch - inline sync_blockdev() in xfs_alloc_buftarg() (Christoph) - fix xfs_calc_rtgroup_awu_max() for 0 block count (Darrick) - Add RB tag from Christoph (thanks!) Differences to v10: - add "xfs: only call xfs_setsize_buftarg once ..." by Darrick - symbol renames in "xfs: ignore HW which cannot..." by Darrick Differences to v9: - rework "ignore HW which cannot .." patch by Darrick - Ensure power-of-2 max always for unit min/max when no HW support With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
2025-05-09ext4: hold s_fc_lock while during fast commitHarshad Shirwadkar1-31/+13
Leaving s_fc_lock in between during commit in ext4_fc_perform_commit() function leaves room for subtle concurrency bugs where ext4_fc_del() may delete an inode from the fast commit list, leaving list in an inconsistent state. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250508175908.1004880-10-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-09ext4: convert s_fc_lock to mutex typeHarshad Shirwadkar3-32/+32
This allows us to hold s_fc_lock during kmem_cache_* functions, which is needed in the following patch. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250508175908.1004880-9-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-09ext4: temporarily elevate commit thread priorityHarshad Shirwadkar3-4/+18
Unlike JBD2 based full commits, there is no dedicated journal thread for fast commits. Thus to reduce scheduling delays between IO submission and completion, temporarily elevate the committer thread's priority to match the configured priority of the JBD2 journal thread. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250508175908.1004880-8-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-09ext4: update code documentationHarshad Shirwadkar1-17/+31
This patch updates code documentation to reflect the commit path changes made in this series. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> code docs Link: https://patch.msgid.link/20250508175908.1004880-7-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-09ext4: drop i_fc_updates from inode fc infoHarshad Shirwadkar2-73/+0
The new logic introduced in this series does not require tracking number of active handles open on an inode. So, drop it. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250508175908.1004880-6-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-09ext4: rework fast commit commit pathHarshad Shirwadkar3-76/+126
This patch reworks fast commit's commit path to remove locking the journal for the entire duration of a fast commit. Instead, we only lock the journal while marking all the eligible inodes as "committing". This allows handles to make progress in parallel with the fast commit. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://patch.msgid.link/20250508175908.1004880-5-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-09ext4: mark inode dirty before grabbing i_data_sem in ext4_setattrHarshad Shirwadkar1-3/+4
Mark inode dirty first and then grab i_data_sem in ext4_setattr(). Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://patch.msgid.link/20250508175908.1004880-4-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-09ext4: for committing inode, make ext4_fc_track_inode waitHarshad Shirwadkar3-0/+41
If the inode that's being requested to track using ext4_fc_track_inode is being committed, then wait until the inode finishes the commit. Also, add calls to ext4_fc_track_inode at the right places. With this patch, now calling ext4_reserve_inode_write() results in inode being tracked for next fast commit. This ensures that by the time ext4_reserve_inode_write() returns, it is ready to be modified and won't be committed until the corresponding handle is open. A subtle lock ordering requirement with i_data_sem (which is documented in the code) requires that ext4_fc_track_inode() be called before grabbing i_data_sem. So, this patch also adds explicit ext4_fc_track_inode() calls in places where i_data_sem grabbed. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://patch.msgid.link/20250508175908.1004880-3-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-09ext4: convert i_fc_lock to spinlockHarshad Shirwadkar3-13/+15
Convert ext4_inode_info->i_fc_lock to spinlock to avoid sleeping in invalid contexts. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://patch.msgid.link/20250508175908.1004880-2-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-09Merge tag 'bcachefs-2025-05-08' of git://evilpiepirate.org/bcachefsLinus Torvalds12-19/+55
Pull bcachefs fixes from Kent Overstreet: - Some fixes to help with filesystem analysis: ensure superblock error count gets written if we go ERO, don't discard the journal aggressively (so it's available for list_journal -a) - Fix lost wakeup on arm causing us to get stuck when reading btree nodes - Fix fsck failing to exit on ctrl-c - An additional fix for filesystems with misaligned bucket sizes: we now ensure that allocations are properly aligned - Setting background target but not promote target will now leave that data cached on the foreground target, as it used to - Revert a change to when we allocate the VFS superblock, this was done for implementing blk_holder_ops but ended up not being needed, and allocating a superblock and not setting SB_BORN while we do recovery caused sync() calls and other things to hang - Assorted fixes for harmless error messages that caused concern to users * tag 'bcachefs-2025-05-08' of git://evilpiepirate.org/bcachefs: bcachefs: Don't aggressively discard the journal bcachefs: Ensure superblock gets written when we go ERO bcachefs: Filter out harmless EROFS error messages bcachefs: journal_shutdown is EROFS, not EIO bcachefs: Call bch2_fs_start before getting vfs superblock bcachefs: fix hung task timeout in journal read bcachefs: Add missing barriers before wake_up_bit() bcachefs: Ensure proper write alignment bcachefs: Improve want_cached_ptr() bcachefs: thread_with_stdio: fix spinning instead of exiting
2025-05-08treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack()Ingo Molnar1-1/+1
Move this API to the canonical timer_*() namespace. Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250507175338.672442-10-mingo@kernel.org
2025-05-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski41-290/+610
Cross-merge networking fixes after downstream PR (net-6.15-rc6). No conflicts. Adjacent changes: net/core/dev.c: 08e9f2d584c4 ("net: Lock netdevices during dev_shutdown") a82dc19db136 ("net: avoid potential race between netdev_get_by_index_lock() and netns switch") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-08f2fs: return bool from __write_node_folioChristoph Hellwig1-16/+13
__write_node_folio can only return 0 or AOP_WRITEPAGE_ACTIVATE. As part of phasing out AOP_WRITEPAGE_ACTIVATE, switch to a bool return instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-08f2fs: simplify return value handling in f2fs_fsync_node_pagesChristoph Hellwig1-12/+11
Always assign ret where the error happens, and jump to out instead of multiple loop exit conditions to prepare for changes in the __write_node_folio calling convention. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-08f2fs: always unlock the page in f2fs_write_single_data_pageChristoph Hellwig2-7/+4
Consolidate the code to unlock the page in f2fs_write_single_data_page instead of leaving it to the callers for the AOP_WRITEPAGE_ACTIVATE case. Replace AOP_WRITEPAGE_ACTIVATE with a positive return of 1 as this case now doesn't match the historic ->writepage special return code that is on it's way out now that ->writepage has been removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-08f2fs: remove wbc->for_reclaim handlingChristoph Hellwig4-40/+5
Since commits 7ff0104a8052 ("f2fs: Remove f2fs_write_node_page()") and 3b47398d9861 ("f2fs: Remove f2fs_write_meta_page()'), f2fs can't be called from reclaim context any more. Remove all code keyed of the wbc->for_reclaim flag, which is now only set for writing out swap or shmem pages inside the swap code, but never passed to file systems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-08Merge tag 'v6.15-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds4-9/+43
Pull smb server fixes from Steve French: - Fix UAF closing file table (e.g. in tree disconnect) - Fix potential out of bounds write - Fix potential memory leak parsing lease state in open - Fix oops in rename with empty target * tag 'v6.15-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: Fix UAF in __close_file_table_ids ksmbd: prevent out-of-bounds stream writes by validating *pos ksmbd: fix memory leak in parse_lease_state() ksmbd: prevent rename with empty string
2025-05-08f2fs: return bool from __f2fs_write_meta_folioChristoph Hellwig1-11/+11
__f2fs_write_meta_folio can only return 0 or AOP_WRITEPAGE_ACTIVATE. As part of phasing out AOP_WRITEPAGE_ACTIVATE, switch to a bool return instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-08f2fs: fix to return correct error number in f2fs_sync_node_pages()Chao Yu1-2/+6
If __write_node_folio() failed, it will return AOP_WRITEPAGE_ACTIVATE, the incorrect return value may be passed to userspace in below path, fix it. - sync_filesystem - sync_fs - f2fs_issue_checkpoint - block_operations - f2fs_sync_node_pages - __write_node_folio : return AOP_WRITEPAGE_ACTIVATE Cc: stable@vger.kernel.org Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-08nilfs2: fix deadlock warnings caused by lock dependency in init_nilfs()Ryusuke Konishi1-3/+0
After commit c0e473a0d226 ("block: fix race between set_blocksize and read paths") was merged, set_blocksize() called by sb_set_blocksize() now locks the inode of the backing device file. As a result of this change, syzbot started reporting deadlock warnings due to a circular dependency involving the semaphore "ns_sem" of the nilfs object, the inode lock of the backing device file, and the locks that this inode lock is transitively dependent on. This is caused by a new lock dependency added by the above change, since init_nilfs() calls sb_set_blocksize() in the lock section of "ns_sem". However, these warnings are false positives because init_nilfs() is called in the early stage of the mount operation and the filesystem has not yet started. The reason why "ns_sem" is locked in init_nilfs() was to avoid a race condition in nilfs_fill_super() caused by sharing a nilfs object among multiple filesystem instances (super block structures) in the early implementation. However, nilfs objects and super block structures have long ago become one-to-one, and there is no longer any need to use the semaphore there. So, fix this issue by removing the use of the semaphore "ns_sem" in init_nilfs(). Link: https://lkml.kernel.org/r/20250503053327.12294-1-konishi.ryusuke@gmail.com Fixes: c0e473a0d226 ("block: fix race between set_blocksize and read paths") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+00f7f5b884b117ee6773@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=00f7f5b884b117ee6773 Tested-by: syzbot+00f7f5b884b117ee6773@syzkaller.appspotmail.com Reported-by: syzbot+f30591e72bfc24d4715b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f30591e72bfc24d4715b Tested-by: syzbot+f30591e72bfc24d4715b@syzkaller.appspotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-08ocfs2: stop quota recovery before disabling quotasJan Kara5-9/+30
Currently quota recovery is synchronized with unmount using sb->s_umount semaphore. That is however prone to deadlocks because flush_workqueue(osb->ocfs2_wq) called from umount code can wait for quota recovery to complete while ocfs2_finish_quota_recovery() waits for sb->s_umount semaphore. Grabbing of sb->s_umount semaphore in ocfs2_finish_quota_recovery() is only needed to protect that function from disabling of quotas from ocfs2_dismount_volume(). Handle this problem by disabling quota recovery early during unmount in ocfs2_dismount_volume() instead so that we can drop acquisition of sb->s_umount from ocfs2_finish_quota_recovery(). Link: https://lkml.kernel.org/r/20250424134515.18933-6-jack@suse.cz Fixes: 5f530de63cfc ("ocfs2: Use s_umount for quota recovery protection") Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Shichangkuo <shi.changkuo@h3c.com> Reported-by: Murad Masimov <m.masimov@mt-integration.ru> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Tested-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-08ocfs2: implement handshaking with ocfs2 recovery threadJan Kara2-17/+39
We will need ocfs2 recovery thread to acknowledge transitions of recovery_state when disabling particular types of recovery. This is similar to what currently happens when disabling recovery completely, just more general. Implement the handshake and use it for exit from recovery. Link: https://lkml.kernel.org/r/20250424134515.18933-5-jack@suse.cz Fixes: 5f530de63cfc ("ocfs2: Use s_umount for quota recovery protection") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Tested-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Murad Masimov <m.masimov@mt-integration.ru> Cc: Shichangkuo <shi.changkuo@h3c.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-08ocfs2: switch osb->disable_recovery to enumJan Kara2-7/+14
Patch series "ocfs2: Fix deadlocks in quota recovery", v3. This implements another approach to fixing quota recovery deadlocks. We avoid grabbing sb->s_umount semaphore from ocfs2_finish_quota_recovery() and instead stop quota recovery early in ocfs2_dismount_volume(). This patch (of 3): We will need more recovery states than just pure enable / disable to fix deadlocks with quota recovery. Switch osb->disable_recovery to enum. Link: https://lkml.kernel.org/r/20250424134301.1392-1-jack@suse.cz Link: https://lkml.kernel.org/r/20250424134515.18933-4-jack@suse.cz Fixes: 5f530de63cfc ("ocfs2: Use s_umount for quota recovery protection") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Tested-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Murad Masimov <m.masimov@mt-integration.ru> Cc: Shichangkuo <shi.changkuo@h3c.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-08mm/userfaultfd: fix uninitialized output field for -EAGAIN racePeter Xu1-6/+22
While discussing some userfaultfd relevant issues recently, Andrea noticed a potential ABI breakage with -EAGAIN on almost all userfaultfd ioctl()s. Quote from Andrea, explaining how -EAGAIN was processed, and how this should fix it (taking example of UFFDIO_COPY ioctl): The "mmap_changing" and "stale pmd" conditions are already reported as -EAGAIN written in the copy field, this does not change it. This change removes the subnormal case that left copy.copy uninitialized and required apps to explicitly set the copy field to get deterministic behavior (which is a requirement contrary to the documentation in both the manpage and source code). In turn there's no alteration to backwards compatibility as result of this change because userland will find the copy field consistently set to -EAGAIN, and not anymore sometime -EAGAIN and sometime uninitialized. Even then the change only can make a difference to non cooperative users of userfaultfd, so when UFFD_FEATURE_EVENT_* is enabled, which is not true for the vast majority of apps using userfaultfd or this unintended uninitialized field may have been noticed sooner. Meanwhile, since this bug existed for years, it also almost affects all ioctl()s that was introduced later. Besides UFFDIO_ZEROPAGE, these also get affected in the same way: - UFFDIO_CONTINUE - UFFDIO_POISON - UFFDIO_MOVE This patch should have fixed all of them. Link: https://lkml.kernel.org/r/20250424215729.194656-2-peterx@redhat.com Fixes: df2cc96e7701 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") Fixes: fc71884a5f59 ("mm: userfaultfd: add new UFFDIO_POISON ioctl") Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-08ocfs2: fix panic in failed foilio allocationMark Tinguely1-0/+1
commit 7e119cff9d0a ("ocfs2: convert w_pages to w_folios") and commit 9a5e08652dc4b ("ocfs2: use an array of folios instead of an array of pages") save -ENOMEM in the folio array upon allocation failure and call the folio array free code. The folio array free code expects either valid folio pointers or NULL. Finding the -ENOMEM will result in a panic. Fix by NULLing the error folio entry. Link: https://lkml.kernel.org/r/c879a52b-835c-4fa0-902b-8b2e9196dcbd@oracle.com Fixes: 7e119cff9d0a ("ocfs2: convert w_pages to w_folios") Fixes: 9a5e08652dc4b ("ocfs2: use an array of folios instead of an array of pages") Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-08ocfs2: fix the issue with discontiguous allocation in the global_bitmapHeming Zhao2-6/+33
commit 4eb7b93e0310 ("ocfs2: improve write IO performance when fragmentation is high") introduced another regression. The following ocfs2-test case can trigger this issue: > discontig_runner.sh => activate_discontig_bg.sh => resv_unwritten: > ${RESV_UNWRITTEN_BIN} -f ${WORK_PLACE}/large_testfile -s 0 -l \ > $((${FILE_MAJOR_SIZE_M}*1024*1024)) In my env, test disk size (by "fdisk -l <dev>"): > 53687091200 bytes, 104857600 sectors. Above command is: > /usr/local/ocfs2-test/bin/resv_unwritten -f \ > /mnt/ocfs2/ocfs2-activate-discontig-bg-dir/large_testfile -s 0 -l \ > 53187969024 Error log: > [*] Reserve 50724M space for a LARGE file, reserve 200M space for future test. > ioctl error 28: "No space left on device" > resv allocation failed Unknown error -1 > reserve unwritten region from 0 to 53187969024. Call flow: __ocfs2_change_file_space //by ioctl OCFS2_IOC_RESVSP64 ocfs2_allocate_unwritten_extents //start:0 len:53187969024 while() + ocfs2_get_clusters //cpos:0, alloc_size:1623168 (cluster number) + ocfs2_extend_allocation + ocfs2_lock_allocators | + choose OCFS2_AC_USE_MAIN & ocfs2_cluster_group_search | + ocfs2_add_inode_data ocfs2_add_clusters_in_btree __ocfs2_claim_clusters ocfs2_claim_suballoc_bits + During the allocation of the final part of the large file (after ~47GB), no chain had the required contiguous bits_wanted. Consequently, the allocation failed. How to fix: When OCFS2 is encountering fragmented allocation, the file system should stop attempting bits_wanted contiguous allocation and instead provide the largest available contiguous free bits from the cluster groups. Link: https://lkml.kernel.org/r/20250414060125.19938-2-heming.zhao@suse.com Fixes: 4eb7b93e0310 ("ocfs2: improve write IO performance when fragmentation is high") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reported-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-08xfs: allow sysadmins to specify a maximum atomic write limit at mount timeDarrick J. Wong6-2/+248
Introduce a mount option to allow sysadmins to specify the maximum size of an atomic write. If the filesystem can work with the supplied value, that becomes the new guaranteed maximum. The value mustn't be too big for the existing filesystem geometry (max write size, max AG/rtgroup size). We dynamically recompute the tr_atomic_write transaction reservation based on the given block size, check that the current log size isn't less than the new minimum log size constraints, and set a new maximum. The actual software atomic write max is still computed based off of tr_atomic_ioend the same way it has for the past few commits. Note also that xfs_calc_atomic_write_log_geometry is non-static because mkfs will need that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: John Garry <john.g.garry@oracle.com>