kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
2025-07-29	Merge tag 'vfs-6.17-rc1.iomap' of ↵	Linus Torvalds	4	-24/+36
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs iomap updates from Christian Brauner: - Refactor the iomap writeback code and split the generic and ioend/bio based writeback code. There are two methods that define the split between the generic writeback code, and the implemementation of it, and all knowledge of ioends and bios now sits below that layer. - Add fuse iomap support for buffered writes and dirty folio writeback. This is needed so that granular uptodate and dirty tracking can be used in fuse when large folios are enabled. This has two big advantages. For writes, instead of the entire folio needing to be read into the page cache, only the relevant portions need to be. For writeback, only the dirty portions need to be written back instead of the entire folio. * tag 'vfs-6.17-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fuse: refactor writeback to use iomap_writepage_ctx inode fuse: hook into iomap for invalidating and checking partial uptodateness fuse: use iomap for folio laundering fuse: use iomap for writeback fuse: use iomap for buffered writes iomap: build the writeback code without CONFIG_BLOCK iomap: add read_folio_range() handler for buffered writes iomap: improve argument passing to iomap_read_folio_sync iomap: replace iomap_folio_ops with iomap_write_ops iomap: export iomap_writeback_folio iomap: move folio_unlock out of iomap_writeback_folio iomap: rename iomap_writepage_map to iomap_writeback_folio iomap: move all ioend handling to ioend.c iomap: add public helpers for uptodate state manipulation iomap: hide ioends from the generic writeback code iomap: refactor the writeback interface iomap: cleanup the pending writeback tracking in iomap_writepage_map_blocks iomap: pass more arguments using the iomap writeback context iomap: header diet
2025-07-29	Merge tag 'vfs-6.17-rc1.fileattr' of ↵	Linus Torvalds	2	-4/+4
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fileattr updates from Christian Brauner: "This introduces the new file_getattr() and file_setattr() system calls after lengthy discussions. Both system calls serve as successors and extensible companions to the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have started to show their age in addition to being named in a way that makes it easy to conflate them with extended attribute related operations. These syscalls allow userspace to set filesystem inode attributes on special files. One of the usage examples is the XFS quota projects. XFS has project quotas which could be attached to a directory. All new inodes in these directories inherit project ID set on parent directory. The project is created from userspace by opening and calling FS_IOC_FSSETXATTR on each inode. This is not possible for special files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left with empty project ID. Those inodes then are not shown in the quota accounting but still exist in the directory. This is not critical but in the case when special files are created in the directory with already existing project quota, these new inodes inherit extended attributes. This creates a mix of special files with and without attributes. Moreover, special files with attributes don't have a possibility to become clear or change the attributes. This, in turn, prevents userspace from re-creating quota project on these existing files. In addition, these new system calls allow the implementation of additional attributes that we couldn't or didn't want to fit into the legacy ioctls anymore" * tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: tighten a sanity check in file_attr_to_fileattr() tree-wide: s/struct fileattr/struct file_kattr/g fs: introduce file_getattr and file_setattr syscalls fs: prepare for extending file_get/setattr() fs: make vfs_fileattr_[get\|set] return -EOPNOTSUPP selinux: implement inode_file_[g\|s]etattr hooks lsm: introduce new hooks for setting/getting inode fsxattr fs: split fileattr related helpers into separate file
2025-07-28	Merge tag 'pull-dcache' of ↵	Linus Torvalds	1	-1/+1
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dentry d_flags updates from Al Viro: "The current exclusion rules for dentry->d_flags stores are rather unpleasant. The basic rules are simple: - stores to dentry->d_flags are OK under dentry->d_lock - stores to dentry->d_flags are OK in the dentry constructor, before becomes potentially visible to other threads Unfortunately, there's a couple of exceptions to that, and that's where the headache comes from. The main PITA comes from d_set_d_op(); that primitive sets ->d_op of dentry and adjusts the flags that correspond to presence of individual methods. It's very easy to misuse; existing uses _are_ safe, but proof of correctness is brittle. Use in __d_alloc() is safe (we are within a constructor), but we might as well precalculate the initial value of 'd_flags' when we set the default ->d_op for given superblock and set 'd_flags' directly instead of messing with that helper. The reasons why other uses are safe are bloody convoluted; I'm not going to reproduce it here. See [1] for gory details, if you care. The critical part is using d_set_d_op() only just prior to d_splice_alias(), which makes a combination of d_splice_alias() with setting ->d_op, etc a natural replacement primitive. Better yet, if we go that way, it's easy to take setting ->d_op and modifying 'd_flags' under ->d_lock, which eliminates the headache as far as 'd_flags' exclusion rules are concerned. Other exceptions are minor and easy to deal with. What this series does: - d_set_d_op() is no longer available; instead a new primitive (d_splice_alias_ops()) is provided, equivalent to combination of d_set_d_op() and d_splice_alias(). - new field of struct super_block - 's_d_flags'. This sets the default value of 'd_flags' to be used when allocating dentries on this filesystem. - new primitive for setting 's_d_op': set_default_d_op(). This replaces stores to 's_d_op' at mount time. All in-tree filesystems converted; out-of-tree ones will get caught by the compiler ('s_d_op' is renamed, so stores to it will be caught). 's_d_flags' is set by the same primitive to match the 's_d_op'. - a lot of filesystems had sb->s_d_op->d_delete equal to always_delete_dentry; that is equivalent to setting DCACHE_DONTCACHE in 'd_flags', so such filesystems can bloody well set that bit in 's_d_flags' and drop 'd_delete()' from dentry_operations. In quite a few cases that results in empty dentry_operations, which means that we can get rid of those. - kill simple_dentry_operations - not needed anymore - massage d_alloc_parallel() to get rid of the other exception wrt 'd_flags' stores - we can set DCACHE_PAR_LOOKUP as soon as we allocate the new dentry; no need to delay that until we commit to using the sucker. As the result, 'd_flags' stores are all either under ->d_lock or done before the dentry becomes visible in any shared data structures" Link: https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/ [1] * tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits) configfs: use DCACHE_DONTCACHE debugfs: use DCACHE_DONTCACHE efivarfs: use DCACHE_DONTCACHE instead of always_delete_dentry() 9p: don't bother with always_delete_dentry ramfs, hugetlbfs, mqueue: set DCACHE_DONTCACHE kill simple_dentry_operations devpts, sunrpc, hostfs: don't bother with ->d_op shmem: no dentry retention past the refcount reaching zero d_alloc_parallel(): set DCACHE_PAR_LOOKUP earlier make d_set_d_op() static simple_lookup(): just set DCACHE_DONTCACHE tracefs: Add d_delete to remove negative dentries set_default_d_op(): calculate the matching value for ->d_flags correct the set of flags forbidden at d_set_d_op() time split d_flags calculation out of d_set_d_op() new helper: set_default_d_op() fuse: no need for special dentry_operations for root dentry switch procfs from d_set_d_op() to d_splice_alias_ops() new helper: d_splice_alias_ops() procfs: kill ->proc_dops ...
2025-07-17	gfs2: No more self recovery	Andreas Gruenbacher	1	-20/+11
	When a node withdraws and it turns out that it is the only node that has the filesystem mounted, gfs2 currently tries to replay the local journal to bring the filesystem back into a consistent state. Not only is that a very bad idea, it has also never worked because gfs2_recover_func() will refuse to do anything during a withdraw. However, before even getting to this point, gfs2_recover_func() dereferences sdp->sd_jdesc->jd_inode. This was a use-after-free before commit 04133b607a78 ("gfs2: Prevent double iput for journal on error") and is a NULL pointer dereference since then. Simply get rid of self recovery to fix that. Fixes: 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish") Reported-by: Chunjie Zhu <chunjie.zhu@cloud.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-16	gfs2: Validate i_depth for exhash directories	Andrew Price	2	-4/+8
	A fuzzer test introduced corruption that ends up with a depth of 0 in dir_e_read(), causing an undefined shift by 32 at: index = hash >> (32 - dip->i_depth); As calculated in an open-coded way in dir_make_exhash(), the minimum depth for an exhash directory is ilog2(sdp->sd_hash_ptrs) and 0 is invalid as sdp->sd_hash_ptrs is fixed as sdp->bsize / 16 at mount time. So we can avoid the undefined behaviour by checking for depth values lower than the minimum in gfs2_dinode_in(). Values greater than the maximum are already being checked for there. Also switch the calculation in dir_make_exhash() to use ilog2() to clarify how the depth is calculated. Tested with the syzkaller repro.c and xfstests '-g quick'. Reported-by: syzbot+4708579bb230a0582a57@syzkaller.appspotmail.com Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-15	gfs2: Set .migrate_folio in gfs2_{rgrp,meta}_aops	Andrew Price	1	-0/+2
	Clears up the warning added in 7ee3647243e5 ("migrate: Remove call to ->writepage") that occurs in various xfstests, causing "something found in dmesg" failures. [ 341.136573] gfs2_meta_aops does not implement migrate_folio [ 341.136953] WARNING: CPU: 1 PID: 36 at mm/migrate.c:944 move_to_new_folio+0x2f8/0x300 Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-15	gfs2: a minor finish_xmote cleanup	Andreas Gruenbacher	1	-2/+1
	As a minor clean-up to commit 1fc05c8d8426 ("gfs2: cancel timed-out glock requests"), when a demote request is in progress in finish_xmote(), there is no point in waking up the glock holder at the head of the queue because the reply from dlm cannot be on behalf of that glock holder. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Andrew Price <anprice@redhat.com>
2025-07-15	gfs2: simplify finish_xmote	Andreas Gruenbacher	1	-8/+0
	As a follow-up to commit a431d49243a0 ("gfs2: Fix request cancelation bug"), it turns out that any call to finish_xmote() is always followed by a call to run_queue(), either * directly when glock_work_func() calls finish_xmote() before calling run_queue(), or * indirectly when do_xmote() calls finish_xmote() before calling gfs2_glock_queue_work(), which queues a call to glock_work_func() in work queue context, so remove the code in finish_xmote() that duplicates the functionality of run_queue(). In addition, the code this commit removes is missing a check for the GLF_DEMOTE flag which indicates that no further promotes should be performed, so if that code didn't get removed, that check would have to be added. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Andrew Price <anprice@redhat.com>
2025-07-15	gfs2: sanitize the gdlm_ast -> finish_xmote interface	Andreas Gruenbacher	3	-14/+34
	When gdlm_ast() is called with a non-zero status code, this means that the requested operation did not succeed and the current lock state didn't change. Turn that into a non-zero LM_OUT_* status code (with ret & ~LM_OUT_ST_MASK != 0) instead of pretending that dlm returned the current lock state. That way, we can easily change finish_xmote() to only update gl->gl_state when the state has actually changed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Andrew Price <anprice@redhat.com>
2025-07-14	iomap: replace iomap_folio_ops with iomap_write_ops	Christoph Hellwig	3	-10/+15
	The iomap_folio_ops are only used for buffered writes, including the zero and unshare variants. Rename them to iomap_write_ops to better describe the usage, and pass them through the call chain like the other operation specific methods instead of through the iomap. xfs_iomap_valid grows a IOMAP_HOLE check to keep the existing behavior that never attached the folio_ops to a iomap representing a hole. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-12-hch@lst.de Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14	iomap: hide ioends from the generic writeback code	Christoph Hellwig	1	-0/+1
	Replace the ioend pointer in iomap_writeback_ctx with a void *wb_ctx one to facilitate non-block, non-ioend writeback for use. Rename the submit_ioend method to writeback_submit and make it mandatory so that the generic writeback code stops seeing ioends and bios. Co-developed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-6-hch@lst.de Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14	iomap: refactor the writeback interface	Christoph Hellwig	1	-12/+14
	Replace ->map_blocks with a new ->writeback_range, which differs in the following ways: - it must also queue up the I/O for writeback, that is called into the slightly refactored and extended in scope iomap_add_to_ioend for each region - can handle only a part of the requested region, that is the retry loop for partial mappings moves to the caller - handles cleanup on failures as well, and thus also replaces the discard_folio method only implemented by XFS. This will allow to use the iomap writeback code also for file systems that are not block based like fuse. Co-developed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-5-hch@lst.de Acked-by: Damien Le Moal <dlemoal@kernel.org> # zonefs Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14	iomap: pass more arguments using the iomap writeback context	Christoph Hellwig	1	-2/+6
	Add inode and wpc fields to pass the inode and writeback context that are needed in the entire writeback call chain, and let the callers initialize all fields in the writeback context before calling iomap_writepages to simplify the argument passing. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-3-hch@lst.de Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-09	gfs2: Minor do_xmote cancelation fix	Andreas Gruenbacher	1	-1/+2
	Commit 6cb3b1c2df87 changed how finish_xmote() clears the GLF_LOCK flag, but it failed to adjust the equivalent code in do_xmote(). Fix that. Fixes: 6cb3b1c2df87 ("gfs2: Fix additional unlikely request cancelation race") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-09	gfs2: Remove GIF_ALLOC_FAILED flag	Andreas Gruenbacher	4	-14/+6
	Get rid of the GIF_ALLOC_FAILED flag; we can now be confident that the additional consistency check isn't needed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-09	gfs2: Use SECTOR_SIZE and SECTOR_SHIFT	Andreas Gruenbacher	3	-10/+10
	Use the SECTOR_SIZE and SECTOR_SHIFT constants where appropriate instead of hardcoding their values. Reported-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-04	tree-wide: s/struct fileattr/struct file_kattr/g	Christian Brauner	2	-4/+4
	Now that we expose struct file_attr as our uapi struct rename all the internal struct to struct file_kattr to clearly communicate that it is a kernel internal struct. This is similar to struct mount_{k}attr and others. Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-11	new helper: set_default_d_op()	Al Viro	1	-1/+1
	... to be used instead of manually assigning to ->s_d_op. All in-tree filesystem converted (and field itself is renamed, so any out-of-tree ones in need of conversion will be caught by compiler). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-30	gfs2: Don't clear sb->s_fs_info in gfs2_sys_fs_add	Andrew Price	2	-2/+3
	When gfs2_sys_fs_add() fails, it sets sb->s_fs_info to NULL on its error path (see commit 0d515210b696 ("GFS2: Add kobject release method")). The intention seems to be to prevent dereferencing sb->s_fs_info once the object pointed to has been deallocated, but that would be better achieved by setting the pointer to NULL in free_sbd(). As a consequence, when the call to gfs2_sys_fs_add() fails in gfs2_fill_super(), sdp = GFS2_SB(inode) will evaluate to NULL in iput() -> gfs2_drop_inode(), and accessing sdp->sd_flags will be a NULL pointer dereference. Fix that by only setting sb->s_fs_info to NULL when actually freeing the object pointed to in free_sbd(). Fixes: ae9f3bd8259a ("gfs2: replace sd_aspace with sd_inode") Reported-by: syzbot+b12826218502df019f9d@syzkaller.appspotmail.com Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-26	Merge tag 'gfs2-for-6.16' of ↵	Linus Torvalds	24	-213/+263
	git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Fix the long-standing warnings in inode_to_wb() when CONFIG_LOCKDEP is enabled: gfs2 doesn't support cgroup writeback and so inode->i_wb will never change. This is the counterpart of commit 9e888998ea4d ("writeback: fix false warning in inode_to_wb()") - Fix a hang introduced by commit 8d391972ae2d ("gfs2: Remove __gfs2_writepage()"): prevent gfs2_logd from creating transactions for jdata pages while trying to flush the log - Fix a race between gfs2_create_inode() and gfs2_evict_inode() by deallocating partially created inodes on the gfs2_create_inode() error path - Fix a bug in the journal head lookup code that could cause mount to fail after successful recovery - Various smaller fixes and cleanups from various people * tag 'gfs2-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (23 commits) gfs2: No more gfs2_find_jhead caching gfs2: Get rid of duplicate log head lookup gfs2: Simplify clean_journal gfs2: Simplify gfs2_log_pointers_init gfs2: Move gfs2_log_pointers_init gfs2: Minor comments fix gfs2: Don't start unnecessary transactions during log flush gfs2: Move gfs2_trans_add_databufs gfs2: Rename jdata_dirty_folio to gfs2_jdata_dirty_folio gfs2: avoid inefficient use of crc32_le_shift() gfs2: Do not call iomap_zero_range beyond eof gfs: don't check for AOP_WRITEPAGE_ACTIVATE in gfs2_write_jdata_batch gfs2: Fix usage of bio->bi_status in gfs2_end_log_write gfs2: deallocate inodes in gfs2_create_inode gfs2: Move GIF_ALLOC_FAILED check out of gfs2_ea_dealloc gfs2: Move gfs2_dinode_dealloc gfs2: Don't reread inodes unnecessarily gfs2: gfs2_create_inode error handling fix gfs2: Remove unnecessary NULL check before free_percpu() gfs2: check sb_min_blocksize return value ...
2025-05-26	Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux	Linus Torvalds	1	-15/+9
	Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...
2025-05-26	Merge tag 'vfs-6.16-rc1.super' of ↵	Linus Torvalds	2	-11/+17
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs freezing updates from Christian Brauner: "This contains various filesystem freezing related work for this cycle: - Allow the power subsystem to support filesystem freeze for suspend and hibernate. Now all the pieces are in place to actually allow the power subsystem to freeze/thaw filesystems during suspend/resume. Filesystems are only frozen and thawed if the power subsystem does actually own the freeze. If the filesystem is already frozen by the time we've frozen all userspace processes we don't care to freeze it again. That's userspace's job once the process resumes. We only actually freeze filesystems if we absolutely have to and we ignore other failures to freeze. We could bubble up errors and fail suspend/resume if the error isn't EBUSY (aka it's already frozen) but I don't think that this is worth it. Filesystem freezing during suspend/resume is best-effort. If the user has 500 ext4 filesystems mounted and 4 fail to freeze for whatever reason then we simply skip them. What we have now is already a big improvement and let's see how we fare with it before making our lives even harder (and uglier) than we have to. - Allow efivars to support freeze and thaw Allow efivarfs to partake to resync variable state during system hibernation and suspend. Add freeze/thaw support. This is a pretty straightforward implementation. We simply add regular freeze/thaw support for both userspace and the kernel. efivars is the first pseudofilesystem that adds support for filesystem freezing and thawing. The simplicity comes from the fact that we simply always resync variable state after efivarfs has been frozen. It doesn't matter whether that's because of suspend, userspace initiated freeze or hibernation. Efivars is simple enough that it doesn't matter that we walk all dentries. There are no directories and there aren't insane amounts of entries and both freeze/thaw are already heavy-handed operations. If userspace initiated a freeze/thaw cycle they would need CAP_SYS_ADMIN in the initial user namespace (as that's where efivarfs is mounted) so it can't be triggered by random userspace. IOW, we really really don't care" * tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: f2fs: fix freezing filesystem during resize kernfs: add warning about implementing freeze/thaw efivarfs: support freeze/thaw power: freeze filesystems during suspend/resume libfs: export find_next_child() super: add filesystem freezing helpers for suspend and hibernate gfs2: pass through holder from the VFS for freeze/thaw super: use common iterator (Part 2) super: use a common iterator (Part 1) super: skip dying superblocks early super: simplify user_get_super() super: remove pointless s_root checks fs: allow all writers to be frozen locking/percpu-rwsem: add freezable alternative to down_read
2025-05-22	gfs2: No more gfs2_find_jhead caching	Andreas Gruenbacher	6	-10/+7
	We are no longer calling gfs2_find_jhead() on the same log twice, so there is no more reason for keeping the log contents cached across those calls. In addition, log head lookup and log header writing didn't go through the same address space and so the caching wasn't even fully working, anyway. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22	gfs2: Get rid of duplicate log head lookup	Andreas Gruenbacher	2	-12/+6
	Currently at mount time, the recovery code looks up the current log head and, if necessary, replays the log and writes a recovery header to indicate that the log is clean. It does that for each log that may need recovery. We also know that our own log will always be checked as part of that process. Then, the mount code looks up the log head of our own log again. The double log head lookup can be costly, but more importantly, it is unnecessary because we can trivially compute the position of the log head after recovery; all we need to do for that is bump the position and lh_sequence by one when writing a recovery header. With that in mind, move the call to gfs2_log_pointers_init() into gfs2_recover_func() and get rid of the double lookup in gfs2_make_fs_rw(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22	gfs2: Simplify clean_journal	Andreas Gruenbacher	1	-3/+3
	In function clean_journal(), update @head to point at the log header that indicates successful recovery: this is where logging needs to resume. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22	gfs2: Simplify gfs2_log_pointers_init	Andreas Gruenbacher	4	-16/+12
	Move the initialization of sdp->sd_log_sequence and sdp->sd_log_flush_head inside gfs2_log_pointers_init(). Use gfs2_replay_incr_blk(). Before this change, the log head lookup code in freeze_go_xmote_bh() didn't update sdp->sd_log_flush_head. This is now fixed, but the code in freeze_go_xmote_bh() appears to be pretty useless in the first place: on a frozen filesystem, the log head will not change. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22	gfs2: Move gfs2_log_pointers_init	Andreas Gruenbacher	3	-11/+10
	Move gfs2_log_pointers_init to recovery.c: there is no need for inlining this function. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22	gfs2: Minor comments fix	Andreas Gruenbacher	1	-2/+2
	Commit 40829760096df ("gfs2: Convert gfs2_find_jhead() to use a folio") replaced grab_cache_page() by filemap_grab_folio(), but the comments were still referring to grab_cache_page(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22	gfs2: Don't start unnecessary transactions during log flush	Andreas Gruenbacher	3	-1/+38
	Commit 8d391972ae2d ("gfs2: Remove __gfs2_writepage()") changed the log flush code in gfs2_ail1_start_one() to call aops->writepages() instead of aops->writepage(). For jdata inodes, this means that we will now try to reserve log space and start a transaction before we can determine that the pages in question have already been journaled. When this happens in the context of gfs2_logd(), it can now appear that not enough log space is available for freeing up log space, and we will lock up. Fix that by issuing journal writes directly instead of going through aops->writepages() in the log flush code. Fixes: 8d391972ae2d ("gfs2: Remove __gfs2_writepage()") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22	gfs2: Move gfs2_trans_add_databufs	Andreas Gruenbacher	5	-25/+26
	Move gfs2_trans_add_databufs() to trans.c. Pass in a glock instead of a gfs2_inode. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22	gfs2: Rename jdata_dirty_folio to gfs2_jdata_dirty_folio	Andreas Gruenbacher	1	-2/+2
	Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22	gfs2: avoid inefficient use of crc32_le_shift()	Eric Biggers	1	-1/+2
	__get_log_header() was using crc32_le_shift() to update a CRC with four zero bytes. However, this is about 5x slower than just CRC'ing four zero bytes in the normal way. Just do that instead. (We could instead make crc32_le_shift() faster on short lengths. But all its callers do just fine without it, so I'd like to just remove it.) Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22	gfs2: Do not call iomap_zero_range beyond eof	Andreas Gruenbacher	1	-2/+4
	Since commit eb65540aa9fc ("iomap: warn on zero range of a post-eof folio"), iomap_zero_range() warns when asked to zero a folio beyond eof. The warning triggers on the following code path: gfs2_fallocate(FALLOC_FL_PUNCH_HOLE \| FALLOC_FL_KEEP_SIZE) __gfs2_punch_hole() gfs2_block_zero_range() iomap_zero_range() In __gfs2_punch_hole(), gfs2 zeroes out partial folios at the beginning and at the end of the specified range, whether those folios are beyond eof or not. This may add folios to the page cache which are entirely beyond eof, which isn't of any use. Avoid that by truncating the range to zero out at eof. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-22	gfs: don't check for AOP_WRITEPAGE_ACTIVATE in gfs2_write_jdata_batch	Christoph Hellwig	1	-18/+10
	__gfs2_jdata_write_folio can't return AOP_WRITEPAGE_ACTIVATE, so don't check for it in gfs2_write_jdata_batch. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-09	super: add filesystem freezing helpers for suspend and hibernate	Christian Brauner	2	-11/+15
	Allow the power subsystem to support filesystem freeze for suspend and hibernate. For some kernel subsystems it is paramount that they are guaranteed that they are the owner of the freeze to avoid any risk of deadlocks. This is the case for the power subsystem. Enable it to recognize whether it did actually freeze the filesystem. If userspace has 10 filesystems and suspend/hibernate manges to freeze 5 and then fails on the 6th for whatever odd reason (current or future) then power needs to undo the freeze of the first 5 filesystems. It can't just walk the list again because while it's unlikely that a new filesystem got added in the meantime it still cannot tell which filesystems the power subsystem actually managed to get a freeze reference count on that needs to be dropped during thaw. There's various ways out of this ugliness. For example, record the filesystems the power subsystem managed to freeze on a temporary list in the callbacks and then walk that list backwards during thaw to undo the freezing or make sure that the power subsystem just actually exclusively freezes things it can freeze and marking such filesystems as being owned by power for the duration of the suspend or resume cycle. I opted for the latter as that seemed the clean thing to do even if it means more code changes. If hibernation races with filesystem freezing (e.g. DM reconfiguration), then hibernation need not freeze a filesystem because it's already frozen but userspace may thaw the filesystem before hibernation actually happens. If the race happens the other way around, DM reconfiguration may unexpectedly fail with EBUSY. So allow FREEZE_EXCL to nest with other holders. An exclusive freezer cannot be undone by any of the other concurrent freezers. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-6-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-07	gfs2: use bdev_rw_virt in gfs2_read_super	Christoph Hellwig	1	-15/+9
	Switch gfs2_read_super to allocate the superblock buffer using kmalloc which falls back to the page allocator for PAGE_SIZE allocation but gives us a kernel virtual address and then use bdev_rw_virt to perform the synchronous read into it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-25	gfs2: Fix usage of bio->bi_status in gfs2_end_log_write	Andrew Price	1	-2/+4
	bio->bi_status is an index into the blk_errors array, not an errno. Its __bitwise tag is cast away here, resulting in a sparse warning: fs/gfs2/lops.c:207:22: warning: cast from restricted blk_status_t We could either add __force to the cast and continue logging bi_status in the error message, or we could look up the errno in the array and log that. As sdp->sd_log_error is used as an errno in all other cases, look up the errno here for consistency. Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-25	gfs2: deallocate inodes in gfs2_create_inode	Andreas Gruenbacher	2	-13/+20
	When creating and destroying inodes, we are relying on the inode hash table to make sure that for a given inode number, only a single inode will exist. We then link that inode to its inode and iopen glock and let those glocks point back at the inode. However, when iget_failed() is called, the inode is removed from the inode hash table before gfs_evict_inode() is called, and uniqueness is no longer guaranteed. Commit f1046a472b70 ("gfs2: gl_object races fix") was trying to work around that problem by detaching the inode glock from the inode before calling iget_failed(), but that broke the inode deallocation code in gfs_evict_inode(). To fix that, deallocate partially created inodes in gfs2_create_inode() instead of relying on gfs_evict_inode() for doing that. This means that gfs2_evict_inode() and its helper functions will no longer see partially created inodes, and so some simplifications are possible there. Fixes: 9ffa18884cce ("gfs2: gl_object races fix") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21	gfs2: Move GIF_ALLOC_FAILED check out of gfs2_ea_dealloc	Andreas Gruenbacher	3	-7/+8
	Don't check for the GIF_ALLOC_FAILED flag in gfs2_ea_dealloc() and pass that information explicitly instead. This allows for a cleaner follow-up patch. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21	gfs2: Move gfs2_dinode_dealloc	Andreas Gruenbacher	3	-68/+69
	Move gfs2_dinode_dealloc() and its helper gfs2_final_release_pages() from super.c to inode.c. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21	gfs2: Don't reread inodes unnecessarily	Andreas Gruenbacher	1	-0/+1
	In gfs2_create_inode(), we initialize the inode from scratch and then we write the result to disk. Clear the GLF_INSTANTIATE_NEEDED glock flag to indicate that the inode is up to date. Otherwise, the next time the inode glock is acquired, gfs2_instantiate() would reread the inode from disk, which isn't necessary. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21	gfs2: gfs2_create_inode error handling fix	Andreas Gruenbacher	1	-1/+2
	When gfs2_create_inode() finds a directory, make sure to return -EISDIR. Fixes: 571a4b57975a ("GFS2: bugger off early if O_CREAT open finds a directory") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21	gfs2: Remove unnecessary NULL check before free_percpu()	Chen Ni	1	-2/+1
	free_percpu() checks for NULL pointers internally. Remove unneeded NULL check here. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21	gfs2: check sb_min_blocksize return value	Edward Adam Davis	1	-1/+6
	Check the return value of sb_min_blocksize(): it will be 0 when the requested block size is invalid. In addition, check the return value of sb_set_blocksize() as well. Reported-by: syzbot+b0018b7468b2af33b4d5@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21	gfs2: replace sd_aspace with sd_inode	Andreas Gruenbacher	7	-23/+32
	Currently, sdp->sd_aspace and the per-inode metadata address spaces use sb->s_bdev->bd_mapping->host as their ->host; folios in those address spaces will thus appear to be on bdev rather than on gfs2 filesystems. This is a problem because gfs2 doesn't support cgroup writeback (SB_I_CGROUPWB), but bdev does. Fix that by using a "dummy" gfs2 inode as ->host in those address spaces. When coming from a folio, folio->mapping->host->i_sb will then be a gfs2 super block and the SB_I_CGROUPWB flag will not be set in sb->s_iflags. Based on a previous version from Bob Peterson from several years ago. Thanks to Tetsuo Handa, Jan Kara, and Rafael Aquini for helping figure this out. Fixes: aaa2cacf8184 ("writeback: add lockdep annotation to inode_to_wb()") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21	gfs2: only apply DLM_LKF_VALBLK if sb_lvbptr is not NULL	Alexander Aring	1	-2/+6
	Currently, gfs2 always sets the DLM_LKF_VALBLK flag to enable lvb handling even when sb_lvbptr is NULL. This currently causes no problems because DLM ignores the DLM_LKF_VALBLK flag when sb_lvbptr is NULL, but it does violate the DLM API. Fix that by only setting DLM_LKF_VALBLK when sb_lvbptr is not NULL. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-21	gfs2: move msleep to sleepable context	Alexander Aring	1	-1/+2
	This patch moves the msleep_interruptible() out of the non-sleepable context by moving the ls->ls_recover_spin spinlock around so msleep_interruptible() will be called in a sleepable context. Cc: stable@vger.kernel.org Fixes: 4a7727725dc7 ("GFS2: Fix recovery issues for spectators") Suggested-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-04-07	gfs2: pass through holder from the VFS for freeze/thaw	Christian Brauner	1	-6/+8
	The filesystem's freeze/thaw functions can be called from contexts where the holder isn't userspace but the kernel, e.g., during systemd suspend/hibernate. So pass through the freeze/thaw flags from the VFS instead of hard-coding them. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-04	lib/crc: remove CONFIG_LIBCRC32C	Eric Biggers	1	-1/+0
	Now that LIBCRC32C does nothing besides select CRC32, make every option that selects LIBCRC32C instead select CRC32 directly. Then remove LIBCRC32C. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-03-27	Merge tag 'gfs2-for-6.15' of ↵	Linus Torvalds	8	-131/+136
	git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Fix two bugs related to locking request cancelation (locking request being retried instead of canceled; canceling the wrong locking request) - Prevent a race between inode creation and deferred delete analogous to commit ffd1cf0443a2 from 6.13. This now allows to further simplify gfs2_evict_inode() without introducing mysterious problems - When in inode delete should be verified / retried "later" but that isn't possible, skip the delete instead of carrying it out immediately. This broke in 6.13 - More folio conversions from Matthew Wilcox (plus a fix from Dan Carpenter) - Various minor fixes and cleanups * tag 'gfs2-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (22 commits) gfs2: some comment clarifications gfs2: Fix a NULL vs IS_ERR() bug in gfs2_find_jhead() gfs2: Convert gfs2_meta_read_endio() to use a folio gfs2: Convert gfs2_end_log_write_bh() to work on a folio gfs2: Convert gfs2_find_jhead() to use a folio gfs2: Convert gfs2_jhead_pg_srch() to gfs2_jhead_folio_search() gfs2: Use b_folio in gfs2_check_magic() gfs2: Use b_folio in gfs2_submit_bhs() gfs2: Use b_folio in gfs2_trans_add_meta() gfs2: Use b_folio in gfs2_log_write_bh() gfs2: skip if we cannot defer delete gfs2: remove redundant warnings gfs2: minor evict fix gfs2: Prevent inode creation race (2) gfs2: Fix additional unlikely request cancelation race gfs2: Fix request cancelation bug gfs2: Check for empty queue in run_queue gfs2: Remove more dead code in add_to_queue gfs2: Replace GIF_DEFER_DELETE with GLF_DEFER_DELETE gfs2: glock holder GL_NOPID fix ...