summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)AuthorFilesLines
2026-01-02ext4: align max orphan file size with e2fsprogs limitBaokun Li1-1/+3
commit 7c11c56eb32eae96893eebafdbe3decadefe88ad upstream. Kernel commit 0a6ce20c1564 ("ext4: verify orphan file size is not too big") limits the maximum supported orphan file size to 8 << 20. However, in e2fsprogs, the orphan file size is set to 32–512 filesystem blocks when creating a filesystem. With 64k block size, formatting an ext4 fs >32G gives an orphan file bigger than the kernel allows, so mount prints an error and fails: EXT4-fs (vdb): orphan file too big: 8650752 EXT4-fs (vdb): mount failed To prevent this issue and allow previously created 64KB filesystems to mount, we updates the maximum allowed orphan file size in the kernel to 512 filesystem blocks. Fixes: 0a6ce20c1564 ("ext4: verify orphan file size is not too big") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251120134233.2994147-1-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02ext4: fix incorrect group number assertion in mb_check_buddyYongjian Sun1-0/+2
commit 3f7a79d05c692c7cfec70bf104b1b3c3d0ce6247 upstream. When the MB_CHECK_ASSERT macro is enabled, an assertion failure can occur in __mb_check_buddy when checking preallocated blocks (pa) in a block group: Assertion failure in mb_free_blocks() : "groupnr == e4b->bd_group" This happens when a pa at the very end of a block group (e.g., pa_pstart=32765, pa_len=3 in a group of 32768 blocks) becomes exhausted - its pa_pstart is advanced by pa_len to 32768, which lies in the next block group. If this exhausted pa (with pa_len == 0) is still in the bb_prealloc_list during the buddy check, the assertion incorrectly flags it as belonging to the wrong group. A possible sequence is as follows: ext4_mb_new_blocks ext4_mb_release_context pa->pa_pstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len) pa->pa_len -= ac->ac_b_ex.fe_len __mb_check_buddy for each pa in group ext4_get_group_no_and_offset MB_CHECK_ASSERT(groupnr == e4b->bd_group) To fix this, we modify the check to skip block group validation for exhausted preallocations (where pa_len == 0). Such entries are in a transitional state and will be removed from the list soon, so they should not trigger an assertion. This change prevents the false positive while maintaining the integrity of the checks for active allocations. Fixes: c9de560ded61f ("ext4: Add multi block allocator for ext4") Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251106060614.631382-2-sunyongjian@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02ext4: clear i_state_flags when alloc inodeHaibo Chen3-2/+1
commit 4091c8206cfd2e3bb529ef260887296b90d9b6a2 upstream. i_state_flags used on 32-bit archs, need to clear this flag when alloc inode. Find this issue when umount ext4, sometimes track the inode as orphan accidently, cause ext4 mesg dump. Fixes: acf943e9768e ("ext4: fix checks for orphan inodes") Signed-off-by: Haibo Chen <haibo.chen@nxp.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251104-ext4-v1-1-73691a0800f9@nxp.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02ext4: xattr: fix null pointer deref in ext4_raw_inode()Karina Yankevich1-1/+5
commit b97cb7d6a051aa6ebd57906df0e26e9e36c26d14 upstream. If ext4_get_inode_loc() fails (e.g. if it returns -EFSCORRUPTED), iloc.bh will remain set to NULL. Since ext4_xattr_inode_dec_ref_all() lacks error checking, this will lead to a null pointer dereference in ext4_raw_inode(), called right after ext4_get_inode_loc(). Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: c8e008b60492 ("ext4: ignore xattrs past end") Cc: stable@kernel.org Signed-off-by: Karina Yankevich <k.yankevich@omp.ru> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251022093253.3546296-1-k.yankevich@omp.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02ext4: check if mount_opts is NUL-terminated in ext4_ioctl_set_tune_sb()Fedor Pchelkin1-0/+4
commit 3db63d2c2d1d1e78615dd742568c5a2d55291ad1 upstream. params.mount_opts may come as potentially non-NUL-term string. Userspace is expected to pass a NUL-term string. Add an extra check to ensure this holds true. Note that further code utilizes strscpy_pad() so this is just for proper informing the user of incorrect data being provided. Found by Linux Verification Center (linuxtesting.org). Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251101160430.222297-2-pchelkin@ispras.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02ext4: fix string copying in parse_apply_sb_mount_options()Fedor Pchelkin1-2/+3
commit ee5a977b4e771cc181f39d504426dbd31ed701cc upstream. strscpy_pad() can't be used to copy a non-NUL-term string into a NUL-term string of possibly bigger size. Commit 0efc5990bca5 ("string.h: Introduce memtostr() and memtostr_pad()") provides additional information in that regard. So if this happens, the following warning is observed: strnlen: detected buffer overflow: 65 byte read of buffer size 64 WARNING: CPU: 0 PID: 28655 at lib/string_helpers.c:1032 __fortify_report+0x96/0xc0 lib/string_helpers.c:1032 Modules linked in: CPU: 0 UID: 0 PID: 28655 Comm: syz-executor.3 Not tainted 6.12.54-syzkaller-00144-g5f0270f1ba00 #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:__fortify_report+0x96/0xc0 lib/string_helpers.c:1032 Call Trace: <TASK> __fortify_panic+0x1f/0x30 lib/string_helpers.c:1039 strnlen include/linux/fortify-string.h:235 [inline] sized_strscpy include/linux/fortify-string.h:309 [inline] parse_apply_sb_mount_options fs/ext4/super.c:2504 [inline] __ext4_fill_super fs/ext4/super.c:5261 [inline] ext4_fill_super+0x3c35/0xad00 fs/ext4/super.c:5706 get_tree_bdev_flags+0x387/0x620 fs/super.c:1636 vfs_get_tree+0x93/0x380 fs/super.c:1814 do_new_mount fs/namespace.c:3553 [inline] path_mount+0x6ae/0x1f70 fs/namespace.c:3880 do_mount fs/namespace.c:3893 [inline] __do_sys_mount fs/namespace.c:4103 [inline] __se_sys_mount fs/namespace.c:4080 [inline] __x64_sys_mount+0x280/0x300 fs/namespace.c:4080 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x64/0x140 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e Since userspace is expected to provide s_mount_opts field to be at most 63 characters long with the ending byte being NUL-term, use a 64-byte buffer which matches the size of s_mount_opts, so that strscpy_pad() does its job properly. Return with error if the user still managed to provide a non-NUL-term string here. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 8ecb790ea8c3 ("ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()") Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251101160430.222297-1-pchelkin@ispras.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-18ext4: improve integrity checking in __mb_check_buddy by enhancing order-0 ↵Yongjian Sun1-17/+32
validation [ Upstream commit d9ee3ff810f1cc0e253c9f2b17b668b973cb0e06 ] When the MB_CHECK_ASSERT macro is enabled, we found that the current validation logic in __mb_check_buddy has a gap in detecting certain invalid buddy states, particularly related to order-0 (bitmap) bits. The original logic consists of three steps: 1. Validates higher-order buddies: if a higher-order bit is set, at most one of the two corresponding lower-order bits may be free; if a higher-order bit is clear, both lower-order bits must be allocated (and their bitmap bits must be 0). 2. For any set bit in order-0, ensures all corresponding higher-order bits are not free. 3. Verifies that all preallocated blocks (pa) in the group have pa_pstart within bounds and their bitmap bits marked as allocated. However, this approach fails to properly validate cases where order-0 bits are incorrectly cleared (0), allowing some invalid configurations to pass: corrupt integral order 3 1 1 order 2 1 1 1 1 order 1 1 1 1 1 1 1 1 1 order 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Here we get two adjacent free blocks at order-0 with inconsistent higher-order state, and the right one shows the correct scenario. The root cause is insufficient validation of order-0 zero bits. To fix this and improve completeness without significant performance cost, we refine the logic: 1. Maintain the top-down higher-order validation, but we no longer check the cases where the higher-order bit is 0, as this case will be covered in step 2. 2. Enhance order-0 checking by examining pairs of bits: - If either bit in a pair is set (1), all corresponding higher-order bits must not be free. - If both bits are clear (0), then exactly one of the corresponding higher-order bits must be free 3. Keep the preallocation (pa) validation unchanged. This change closes the validation gap, ensuring illegal buddy states involving order-0 are correctly detected, while removing redundant checks and maintaining efficiency. Fixes: c9de560ded61f ("ext4: Add multi block allocator for ext4") Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251106060614.631382-3-sunyongjian@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18ext4: correct the checking of quota files before moving extentsZhang Yi1-1/+1
[ Upstream commit a2e5a3cea4b18f6e2575acc444a5e8cce1fc8260 ] The move extent operation should return -EOPNOTSUPP if any of the inodes is a quota inode, rather than requiring both to be quota inodes. Fixes: 02749a4c2082 ("ext4: add ext4_is_quota_file()") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-2-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-12ext4: add i_data_sem protection in ext4_destroy_inline_data_nolock()Alexey Nepomnyashih1-1/+6
commit 0cd8feea8777f8d9b9a862b89c688b049a5c8475 upstream. Fix a race between inline data destruction and block mapping. The function ext4_destroy_inline_data_nolock() changes the inode data layout by clearing EXT4_INODE_INLINE_DATA and setting EXT4_INODE_EXTENTS. At the same time, another thread may execute ext4_map_blocks(), which tests EXT4_INODE_EXTENTS to decide whether to call ext4_ext_map_blocks() or ext4_ind_map_blocks(). Without i_data_sem protection, ext4_ind_map_blocks() may receive inode with EXT4_INODE_EXTENTS flag and triggering assert. kernel BUG at fs/ext4/indirect.c:546! EXT4-fs (loop2): unmounting filesystem. invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:ext4_ind_map_blocks.cold+0x2b/0x5a fs/ext4/indirect.c:546 Call Trace: <TASK> ext4_map_blocks+0xb9b/0x16f0 fs/ext4/inode.c:681 _ext4_get_block+0x242/0x590 fs/ext4/inode.c:822 ext4_block_write_begin+0x48b/0x12c0 fs/ext4/inode.c:1124 ext4_write_begin+0x598/0xef0 fs/ext4/inode.c:1255 ext4_da_write_begin+0x21e/0x9c0 fs/ext4/inode.c:3000 generic_perform_write+0x259/0x5d0 mm/filemap.c:3846 ext4_buffered_write_iter+0x15b/0x470 fs/ext4/file.c:285 ext4_file_write_iter+0x8e0/0x17f0 fs/ext4/file.c:679 call_write_iter include/linux/fs.h:2271 [inline] do_iter_readv_writev+0x212/0x3c0 fs/read_write.c:735 do_iter_write+0x186/0x710 fs/read_write.c:861 vfs_iter_write+0x70/0xa0 fs/read_write.c:902 iter_file_splice_write+0x73b/0xc90 fs/splice.c:685 do_splice_from fs/splice.c:763 [inline] direct_splice_actor+0x10f/0x170 fs/splice.c:950 splice_direct_to_actor+0x33a/0xa10 fs/splice.c:896 do_splice_direct+0x1a9/0x280 fs/splice.c:1002 do_sendfile+0xb13/0x12c0 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1cf/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Fixes: c755e251357a ("ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()") Cc: stable@vger.kernel.org # v4.11+ Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru> Message-ID: <20251104093326.697381-1-sdl@nppct.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-12ext4: refresh inline data size before write operationsDeepanshu Kartikey1-1/+6
commit 892e1cf17555735e9d021ab036c36bc7b58b0e3b upstream. The cached ei->i_inline_size can become stale between the initial size check and when ext4_update_inline_data()/ext4_create_inline_data() use it. Although ext4_get_max_inline_size() reads the correct value at the time of the check, concurrent xattr operations can modify i_inline_size before ext4_write_lock_xattr() is acquired. This causes ext4_update_inline_data() and ext4_create_inline_data() to work with stale capacity values, leading to a BUG_ON() crash in ext4_write_inline_data(): kernel BUG at fs/ext4/inline.c:1331! BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); The race window: 1. ext4_get_max_inline_size() reads i_inline_size = 60 (correct) 2. Size check passes for 50-byte write 3. [Another thread adds xattr, i_inline_size changes to 40] 4. ext4_write_lock_xattr() acquires lock 5. ext4_update_inline_data() uses stale i_inline_size = 60 6. Attempts to write 50 bytes but only 40 bytes actually available 7. BUG_ON() triggers Fix this by recalculating i_inline_size via ext4_find_inline_data_nolock() immediately after acquiring xattr_sem. This ensures ext4_update_inline_data() and ext4_create_inline_data() work with current values that are protected from concurrent modifications. This is similar to commit a54c4613dac1 ("ext4: fix race writing to an inline_data file while its xattrs are changing") which fixed i_inline_off staleness. This patch addresses the related i_inline_size staleness issue. Reported-by: syzbot+f3185be57d7e8dda32b8@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=f3185be57d7e8dda32b8 Cc: stable@kernel.org Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Message-ID: <20251020060936.474314-1-kartikey406@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-15Merge tag 'ext4_for_linus-6.18-rc2' of ↵Linus Torvalds3-4/+19
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bug fixes from Ted Ts'o: - Fix regression caused by removing CONFIG_EXT3_FS when testing some very old defconfigs - Avoid a BUG_ON when opening a file on a maliciously corrupted file system - Avoid mm warnings when freeing a very large orphan file metadata - Avoid a theoretical races between metadata writeback and checkpoints (it's very hard to hit in practice, since the race requires that the writeback take a very long time) * tag 'ext4_for_linus-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: Use CONFIG_EXT4_FS instead of CONFIG_EXT3_FS in all of the defconfigs ext4: free orphan info with kvfree ext4: detect invalid INLINE_DATA + EXTENTS flag combination ext4, doc: fix and improve directory hash tree description ext4: wait for ongoing I/O to complete before freeing blocks jbd2: ensure that all ongoing I/O complete before freeing blocks
2025-10-10ext4: free orphan info with kvfreeJan Kara1-2/+2
Orphan info is now getting allocated with kvmalloc_array(). Free it with kvfree() instead of kfree() to avoid complaints from mm. Reported-by: Chris Mason <clm@meta.com> Fixes: 0a6ce20c1564 ("ext4: verify orphan file size is not too big") Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Message-ID: <20251007134936.7291-2-jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-10-10ext4: detect invalid INLINE_DATA + EXTENTS flag combinationDeepanshu Kartikey1-0/+8
syzbot reported a BUG_ON in ext4_es_cache_extent() when opening a verity file on a corrupted ext4 filesystem mounted without a journal. The issue is that the filesystem has an inode with both the INLINE_DATA and EXTENTS flags set: EXT4-fs error (device loop0): ext4_cache_extents:545: inode #15: comm syz.0.17: corrupted extent tree: lblk 0 < prev 66 Investigation revealed that the inode has both flags set: DEBUG: inode 15 - flag=1, i_inline_off=164, has_inline=1, extents_flag=1 This is an invalid combination since an inode should have either: - INLINE_DATA: data stored directly in the inode - EXTENTS: data stored in extent-mapped blocks Having both flags causes ext4_has_inline_data() to return true, skipping extent tree validation in __ext4_iget(). The unvalidated out-of-order extents then trigger a BUG_ON in ext4_es_cache_extent() due to integer underflow when calculating hole sizes. Fix this by detecting this invalid flag combination early in ext4_iget() and rejecting the corrupted inode. Cc: stable@kernel.org Reported-and-tested-by: syzbot+038b7bf43423e132b308@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=038b7bf43423e132b308 Suggested-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Message-ID: <20250930112810.315095-1-kartikey406@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-10-10ext4: wait for ongoing I/O to complete before freeing blocksZhang Yi1-2/+9
When freeing metadata blocks in nojournal mode, ext4_forget() calls bforget() to clear the dirty flag on the buffer_head and remvoe associated mappings. This is acceptable if the metadata has not yet begun to be written back. However, if the write-back has already started but is not yet completed, ext4_forget() will have no effect. Subsequently, ext4_mb_clear_bb() will immediately return the block to the mb allocator. This block can then be reallocated immediately, potentially causing an data corruption issue. Fix this by clearing the buffer's dirty flag and waiting for the ongoing I/O to complete, ensuring that no further writes to stale data will occur. Fixes: 16e08b14a455 ("ext4: cleanup clean_bdev_aliases() calls") Cc: stable@kernel.org Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com> Closes: https://lore.kernel.org/linux-ext4/a9417096-9549-4441-9878-b1955b899b4e@huaweicloud.com/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20250916093337.3161016-3-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-10-03Merge tag 'ext4_for_linus-6.18-rc1' of ↵Linus Torvalds14-117/+413
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "New ext4 features: - Add support so tune2fs can modify/update the superblock using an ioctl, without needing write access to the block device - Add support for 32-bit reserved uid's and gid's Bug fixes: - Fix potential warnings and other failures caused by corrupted / fuzzed file systems - Fail unaligned direct I/O write with EINVAL instead of silently falling back to buffered I/O - Correectly handle fsmap queries for metadata mappings - Avoid journal stalls caused by writeback throttling - Add some missing GFP_NOFAIL flags to avoid potential deadlocks under extremem memory pressure Cleanups: - Remove obsolete EXT3 Kconfigs" * tag 'ext4_for_linus-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix checks for orphan inodes ext4: validate ea_ino and size in check_xattrs ext4: guard against EA inode refcount underflow in xattr update ext4: implemet new ioctls to set and get superblock parameters ext4: add support for 32-bit default reserved uid and gid values ext4: avoid potential buffer over-read in parse_apply_sb_mount_options() ext4: fix an off-by-one issue during moving extents ext4: increase i_disksize to offset + len in ext4_update_disksize_before_punch() ext4: verify orphan file size is not too big ext4: fail unaligned direct IO write with EINVAL ext4: correctly handle queries for metadata mappings ext4: increase IO priority of fastcommit ext4: remove obsolete EXT3 config options jbd2: increase IO priority of checkpoint ext4: fix potential null deref in ext4_mb_init() ext4: add ext4_sb_bread_nofail() helper function for ext4_free_branches() ext4: replace min/max nesting with clamp() fs: ext4: change GFP_KERNEL to GFP_NOFS to avoid deadlock
2025-09-29Merge tag 'vfs-6.18-rc1.workqueue' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs workqueue updates from Christian Brauner: "This contains various workqueue changes affecting the filesystem layer. Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This replaces the use of system_wq and system_unbound_wq. system_wq is a per-CPU workqueue which isn't very obvious from the name and system_unbound_wq is to be used when locality is not required. So this renames system_wq to system_percpu_wq, and system_unbound_wq to system_dfl_wq. This also adds a new WQ_PERCPU flag to allow the fs subsystem users to explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and WQ_PERCPU flags coexist for one release cycle to allow callers to transition their calls. WQ_UNBOUND will be removed in a next release cycle" * tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: WQ_PERCPU added to alloc_workqueue users fs: replace use of system_wq with system_percpu_wq fs: replace use of system_unbound_wq with system_dfl_wq
2025-09-29Merge tag 'vfs-6.18-rc1.inode' of ↵Linus Torvalds5-2/+20
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "This contains a series I originally wrote and that Eric brought over the finish line. It moves out the i_crypt_info and i_verity_info pointers out of 'struct inode' and into the fs-specific part of the inode. So now the few filesytems that actually make use of this pay the price in their own private inode storage instead of forcing it upon every user of struct inode. The pointer for the crypt and verity info is simply found by storing an offset to its address in struct fsverity_operations and struct fscrypt_operations. This shrinks struct inode by 16 bytes. I hope to move a lot more out of it in the future so that struct inode becomes really just about very core stuff that we need, much like struct dentry and struct file, instead of the dumping ground it has become over the years. On top of this are a various changes associated with the ongoing inode lifetime handling rework that multiple people are pushing forward: - Stop accessing inode->i_count directly in f2fs and gfs2. They simply should use the __iget() and iput() helpers - Make the i_state flags an enum - Rework the iput() logic Currently, if we are the last iput, and we have the I_DIRTY_TIME bit set, we will grab a reference on the inode again and then mark it dirty and then redo the put. This is to make sure we delay the time update for as long as possible We can rework this logic to simply dec i_count if it is not 1, and if it is do the time update while still holding the i_count reference Then we can replace the atomic_dec_and_lock with locking the ->i_lock and doing atomic_dec_and_test, since we did the atomic_add_unless above - Add an icount_read() helper and convert everyone that accesses inode->i_count directly for this purpose to use the helper - Expand dump_inode() to dump more information about an inode helping in debugging - Add some might_sleep() annotations to iput() and associated helpers" * tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add might_sleep() annotation to iput() and more fs: expand dump_inode() inode: fix whitespace issues fs: add an icount_read helper fs: rework iput logic fs: make the i_state flags an enum fs: stop accessing ->i_count directly in f2fs and gfs2 fsverity: check IS_VERITY() in fsverity_cleanup_inode() fs: remove inode::i_verity_info btrfs: move verity info pointer to fs-specific part of inode f2fs: move verity info pointer to fs-specific part of inode ext4: move verity info pointer to fs-specific part of inode fsverity: add support for info in fs-specific part of inode fs: remove inode::i_crypt_info ceph: move crypt info pointer to fs-specific part of inode ubifs: move crypt info pointer to fs-specific part of inode f2fs: move crypt info pointer to fs-specific part of inode ext4: move crypt info pointer to fs-specific part of inode fscrypt: add support for info in fs-specific part of inode fscrypt: replace raw loads of info pointer with helper function
2025-09-29Merge tag 'vfs-6.18-rc1.misc' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...
2025-09-26ext4: fix checks for orphan inodesJan Kara5-9/+15
When orphan file feature is enabled, inode can be tracked as orphan either in the standard orphan list or in the orphan file. The first can be tested by checking ei->i_orphan list head, the second is recorded by EXT4_STATE_ORPHAN_FILE inode state flag. There are several places where we want to check whether inode is tracked as orphan and only some of them properly check for both possibilities. Luckily the consequences are mostly minor, the worst that can happen is that we track an inode as orphan although we don't need to and e2fsck then complains (resulting in occasional ext4/307 xfstest failures). Fix the problem by introducing a helper for checking whether an inode is tracked as orphan and use it in appropriate places. Fixes: 4a79a98c7b19 ("ext4: Improve scalability of ext4 orphan file handling") Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Message-ID: <20250925123038.20264-2-jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: validate ea_ino and size in check_xattrsDeepanshu Kartikey1-0/+4
During xattr block validation, check_xattrs() processes xattr entries without validating that entries claiming to use EA inodes have non-zero sizes. Corrupted filesystems may contain xattr entries where e_value_size is zero but e_value_inum is non-zero, indicating invalid xattr data. Add validation in check_xattrs() to detect this corruption pattern early and return -EFSCORRUPTED, preventing invalid xattr entries from causing issues throughout the ext4 codebase. Cc: stable@kernel.org Suggested-by: Theodore Ts'o <tytso@mit.edu> Reported-by: syzbot+4c9d23743a2409b80293@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=4c9d23743a2409b80293 Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Message-ID: <20250923133245.1091761-1-kartikey406@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: guard against EA inode refcount underflow in xattr updateAhmet Eray Karadag1-7/+8
syzkaller found a path where ext4_xattr_inode_update_ref() reads an EA inode refcount that is already <= 0 and then applies ref_change (often -1). That lets the refcount underflow and we proceed with a bogus value, triggering errors like: EXT4-fs error: EA inode <n> ref underflow: ref_count=-1 ref_change=-1 EXT4-fs warning: ea_inode dec ref err=-117 Make the invariant explicit: if the current refcount is non-positive, treat this as on-disk corruption, emit ext4_error_inode(), and fail the operation with -EFSCORRUPTED instead of updating the refcount. Delete the WARN_ONCE() as negative refcounts are now impossible; keep error reporting in ext4_error_inode(). This prevents the underflow and the follow-on orphan/cleanup churn. Reported-by: syzbot+0be4f339a8218d2a5bb1@syzkaller.appspotmail.com Fixes: https://syzbot.org/bug?extid=0be4f339a8218d2a5bb1 Cc: stable@kernel.org Co-developed-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Ahmet Eray Karadag <eraykrdg1@gmail.com> Message-ID: <20250920021342.45575-1-eraykrdg1@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: implemet new ioctls to set and get superblock parametersTheodore Ts'o1-7/+305
Implement the EXT4_IOC_GET_TUNE_SB_PARAM and EXT4_IOC_SET_TUNE_SB_PARAM ioctls, which allow certains superblock parameters to be set while the file system is mounted, without needing write access to the block device. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Message-ID: <20250916-tune2fs-v2-3-d594dc7486f0@mit.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: add support for 32-bit default reserved uid and gid valuesTheodore Ts'o2-5/+19
Support for specifying the default user id and group id that is allowed to use the reserved block space was added way back when Linux only supported 16-bit uid's and gid's. (Yeah, that long ago.) It's not a commonly used feature, but let's add support for 32-bit user and group id's. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Message-ID: <20250916-tune2fs-v2-2-d594dc7486f0@mit.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()Theodore Ts'o1-12/+5
Unlike other strings in the ext4 superblock, we rely on tune2fs to make sure s_mount_opts is NUL terminated. Harden parse_apply_sb_mount_options() by treating s_mount_opts as a potential __nonstring. Cc: stable@vger.kernel.org Fixes: 8b67f04ab9de ("ext4: Add mount options in superblock") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Message-ID: <20250916-tune2fs-v2-1-d594dc7486f0@mit.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: fix an off-by-one issue during moving extentsZhang Yi1-1/+1
During the movement of a written extent, mext_page_mkuptodate() is called to read data in the range [from, to) into the page cache and to update the corresponding buffers. Therefore, we should not wait on any buffer whose start offset is >= 'to'. Otherwise, it will return -EIO and fail the extents movement. $ for i in `seq 3 -1 0`; \ do xfs_io -fs -c "pwrite -b 1024 $((i * 1024)) 1024" /mnt/foo; \ done $ umount /mnt && mount /dev/pmem1s /mnt # drop cache $ e4defrag /mnt/foo e4defrag 1.47.0 (5-Feb-2023) ext4 defragmentation for /mnt/foo [1/1]/mnt/foo: 0% [ NG ] Success: [0/1] Cc: stable@kernel.org Fixes: a40759fb16ae ("ext4: remove array of buffer_heads from mext_page_mkuptodate()") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20250912105841.1886799-1-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: increase i_disksize to offset + len in ext4_update_disksize_before_punch()Yongjian Sun1-2/+8
After running a stress test combined with fault injection, we performed fsck -a followed by fsck -fn on the filesystem image. During the second pass, fsck -fn reported: Inode 131512, end of extent exceeds allowed value (logical block 405, physical block 1180540, len 2) This inode was not in the orphan list. Analysis revealed the following call chain that leads to the inconsistency: ext4_da_write_end() //does not update i_disksize ext4_punch_hole() //truncate folio, keep size ext4_page_mkwrite() ext4_block_page_mkwrite() ext4_block_write_begin() ext4_get_block() //insert written extent without update i_disksize journal commit echo 1 > /sys/block/xxx/device/delete da-write path updates i_size but does not update i_disksize. Then ext4_punch_hole truncates the da-folio yet still leaves i_disksize unchanged(in the ext4_update_disksize_before_punch function, the condition offset + len < size is met). Then ext4_page_mkwrite sees ext4_nonda_switch return 1 and takes the nodioread_nolock path, the folio about to be written has just been punched out, and it’s offset sits beyond the current i_disksize. This may result in a written extent being inserted, but again does not update i_disksize. If the journal gets committed and then the block device is yanked, we might run into this. It should be noted that replacing ext4_punch_hole with ext4_zero_range in the call sequence may also trigger this issue, as neither will update i_disksize under these circumstances. To fix this, we can modify ext4_update_disksize_before_punch to increase i_disksize to min(i_size, offset + len) when both i_size and (offset + len) are greater than i_disksize. Cc: stable@kernel.org Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20250911133024.1841027-1-sunyongjian@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: verify orphan file size is not too bigJan Kara1-1/+12
In principle orphan file can be arbitrarily large. However orphan replay needs to traverse it all and we also pin all its buffers in memory. Thus filesystems with absurdly large orphan files can lead to big amounts of memory consumed. Limit orphan file size to a sane value and also use kvmalloc() for allocating array of block descriptor structures to avoid large order allocations for sane but large orphan files. Reported-by: syzbot+0b92850d68d9b12934f5@syzkaller.appspotmail.com Fixes: 02f310fcf47f ("ext4: Speedup ext4 orphan inode handling") Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Message-ID: <20250909112206.10459-2-jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: fail unaligned direct IO write with EINVALJan Kara1-35/+0
Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances") changed the error handling logic in iomap_iter(). Previously any error from iomap_dio_bio_iter() got propagated to userspace, after this commit if ->iomap_end returns error, it gets propagated to userspace instead of an error from iomap_dio_bio_iter(). This results in unaligned writes to ext4 to silently fallback to buffered IO instead of erroring out. Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems unnecessary these days. It is enough to return ENOTBLK from ext4_iomap_begin() when we don't support DIO write for that particular file offset (due to hole). Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances") Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Message-ID: <20250901112739.32484-2-jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-26ext4: correctly handle queries for metadata mappingsOjaswin Mujoo1-5/+9
Currently, our handling of metadata is _ambiguous_ in some scenarios, that is, we end up returning unknown if the range only covers the mapping partially. For example, in the following case: $ xfs_io -c fsmap -d 0: 254:16 [0..7]: static fs metadata 8 1: 254:16 [8..15]: special 102:1 8 2: 254:16 [16..5127]: special 102:2 5112 3: 254:16 [5128..5255]: special 102:3 128 4: 254:16 [5256..5383]: special 102:4 128 5: 254:16 [5384..70919]: inodes 65536 6: 254:16 [70920..70967]: unknown 48 ... $ xfs_io -c fsmap -d 24 33 0: 254:16 [24..39]: unknown 16 <--- incomplete reporting $ xfs_io -c fsmap -d 24 33 (With patch) 0: 254:16 [16..5127]: special 102:2 5112 This is because earlier in ext4_getfsmap_meta_helper, we end up ignoring any extent that starts before our queried range, but overlaps it. While the man page [1] is a bit ambiguous on this, this fix makes the output make more sense since we are anyways returning an "unknown" extent. This is also consistent to how XFS does it: $ xfs_io -c fsmap -d ... 6: 254:16 [104..127]: free space 24 7: 254:16 [128..191]: inodes 64 ... $ xfs_io -c fsmap -d 137 150 0: 254:16 [128..191]: inodes 64 <-- full extent returned [1] https://man7.org/linux/man-pages/man2/ioctl_getfsmap.2.html Reported-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: stable@kernel.org Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <023f37e35ee280cd9baac0296cbadcbe10995cab.1757058211.git.ojaswin@linux.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25ext4: increase IO priority of fastcommitJulian Sun1-1/+1
The following code paths may result in high latency or even task hangs: 1. fastcommit io is throttled by wbt. 2. jbd2_fc_wait_bufs() might wait for a long time while JBD2_FAST_COMMIT_ONGOING is set in journal->flags, and then jbd2_journal_commit_transaction() waits for the JBD2_FAST_COMMIT_ONGOING bit for a long time while holding the write lock of j_state_lock. 3. start_this_handle() waits for read lock of j_state_lock which results in high latency or task hang. Given the fact that ext4_fc_commit() already modifies the current process' IO priority to match that of the jbd2 thread, it should be reasonable to match jbd2's IO submission flags as well. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Julian Sun <sunjunchao@bytedance.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20250827121812.1477634-1-sunjunchao@bytedance.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25ext4: remove obsolete EXT3 config optionsLukas Bulwahn1-27/+0
In June 2015, commit c290ea01abb7 ("fs: Remove ext3 filesystem driver") removed the historic ext3 filesystem support as ext3 partitions are fully supported with the ext4 filesystem support. To simplify updating the kernel build configuration, which had only EXT3 support but not EXT4 support enabled, the three config options EXT3_{FS,FS_POSIX_ACL,FS_SECURITY} were kept, instead of immediately removing them. The three options just enable the corresponding EXT4 counterparts when configs from older kernel versions are used to build on later kernel versions. This ensures that the kernels from those kernel build configurations would then continue to have EXT4 enabled for supporting booting from ext3 and ext4 file systems, to avoid potential unexpected surprises. Given that the kernel build configuration has no backwards-compatibility guarantee and this transition phase for such build configurations has been in place for a decade, we can reasonably expect all such users to have transitioned to use the EXT4 config options in their config files at this point in time. With that in mind, the three EXT3 config options are obsolete by now. Remove the obsolete EXT3 config options. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25ext4: fix potential null deref in ext4_mb_init()Baokun Li1-0/+10
In ext4_mb_init(), ext4_mb_avg_fragment_size_destroy() may be called when sbi->s_mb_avg_fragment_size remains uninitialized (e.g., if groupinfo slab cache allocation fails). Since ext4_mb_avg_fragment_size_destroy() lacks null pointer checking, this leads to a null pointer dereference. ================================================================== EXT4-fs: no memory for groupinfo slab cache BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: Oops: 0002 [#1] SMP PTI CPU:2 UID: 0 PID: 87 Comm:mount Not tainted 6.17.0-rc2 #1134 PREEMPT(none) RIP: 0010:_raw_spin_lock_irqsave+0x1b/0x40 Call Trace: <TASK> xa_destroy+0x61/0x130 ext4_mb_init+0x483/0x540 __ext4_fill_super+0x116d/0x17b0 ext4_fill_super+0xd3/0x280 get_tree_bdev_flags+0x132/0x1d0 vfs_get_tree+0x29/0xd0 do_new_mount+0x197/0x300 __x64_sys_mount+0x116/0x150 do_syscall_64+0x50/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ================================================================== Therefore, add necessary null check to ext4_mb_avg_fragment_size_destroy() to prevent this issue. The same fix is also applied to ext4_mb_largest_free_orders_destroy(). Reported-by: syzbot+1713b1aa266195b916c2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1713b1aa266195b916c2 Cc: stable@kernel.org Fixes: f7eaacbb4e54 ("ext4: convert free groups order lists to xarrays") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25ext4: add ext4_sb_bread_nofail() helper function for ext4_free_branches()Baokun Li3-1/+12
The implicit __GFP_NOFAIL flag in ext4_sb_bread() was removed in commit 8a83ac54940d ("ext4: call bdev_getblk() from sb_getblk_gfp()"), meaning the function can now fail under memory pressure. Most callers of ext4_sb_bread() propagate the error to userspace and do not remount the filesystem read-only. However, ext4_free_branches() handles ext4_sb_bread() failure by remounting the filesystem read-only. This implies that an ext3 filesystem (mounted via the ext4 driver) could be forcibly remounted read-only due to a transient page allocation failure, which is unacceptable. To mitigate this, introduce a new helper function, ext4_sb_bread_nofail(), which explicitly uses __GFP_NOFAIL, and use it in ext4_free_branches(). Fixes: 8a83ac54940d ("ext4: call bdev_getblk() from sb_getblk_gfp()") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25ext4: replace min/max nesting with clamp()Xichao Zhao1-3/+3
The clamp() macro explicitly expresses the intent of constraining a value within bounds.Therefore, replacing max(min(a,b),c) with clamp(val, lo, hi) can improve code readability. Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-25fs: ext4: change GFP_KERNEL to GFP_NOFS to avoid deadlockchuguangqing1-1/+1
The parent function ext4_xattr_inode_lookup_create already uses GFP_NOFS for memory alloction, so the function ext4_xattr_inode_cache_find should use same gfp_flag. Signed-off-by: chuguangqing <chuguangqing@inspur.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-09-19fs: replace use of system_unbound_wq with system_dfl_wqMarco Crivellari1-1/+1
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/20250916082906.77439-2-marco.crivellari@suse.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-15fs: rename generic_delete_inode() and generic_drop_inode()Mateusz Guzik1-1/+1
generic_delete_inode() is rather misleading for what the routine is doing. inode_just_drop() should be much clearer. The new naming is inconsistent with generic_drop_inode(), so rename that one as well with inode_ as the suffix. No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01fs: add an icount_read helperJosef Bacik1-2/+2
Instead of doing direct access to ->i_count, add a helper to handle this. This will make it easier to convert i_count to a refcount later. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/9bc62a84c6b9d6337781203f60837bd98fbc4a96.1756222464.git.josef@toxicpanda.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-21ext4: move verity info pointer to fs-specific part of inodeEric Biggers3-0/+9
Move the fsverity_info pointer into the filesystem-specific part of the inode by adding the field ext4_inode_info::i_verity_info and configuring fsverity_operations::inode_info_offs accordingly. This is a prerequisite for a later commit that removes inode::i_verity_info, saving memory and improving cache efficiency on filesystems that don't support fsverity. Co-developed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://lore.kernel.org/20250810075706.172910-10-ebiggers@kernel.org Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-21ext4: move crypt info pointer to fs-specific part of inodeEric Biggers3-0/+9
Move the fscrypt_inode_info pointer into the filesystem-specific part of the inode by adding the field ext4_inode_info::i_crypt_info and configuring fscrypt_operations::inode_info_offs accordingly. This is a prerequisite for a later commit that removes inode::i_crypt_info, saving memory and improving cache efficiency with filesystems that don't support fscrypt. Co-developed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://lore.kernel.org/20250810075706.172910-4-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-18Merge tag 'ext4_for_linus-6.17-rc3' of ↵Linus Torvalds7-18/+36
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: - Fix fast commit checks for file systems with ea_inode enabled - Don't drop the i_version mount option on a remount - Fix FIEMAP reporting when there are holes in a bigalloc file system - Don't fail when mounting read-only when there are inodes in the orphan file - Fix hole length overflow for indirect mapped files on file systems with an 8k or 16k block file system * tag 'ext4_for_linus-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: prevent softlockup in jbd2_log_do_checkpoint() ext4: fix incorrect function name in comment ext4: use kmalloc_array() for array space allocation ext4: fix hole length calculation overflow in non-extent inodes ext4: don't try to clear the orphan_present feature block device is r/o ext4: fix reserved gdt blocks handling in fsmap ext4: fix fsmap end of range reporting with bigalloc ext4: remove redundant __GFP_NOWARN ext4: fix unused variable warning in ext4_init_new_dir ext4: remove useless if check ext4: check fast symlink for ea_inode correctly ext4: preserve SB_I_VERSION on remount ext4: show the default enabled i_version option
2025-08-13ext4: fix incorrect function name in commentBaolin Liu1-1/+1
Since commit 6b730a405037 “ext4: hoist ext4_block_write_begin and replace the __block_write_begin”, the comment should be updated accordingly from "__block_write_begin" to "ext4_block_write_begin". Fixes: 6b730a405037 (“ext4: hoist ext4_block_write_begin and replace...") Signed-off-by: Baolin Liu <liubaolin@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/20250812021709.1120716-1-liubaolin12138@163.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-13ext4: use kmalloc_array() for array space allocationLiao Yuanhong1-2/+3
Replace kmalloc(size * sizeof) with kmalloc_array() for safer memory allocation and overflow prevention. Cc: stable@kernel.org Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Link: https://patch.msgid.link/20250811125816.570142-1-liaoyuanhong@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-13ext4: fix hole length calculation overflow in non-extent inodesZhang Yi1-2/+2
In a filesystem with a block size larger than 4KB, the hole length calculation for a non-extent inode in ext4_ind_map_blocks() can easily exceed INT_MAX. Then it could return a zero length hole and trigger the following waring and infinite in the iomap infrastructure. ------------[ cut here ]------------ WARNING: CPU: 3 PID: 434101 at fs/iomap/iter.c:34 iomap_iter_done+0x148/0x190 CPU: 3 UID: 0 PID: 434101 Comm: fsstress Not tainted 6.16.0-rc7+ #128 PREEMPT(voluntary) Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : iomap_iter_done+0x148/0x190 lr : iomap_iter+0x174/0x230 sp : ffff8000880af740 x29: ffff8000880af740 x28: ffff0000db8e6840 x27: 0000000000000000 x26: 0000000000000000 x25: ffff8000880af830 x24: 0000004000000000 x23: 0000000000000002 x22: 000001bfdbfa8000 x21: ffffa6a41c002e48 x20: 0000000000000001 x19: ffff8000880af808 x18: 0000000000000000 x17: 0000000000000000 x16: ffffa6a495ee6cd0 x15: 0000000000000000 x14: 00000000000003d4 x13: 00000000fa83b2da x12: 0000b236fc95f18c x11: ffffa6a4978b9c08 x10: 0000000000001da0 x9 : ffffa6a41c1a2a44 x8 : ffff8000880af5c8 x7 : 0000000001000000 x6 : 0000000000000000 x5 : 0000000000000004 x4 : 000001bfdbfa8000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000004004030000 x0 : 0000000000000000 Call trace: iomap_iter_done+0x148/0x190 (P) iomap_iter+0x174/0x230 iomap_fiemap+0x154/0x1d8 ext4_fiemap+0x110/0x140 [ext4] do_vfs_ioctl+0x4b8/0xbc0 __arm64_sys_ioctl+0x8c/0x120 invoke_syscall+0x6c/0x100 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x38/0x120 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x198/0x1a0 ---[ end trace 0000000000000000 ]--- Cc: stable@kernel.org Fixes: facab4d9711e ("ext4: return hole from ext4_map_blocks()") Reported-by: Qu Wenruo <wqu@suse.com> Closes: https://lore.kernel.org/linux-ext4/9b650a52-9672-4604-a765-bb6be55d1e4a@gmx.com/ Tested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250811064532.1788289-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-13ext4: don't try to clear the orphan_present feature block device is r/oTheodore Ts'o1-0/+2
When the file system is frozen in preparation for taking an LVM snapshot, the journal is checkpointed and if the orphan_file feature is enabled, and the orphan file is empty, we clear the orphan_present feature flag. But if there are pending inodes that need to be removed the orphan_present feature flag can't be cleared. The problem comes if the block device is read-only. In that case, we can't process the orphan inode list, so it is skipped in ext4_orphan_cleanup(). But then in ext4_mark_recovery_complete(), this results in the ext4 error "Orphan file not empty on read-only fs" firing and the file system mount is aborted. Fix this by clearing the needs_recovery flag in the block device is read-only. We do this after the call to ext4_load_and_init-journal() since there are some error checks need to be done in case the journal needs to be replayed and the block device is read-only, or if the block device containing the externa journal is read-only, etc. Cc: stable@kernel.org Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1108271 Cc: stable@vger.kernel.org Fixes: 02f310fcf47f ("ext4: Speedup ext4 orphan inode handling") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-13ext4: fix reserved gdt blocks handling in fsmapOjaswin Mujoo1-0/+8
In some cases like small FSes with no meta_bg and where the resize doesn't need extra gdt blocks as it can fit in the current one, s_reserved_gdt_blocks is set as 0, which causes fsmap to emit a 0 length entry, which is incorrect. $ mkfs.ext4 -b 65536 -O bigalloc /dev/sda 5G $ mount /dev/sda /mnt/scratch $ xfs_io -c "fsmap -d" /mnt/scartch 0: 253:48 [0..127]: static fs metadata 128 1: 253:48 [128..255]: special 102:1 128 2: 253:48 [256..255]: special 102:2 0 <---- 0 len entry 3: 253:48 [256..383]: special 102:3 128 Fix this by adding a check for this case. Cc: stable@kernel.org Fixes: 0c9ec4beecac ("ext4: support GETFSMAP ioctls") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/08781b796453a5770112aa96ad14c864fbf31935.1754377641.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-13ext4: fix fsmap end of range reporting with bigallocOjaswin Mujoo1-3/+12
With bigalloc enabled, the logic to report last extent has a bug since we try to use cluster units instead of block units. This can cause an issue where extra incorrect entries might be returned back to the user. This was flagged by generic/365 with 64k bs and -O bigalloc. ** Details of issue ** The issue was noticed on 5G 64k blocksize FS with -O bigalloc which has only 1 bg. $ xfs_io -c "fsmap -d" /mnt/scratch 0: 253:48 [0..127]: static fs metadata 128 /* sb */ 1: 253:48 [128..255]: special 102:1 128 /* gdt */ 3: 253:48 [256..383]: special 102:3 128 /* block bitmap */ 4: 253:48 [384..2303]: unknown 1920 /* flex bg empty space */ 5: 253:48 [2304..2431]: special 102:4 128 /* inode bitmap */ 6: 253:48 [2432..4351]: unknown 1920 /* flex bg empty space */ 7: 253:48 [4352..6911]: inodes 2560 8: 253:48 [6912..538623]: unknown 531712 9: 253:48 [538624..10485759]: free space 9947136 The issue can be seen with: $ xfs_io -c "fsmap -d 0 3" /mnt/scratch 0: 253:48 [0..127]: static fs metadata 128 1: 253:48 [384..2047]: unknown 1664 Only the first entry was expected to be returned but we get 2. This is because: ext4_getfsmap_datadev() first_cluster, last_cluster = 0 ... info->gfi_last = true; ext4_getfsmap_datadev_helper(sb, end_ag, last_cluster + 1, 0, info); fsb = C2B(1) = 16 fslen = 0 ... /* Merge in any relevant extents from the meta_list */ list_for_each_entry_safe(p, tmp, &info->gfi_meta_list, fmr_list) { ... // since fsb = 16, considers all metadata which starts before 16 blockno iter 1: error = ext4_getfsmap_helper(sb, info, p); // p = sb (0,1), nop info->gfi_next_fsblk = 1 iter 2: error = ext4_getfsmap_helper(sb, info, p); // p = gdt (1,2), nop info->gfi_next_fsblk = 2 iter 3: error = ext4_getfsmap_helper(sb, info, p); // p = blk bitmap (2,3), nop info->gfi_next_fsblk = 3 iter 4: error = ext4_getfsmap_helper(sb, info, p); // p = ino bitmap (18,19) if (rec_blk > info->gfi_next_fsblk) { // (18 > 3) // emits an extra entry ** BUG ** } } Fix this by directly calling ext4_getfsmap_datadev() with a dummy record that has fmr_physical set to (end_fsb + 1) instead of last_cluster + 1. By using the block instead of cluster we get the correct behavior. Replacing ext4_getfsmap_datadev_helper() with ext4_getfsmap_helper() is okay since the gfi_lastfree and metadata checks in ext4_getfsmap_datadev_helper() are anyways redundant when we only want to emit the last allocated block of the range, as we have already taken care of emitting metadata and any last free blocks. Cc: stable@kernel.org Reported-by: Disha Goel <disgoel@linux.ibm.com> Fixes: 4a622e4d477b ("ext4: fix FS_IOC_GETFSMAP handling") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/e7472c8535c9c5ec10f425f495366864ea12c9da.1754377641.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-13ext4: remove redundant __GFP_NOWARNQianfeng Rong2-2/+2
GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant __GFP_NOWARN. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Link: https://patch.msgid.link/20250803102243.623705-4-rongqianfeng@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-13ext4: fix unused variable warning in ext4_init_new_dirTheodore Ts'o1-2/+0
Cc: stable@kernel.org Fixes: 90f097b1403f ("ext4: refactor the inline directory conversion and...") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-08-13ext4: remove useless if checkAntonio Quartulli1-2/+0
This if branch is only jumping to 'out' which is defined just after the branch itself. Hence this is if-check is a no-op and can be removed. Address-Coverity-ID: 1647981 ("Incorrect expression (IDENTICAL_BRANCHES)") Signed-off-by: Antonio Quartulli <antonio@mandelbit.com> Link: https://patch.msgid.link/20250721200902.1071-1-antonio@mandelbit.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>