summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)AuthorFilesLines
2026-04-02ext4: always drain queued discard work in ext4_mb_release()Theodore Ts'o1-7/+5
commit 9ee29d20aab228adfb02ca93f87fb53c56c2f3af upstream. While reviewing recent ext4 patch[1], Sashiko raised the following concern[2]: > If the filesystem is initially mounted with the discard option, > deleting files will populate sbi->s_discard_list and queue > s_discard_work. If it is then remounted with nodiscard, the > EXT4_MOUNT_DISCARD flag is cleared, but the pending s_discard_work is > neither cancelled nor flushed. [1] https://lore.kernel.org/r/20260319094545.19291-1-qiang.zhang@linux.dev/ [2] https://sashiko.dev/#/patchset/20260319094545.19291-1-qiang.zhang%40linux.dev The concern was valid, but it had nothing to do with the patch[1]. One of the problems with Sashiko in its current (early) form is that it will detect pre-existing issues and report it as a problem with the patch that it is reviewing. In practice, it would be hard to hit deliberately (unless you are a malicious syzkaller fuzzer), since it would involve mounting the file system with -o discard, and then deleting a large number of files, remounting the file system with -o nodiscard, and then immediately unmounting the file system before the queued discard work has a change to drain on its own. Fix it because it's a real bug, and to avoid Sashiko from raising this concern when analyzing future patches to mballoc.c. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Fixes: 55cdd0af2bc5 ("ext4: get discard out of jbd2 commit kthread contex") Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: fix iloc.bh leak in ext4_fc_replay_inode() error pathsBaokun Li1-5/+8
commit ec0a7500d8eace5b4f305fa0c594dd148f0e8d29 upstream. During code review, Joseph found that ext4_fc_replay_inode() calls ext4_get_fc_inode_loc() to get the inode location, which holds a reference to iloc.bh that must be released via brelse(). However, several error paths jump to the 'out' label without releasing iloc.bh: - ext4_handle_dirty_metadata() failure - sync_dirty_buffer() failure - ext4_mark_inode_used() failure - ext4_iget() failure Fix this by introducing an 'out_brelse' label placed just before the existing 'out' label to ensure iloc.bh is always released. Additionally, make ext4_fc_replay_inode() propagate errors properly instead of always returning 0. Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com> Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260323060836.3452660-1-libaokun@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: handle wraparound when searching for blocks for indirect mapped blocksTheodore Ts'o1-0/+2
commit bb81702370fad22c06ca12b6e1648754dbc37e0f upstream. Commit 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") restricts what blocks will be allocated for indirect block based files to block numbers that fit within 32-bit block numbers. However, when using a review bot running on the latest Gemini LLM to check this commit when backporting into an LTS based kernel, it raised this concern: If ac->ac_g_ex.fe_group is >= ngroups (for instance, if the goal group was populated via stream allocation from s_mb_last_groups), then start will be >= ngroups. Does this allow allocating blocks beyond the 32-bit limit for indirect block mapped files? The commit message mentions that ext4_mb_scan_groups_linear() takes care to not select unsupported groups. However, its loop uses group = *start, and the very first iteration will call ext4_mb_scan_group() with this unsupported group because next_linear_group() is only called at the end of the iteration. After reviewing the code paths involved and considering the LLM review, I determined that this can happen when there is a file system where some files/directories are extent-mapped and others are indirect-block mapped. To address this, add a safety clamp in ext4_mb_scan_groups(). Fixes: 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") Cc: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20260326045834.1175822-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: fix the might_sleep() warnings in kvfree()Zqiang2-13/+5
commit 496bb99b7e66f48b178126626f47e9ba79e2d0fa upstream. Use the kvfree() in the RCU read critical section can trigger the following warnings: EXT4-fs (vdb): unmounting filesystem cd983e5b-3c83-4f5a-a136-17b00eb9d018. WARNING: suspicious RCU usage ./include/linux/rcupdate.h:409 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 Call Trace: <TASK> dump_stack_lvl+0xbb/0xd0 dump_stack+0x14/0x20 lockdep_rcu_suspicious+0x15a/0x1b0 __might_resched+0x375/0x4d0 ? put_object.part.0+0x2c/0x50 __might_sleep+0x108/0x160 vfree+0x58/0x910 ? ext4_group_desc_free+0x27/0x270 kvfree+0x23/0x40 ext4_group_desc_free+0x111/0x270 ext4_put_super+0x3c8/0xd40 generic_shutdown_super+0x14c/0x4a0 ? __pfx_shrinker_free+0x10/0x10 kill_block_super+0x40/0x90 ext4_kill_sb+0x6d/0xb0 deactivate_locked_super+0xb4/0x180 deactivate_super+0x7e/0xa0 cleanup_mnt+0x296/0x3e0 __cleanup_mnt+0x16/0x20 task_work_run+0x157/0x250 ? __pfx_task_work_run+0x10/0x10 ? exit_to_user_mode_loop+0x6a/0x550 exit_to_user_mode_loop+0x102/0x550 do_syscall_64+0x44a/0x500 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> BUG: sleeping function called from invalid context at mm/vmalloc.c:3441 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 556, name: umount preempt_count: 1, expected: 0 CPU: 3 UID: 0 PID: 556 Comm: umount Call Trace: <TASK> dump_stack_lvl+0xbb/0xd0 dump_stack+0x14/0x20 __might_resched+0x275/0x4d0 ? put_object.part.0+0x2c/0x50 __might_sleep+0x108/0x160 vfree+0x58/0x910 ? ext4_group_desc_free+0x27/0x270 kvfree+0x23/0x40 ext4_group_desc_free+0x111/0x270 ext4_put_super+0x3c8/0xd40 generic_shutdown_super+0x14c/0x4a0 ? __pfx_shrinker_free+0x10/0x10 kill_block_super+0x40/0x90 ext4_kill_sb+0x6d/0xb0 deactivate_locked_super+0xb4/0x180 deactivate_super+0x7e/0xa0 cleanup_mnt+0x296/0x3e0 __cleanup_mnt+0x16/0x20 task_work_run+0x157/0x250 ? __pfx_task_work_run+0x10/0x10 ? exit_to_user_mode_loop+0x6a/0x550 exit_to_user_mode_loop+0x102/0x550 do_syscall_64+0x44a/0x500 entry_SYSCALL_64_after_hwframe+0x77/0x7f The above scenarios occur in initialization failures and teardown paths, there are no parallel operations on the resources released by kvfree(), this commit therefore remove rcu_read_lock/unlock() and use rcu_access_pointer() instead of rcu_dereference() operations. Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access") Fixes: df3da4ea5a0f ("ext4: fix potential race between s_group_info online resizing and access") Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Link: https://patch.msgid.link/20260319094545.19291-1-qiang.zhang@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: fix use-after-free in update_super_work when racing with umountJiayuan Chen3-1/+11
commit d15e4b0a418537aafa56b2cb80d44add83e83697 upstream. Commit b98535d09179 ("ext4: fix bug_on in start_this_handle during umount filesystem") moved ext4_unregister_sysfs() before flushing s_sb_upd_work to prevent new error work from being queued via /proc/fs/ext4/xx/mb_groups reads during unmount. However, this introduced a use-after-free because update_super_work calls ext4_notify_error_sysfs() -> sysfs_notify() which accesses the kobject's kernfs_node after it has been freed by kobject_del() in ext4_unregister_sysfs(): update_super_work ext4_put_super ----------------- -------------- ext4_unregister_sysfs(sb) kobject_del(&sbi->s_kobj) __kobject_del() sysfs_remove_dir() kobj->sd = NULL sysfs_put(sd) kernfs_put() // RCU free ext4_notify_error_sysfs(sbi) sysfs_notify(&sbi->s_kobj) kn = kobj->sd // stale pointer kernfs_get(kn) // UAF on freed kernfs_node ext4_journal_destroy() flush_work(&sbi->s_sb_upd_work) Instead of reordering the teardown sequence, fix this by making ext4_notify_error_sysfs() detect that sysfs has already been torn down by checking s_kobj.state_in_sysfs, and skipping the sysfs_notify() call in that case. A dedicated mutex (s_error_notify_mutex) serializes ext4_notify_error_sysfs() against kobject_del() in ext4_unregister_sysfs() to prevent TOCTOU races where the kobject could be deleted between the state_in_sysfs check and the sysfs_notify() call. Fixes: b98535d09179 ("ext4: fix bug_on in start_this_handle during umount filesystem") Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260319120336.157873-1-jiayuan.chen@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: reject mount if bigalloc with s_first_data_block != 0Helen Koike1-0/+7
commit 3822743dc20386d9897e999dbb990befa3a5b3f8 upstream. bigalloc with s_first_data_block != 0 is not supported, reject mounting it. Signed-off-by: Helen Koike <koike@igalia.com> Suggested-by: Theodore Ts'o <tytso@mit.edu> Reported-by: syzbot+b73703b873a33d8eb8f6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b73703b873a33d8eb8f6 Link: https://patch.msgid.link/20260317142325.135074-1-koike@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: avoid allocate block from corrupted group in ext4_mb_find_by_goal()Ye Bin1-1/+5
commit 46066e3a06647c5b186cc6334409722622d05c44 upstream. There's issue as follows: ... EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2243 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2239 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): error count since last fsck: 1 EXT4-fs (mmcblk0p1): initial error at time 1765597433: ext4_mb_generate_buddy:760 EXT4-fs (mmcblk0p1): last error at time 1765597433: ext4_mb_generate_buddy:760 ... According to the log analysis, blocks are always requested from the corrupted block group. This may happen as follows: ext4_mb_find_by_goal ext4_mb_load_buddy ext4_mb_load_buddy_gfp ext4_mb_init_cache ext4_read_block_bitmap_nowait ext4_wait_block_bitmap ext4_validate_block_bitmap if (!grp || EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) return -EFSCORRUPTED; // There's no logs. if (err) return err; // Will return error ext4_lock_group(ac->ac_sb, group); if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) // Unreachable goto out; After commit 9008a58e5dce ("ext4: make the bitmap read routines return real error codes") merged, Commit 163a203ddb36 ("ext4: mark block group as corrupt on block bitmap error") is no real solution for allocating blocks from corrupted block groups. This is because if 'EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)' is true, then 'ext4_mb_load_buddy()' may return an error. This means that the block allocation will fail. Therefore, check block group if corrupted when ext4_mb_load_buddy() returns error. Fixes: 163a203ddb36 ("ext4: mark block group as corrupt on block bitmap error") Fixes: 9008a58e5dce ("ext4: make the bitmap read routines return real error codes") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260302134619.3145520-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: avoid infinite loops caused by residual dataEdward Adam Davis1-2/+6
commit 5422fe71d26d42af6c454ca9527faaad4e677d6c upstream. On the mkdir/mknod path, when mapping logical blocks to physical blocks, if inserting a new extent into the extent tree fails (in this example, because the file system disabled the huge file feature when marking the inode as dirty), ext4_ext_map_blocks() only calls ext4_free_blocks() to reclaim the physical block without deleting the corresponding data in the extent tree. This causes subsequent mkdir operations to reference the previously reclaimed physical block number again, even though this physical block is already being used by the xattr block. Therefore, a situation arises where both the directory and xattr are using the same buffer head block in memory simultaneously. The above causes ext4_xattr_block_set() to enter an infinite loop about "inserted" and cannot release the inode lock, ultimately leading to the 143s blocking problem mentioned in [1]. If the metadata is corrupted, then trying to remove some extent space can do even more harm. Also in case EXT4_GET_BLOCKS_DELALLOC_RESERVE was passed, remove space wrongly update quota information. Jan Kara suggests distinguishing between two cases: 1) The error is ENOSPC or EDQUOT - in this case the filesystem is fully consistent and we must maintain its consistency including all the accounting. However these errors can happen only early before we've inserted the extent into the extent tree. So current code works correctly for this case. 2) Some other error - this means metadata is corrupted. We should strive to do as few modifications as possible to limit damage. So I'd just skip freeing of allocated blocks. [1] INFO: task syz.0.17:5995 blocked for more than 143 seconds. Call Trace: inode_lock_nested include/linux/fs.h:1073 [inline] __start_dirop fs/namei.c:2923 [inline] start_dirop fs/namei.c:2934 [inline] Reported-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1659aaaaa8d9d11265d7 Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Reported-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=512459401510e2a9a39f Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com Link: https://patch.msgid.link/tencent_43696283A68450B761D76866C6F360E36705@qq.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: validate p_idx bounds in ext4_ext_correct_indexesTejas Bharambe1-0/+15
commit 2acb5c12ebd860f30e4faf67e6cc8c44ddfe5fe8 upstream. ext4_ext_correct_indexes() walks up the extent tree correcting index entries when the first extent in a leaf is modified. Before accessing path[k].p_idx->ei_block, there is no validation that p_idx falls within the valid range of index entries for that level. If the on-disk extent header contains a corrupted or crafted eh_entries value, p_idx can point past the end of the allocated buffer, causing a slab-out-of-bounds read. Fix this by validating path[k].p_idx against EXT_LAST_INDEX() at both access sites: before the while loop and inside it. Return -EFSCORRUPTED if the index pointer is out of range, consistent with how other bounds violations are handled in the ext4 extent tree code. Reported-by: syzbot+04c4e65cab786a2e5b7e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=04c4e65cab786a2e5b7e Signed-off-by: Tejas Bharambe <tejas.bharambe@outlook.com> Link: https://patch.msgid.link/JH0PR06MB66326016F9B6AD24097D232B897CA@JH0PR06MB6632.apcprd06.prod.outlook.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: replace BUG_ON with proper error handling in ext4_read_inline_folioYuto Ohnuki1-1/+9
commit 356227096eb66e41b23caf7045e6304877322edf upstream. Replace BUG_ON() with proper error handling when inline data size exceeds PAGE_SIZE. This prevents kernel panic and allows the system to continue running while properly reporting the filesystem corruption. The error is logged via ext4_error_inode(), the buffer head is released to prevent memory leak, and -EFSCORRUPTED is returned to indicate filesystem corruption. Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Link: https://patch.msgid.link/20260223123345.14838-2-ytohnuki@amazon.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: make recently_deleted() properly work with lazy itable initializationJan Kara1-0/+6
commit bd060afa7cc3e0ad30afa9ecc544a78638498555 upstream. recently_deleted() checks whether inode has been used in the near past. However this can give false positive result when inode table is not initialized yet and we are in fact comparing to random garbage (or stale itable block of a filesystem before mkfs). Ultimately this results in uninitialized inodes being skipped during inode allocation and possibly they are never initialized and thus e2fsck complains. Verify if the inode has been initialized before checking for dtime. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260216164848.3074-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: fix fsync(2) for nojournal modeJan Kara1-2/+14
commit 1308255bbf8452762f89f44f7447ce137ecdbcff upstream. When inode metadata is changed, we sometimes just call ext4_mark_inode_dirty() to track modified metadata. This copies inode metadata into block buffer which is enough when we are journalling metadata. However when we are running in nojournal mode we currently fail to write the dirtied inode buffer during fsync(2) because the inode is not marked as dirty. Use explicit ext4_write_inode() call to make sure the inode table buffer is written to the disk. This is a band aid solution but proper solution requires a much larger rewrite including changes in metadata bh tracking infrastructure. Reported-by: Free Ekanayaka <free.ekanayaka@gmail.com> Link: https://lore.kernel.org/all/87il8nhxdm.fsf@x1.mail-host-address-is-not-set/ CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260216164848.3074-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: fix stale xarray tags after writebackJan Kara1-2/+8
commit f4a2b42e78914ff15630e71289adc589c3a8eb45 upstream. There are cases where ext4_bio_write_page() gets called for a page which has no buffers to submit. This happens e.g. when the part of the file is actually a hole, when we cannot allocate blocks due to being called from jbd2, or in data=journal mode when checkpointing writes the buffers earlier. In these cases we just return from ext4_bio_write_page() however if the page didn't need redirtying, we will leave stale DIRTY and/or TOWRITE tags in xarray because those get cleared only in __folio_start_writeback(). As a result we can leave these tags set in mappings even after a final sync on filesystem that's getting remounted read-only or that's being frozen. Various assertions can then get upset when writeback is started on such filesystems (Gerald reported assertion in ext4_journal_check_start() firing). Fix the problem by cycling the page through writeback state even if we decide nothing needs to be written for it so that xarray tags get properly updated. This is slightly silly (we could update the xarray tags directly) but I don't think a special helper messing with xarray tags is really worth it in this relatively rare corner case. Reported-by: Gerald Yang <gerald.yang@canonical.com> Link: https://lore.kernel.org/all/20260128074515.2028982-1-gerald.yang@canonical.com Fixes: dff4ac75eeee ("ext4: move keep_towrite handling to ext4_bio_write_page()") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260205092223.21287-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: convert inline data to extents when truncate exceeds inline sizeDeepanshu Kartikey1-0/+12
commit ed9356a30e59c7cc3198e7fc46cfedf3767b9b17 upstream. Add a check in ext4_setattr() to convert files from inline data storage to extent-based storage when truncate() grows the file size beyond the inline capacity. This prevents the filesystem from entering an inconsistent state where the inline data flag is set but the file size exceeds what can be stored inline. Without this fix, the following sequence causes a kernel BUG_ON(): 1. Mount filesystem with inode that has inline flag set and small size 2. truncate(file, 50MB) - grows size but inline flag remains set 3. sendfile() attempts to write data 4. ext4_write_inline_data() hits BUG_ON(write_size > inline_capacity) The crash occurs because ext4_write_inline_data() expects inline storage to accommodate the write, but the actual inline capacity (~60 bytes for i_block + ~96 bytes for xattrs) is far smaller than the file size and write request. The fix checks if the new size from setattr exceeds the inode's actual inline capacity (EXT4_I(inode)->i_inline_size) and converts the file to extent-based storage before proceeding with the size change. This addresses the root cause by ensuring the inline data flag and file size remain consistent during truncate operations. Reported-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=7de5fe447862fc37576f Tested-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Link: https://patch.msgid.link/20260207043607.1175976-1-kartikey406@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02ext4: fix journal credit check when setting fscrypt contextSimon Weber1-1/+8
commit b1d682f1990c19fb1d5b97d13266210457092bcd upstream. Fix an issue arising when ext4 features has_journal, ea_inode, and encrypt are activated simultaneously, leading to ENOSPC when creating an encrypted file. Fix by passing XATTR_CREATE flag to xattr_set_handle function if a handle is specified, i.e., when the function is called in the control flow of creating a new inode. This aligns the number of jbd2 credits set_handle checks for with the number allocated for creating a new inode. ext4_set_context must not be called with a non-null handle (fs_data) if fscrypt context xattr is not guaranteed to not exist yet. The only other usage of this function currently is when handling the ioctl FS_IOC_SET_ENCRYPTION_POLICY, which calls it with fs_data=NULL. Fixes: c1a5d5f6ab21eb7e ("ext4: improve journal credit handling in set xattr paths") Co-developed-by: Anthony Durrer <anthonydev@fastmail.com> Signed-off-by: Anthony Durrer <anthonydev@fastmail.com> Signed-off-by: Simon Weber <simon.weber.39@gmail.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20260207100148.724275-4-simon.weber.39@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-13ext4: fix potential null deref in ext4_mb_init()Baokun Li1-0/+10
commit 3c3fac6bc0a9c00dbe65d8dc0d3a282afe4d3188 upstream. In ext4_mb_init(), ext4_mb_avg_fragment_size_destroy() may be called when sbi->s_mb_avg_fragment_size remains uninitialized (e.g., if groupinfo slab cache allocation fails). Since ext4_mb_avg_fragment_size_destroy() lacks null pointer checking, this leads to a null pointer dereference. ================================================================== EXT4-fs: no memory for groupinfo slab cache BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: Oops: 0002 [#1] SMP PTI CPU:2 UID: 0 PID: 87 Comm:mount Not tainted 6.17.0-rc2 #1134 PREEMPT(none) RIP: 0010:_raw_spin_lock_irqsave+0x1b/0x40 Call Trace: <TASK> xa_destroy+0x61/0x130 ext4_mb_init+0x483/0x540 __ext4_fill_super+0x116d/0x17b0 ext4_fill_super+0xd3/0x280 get_tree_bdev_flags+0x132/0x1d0 vfs_get_tree+0x29/0xd0 do_new_mount+0x197/0x300 __x64_sys_mount+0x116/0x150 do_syscall_64+0x50/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ================================================================== Therefore, add necessary null check to ext4_mb_avg_fragment_size_destroy() to prevent this issue. The same fix is also applied to ext4_mb_largest_free_orders_destroy(). Reported-by: syzbot+1713b1aa266195b916c2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1713b1aa266195b916c2 Cc: stable@kernel.org Fixes: f7eaacbb4e54 ("ext4: convert free groups order lists to xarrays") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-13ext4: always allocate blocks only from groups inode can useJan Kara1-9/+20
[ Upstream commit 4865c768b563deff1b6a6384e74a62f143427b42 ] For filesystems with more than 2^32 blocks inodes using indirect block based format cannot use blocks beyond the 32-bit limit. ext4_mb_scan_groups_linear() takes care to not select these unsupported groups for such inodes however other functions selecting groups for allocation don't. So far this is harmless because the other selection functions are used only with mb_optimize_scan and this is currently disabled for inodes with indirect blocks however in the following patch we want to enable mb_optimize_scan regardless of inode format. Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Pedro Falcato <pfalcato@suse.de> Cc: stable@kernel.org Link: https://patch.msgid.link/20260114182836.14120-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13ext4: implement linear-like traversal across order xarraysBaokun Li1-21/+47
[ Upstream commit a3ce570a5d6a70df616ae9a78635a188e6b5fd2f ] Although we now perform ordered traversal within an xarray, this is currently limited to a single xarray. However, we have multiple such xarrays, which prevents us from guaranteeing a linear-like traversal where all groups on the right are visited before all groups on the left. For example, suppose we have 128 block groups, with a target group of 64, a target length corresponding to an order of 1, and available free groups of 16 (order 1) and group 65 (order 8): For linear traversal, when no suitable free block is found in group 64, it will search in the next block group until group 127, then start searching from 0 up to block group 63. It ensures continuous forward traversal, which is consistent with the unidirectional rotation behavior of HDD platters. Additionally, the block group lock contention during freeing block is unavoidable. The goal increasing from 0 to 64 indicates that previously scanned groups (which had no suitable free space and are likely to free blocks later) and skipped groups (which are currently in use) have newly freed some used blocks. If we allocate blocks in these groups, the probability of competing with other processes increases. For non-linear traversal, we first traverse all groups in order_1. If only group 16 has free space in this list, we first traverse [63, 128), then traverse [0, 64) to find the available group 16, and then allocate blocks in group 16. Therefore, it cannot guarantee continuous traversal in one direction, thus increasing the probability of contention. So refactor ext4_mb_scan_groups_xarray() to ext4_mb_scan_groups_xa_range() to only traverse a fixed range of groups, and move the logic for handling wrap around to the caller. The caller first iterates through all xarrays in the range [start, ngroups) and then through the range [0, start). This approach simulates a linear scan, which reduces contention between freeing blocks and allocating blocks. Assume we have the following groups, where "|" denotes the xarray traversal start position: order_1_groups: AB | CD order_2_groups: EF | GH Traversal order: Before: C > D > A > B > G > H > E > F After: C > D > G > H > A > B > E > F Performance test data follows: |CPU: Kunpeng 920 | P80 | P1 | |Memory: 512GB |------------------------|-------------------------| |960GB SSD (0.5GB/s)| base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 19555 | 20049 (+2.5%) | 315636 | 316724 (-0.3%) | |mb_optimize_scan=1 | 15496 | 19342 (+24.8%) | 323569 | 328324 (+1.4%) | |CPU: AMD 9654 * 2 | P96 | P1 | |Memory: 1536GB |------------------------|-------------------------| |960GB SSD (1GB/s) | base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 53192 | 52125 (-2.0%) | 212678 | 215136 (+1.1%) | |mb_optimize_scan=1 | 37636 | 50331 (+33.7%) | 214189 | 209431 (-2.2%) | Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-18-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13ext4: refactor choose group to scan groupBaokun Li3-174/+131
[ Upstream commit 6347558764911f88acac06ab996e162f0c8a212d ] This commit converts the `choose group` logic to `scan group` using previously prepared helper functions. This allows us to leverage xarrays for ordered non-linear traversal, thereby mitigating the "bouncing" issue inherent in the `choose group` mechanism. This also decouples linear and non-linear traversals, leading to cleaner and more readable code. Key changes: * ext4_mb_choose_next_group() is refactored to ext4_mb_scan_groups(). * Replaced ext4_mb_good_group() with ext4_mb_scan_group() in non-linear traversals, and related functions now return error codes instead of group info. * Added ext4_mb_scan_groups_linear() for performing linear scans starting from a specific group for a set number of times. * Linear scans now execute up to sbi->s_mb_max_linear_groups times, so ac_groups_linear_remaining is removed as it's no longer used. * ac->ac_criteria is now used directly instead of passing cr around. Also, ac->ac_criteria is incremented directly after groups scan fails for the corresponding criteria. * Since we're now directly scanning groups instead of finding a good group then scanning, the following variables and flags are no longer needed, s_bal_cX_groups_considered is sufficient. s_bal_p2_aligned_bad_suggestions s_bal_goal_fast_bad_suggestions s_bal_best_avail_bad_suggestions EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-17-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13ext4: convert free groups order lists to xarraysBaokun Li3-126/+140
[ Upstream commit f7eaacbb4e54f8a6c6674c16eff54f703ea63d5e ] While traversing the list, holding a spin_lock prevents load_buddy, making direct use of ext4_try_lock_group impossible. This can lead to a bouncing scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group() fails, forcing the list traversal to repeatedly restart from grp_A. In contrast, linear traversal directly uses ext4_try_lock_group(), avoiding this bouncing. Therefore, we need a lockless, ordered traversal to achieve linear-like efficiency. Therefore, this commit converts both average fragment size lists and largest free order lists into ordered xarrays. In an xarray, the index represents the block group number and the value holds the block group information; a non-empty value indicates the block group's presence. While insertion and deletion complexity remain O(1), lookup complexity changes from O(1) to O(nlogn), which may slightly reduce single-threaded performance. Additionally, xarray insertions might fail, potentially due to memory allocation issues. However, since we have linear traversal as a fallback, this isn't a major problem. Therefore, we've only added a warning message for insertion failures here. A helper function ext4_mb_find_good_group_xarray() is added to find good groups in the specified xarray starting at the specified position start, and when it reaches ngroups-1, it wraps around to 0 and then to start-1. This ensures an ordered traversal within the xarray. Performance test results are as follows: Single-process operations on an empty disk show negligible impact, while multi-process workloads demonstrate a noticeable performance gain. |CPU: Kunpeng 920 | P80 | P1 | |Memory: 512GB |------------------------|-------------------------| |960GB SSD (0.5GB/s)| base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 20097 | 19555 (-2.6%) | 316141 | 315636 (-0.2%) | |mb_optimize_scan=1 | 13318 | 15496 (+16.3%) | 325273 | 323569 (-0.5%) | |CPU: AMD 9654 * 2 | P96 | P1 | |Memory: 1536GB |------------------------|-------------------------| |960GB SSD (1GB/s) | base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 53603 | 53192 (-0.7%) | 214243 | 212678 (-0.7%) | |mb_optimize_scan=1 | 20887 | 37636 (+80.1%) | 213632 | 214189 (+0.2%) | [ Applied spelling fixes per discussion on the ext4-list see thread referened in the Link tag. --tytso] Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-16-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13ext4: factor out ext4_mb_scan_group()Baokun Li2-44/+51
[ Upstream commit 9c08e42db9056d423dcef5e7998c73182180ff83 ] Extract ext4_mb_scan_group() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-15-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13ext4: factor out ext4_mb_might_prefetch()Baokun Li2-24/+42
[ Upstream commit 5abd85f667a19ef7d880ed00c201fc22de6fa707 ] Extract ext4_mb_might_prefetch() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-14-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13ext4: factor out __ext4_mb_scan_group()Baokun Li2-18/+29
[ Upstream commit 45704f92e55853fe287760e019feb45eeb9c988e ] Extract __ext4_mb_scan_group() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-13-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13ext4: add ext4_try_lock_group() to skip busy groupsBaokun Li2-12/+30
[ Upstream commit e9eec6f33971fbfcdd32fd1c7dd515ff4d2954c0 ] When ext4 allocates blocks, we used to just go through the block groups one by one to find a good one. But when there are tons of block groups (like hundreds of thousands or even millions) and not many have free space (meaning they're mostly full), it takes a really long time to check them all, and performance gets bad. So, we added the "mb_optimize_scan" mount option (which is on by default now). It keeps track of some group lists, so when we need a free block, we can just grab a likely group from the right list. This saves time and makes block allocation much faster. But when multiple processes or containers are doing similar things, like constantly allocating 8k blocks, they all try to use the same block group in the same list. Even just two processes doing this can cut the IOPS in half. For example, one container might do 300,000 IOPS, but if you run two at the same time, the total is only 150,000. Since we can already look at block groups in a non-linear way, the first and last groups in the same list are basically the same for finding a block right now. Therefore, add an ext4_try_lock_group() helper function to skip the current group when it is locked by another process, thereby avoiding contention with other processes. This helps ext4 make better use of having multiple block groups. Also, to make sure we don't skip all the groups that have free space when allocating blocks, we won't try to skip busy groups anymore when ac_criteria is CR_ANY_FREE. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. |CPU: Kunpeng 920 | P80 | |Memory: 512GB |-------------------------| |960GB SSD (0.5GB/s)| base | patched | |-------------------|-------|-----------------| |mb_optimize_scan=0 | 2667 | 4821 (+80.7%) | |mb_optimize_scan=1 | 2643 | 4784 (+81.0%) | |CPU: AMD 9654 * 2 | P96 | |Memory: 1536GB |-------------------------| |960GB SSD (1GB/s) | base | patched | |-------------------|-------|-----------------| |mb_optimize_scan=0 | 3450 | 15371 (+345%) | |mb_optimize_scan=1 | 3209 | 6101 (+90.0%) | Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before submitting I/OZhang Yi1-4/+8
[ Upstream commit feaf2a80e78f89ee8a3464126077ba8683b62791 ] When allocating blocks during within-EOF DIO and writeback with dioread_nolock enabled, EXT4_GET_BLOCKS_PRE_IO was set to split an existing large unwritten extent. However, EXT4_GET_BLOCKS_CONVERT was set when calling ext4_split_convert_extents(), which may potentially result in stale data issues. Assume we have an unwritten extent, and then DIO writes the second half. [UUUUUUUUUUUUUUUU] on-disk extent U: unwritten extent [UUUUUUUUUUUUUUUU] extent status tree |<- ->| ----> dio write this range First, ext4_iomap_alloc() call ext4_map_blocks() with EXT4_GET_BLOCKS_PRE_IO, EXT4_GET_BLOCKS_UNWRIT_EXT and EXT4_GET_BLOCKS_CREATE flags set. ext4_map_blocks() find this extent and call ext4_split_convert_extents() with EXT4_GET_BLOCKS_CONVERT and the above flags set. Then, ext4_split_convert_extents() calls ext4_split_extent() with EXT4_EXT_MAY_ZEROOUT, EXT4_EXT_MARK_UNWRIT2 and EXT4_EXT_DATA_VALID2 flags set, and it calls ext4_split_extent_at() to split the second half with EXT4_EXT_DATA_VALID2, EXT4_EXT_MARK_UNWRIT1, EXT4_EXT_MAY_ZEROOUT and EXT4_EXT_MARK_UNWRIT2 flags set. However, ext4_split_extent_at() failed to insert extent since a temporary lack -ENOSPC. It zeroes out the first half but convert the entire on-disk extent to written since the EXT4_EXT_DATA_VALID2 flag set, but left the second half as unwritten in the extent status tree. [0000000000SSSSSS] data S: stale data, 0: zeroed [WWWWWWWWWWWWWWWW] on-disk extent W: written extent [WWWWWWWWWWUUUUUU] extent status tree Finally, if the DIO failed to write data to the disk, the stale data in the second half will be exposed once the cached extent entry is gone. Fix this issue by not passing EXT4_GET_BLOCKS_CONVERT when splitting an unwritten extent before submitting I/O, and make ext4_split_convert_extents() to zero out the entire extent range to zero for this case, and also mark the extent in the extent status tree for consistency. Fixes: b8a8684502a0 ("ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-4-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13ext4: correct the comments place for EXT4_EXT_MAY_ZEROOUTYang Erkun1-4/+4
[ Upstream commit cc742fd1d184bb2a11bacf50587d2c85290622e4 ] Move the comments just before we set EXT4_EXT_MAY_ZEROOUT in ext4_split_convert_extents. Signed-off-by: Yang Erkun <yangerkun@huawei.com> Message-ID: <20251112084538.1658232-4-yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: feaf2a80e78f ("ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before submitting I/O") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ext4: propagate flags to convert_initialized_extent()Ojaswin Mujoo1-3/+4
[ Upstream commit 3fffa44b6ebf65be92a562a5063303979385a1c9 ] Currently, ext4_zero_range passes EXT4_EX_NOCACHE flag to avoid caching extents however this is not respected by convert_initialized_extent(). Hence, modify it to accept flags from the caller and to pass the flags on to other extent manipulation functions it calls. This makes sure the NOCACHE flag is respected throughout the code path. Also, we no longer explicitly pass CONVERT_UNWRITTEN as the caller takes care of this. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/07008fbb14db727fddcaf4c30e2346c49f6c8fe0.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ext4: use reserved metadata blocks when splitting extent on endioZhang Yi1-1/+3
[ Upstream commit 01942af95ab6c9d98e64ae01fdc243a03e4b973f ] When performing buffered writes, we may need to split and convert an unwritten extent into a written one during the end I/O process. However, we do not reserve space specifically for these metadata changes, we only reserve 2% of space or 4096 blocks. To address this, we use EXT4_GET_BLOCKS_PRE_IO to potentially split extents in advance and EXT4_GET_BLOCKS_METADATA_NOFAIL to utilize reserved space if necessary. These two approaches can reduce the likelihood of running out of space and losing data. However, these methods are merely best efforts, we could still run out of space, and there is not much difference between converting an extent during the writeback process and the end I/O process, it won't increase the risk of losing data if we postpone the conversion. Therefore, also use EXT4_GET_BLOCKS_METADATA_NOFAIL in ext4_convert_unwritten_extents_endio() to prepare for the buffered I/O iomap conversion, which may perform extent conversion during the end I/O process. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20260105014522.1937690-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ext4: mark group extend fast-commit ineligibleLi Chen1-0/+2
[ Upstream commit 1f8dd813a1c771b13c303f73d876164bc9b327cc ] Fast commits only log operations that have dedicated replay support. EXT4_IOC_GROUP_EXTEND grows the filesystem to the end of the last block group and updates the same on-disk metadata without going through the fast commit tracking paths. In practice these operations are rare and usually followed by further updates, but mixing them into a fast commit makes the overall semantics harder to reason about and risks replay gaps if new call sites appear. Teach ext4 to mark the filesystem fast-commit ineligible when EXT4_IOC_GROUP_EXTEND grows the filesystem. This forces those transactions to fall back to a full commit, ensuring that the group extension changes are captured by the normal journal rather than partially encoded in fast commit TLVs. This change should not affect common workloads but makes online resize via GROUP_EXTEND safer and easier to reason about under fast commit. Testing: 1. prepare: dd if=/dev/zero of=/root/fc_resize.img bs=1M count=0 seek=256 mkfs.ext4 -O fast_commit -F /root/fc_resize.img mkdir -p /mnt/fc_resize && mount -t ext4 -o loop /root/fc_resize.img /mnt/fc_resize 2. Extended the filesystem to the end of the last block group using a helper that calls EXT4_IOC_GROUP_EXTEND on the mounted filesystem and checked fc_info: ./group_extend_helper /mnt/fc_resize cat /proc/fs/ext4/loop0/fc_info shows the "Resize" ineligible reason increased. 3. Fsynced a file on the resized filesystem and confirmed that the fast commit ineligible counter incremented for the resize transaction: touch /mnt/fc_resize/file /root/fsync_file /mnt/fc_resize/file sync cat /proc/fs/ext4/loop0/fc_info Signed-off-by: Li Chen <me@linux.beauty> Link: https://patch.msgid.link/20251211115146.897420-6-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ext4: move ext4_percpu_param_init() before ext4_mb_init()Baokun Li1-5/+5
[ Upstream commit 270564513489d98b721a1e4a10017978d5213bff ] When running `kvm-xfstests -c ext4/1k -C 1 generic/383` with the `DOUBLE_CHECK` macro defined, the following panic is triggered: ================================================================== EXT4-fs error (device vdc): ext4_validate_block_bitmap:423: comm mount: bg 0: bad block bitmap checksum BUG: unable to handle page fault for address: ff110000fa2cc000 PGD 3e01067 P4D 3e02067 PUD 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 2386 Comm: mount Tainted: G W 6.18.0-gba65a4e7120a-dirty #1152 PREEMPT(none) RIP: 0010:percpu_counter_add_batch+0x13/0xa0 Call Trace: <TASK> ext4_mark_group_bitmap_corrupted+0xcb/0xe0 ext4_validate_block_bitmap+0x2a1/0x2f0 ext4_read_block_bitmap+0x33/0x50 mb_group_bb_bitmap_alloc+0x33/0x80 ext4_mb_add_groupinfo+0x190/0x250 ext4_mb_init_backend+0x87/0x290 ext4_mb_init+0x456/0x640 __ext4_fill_super+0x1072/0x1680 ext4_fill_super+0xd3/0x280 get_tree_bdev_flags+0x132/0x1d0 vfs_get_tree+0x29/0xd0 vfs_cmd_create+0x59/0xe0 __do_sys_fsconfig+0x4f6/0x6b0 do_syscall_64+0x50/0x1f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ================================================================== This issue can be reproduced using the following commands: mkfs.ext4 -F -q -b 1024 /dev/sda 5G tune2fs -O quota,project /dev/sda mount /dev/sda /tmp/test With DOUBLE_CHECK defined, mb_group_bb_bitmap_alloc() reads and validates the block bitmap. When the validation fails, ext4_mark_group_bitmap_corrupted() attempts to update sbi->s_freeclusters_counter. However, this percpu_counter has not been initialized yet at this point, which leads to the panic described above. Fix this by moving the execution of ext4_percpu_param_init() to occur before ext4_mb_init(), ensuring the per-CPU counters are initialized before they are used. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20251209133116.731350-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ext4: mark group add fast-commit ineligibleLi Chen1-0/+1
[ Upstream commit 89b4336fd5ec78f51f9d3a1d100f3ffa3228e604 ] Fast commits only log operations that have dedicated replay support. Online resize via EXT4_IOC_GROUP_ADD updates the superblock and group descriptor metadata without going through the fast commit tracking paths. In practice these operations are rare and usually followed by further updates, but mixing them into a fast commit makes the overall semantics harder to reason about and risks replay gaps if new call sites appear. Teach ext4 to mark the filesystem fast-commit ineligible when ext4_ioctl_group_add() adds new block groups. This forces those transactions to fall back to a full commit, ensuring that the filesystem geometry updates are captured by the normal journal rather than partially encoded in fast commit TLVs. This change should not affect common workloads but makes online resize via GROUP_ADD safer and easier to reason about under fast commit. Testing: 1. prepare: dd if=/dev/zero of=/root/fc_resize.img bs=1M count=0 seek=256 mkfs.ext4 -O fast_commit -F /root/fc_resize.img mkdir -p /mnt/fc_resize && mount -t ext4 -o loop /root/fc_resize.img /mnt/fc_resize 2. Ran a helper that issues EXT4_IOC_GROUP_ADD on the mounted filesystem and checked the resize ineligible reason: ./group_add_helper /mnt/fc_resize cat /proc/fs/ext4/loop0/fc_info shows "Resize": > 0. 3. Fsynced a file on the resized filesystem and verified that the fast commit stats report at least one ineligible commit: touch /mnt/fc_resize/file /root/fsync_file /mnt/fc_resize/file sync cat /proc/fs/ext4/loop0/fc_info shows fc stats ineligible > 0. Signed-off-by: Li Chen <me@linux.beauty> Link: https://patch.msgid.link/20251211115146.897420-5-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ext4: use optimized mballoc scanning regardless of inode formatJan Kara1-2/+0
commit 3574c322b1d0eb32dbd76b469cb08f9a67641599 upstream. Currently we don't used mballoc optimized scanning (using max free extent order and avg free extent order group lists) for inodes with indirect block based format. This is confusing for users and I don't see a good reason for that. Even with indirect block based inode format we can spend big amount of time searching for free blocks for large filesystems with fragmented free space. To add to the confusion before commit 077d0c2c78df ("ext4: make mb_optimize_scan performance mount option work with extents") optimized scanning was applied *only* to indirect block based inodes so that commit appears as a performance regression to some users. Just use optimized scanning whenever it is enabled by mount options. Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Link: https://patch.msgid.link/20260114182836.14120-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04ext4: fix dirtyclusters double decrement on fs shutdownBrian Foster2-17/+6
commit 94a8cea54cd935c54fa2fba70354757c0fc245e3 upstream. fstests test generic/388 occasionally reproduces a warning in ext4_put_super() associated with the dirty clusters count: WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4] Tracing the failure shows that the warning fires due to an s_dirtyclusters_counter value of -1. IOW, this appears to be a spurious decrement as opposed to some sort of leak. Further tracing of the dirty cluster count deltas and an LLM scan of the resulting output identified the cause as a double decrement in the error path between ext4_mb_mark_diskspace_used() and the caller ext4_mb_new_blocks(). First, note that generic/388 is a shutdown vs. fsstress test and so produces a random set of operations and shutdown injections. In the problematic case, the shutdown triggers an error return from the ext4_handle_dirty_metadata() call(s) made from ext4_mb_mark_context(). The changed value is non-zero at this point, so ext4_mb_mark_diskspace_used() does not exit after the error bubbles up from ext4_mb_mark_context(). Instead, the former decrements both cluster counters and returns the error up to ext4_mb_new_blocks(). The latter falls into the !ar->len out path which decrements the dirty clusters counter a second time, creating the inconsistency. To avoid this problem and simplify ownership of the cluster reservation in this codepath, lift the counter reduction to a single place in the caller. This makes it more clear that ext4_mb_new_blocks() is responsible for acquiring cluster reservation (via ext4_claim_free_clusters()) in the !delalloc case as well as releasing it, regardless of whether it ends up consumed or returned due to failure. Fixes: 0087d9fb3f29 ("ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20260113171905.118284-1-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04ext4: fix e4b bitmap inconsistency reportsYongjian Sun1-10/+11
commit bdc56a9c46b2a99c12313122b9352b619a2e719e upstream. A bitmap inconsistency issue was observed during stress tests under mixed huge-page workloads. Ext4 reported multiple e4b bitmap check failures like: ext4_mb_complex_scan_group:2508: group 350, 8179 free clusters as per group info. But got 8192 blocks Analysis and experimentation confirmed that the issue is caused by a race condition between page migration and bitmap modification. Although this timing window is extremely narrow, it is still hit in practice: folio_lock ext4_mb_load_buddy __migrate_folio check ref count folio_mc_copy __filemap_get_folio folio_try_get(folio) ...... mb_mark_used ext4_mb_unload_buddy __folio_migrate_mapping folio_ref_freeze folio_unlock The root cause of this issue is that the fast path of load_buddy only increments the folio's reference count, which is insufficient to prevent concurrent folio migration. We observed that the folio migration process acquires the folio lock. Therefore, we can determine whether to take the fast path in load_buddy by checking the lock status. If the folio is locked, we opt for the slow path (which acquires the lock) to close this concurrency window. Additionally, this change addresses the following issues: When the DOUBLE_CHECK macro is enabled to inspect bitmap-related issues, the following error may be triggered: corruption in group 324 at byte 784(6272): f in copy != ff on disk/prealloc Analysis reveals that this is a false positive. There is a specific race window where the bitmap and the group descriptor become momentarily inconsistent, leading to this error report: ext4_mb_load_buddy ext4_mb_load_buddy __filemap_get_folio(create|lock) folio_lock ext4_mb_init_cache folio_mark_uptodate __filemap_get_folio(no lock) ...... mb_mark_used mb_mark_used_double mb_cmp_bitmaps mb_set_bits(e4b->bd_bitmap) folio_unlock The original logic assumed that since mb_cmp_bitmaps is called when the bitmap is newly loaded from disk, the folio lock would be sufficient to prevent concurrent access. However, this overlooks a specific race condition: if another process attempts to load buddy and finds the folio is already in an uptodate state, it will immediately begin using it without holding folio lock. Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260106090820.836242-1-sunyongjian@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04ext4: fix memory leak in ext4_ext_shift_extents()Zilin Guan1-1/+2
commit ca81109d4a8f192dc1cbad4a1ee25246363c2833 upstream. In ext4_ext_shift_extents(), if the extent is NULL in the while loop, the function returns immediately without releasing the path obtained via ext4_find_extent(), leading to a memory leak. Fix this by jumping to the out label to ensure the path is properly released. Fixes: a18ed359bdddc ("ext4: always check ext4_ext_find_extent result") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20251225084800.905701-1-zilin@seu.edu.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04ext4: drop extent cache when splitting extent failsZhang Yi1-2/+6
commit 79b592e8f1b435796cbc2722190368e3e8ffd7a1 upstream. When the split extent fails, we might leave some extents still being processed and return an error directly, which will result in stale extent entries remaining in the extent status tree. So drop all of the remaining potentially stale extents if the splitting fails. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251129103247.686136-8-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04ext4: drop extent cache after doing PARTIAL_VALID1 zerooutZhang Yi1-1/+9
commit 6d882ea3b0931b43530d44149b79fcd4ffc13030 upstream. When splitting an unwritten extent in the middle and converting it to initialized in ext4_split_extent() with the EXT4_EXT_MAY_ZEROOUT and EXT4_EXT_DATA_VALID2 flags set, it could leave a stale unwritten extent. Assume we have an unwritten file and buffered write in the middle of it without dioread_nolock enabled, it will allocate blocks as written extent. 0 A B N [UUUUUUUUUUUU] on-disk extent U: unwritten extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDD--] D: valid data |<- ->| ----> this range needs to be initialized ext4_split_extent() first try to split this extent at B with EXT4_EXT_DATA_PARTIAL_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but ext4_split_extent_at() failed to split this extent due to temporary lack of space. It zeroout B to N and leave the entire extent as unwritten. 0 A B N [UUUUUUUUUUUU] on-disk extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDDZZ] Z: zeroed data ext4_split_extent() then try to split this extent at A with EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and leave an written extent from A to N. 0 A B N [UUWWWWWWWWWW] on-disk extent W: written extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDDZZ] Finally ext4_map_create_blocks() only insert extent A to B to the extent status tree, and leave an stale unwritten extent in the status tree. 0 A B N [UUWWWWWWWWWW] on-disk extent W: written extent [UUWWWWWWWWUU] extent status tree [--DDDDDDDDZZ] Fix this issue by always cached extent status entry after zeroing out the second part. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251129103247.686136-7-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04ext4: don't cache extent during splitting extentZhang Yi1-0/+6
commit 8b4b19a2f96348d70bfa306ef7d4a13b0bcbea79 upstream. Caching extents during the splitting process is risky, as it may result in stale extents remaining in the status tree. Moreover, in most cases, the corresponding extent block entries are likely already cached before the split happens, making caching here not particularly useful. Assume we have an unwritten extent, and then DIO writes the first half. [UUUUUUUUUUUUUUUU] on-disk extent U: unwritten extent [UUUUUUUUUUUUUUUU] extent status tree |<- ->| ----> dio write this range First, when ext4_split_extent_at() splits this extent, it truncates the existing extent and then inserts a new one. During this process, this extent status entry may be shrunk, and calls to ext4_find_extent() and ext4_cache_extents() may occur, which could potentially insert the truncated range as a hole into the extent status tree. After the split is completed, this hole is not replaced with the correct status. [UUUUUUU|UUUUUUUU] on-disk extent U: unwritten extent [UUUUUUU|HHHHHHHH] extent status tree H: hole Then, the outer calling functions will not correct this remaining hole extent either. Finally, if we perform a delayed buffer write on this latter part, it will re-insert the delayed extent and cause an error in space accounting. In adition, if the unwritten extent cache is not shrunk during the splitting, ext4_cache_extents() also conflicts with existing extents when caching extents. In the future, we will add checks when caching extents, which will trigger a warning. Therefore, Do not cache extents that are being split. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-6-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1Zhang Yi1-1/+12
commit 1bf6974822d1dba86cf11b5f05498581cf3488a2 upstream. When allocating initialized blocks from a large unwritten extent, or when splitting an unwritten extent during end I/O and converting it to initialized, there is currently a potential issue of stale data if the extent needs to be split in the middle. 0 A B N [UUUUUUUUUUUU] U: unwritten extent [--DDDDDDDD--] D: valid data |<- ->| ----> this range needs to be initialized ext4_split_extent() first try to split this extent at B with EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but ext4_split_extent_at() failed to split this extent due to temporary lack of space. It zeroout B to N and mark the entire extent from 0 to N as written. 0 A B N [WWWWWWWWWWWW] W: written extent [SSDDDDDDDDZZ] Z: zeroed, S: stale data ext4_split_extent() then try to split this extent at A with EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left a stale written extent from 0 to A. 0 A B N [WW|WWWWWWWWWW] [SS|DDDDDDDDZZ] Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at() when splitting at B, don't convert the entire extent to written and left it as unwritten after zeroing out B to N. The remaining work is just like the standard two-part split. ext4_split_extent() will pass the EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the second time, allowing it to properly handle the split. If the split is successful, it will keep extent from 0 to A as unwritten. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-3-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04ext4: subdivide EXT4_EXT_DATA_VALID1Zhang Yi1-6/+12
commit 22784ca541c0f01c5ebad14e8228298dc0a390ed upstream. When splitting an extent, if the EXT4_GET_BLOCKS_CONVERT flag is set and it is necessary to split the target extent in the middle, ext4_split_extent() first handles splitting the latter half of the extent and passes the EXT4_EXT_DATA_VALID1 flag. This flag implies that all blocks before the split point contain valid data; however, this assumption is incorrect. Therefore, subdivid EXT4_EXT_DATA_VALID1 into EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_DATA_PARTIAL_VALID1, which indicate that the first half of the extent is either entirely valid or only partially valid, respectively. These two flags cannot be set simultaneously. This patch does not use EXT4_EXT_DATA_PARTIAL_VALID1, it only replaces EXT4_EXT_DATA_VALID1 with EXT4_EXT_DATA_ENTIRE_VALID1 at the location where it is set, no logical changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-2-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23ext4: fix iloc.bh leak in ext4_xattr_inode_update_refYang Erkun1-0/+1
commit d250bdf531d9cd4096fedbb9f172bb2ca660c868 upstream. The error branch for ext4_xattr_inode_update_ref forget to release the refcount for iloc.bh. Find this when review code. Fixes: 57295e835408 ("ext4: guard against EA inode refcount underflow in xattr update") Signed-off-by: Yang Erkun <yangerkun@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20251213055706.3417529-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08ext4: align max orphan file size with e2fsprogs limitBaokun Li1-1/+3
commit 7c11c56eb32eae96893eebafdbe3decadefe88ad upstream. Kernel commit 0a6ce20c1564 ("ext4: verify orphan file size is not too big") limits the maximum supported orphan file size to 8 << 20. However, in e2fsprogs, the orphan file size is set to 32–512 filesystem blocks when creating a filesystem. With 64k block size, formatting an ext4 fs >32G gives an orphan file bigger than the kernel allows, so mount prints an error and fails: EXT4-fs (vdb): orphan file too big: 8650752 EXT4-fs (vdb): mount failed To prevent this issue and allow previously created 64KB filesystems to mount, we updates the maximum allowed orphan file size in the kernel to 512 filesystem blocks. Fixes: 0a6ce20c1564 ("ext4: verify orphan file size is not too big") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251120134233.2994147-1-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08ext4: fix incorrect group number assertion in mb_check_buddyYongjian Sun1-0/+2
commit 3f7a79d05c692c7cfec70bf104b1b3c3d0ce6247 upstream. When the MB_CHECK_ASSERT macro is enabled, an assertion failure can occur in __mb_check_buddy when checking preallocated blocks (pa) in a block group: Assertion failure in mb_free_blocks() : "groupnr == e4b->bd_group" This happens when a pa at the very end of a block group (e.g., pa_pstart=32765, pa_len=3 in a group of 32768 blocks) becomes exhausted - its pa_pstart is advanced by pa_len to 32768, which lies in the next block group. If this exhausted pa (with pa_len == 0) is still in the bb_prealloc_list during the buddy check, the assertion incorrectly flags it as belonging to the wrong group. A possible sequence is as follows: ext4_mb_new_blocks ext4_mb_release_context pa->pa_pstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len) pa->pa_len -= ac->ac_b_ex.fe_len __mb_check_buddy for each pa in group ext4_get_group_no_and_offset MB_CHECK_ASSERT(groupnr == e4b->bd_group) To fix this, we modify the check to skip block group validation for exhausted preallocations (where pa_len == 0). Such entries are in a transitional state and will be removed from the list soon, so they should not trigger an assertion. This change prevents the false positive while maintaining the integrity of the checks for active allocations. Fixes: c9de560ded61f ("ext4: Add multi block allocator for ext4") Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251106060614.631382-2-sunyongjian@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08ext4: clear i_state_flags when alloc inodeHaibo Chen3-2/+1
commit 4091c8206cfd2e3bb529ef260887296b90d9b6a2 upstream. i_state_flags used on 32-bit archs, need to clear this flag when alloc inode. Find this issue when umount ext4, sometimes track the inode as orphan accidently, cause ext4 mesg dump. Fixes: acf943e9768e ("ext4: fix checks for orphan inodes") Signed-off-by: Haibo Chen <haibo.chen@nxp.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251104-ext4-v1-1-73691a0800f9@nxp.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08ext4: xattr: fix null pointer deref in ext4_raw_inode()Karina Yankevich1-1/+5
commit b97cb7d6a051aa6ebd57906df0e26e9e36c26d14 upstream. If ext4_get_inode_loc() fails (e.g. if it returns -EFSCORRUPTED), iloc.bh will remain set to NULL. Since ext4_xattr_inode_dec_ref_all() lacks error checking, this will lead to a null pointer dereference in ext4_raw_inode(), called right after ext4_get_inode_loc(). Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: c8e008b60492 ("ext4: ignore xattrs past end") Cc: stable@kernel.org Signed-off-by: Karina Yankevich <k.yankevich@omp.ru> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251022093253.3546296-1-k.yankevich@omp.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08ext4: fix string copying in parse_apply_sb_mount_options()Fedor Pchelkin1-2/+3
commit ee5a977b4e771cc181f39d504426dbd31ed701cc upstream. strscpy_pad() can't be used to copy a non-NUL-term string into a NUL-term string of possibly bigger size. Commit 0efc5990bca5 ("string.h: Introduce memtostr() and memtostr_pad()") provides additional information in that regard. So if this happens, the following warning is observed: strnlen: detected buffer overflow: 65 byte read of buffer size 64 WARNING: CPU: 0 PID: 28655 at lib/string_helpers.c:1032 __fortify_report+0x96/0xc0 lib/string_helpers.c:1032 Modules linked in: CPU: 0 UID: 0 PID: 28655 Comm: syz-executor.3 Not tainted 6.12.54-syzkaller-00144-g5f0270f1ba00 #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:__fortify_report+0x96/0xc0 lib/string_helpers.c:1032 Call Trace: <TASK> __fortify_panic+0x1f/0x30 lib/string_helpers.c:1039 strnlen include/linux/fortify-string.h:235 [inline] sized_strscpy include/linux/fortify-string.h:309 [inline] parse_apply_sb_mount_options fs/ext4/super.c:2504 [inline] __ext4_fill_super fs/ext4/super.c:5261 [inline] ext4_fill_super+0x3c35/0xad00 fs/ext4/super.c:5706 get_tree_bdev_flags+0x387/0x620 fs/super.c:1636 vfs_get_tree+0x93/0x380 fs/super.c:1814 do_new_mount fs/namespace.c:3553 [inline] path_mount+0x6ae/0x1f70 fs/namespace.c:3880 do_mount fs/namespace.c:3893 [inline] __do_sys_mount fs/namespace.c:4103 [inline] __se_sys_mount fs/namespace.c:4080 [inline] __x64_sys_mount+0x280/0x300 fs/namespace.c:4080 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x64/0x140 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e Since userspace is expected to provide s_mount_opts field to be at most 63 characters long with the ending byte being NUL-term, use a 64-byte buffer which matches the size of s_mount_opts, so that strscpy_pad() does its job properly. Return with error if the user still managed to provide a non-NUL-term string here. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 8ecb790ea8c3 ("ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()") Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251101160430.222297-1-pchelkin@ispras.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-18ext4: improve integrity checking in __mb_check_buddy by enhancing order-0 ↵Yongjian Sun1-17/+32
validation [ Upstream commit d9ee3ff810f1cc0e253c9f2b17b668b973cb0e06 ] When the MB_CHECK_ASSERT macro is enabled, we found that the current validation logic in __mb_check_buddy has a gap in detecting certain invalid buddy states, particularly related to order-0 (bitmap) bits. The original logic consists of three steps: 1. Validates higher-order buddies: if a higher-order bit is set, at most one of the two corresponding lower-order bits may be free; if a higher-order bit is clear, both lower-order bits must be allocated (and their bitmap bits must be 0). 2. For any set bit in order-0, ensures all corresponding higher-order bits are not free. 3. Verifies that all preallocated blocks (pa) in the group have pa_pstart within bounds and their bitmap bits marked as allocated. However, this approach fails to properly validate cases where order-0 bits are incorrectly cleared (0), allowing some invalid configurations to pass: corrupt integral order 3 1 1 order 2 1 1 1 1 order 1 1 1 1 1 1 1 1 1 order 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Here we get two adjacent free blocks at order-0 with inconsistent higher-order state, and the right one shows the correct scenario. The root cause is insufficient validation of order-0 zero bits. To fix this and improve completeness without significant performance cost, we refine the logic: 1. Maintain the top-down higher-order validation, but we no longer check the cases where the higher-order bit is 0, as this case will be covered in step 2. 2. Enhance order-0 checking by examining pairs of bits: - If either bit in a pair is set (1), all corresponding higher-order bits must not be free. - If both bits are clear (0), then exactly one of the corresponding higher-order bits must be free 3. Keep the preallocation (pa) validation unchanged. This change closes the validation gap, ensuring illegal buddy states involving order-0 are correctly detected, while removing redundant checks and maintaining efficiency. Fixes: c9de560ded61f ("ext4: Add multi block allocator for ext4") Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251106060614.631382-3-sunyongjian@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18ext4: correct the checking of quota files before moving extentsZhang Yi1-1/+1
[ Upstream commit a2e5a3cea4b18f6e2575acc444a5e8cce1fc8260 ] The move extent operation should return -EOPNOTSUPP if any of the inodes is a quota inode, rather than requiring both to be quota inodes. Fixes: 02749a4c2082 ("ext4: add ext4_is_quota_file()") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-2-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-12ext4: add i_data_sem protection in ext4_destroy_inline_data_nolock()Alexey Nepomnyashih1-1/+6
commit 0cd8feea8777f8d9b9a862b89c688b049a5c8475 upstream. Fix a race between inline data destruction and block mapping. The function ext4_destroy_inline_data_nolock() changes the inode data layout by clearing EXT4_INODE_INLINE_DATA and setting EXT4_INODE_EXTENTS. At the same time, another thread may execute ext4_map_blocks(), which tests EXT4_INODE_EXTENTS to decide whether to call ext4_ext_map_blocks() or ext4_ind_map_blocks(). Without i_data_sem protection, ext4_ind_map_blocks() may receive inode with EXT4_INODE_EXTENTS flag and triggering assert. kernel BUG at fs/ext4/indirect.c:546! EXT4-fs (loop2): unmounting filesystem. invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:ext4_ind_map_blocks.cold+0x2b/0x5a fs/ext4/indirect.c:546 Call Trace: <TASK> ext4_map_blocks+0xb9b/0x16f0 fs/ext4/inode.c:681 _ext4_get_block+0x242/0x590 fs/ext4/inode.c:822 ext4_block_write_begin+0x48b/0x12c0 fs/ext4/inode.c:1124 ext4_write_begin+0x598/0xef0 fs/ext4/inode.c:1255 ext4_da_write_begin+0x21e/0x9c0 fs/ext4/inode.c:3000 generic_perform_write+0x259/0x5d0 mm/filemap.c:3846 ext4_buffered_write_iter+0x15b/0x470 fs/ext4/file.c:285 ext4_file_write_iter+0x8e0/0x17f0 fs/ext4/file.c:679 call_write_iter include/linux/fs.h:2271 [inline] do_iter_readv_writev+0x212/0x3c0 fs/read_write.c:735 do_iter_write+0x186/0x710 fs/read_write.c:861 vfs_iter_write+0x70/0xa0 fs/read_write.c:902 iter_file_splice_write+0x73b/0xc90 fs/splice.c:685 do_splice_from fs/splice.c:763 [inline] direct_splice_actor+0x10f/0x170 fs/splice.c:950 splice_direct_to_actor+0x33a/0xa10 fs/splice.c:896 do_splice_direct+0x1a9/0x280 fs/splice.c:1002 do_sendfile+0xb13/0x12c0 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1cf/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Fixes: c755e251357a ("ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()") Cc: stable@vger.kernel.org # v4.11+ Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru> Message-ID: <20251104093326.697381-1-sdl@nppct.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-12ext4: refresh inline data size before write operationsDeepanshu Kartikey1-1/+6
commit 892e1cf17555735e9d021ab036c36bc7b58b0e3b upstream. The cached ei->i_inline_size can become stale between the initial size check and when ext4_update_inline_data()/ext4_create_inline_data() use it. Although ext4_get_max_inline_size() reads the correct value at the time of the check, concurrent xattr operations can modify i_inline_size before ext4_write_lock_xattr() is acquired. This causes ext4_update_inline_data() and ext4_create_inline_data() to work with stale capacity values, leading to a BUG_ON() crash in ext4_write_inline_data(): kernel BUG at fs/ext4/inline.c:1331! BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); The race window: 1. ext4_get_max_inline_size() reads i_inline_size = 60 (correct) 2. Size check passes for 50-byte write 3. [Another thread adds xattr, i_inline_size changes to 40] 4. ext4_write_lock_xattr() acquires lock 5. ext4_update_inline_data() uses stale i_inline_size = 60 6. Attempts to write 50 bytes but only 40 bytes actually available 7. BUG_ON() triggers Fix this by recalculating i_inline_size via ext4_find_inline_data_nolock() immediately after acquiring xattr_sem. This ensures ext4_update_inline_data() and ext4_create_inline_data() work with current values that are protected from concurrent modifications. This is similar to commit a54c4613dac1 ("ext4: fix race writing to an inline_data file while its xattrs are changing") which fixed i_inline_off staleness. This patch addresses the related i_inline_size staleness issue. Reported-by: syzbot+f3185be57d7e8dda32b8@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=f3185be57d7e8dda32b8 Cc: stable@kernel.org Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Message-ID: <20251020060936.474314-1-kartikey406@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>