kernel/linux.git/fs/ext4/fast_commit.c, branch v7.0-rc7

ext4: fix iloc.bh leak in ext4_fc_replay_inode() error paths

2026-03-28T03:38:01+00:00

During code review, Joseph found that ext4_fc_replay_inode() calls ext4_get_fc_inode_loc() to get the inode location, which holds a reference to iloc.bh that must be released via brelse(). However, several error paths jump to the 'out' label without releasing iloc.bh: - ext4_handle_dirty_metadata() failure - sync_dirty_buffer() failure - ext4_mark_inode_used() failure - ext4_iget() failure Fix this by introducing an 'out_brelse' label placed just before the existing 'out' label to ensure iloc.bh is always released. Additionally, make ext4_fc_replay_inode() propagate errors properly instead of always returning 0. Reported-by: Joseph Qi Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Baokun Li Reviewed-by: Zhang Yi Reviewed-by: Jan Kara Link: https://patch.msgid.link/20260323060836.3452660-1-libaokun@linux.alibaba.com Signed-off-by: Theodore Ts'o Cc: stable@kernel.org

ext4: publish jinode after initialization

2026-03-28T03:32:07+00:00

ext4_inode_attach_jinode() publishes ei->jinode to concurrent users. It used to set ei->jinode before jbd2_journal_init_jbd_inode(), allowing a reader to observe a non-NULL jinode with i_vfs_inode still unset. The fast commit flush path can then pass this jinode to jbd2_wait_inode_data(), which dereferences i_vfs_inode->i_mapping and may crash. Below is the crash I observe: ``` BUG: unable to handle page fault for address: 000000010beb47f4 PGD 110e51067 P4D 110e51067 PUD 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 1 UID: 0 PID: 4850 Comm: fc_fsync_bench_ Not tainted 6.18.0-00764-g795a690c06a5 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 RIP: 0010:xas_find_marked+0x3d/0x2e0 Code: e0 03 48 83 f8 02 0f 84 f0 01 00 00 48 8b 47 08 48 89 c3 48 39 c6 0f 82 fd 01 00 00 48 85 c9 74 3d 48 83 f9 03 77 63 4c 8b 0f <49> 8b 71 08 48 c7 47 18 00 00 00 00 48 89 f1 83 e1 03 48 83 f9 02 RSP: 0018:ffffbbee806e7bf0 EFLAGS: 00010246 RAX: 000000000010beb4 RBX: 000000000010beb4 RCX: 0000000000000003 RDX: 0000000000000001 RSI: 0000002000300000 RDI: ffffbbee806e7c10 RBP: 0000000000000001 R08: 0000002000300000 R09: 000000010beb47ec R10: ffff9ea494590090 R11: 0000000000000000 R12: 0000002000300000 R13: ffffbbee806e7c90 R14: ffff9ea494513788 R15: ffffbbee806e7c88 FS: 00007fc2f9e3e6c0(0000) GS:ffff9ea6b1444000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000010beb47f4 CR3: 0000000119ac5000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: filemap_get_folios_tag+0x87/0x2a0 __filemap_fdatawait_range+0x5f/0xd0 ? srso_alias_return_thunk+0x5/0xfbef5 ? __schedule+0x3e7/0x10c0 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? preempt_count_sub+0x5f/0x80 ? srso_alias_return_thunk+0x5/0xfbef5 ? cap_safe_nice+0x37/0x70 ? srso_alias_return_thunk+0x5/0xfbef5 ? preempt_count_sub+0x5f/0x80 ? srso_alias_return_thunk+0x5/0xfbef5 filemap_fdatawait_range_keep_errors+0x12/0x40 ext4_fc_commit+0x697/0x8b0 ? ext4_file_write_iter+0x64b/0x950 ? srso_alias_return_thunk+0x5/0xfbef5 ? preempt_count_sub+0x5f/0x80 ? srso_alias_return_thunk+0x5/0xfbef5 ? vfs_write+0x356/0x480 ? srso_alias_return_thunk+0x5/0xfbef5 ? preempt_count_sub+0x5f/0x80 ext4_sync_file+0xf7/0x370 do_fsync+0x3b/0x80 ? syscall_trace_enter+0x108/0x1d0 __x64_sys_fdatasync+0x16/0x20 do_syscall_64+0x62/0x2c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ... ``` Fix this by initializing the jbd2_inode first. Use smp_wmb() and WRITE_ONCE() to publish ei->jinode after initialization. Readers use READ_ONCE() to fetch the pointer. Fixes: a361293f5fede ("jbd2: Fix oops in jbd2_journal_file_inode()") Cc: stable@vger.kernel.org Signed-off-by: Li Chen Reviewed-by: Jan Kara Link: https://patch.msgid.link/20260225082617.147957-1-me@linux.beauty Signed-off-by: Theodore Ts'o Cc: stable@kernel.org

ext4: fast commit: make s_fc_lock reclaim-safe

2026-01-20T03:46:05+00:00

s_fc_lock can be acquired from inode eviction and thus is reclaim unsafe. Since the fast commit path holds s_fc_lock while writing the commit log, allocations under the lock can enter reclaim and invert the lock order with fs_reclaim. Add ext4_fc_lock()/ext4_fc_unlock() helpers which acquire s_fc_lock under memalloc_nofs_save()/restore() context and use them everywhere so allocations under the lock cannot recurse into filesystem reclaim. Fixes: 6593714d67ba ("ext4: hold s_fc_lock while during fast commit") Signed-off-by: Li Chen Reviewed-by: Baokun Li Reviewed-by: Zhang Yi Reviewed-by: Jan Kara Link: https://patch.msgid.link/20260106120621.440126-1-me@linux.beauty Signed-off-by: Theodore Ts'o

ext4: mark move extents fast-commit ineligible

2026-01-20T00:26:35+00:00

Fast commits only log operations that have dedicated replay support. EXT4_IOC_MOVE_EXT swaps extents between regular files and may copy data, rewriting the affected inodes' block mapping layout without going through the fast commit tracking paths. In practice these operations are rare and usually followed by further updates, but mixing them into a fast commit makes the overall semantics harder to reason about and risks replay gaps if new call sites appear. Teach ext4 to mark the filesystem fast-commit ineligible for the journal transactions used by move_extent_per_page() when EXT4_IOC_MOVE_EXT runs. This forces those transactions to fall back to a full commit, ensuring that these multi-inode extent swaps are captured by the normal journal rather than partially encoded in fast commit TLVs. This change should not affect common workloads but makes online defragmentation safer and easier to reason about under fast commit. Testing: 1. prepare: dd if=/dev/zero of=/root/fc_move.img bs=1M count=0 seek=256 mkfs.ext4 -O fast_commit -F /root/fc_move.img mkdir -p /mnt/fc_move && mount -t ext4 -o loop \ /root/fc_move.img /mnt/fc_move 2. Created two files, ran EXT4_IOC_MOVE_EXT via e4defrag, and checked the ineligible reason statistics: fallocate -l 64M /mnt/fc_move/file1 cp /mnt/fc_move/file1 /mnt/fc_move/file2 e4defrag /mnt/fc_move/file1 cat /proc/fs/ext4/loop0/fc_info shows "Move extents": > 0 and fc stats ineligible > 0. Signed-off-by: Li Chen Link: https://patch.msgid.link/20251211115146.897420-4-me@linux.beauty Signed-off-by: Theodore Ts'o

ext4: mark fs-verity enable fast-commit ineligible

2026-01-20T00:26:35+00:00

Fast commits only log operations that have dedicated replay support. Enabling fs-verity builds a Merkle tree and updates inode and orphan state in ways that are not described by the fast commit replay tags. In practice these operations are rare and usually followed by further updates, but mixing them into a fast commit makes the overall semantics harder to reason about and risks replay gaps if new call sites appear. Teach ext4 to mark the filesystem fast-commit ineligible when ext4_end_enable_verity() starts its journal transaction. This forces that transaction to fall back to a full commit, ensuring that the fs-verity enable changes are captured by the normal journal rather than partially encoded in fast commit TLVs. This change should not affect common workloads but makes fs-verity enable safer and easier to reason about under fast commit. Testing: 1. prepare: dd if=/dev/zero of=/root/fc_verity.img bs=1M count=0 seek=128 mkfs.ext4 -O fast_commit,verity -F /root/fc_verity.img mkdir -p /mnt/fc_verity && mount -t ext4 -o loop /root/fc_verity.img /mnt/fc_verity 2. Enabled fs-verity on a file and verified reason accounting: echo "data" > /mnt/fc_verity/verityfile /root/enable_verity /mnt/fc_verity/verityfile sync tail -n 1 /proc/fs/ext4/loop0/fc_info "fs-verity enable": 1 3. Enabled fs-verity on a second file, fsynced it, and checked that the ineligible commit counter is updated too: echo "data2" > /mnt/fc_verity/verityfile2 /root/enable_verity /mnt/fc_verity/verityfile2 /root/fsync_file /mnt/fc_verity/verityfile2 sync /proc/fs/ext4/loop0/fc_info shows "fs-verity enable" incremented and fc stats ineligible increased accordingly. Signed-off-by: Li Chen Link: https://patch.msgid.link/20251211115146.897420-3-me@linux.beauty Signed-off-by: Theodore Ts'o

ext4: mark inode format migration fast-commit ineligible

2026-01-20T00:26:35+00:00

Fast commits only log operations that have dedicated replay support. Inode format migration (indirect<->extent layout changes via EXT4_IOC_MIGRATE or toggling EXT4_EXTENTS_FL) rewrites the block mapping representation without going through the fast commit tracking paths. In practice these migrations are rare and usually followed by further updates, but mixing them into a fast commit makes the overall semantics harder to reason about and risks replay gaps if new call sites appear. Teach ext4 to mark the filesystem fast-commit ineligible when ext4_ext_migrate() or ext4_ind_migrate() start their journal transactions. This forces those transactions to fall back to a full commit, ensuring that the entire inode layout change is captured by the normal journal rather than partially encoded in fast commit TLVs. This change should not affect common workloads but makes format migrations safer and easier to reason about under fast commit. Testing: 1. prepare: dd if=/dev/zero of=/root/fc.img bs=1M count=0 seek=128 mkfs.ext4 -O fast_commit -F /root/fc.img mkdir -p /mnt/fc && mount -t ext4 -o loop /root/fc.img /mnt/fc 2. Created a test file and toggled the extents flag to exercise both ext4_ind_migrate() and ext4_ext_migrate(): touch /mnt/fc/migtest chattr -e /mnt/fc/migtest chattr +e /mnt/fc/migtest 3. Verified fast-commit ineligible statistics: tail -n 1 /proc/fs/ext4/loop0/fc_info "Inode format migration": 2 Signed-off-by: Li Chen Link: https://patch.msgid.link/20251211115146.897420-2-me@linux.beauty Signed-off-by: Theodore Ts'o

ext4: increase IO priority of fastcommit

2025-09-25T18:56:31+00:00

The following code paths may result in high latency or even task hangs: 1. fastcommit io is throttled by wbt. 2. jbd2_fc_wait_bufs() might wait for a long time while JBD2_FAST_COMMIT_ONGOING is set in journal->flags, and then jbd2_journal_commit_transaction() waits for the JBD2_FAST_COMMIT_ONGOING bit for a long time while holding the write lock of j_state_lock. 3. start_this_handle() waits for read lock of j_state_lock which results in high latency or task hang. Given the fact that ext4_fc_commit() already modifies the current process' IO priority to match that of the jbd2 thread, it should be reasonable to match jbd2's IO submission flags as well. Suggested-by: Ritesh Harjani (IBM) Signed-off-by: Julian Sun Reviewed-by: Zhang Yi Reviewed-by: Jan Kara Message-ID: <20250827121812.1477634-1-sunjunchao@bytedance.com> Signed-off-by: Theodore Ts'o

ext4: remove sbi argument from ext4_chksum()

2025-05-20T14:31:12+00:00

Since ext4_chksum() no longer uses its sbi argument, remove it. Signed-off-by: Eric Biggers Reviewed-by: Baokun Li Link: https://patch.msgid.link/20250513053809.699974-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o

ext4: prevent stale extent cache entries caused by concurrent I/O writeback

2025-05-14T14:42:12+00:00

Currently, in the I/O writeback path, ext4_map_blocks() may attempt to cache additional unrelated extents in the extent status tree without holding the inode's i_rwsem and the mapping's invalidate_lock. This can lead to stale extent status entries remaining in certain scenarios, potentially causing data corruption. For example, when performing a collapse range in ext4_collapse_range(), it clears the extent cache and dirty pages before removing blocks and shifting extents. It also holds the i_data_sem during these two operations. However, both ext4_ext_remove_space() and ext4_ext_shift_extents() may briefly release the i_data_sem if journal credits are insufficient (ext4_datasem_ensure_credits()). If another writeback process writes dirty pages from other regions during this interval, it may cache extents that are about to be modified. Unless ext4_collapse_range() explicitly clears the extent cache again, these cached entries can become stale and inconsistent with the actual extents. 0 a n b c m | | | | | | [www][wwwwww][wwwwwwww]...[wwwww][wwww]... | | N M Assume that block a is dirty. The collapse range operation is removing data from n to m and drops i_data_sem immediately after removing the extent from b to c. At the same time, a concurrent writeback begins to write back block a; it will reloads the extent from [n, b) into the extent status tree since it does not hold the i_rwsem or the invalidate_lock. After the collapse range operation, it left the stale extent [n, b), which points logical block n to N, but the actual physical block of n should be M. Similarly, both ext4_insert_range() and ext4_truncate() have the same problem. ext4_punch_hole() survived since it re-add a hole extent entry after removing space since commit 9f1118223aa0 ("ext4: add a hole extent entry in cache after punch"). In most cases, during dirty page writeback, the block mapping information is likely to be found in the extent cache, making it less necessary to search for physical extents. Consequently, loading unrelated extent caches during writeback appears to be ineffective. Therefore, fix this by adds EXT4_EX_NOCACHE in the writeback path to prevent caching of unrelated extents, eliminating this potential source of corruption. Signed-off-by: Zhang Yi Link: https://patch.msgid.link/20250423085257.122685-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o

ext4: generalize EXT4_GET_BLOCKS_IO_SUBMIT flag usage

2025-05-14T14:42:12+00:00

Currently, the EXT4_GET_BLOCKS_IO_SUBMIT flag is only used during data writeback to indicate that in ordered mode, the journal commit thread should skip re-submitting data and simply wait for I/O completion. To prepare for later patches that need to detect I/O submission context in ext4_map_blocks(), generalizes the meaning of EXT4_GET_BLOCKS_IO_SUBMIT. This flag will be set during: 1) data I/O writeback, 2) I/O completion extents conversion, 3) journal performing commit in fast_commit. This change doesn't affect current usage of this flag and provides a clear way to identify I/O submission context. Signed-off-by: Zhang Yi Link: https://patch.msgid.link/20250423085257.122685-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o