summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)AuthorFilesLines
2025-08-28ext4: fix hole length calculation overflow in non-extent inodesZhang Yi1-2/+2
commit 02c7f7219ac0e2277b3379a3a0e9841ef464b6d4 upstream. In a filesystem with a block size larger than 4KB, the hole length calculation for a non-extent inode in ext4_ind_map_blocks() can easily exceed INT_MAX. Then it could return a zero length hole and trigger the following waring and infinite in the iomap infrastructure. ------------[ cut here ]------------ WARNING: CPU: 3 PID: 434101 at fs/iomap/iter.c:34 iomap_iter_done+0x148/0x190 CPU: 3 UID: 0 PID: 434101 Comm: fsstress Not tainted 6.16.0-rc7+ #128 PREEMPT(voluntary) Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : iomap_iter_done+0x148/0x190 lr : iomap_iter+0x174/0x230 sp : ffff8000880af740 x29: ffff8000880af740 x28: ffff0000db8e6840 x27: 0000000000000000 x26: 0000000000000000 x25: ffff8000880af830 x24: 0000004000000000 x23: 0000000000000002 x22: 000001bfdbfa8000 x21: ffffa6a41c002e48 x20: 0000000000000001 x19: ffff8000880af808 x18: 0000000000000000 x17: 0000000000000000 x16: ffffa6a495ee6cd0 x15: 0000000000000000 x14: 00000000000003d4 x13: 00000000fa83b2da x12: 0000b236fc95f18c x11: ffffa6a4978b9c08 x10: 0000000000001da0 x9 : ffffa6a41c1a2a44 x8 : ffff8000880af5c8 x7 : 0000000001000000 x6 : 0000000000000000 x5 : 0000000000000004 x4 : 000001bfdbfa8000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000004004030000 x0 : 0000000000000000 Call trace: iomap_iter_done+0x148/0x190 (P) iomap_iter+0x174/0x230 iomap_fiemap+0x154/0x1d8 ext4_fiemap+0x110/0x140 [ext4] do_vfs_ioctl+0x4b8/0xbc0 __arm64_sys_ioctl+0x8c/0x120 invoke_syscall+0x6c/0x100 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x38/0x120 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x198/0x1a0 ---[ end trace 0000000000000000 ]--- Cc: stable@kernel.org Fixes: facab4d9711e ("ext4: return hole from ext4_map_blocks()") Reported-by: Qu Wenruo <wqu@suse.com> Closes: https://lore.kernel.org/linux-ext4/9b650a52-9672-4604-a765-bb6be55d1e4a@gmx.com/ Tested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250811064532.1788289-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28ext4: use kmalloc_array() for array space allocationLiao Yuanhong1-2/+3
commit 76dba1fe277f6befd6ef650e1946f626c547387a upstream. Replace kmalloc(size * sizeof) with kmalloc_array() for safer memory allocation and overflow prevention. Cc: stable@kernel.org Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Link: https://patch.msgid.link/20250811125816.570142-1-liaoyuanhong@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28ext4: don't try to clear the orphan_present feature block device is r/oTheodore Ts'o1-0/+2
commit c5e104a91e7b6fa12c1dc2d8bf84abb7ef9b89ad upstream. When the file system is frozen in preparation for taking an LVM snapshot, the journal is checkpointed and if the orphan_file feature is enabled, and the orphan file is empty, we clear the orphan_present feature flag. But if there are pending inodes that need to be removed the orphan_present feature flag can't be cleared. The problem comes if the block device is read-only. In that case, we can't process the orphan inode list, so it is skipped in ext4_orphan_cleanup(). But then in ext4_mark_recovery_complete(), this results in the ext4 error "Orphan file not empty on read-only fs" firing and the file system mount is aborted. Fix this by clearing the needs_recovery flag in the block device is read-only. We do this after the call to ext4_load_and_init-journal() since there are some error checks need to be done in case the journal needs to be replayed and the block device is read-only, or if the block device containing the externa journal is read-only, etc. Cc: stable@kernel.org Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1108271 Cc: stable@vger.kernel.org Fixes: 02f310fcf47f ("ext4: Speedup ext4 orphan inode handling") Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28ext4: fix reserved gdt blocks handling in fsmapOjaswin Mujoo1-0/+8
commit 3ffbdd1f1165f1b2d6a94d1b1aabef57120deaf7 upstream. In some cases like small FSes with no meta_bg and where the resize doesn't need extra gdt blocks as it can fit in the current one, s_reserved_gdt_blocks is set as 0, which causes fsmap to emit a 0 length entry, which is incorrect. $ mkfs.ext4 -b 65536 -O bigalloc /dev/sda 5G $ mount /dev/sda /mnt/scratch $ xfs_io -c "fsmap -d" /mnt/scartch 0: 253:48 [0..127]: static fs metadata 128 1: 253:48 [128..255]: special 102:1 128 2: 253:48 [256..255]: special 102:2 0 <---- 0 len entry 3: 253:48 [256..383]: special 102:3 128 Fix this by adding a check for this case. Cc: stable@kernel.org Fixes: 0c9ec4beecac ("ext4: support GETFSMAP ioctls") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/08781b796453a5770112aa96ad14c864fbf31935.1754377641.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28ext4: fix fsmap end of range reporting with bigallocOjaswin Mujoo1-3/+12
commit bae76c035bf0852844151e68098c9b7cd63ef238 upstream. With bigalloc enabled, the logic to report last extent has a bug since we try to use cluster units instead of block units. This can cause an issue where extra incorrect entries might be returned back to the user. This was flagged by generic/365 with 64k bs and -O bigalloc. ** Details of issue ** The issue was noticed on 5G 64k blocksize FS with -O bigalloc which has only 1 bg. $ xfs_io -c "fsmap -d" /mnt/scratch 0: 253:48 [0..127]: static fs metadata 128 /* sb */ 1: 253:48 [128..255]: special 102:1 128 /* gdt */ 3: 253:48 [256..383]: special 102:3 128 /* block bitmap */ 4: 253:48 [384..2303]: unknown 1920 /* flex bg empty space */ 5: 253:48 [2304..2431]: special 102:4 128 /* inode bitmap */ 6: 253:48 [2432..4351]: unknown 1920 /* flex bg empty space */ 7: 253:48 [4352..6911]: inodes 2560 8: 253:48 [6912..538623]: unknown 531712 9: 253:48 [538624..10485759]: free space 9947136 The issue can be seen with: $ xfs_io -c "fsmap -d 0 3" /mnt/scratch 0: 253:48 [0..127]: static fs metadata 128 1: 253:48 [384..2047]: unknown 1664 Only the first entry was expected to be returned but we get 2. This is because: ext4_getfsmap_datadev() first_cluster, last_cluster = 0 ... info->gfi_last = true; ext4_getfsmap_datadev_helper(sb, end_ag, last_cluster + 1, 0, info); fsb = C2B(1) = 16 fslen = 0 ... /* Merge in any relevant extents from the meta_list */ list_for_each_entry_safe(p, tmp, &info->gfi_meta_list, fmr_list) { ... // since fsb = 16, considers all metadata which starts before 16 blockno iter 1: error = ext4_getfsmap_helper(sb, info, p); // p = sb (0,1), nop info->gfi_next_fsblk = 1 iter 2: error = ext4_getfsmap_helper(sb, info, p); // p = gdt (1,2), nop info->gfi_next_fsblk = 2 iter 3: error = ext4_getfsmap_helper(sb, info, p); // p = blk bitmap (2,3), nop info->gfi_next_fsblk = 3 iter 4: error = ext4_getfsmap_helper(sb, info, p); // p = ino bitmap (18,19) if (rec_blk > info->gfi_next_fsblk) { // (18 > 3) // emits an extra entry ** BUG ** } } Fix this by directly calling ext4_getfsmap_datadev() with a dummy record that has fmr_physical set to (end_fsb + 1) instead of last_cluster + 1. By using the block instead of cluster we get the correct behavior. Replacing ext4_getfsmap_datadev_helper() with ext4_getfsmap_helper() is okay since the gfi_lastfree and metadata checks in ext4_getfsmap_datadev_helper() are anyways redundant when we only want to emit the last allocated block of the range, as we have already taken care of emitting metadata and any last free blocks. Cc: stable@kernel.org Reported-by: Disha Goel <disgoel@linux.ibm.com> Fixes: 4a622e4d477b ("ext4: fix FS_IOC_GETFSMAP handling") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/e7472c8535c9c5ec10f425f495366864ea12c9da.1754377641.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28ext4: check fast symlink for ea_inode correctlyAndreas Dilger1-1/+1
commit b4cc4a4077268522e3d0d34de4b2dc144e2330fa upstream. The check for a fast symlink in the presence of only an external xattr inode is incorrect. If a fast symlink does not have an xattr block (i_file_acl == 0), but does have an external xattr inode that increases inode i_blocks, then the check for a fast symlink will incorrectly fail and __ext4_iget()->ext4_ind_check_inode() will report the inode is corrupt when it "validates" i_data[] on the next read: # ln -s foo /mnt/tmp/bar # setfattr -h -n trusted.test \ -v "$(yes | head -n 4000)" /mnt/tmp/bar # umount /mnt/tmp # mount /mnt/tmp # ls -l /mnt/tmp ls: cannot access '/mnt/tmp/bar': Structure needs cleaning total 4 ? l?????????? ? ? ? ? ? bar # dmesg | tail -1 EXT4-fs error (device dm-8): __ext4_iget:5098: inode #24578: block 7303014: comm ls: invalid block (note that "block 7303014" = 0x6f6f66 = "foo" in LE order). ext4_inode_is_fast_symlink() should check the superblock EXT4_FEATURE_INCOMPAT_EA_INODE feature flag, not the inode EXT4_EA_INODE_FL, since the latter is only set on the xattr inode itself, and not on the inode that uses this xattr. Cc: stable@vger.kernel.org Fixes: fc82228a5e38 ("ext4: support fast symlinks from ext3 file systems") Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Li Dongyang <dongyangli@ddn.com> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-on: https://review.whamcloud.com/59879 Lustre-bug-id: https://jira.whamcloud.com/browse/LU-19121 Link: https://patch.msgid.link/20250717063709.757077-1-adilger@dilger.ca Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28ext4: preserve SB_I_VERSION on remountBaokun Li1-3/+3
commit f2326fd14a224e4cccbab89e14c52279ff79b7ec upstream. IMA testing revealed that after an ext4 remount, file accesses triggered full measurements even without modifications, instead of skipping as expected when i_version is unchanged. Debugging showed `SB_I_VERSION` was cleared in reconfigure_super() during remount due to commit 1ff20307393e ("ext4: unconditionally enable the i_version counter") removing the fix from commit 960e0ab63b2e ("ext4: fix i_version handling on remount"). To rectify this, `SB_I_VERSION` is always set for `fc->sb_flags` in ext4_init_fs_context(), instead of `sb->s_flags` in __ext4_fill_super(), ensuring it persists across all mounts. Cc: stable@kernel.org Fixes: 1ff20307393e ("ext4: unconditionally enable the i_version counter") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250703073903.6952-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-23ext4: replace ext4_writepage_trans_blocks()Zhang Yi6-27/+25
commit 57661f28756c59510e31543520b5b8f5e591f384 upstream. After ext4 supports large folios, the semantics of reserving credits in pages is no longer applicable. In most scenarios, reserving credits in extents is sufficient. Therefore, introduce ext4_chunk_trans_extent() to replace ext4_writepage_trans_blocks(). move_extent_per_page() is the only remaining location where we are still processing extents in pages. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-23ext4: reserved credits for one extent during the folio writebackZhang Yi1-17/+8
commit bbbf150f3f85619569ac19dc6458cca7c492e715 upstream. After ext4 supports large folios, reserving journal credits for one maximum-ordered folio based on the worst case cenario during the writeback process can easily exceed the maximum transaction credits. Additionally, reserving journal credits for one page is also no longer appropriate. Currently, the folio writeback process can either extend the journal credits or initiate a new transaction if the currently reserved journal credits are insufficient. Therefore, it can be modified to reserve credits for only one extent at the outset. In most cases involving continuous mapping, these credits are generally adequate, and we may only need to perform some basic credit expansion. However, in extreme cases where the block size and folio size differ significantly, or when the folios are sufficiently discontinuous, it may be necessary to restart a new transaction and resubmit the folios. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-23ext4: correct the reserved credits for extent conversionZhang Yi1-3/+3
commit 95ad8ee45cdbc321c135a2db895d48b374ef0f87 upstream. Now, we reserve journal credits for converting extents in only one page to written state when the I/O operation is complete. This is insufficient when large folio is enabled. Fix this by reserving credits for converting up to one extent per block in the largest 2MB folio, this calculation should only involve extents index and leaf blocks, so it should not estimate too many credits. Fixes: 7ac67301e82f ("ext4: enable large folio for regular file") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250707140814.542883-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-23ext4: enhance tracepoints during the folios writebackZhang Yi1-1/+4
commit 6b132759b0fe78e518abafb62190c294100db6d6 upstream. After mpage_map_and_submit_extent() supports restarting handle if credits are insufficient during allocating blocks, it is more likely to exit the current mapping iteration and continue to process the current processing partially mapped folio again. The existing tracepoints are not sufficient to track this situation, so enhance the tracepoints to track the writeback position and the return value before and after submitting the folios. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-23ext4: restart handle if credits are insufficient during allocating blocksZhang Yi1-5/+36
commit e2c4c49dee64ca2f42ad2958cbe1805de96b6732 upstream. After large folios are supported on ext4, writing back a sufficiently large and discontinuous folio may consume a significant number of journal credits, placing considerable strain on the journal. For example, in a 20GB filesystem with 1K block size and 1MB journal size, writing back a 2MB folio could require thousands of credits in the worst-case scenario (when each block is discontinuous and distributed across different block groups), potentially exceeding the journal size. This issue can also occur in ext4_write_begin() and ext4_page_mkwrite() when delalloc is not enabled. Fix this by ensuring that there are sufficient journal credits before allocating an extent in mpage_map_one_extent() and ext4_block_write_begin(). If there are not enough credits, return -EAGAIN, exit the current mapping loop, restart a new handle and a new transaction, and allocating blocks on this folio again in the next iteration. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-23ext4: refactor the block allocation process of ext4_page_mkwrite()Zhang Yi1-45/+50
commit 2bddafea3d0d85ee9ac3cf5ba9a4b2f2d2f50257 upstream. The block allocation process and error handling in ext4_page_mkwrite() is complex now. Refactor it by introducing a new helper function, ext4_block_page_mkwrite(). It will call ext4_block_write_begin() to allocate blocks instead of directly calling block_page_mkwrite(). Preparing to implement retry logic in a subsequent patch to address situations where the reserved journal credits are insufficient. Additionally, this modification will help prevent potential deadlocks that may occur when waiting for folio writeback while holding the transaction handle. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-23ext4: fix stale data if it bail out of the extents mapping loopZhang Yi1-1/+50
commit ded2d726a3041fce8afd88005cbfe15cd4737702 upstream. During the process of writing back folios, if mpage_map_and_submit_extent() exits the extent mapping loop due to an ENOSPC or ENOMEM error, it may result in stale data or filesystem inconsistency in environments where the block size is smaller than the folio size. When mapping a discontinuous folio in mpage_map_and_submit_extent(), some buffers may have already be mapped. If we exit the mapping loop prematurely, the folio data within the mapped range will not be written back, and the file's disk size will not be updated. Once the transaction that includes this range of extents is committed, this can lead to stale data or filesystem inconsistency. Fix this by submitting the current processing partially mapped folio. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-23ext4: move the calculation of wbc->nr_to_write to mpage_folio_done()Zhang Yi1-2/+1
commit f922c8c2461b022a2efd9914484901fb358a5b2a upstream. mpage_folio_done() should be a more appropriate place than mpage_submit_folio() for updating the wbc->nr_to_write after we have submitted a fully mapped folio. Preparing to make mpage_submit_folio() allows to submit partially mapped folio that is still under processing. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250707140814.542883-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-23ext4: process folios writeback in bytesZhang Yi1-34/+36
commit 1bfe6354e0975fe89c3d25e81b6546d205556a4b upstream. Since ext4 supports large folios, processing writebacks in pages is no longer appropriate, it can be modified to process writebacks in bytes. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20ext4: initialize superblock fields in the kballoc-test.c kunit testsZhang Yi1-0/+9
commit 82e6381e23f1ea7a14f418215068aaa2ca046c84 upstream. Various changes in the "ext4: better scalability for ext4 block allocation" patch series have resulted in kunit test failures, most notably in the test_new_blocks_simple and the test_mb_mark_used tests. The root cause of these failures is that various in-memory ext4 data structures were not getting initialized, and while previous versions of the functions exercised by the unit tests didn't use these structure members, this was arguably a test bug. Since one of the patches in the block allocation scalability patches is a fix which is has a cc:stable tag, this commit also has a cc:stable tag. CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250714130327.1830534-1-libaokun1@huawei.com Link: https://patch.msgid.link/20250725021550.3177573-1-yi.zhang@huaweicloud.com Link: https://patch.msgid.link/20250725021654.3188798-1-yi.zhang@huaweicloud.com Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/linux-ext4/b0635ad0-7ebf-4152-a69b-58e7e87d5085@roeck-us.net/ Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20ext4: fix largest free orders lists corruption on mb_optimize_scan switchBaokun Li1-19/+14
commit 7d345aa1fac4c2ec9584fbd6f389f2c2368671d5 upstream. The grp->bb_largest_free_order is updated regardless of whether mb_optimize_scan is enabled. This can lead to inconsistencies between grp->bb_largest_free_order and the actual s_mb_largest_free_orders list index when mb_optimize_scan is repeatedly enabled and disabled via remount. For example, if mb_optimize_scan is initially enabled, largest free order is 3, and the group is in s_mb_largest_free_orders[3]. Then, mb_optimize_scan is disabled via remount, block allocations occur, updating largest free order to 2. Finally, mb_optimize_scan is re-enabled via remount, more block allocations update largest free order to 1. At this point, the group would be removed from s_mb_largest_free_orders[3] under the protection of s_mb_largest_free_orders_locks[2]. This lock mismatch can lead to list corruption. To fix this, whenever grp->bb_largest_free_order changes, we now always attempt to remove the group from its old order list. However, we only insert the group into the new order list if `mb_optimize_scan` is enabled. This approach helps prevent lock inconsistencies and ensures the data in the order lists remains reliable. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-12-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20ext4: fix zombie groups in average fragment size listsBaokun Li1-18/+18
commit 1c320d8e92925bb7615f83a7b6e3f402a5c2ca63 upstream. Groups with no free blocks shouldn't be in any average fragment size list. However, when all blocks in a group are allocated(i.e., bb_fragments or bb_free is 0), we currently skip updating the average fragment size, which means the group isn't removed from its previous s_mb_avg_fragment_size[old] list. This created "zombie" groups that were always skipped during traversal as they couldn't satisfy any block allocation requests, negatively impacting traversal efficiency. Therefore, when a group becomes completely full, bb_avg_fragment_size_order is now set to -1. If the old order was not -1, a removal operation is performed; if the new order is not -1, an insertion is performed. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-11-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20ext4: limit the maximum folio orderZhang Yi3-6/+21
[ Upstream commit b12f423d598fd874df9ecfb2436789d582fda8e6 ] In environments with a page size of 64KB, the maximum size of a folio can reach up to 128MB. Consequently, during the write-back of folios, the 'rsv_blocks' will be overestimated to 1,577, which can make pressure on the journal space where the journal is small. This can easily exceed the limit of a single transaction. Besides, an excessively large folio is meaningless and will instead increase the overhead of traversing the bhs within the folio. Therefore, limit the maximum order of a folio to 2048 filesystem blocks. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Joseph Qi <jiangqi903@gmail.com> Closes: https://lore.kernel.org/linux-ext4/CA+G9fYsyYQ3ZL4xaSg1-Tt5Evto7Zd+hgNWZEa9cQLbahA1+xg@mail.gmail.com/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-12-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20ext4: do not BUG when INLINE_DATA_FL lacks system.data xattrTheodore Ts'o1-3/+16
[ Upstream commit 099b847ccc6c1ad2f805d13cfbcc83f5b6d4bc42 ] A syzbot fuzzed image triggered a BUG_ON in ext4_update_inline_data() when an inode had the INLINE_DATA_FL flag set but was missing the system.data extended attribute. Since this can happen due to a maiciouly fuzzed file system, we shouldn't BUG, but rather, report it as a corrupted file system. Add similar replacements of BUG_ON with EXT4_ERROR_INODE() ii ext4_create_inline_data() and ext4_inline_data_truncate(). Reported-by: syzbot+544248a761451c0df72f@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15ext4: Make sure BH_New bit is cleared in ->write_end handlerJan Kara2-1/+4
[ Upstream commit 91b8ca8b26729b729dda8a4eddb9aceaea706f37 ] Currently we clear BH_New bit in case of error and also in the standard ext4_write_end() handler (in block_commit_write()). However ext4_journalled_write_end() misses this clearing and thus we are leaving stale BH_New bits behind. Generally ext4_block_write_begin() clears these bits before any harm can be done but in case blocksize < pagesize and we hit some error when processing a page with these stale bits, we'll try to zero buffers with these stale BH_New bits and jbd2 will complain (as buffers were not prepared for writing in this transaction). Fix the problem by clearing BH_New bits in ext4_journalled_write_end() and WARN if ext4_block_write_begin() sees stale BH_New bits. Reported-by: Baolin Liu <liubaolin12138@163.com> Reported-by: Zhi Long <longzhi@sangfor.com.cn> Fixes: 3910b513fcdf ("ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250709084831.23876-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15ext4: fix inode use after free in ext4_end_io_rsv_work()Baokun Li1-8/+8
[ Upstream commit c678bdc998754589cea2e6afab9401d7d8312ac4 ] In ext4_io_end_defer_completion(), check if io_end->list_vec is empty to avoid adding an io_end that requires no conversion to the i_rsv_conversion_list, which in turn prevents starting an unnecessary worker. An ext4_emergency_state() check is also added to avoid attempting to abort the journal in an emergency state. Additionally, ext4_put_io_end_defer() is refactored to call ext4_io_end_defer_completion() directly instead of being open-coded. This also prevents starting an unnecessary worker when EXT4_IO_END_FAILED is set but data_err=abort is not enabled. This ensures that the check in ext4_put_io_end_defer() is consistent with the check in ext4_end_bio(). Otherwise, we might add an io_end to the i_rsv_conversion_list and then call ext4_finish_bio(), after which the inode could be freed before ext4_end_io_rsv_work() is called, triggering a use-after-free issue. Fixes: ce51afb8cc5e ("ext4: abort journal on data writeback failure if in data_err=abort mode") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250708111504.3208660-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()Zhang Yi1-2/+2
[ Upstream commit 5137d6c8906b55b3c7b5d1aa5a549753ec8520f5 ] The calculation of journal credits in ext4_meta_trans_blocks() should include pextents, as each extent separately may be allocated from a different group and thus need to update different bitmap and group descriptor block. Fixes: 0e32d8617012 ("ext4: correct the journal credits calculations of allocating blocks") Reported-by: Jan Kara <jack@suse.cz> Closes: https://lore.kernel.org/linux-ext4/nhxfuu53wyacsrq7xqgxvgzcggyscu2tbabginahcygvmc45hy@t4fvmyeky33e/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250707140814.542883-11-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar1-1/+1
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-05-28Merge tag 'ext4_for_linus-6.16-rc1' of ↵Linus Torvalds20-449/+1040
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "New ext4 features and performance improvements: - Fast commit performance improvements - Multi-fsblock atomic write support for bigalloc file systems - Large folio support for regular files This last can result in really stupendous performance for the right workloads. For example, see [1] where the Kernel Test Robot reported over 37% improvement on a large sequential I/O workload. There are also the usual bug fixes and cleanups. Of note are cleanups of the extent status tree to fix potential races that could result in the extent status tree getting corrupted under heavy simultaneous allocation and deallocation to a single file" Link: https://lore.kernel.org/all/202505161418.ec0d753f-lkp@intel.com/ [1] * tag 'ext4_for_linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits) ext4: Add a WARN_ON_ONCE for querying LAST_IN_LEAF instead ext4: Simplify flags in ext4_map_query_blocks() ext4: Rename and document EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER ext4: Simplify last in leaf check in ext4_map_query_blocks ext4: Unwritten to written conversion requires EXT4_EX_NOCACHE ext4: only dirty folios when data journaling regular files ext4: Add atomic block write documentation ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc ext4: Add multi-fsblock atomic write support with bigalloc ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS ext4: Make ext4_meta_trans_blocks() non-static for later use ext4: Check if inode uses extents in ext4_inode_can_atomic_write() ext4: Document an edge case for overwrites jbd2: remove journal_t argument from jbd2_superblock_csum() jbd2: remove journal_t argument from jbd2_chksum() ext4: remove sb argument from ext4_superblock_csum() ext4: remove sbi argument from ext4_chksum() ext4: enable large folio for regular file ext4: make online defragmentation support large folios ext4: make the writeback path support large folios ...
2025-05-26Merge tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-1/+1
Pull xfs updates from Carlos Maiolino: - Atomic writes for XFS - Remove experimental warnings for pNFS, scrub and parent pointers * tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) xfs: add inode to zone caching for data placement xfs: free the item in xfs_mru_cache_insert on failure xfs: remove the EXPERIMENTAL warning for pNFS xfs: remove some EXPERIMENTAL warnings xfs: Remove deprecated xfs_bufd sysctl parameters xfs: stop using set_blocksize xfs: allow sysadmins to specify a maximum atomic write limit at mount time xfs: update atomic write limits xfs: add xfs_calc_atomic_write_unit_max() xfs: add xfs_file_dio_write_atomic() xfs: commit CoW-based atomic writes atomically xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() xfs: add xfs_atomic_write_cow_iomap_begin() xfs: refine atomic write size check in xfs_file_write_iter() xfs: refactor xfs_reflink_end_cow_extent() xfs: allow block allocator to take an alignment hint xfs: ignore HW which cannot atomic write a single block xfs: add helpers to compute transaction reservation for finishing intent items xfs: add helpers to compute log item overhead xfs: separate out setting buftarg atomic writes limits ...
2025-05-20ext4: Add a WARN_ON_ONCE for querying LAST_IN_LEAF insteadRitesh Harjani (IBM)1-1/+3
We added the documentation in ext4_map_blocks() for usage of EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF flag. But It's better to add a WARN_ON_ONCE in case if anyone tries using this flag with CREATE to avoid a random issue later. Since depth can change with CREATE and it needs to be re-calculated before using it in there. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/ee6e82a224c50b432df9ce1ce3333c50182d8473.1747677758.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: Simplify flags in ext4_map_query_blocks()Ritesh Harjani (IBM)2-4/+3
Now that we have EXT4_EX_QUERY_FILTER mask, let's use that to simplify the filtering of flags for passing to ext4_ext_map_blocks() in ext4_map_query_blocks() function. This allows us to kill the query_flags local variable which is not needed anymore. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/4ae735e83e6f43341e53e2d289e59156a8360134.1747677758.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: Rename and document EXT4_EX_FILTER to EXT4_EX_QUERY_FILTERRitesh Harjani (IBM)2-2/+7
Rename EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER to better describe its purpose as a filter mask used specifically in ext4_map_query_blocks(). Add a comment explaining that this macro is used to filter flags needed when querying the on-disk extent tree. We will later use EXT4_EX_QUERY_FILTER mask to add another EXT4_GET_BLOCKS_QUERY needed to lookup in on-disk extent tree. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/51f05d0ba286372eb8693af95bd4b10194b53141.1747677758.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: Simplify last in leaf check in ext4_map_query_blocksRitesh Harjani (IBM)1-2/+1
This simplifies the check for last in leaf in ext4_map_query_blocks() and fixes this cocci warning. cocci warnings: (new ones prefixed by >>) >> fs/ext4/inode.c:573:49-51: WARNING !A || A && B is equivalent to !A || B Fixes: 5bb12b1837c0 ("ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505191524.auftmOwK-lkp@intel.com/ Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/5fd5c806218c83f603c578c95997cf7f6da29d74.1747677758.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: Unwritten to written conversion requires EXT4_EX_NOCACHERitesh Harjani (IBM)1-1/+1
This fixes the atomic write patch series after it was rebased on top of extent status cache cleanup series i.e. 'commit 402e38e6b71f57 ("ext4: prevent stale extent cache entries caused by concurrent I/O writeback")' After the above series, EXT4_GET_BLOCKS_IO_CONVERT_EXT flag which has EXT4_GET_BLOCKS_IO_SUBMIT flag set, requires that the io submit context of any kind should pass EXT4_EX_NOCACHE to avoid caching unncecessary extents in the extent status cache. This patch fixes that by adding the EXT4_EX_NOCACHE flag in ext4_convert_unwritten_extents_atomic() for unwritten to written conversion calls to ext4_map_blocks(). Fixes: b86629c2b299 ("ext4: Add multi-fsblock atomic write support with bigalloc") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/ea0ad9378ff6d31d73f4e53f87548e3a20817689.1747677758.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: only dirty folios when data journaling regular filesBrian Foster1-1/+6
fstest generic/388 occasionally reproduces a crash that looks as follows: BUG: kernel NULL pointer dereference, address: 0000000000000000 ... Call Trace: <TASK> ext4_block_zero_page_range+0x30c/0x380 [ext4] ext4_truncate+0x436/0x440 [ext4] ext4_process_orphan+0x5d/0x110 [ext4] ext4_orphan_cleanup+0x124/0x4f0 [ext4] ext4_fill_super+0x262d/0x3110 [ext4] get_tree_bdev_flags+0x132/0x1d0 vfs_get_tree+0x26/0xd0 vfs_cmd_create+0x59/0xe0 __do_sys_fsconfig+0x4ed/0x6b0 do_syscall_64+0x82/0x170 ... This occurs when processing a symlink inode from the orphan list. The partial block zeroing code in the truncate path calls ext4_dirty_journalled_data() -> folio_mark_dirty(). The latter calls mapping->a_ops->dirty_folio(), but symlink inodes are not assigned an a_ops vector in ext4, hence the crash. To avoid this problem, update the ext4_dirty_journalled_data() helper to only mark the folio dirty on regular files (for which a_ops is assigned). This also matches the journaling logic in the ext4_symlink() creation path, where ext4_handle_dirty_metadata() is called directly. Fixes: d84c9ebdac1e ("ext4: Mark pages with journalled data dirty") Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://patch.msgid.link/20250516173800.175577-1-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
2025-05-20ext4: Enable support for ext4 multi-fsblock atomic write using bigallocRitesh Harjani (IBM)1-2/+5
Last couple of patches added the needed support for multi-fsblock atomic writes using bigalloc. This patch ensures that filesystem advertizes the needed atomic write unit min and max values for enabling multi-fsblock atomic write support with bigalloc. Acked-by: Darrick J. Wong <djwong@kernel.org> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/5e45d7ed24499024b9079436ba6698dae5298e29.1747337952.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: Add multi-fsblock atomic write support with bigallocRitesh Harjani (IBM)4-5/+299
EXT4 supports bigalloc feature which allows the FS to work in size of clusters (group of blocks) rather than individual blocks. This patch adds atomic write support for bigalloc so that systems with bs = ps can also create FS using - mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to adjust ext4's atomic write unit max value to cluster size. This can then support atomic write of size anywhere between [blocksize, clustersize]. This patch adds the required changes to enable multi-fsblock atomic write support using bigalloc in the next patch. In this patch for block allocation: we first query the underlying region of the requested range by calling ext4_map_blocks() call. Here are the various cases which we then handle depending upon the underlying mapping type: 1. If the underlying region for the entire requested range is a mapped extent, then we don't call ext4_map_blocks() to allocate anything. We don't need to even start the jbd2 txn in this case. 2. For an append write case, we create a mapped extent. 3. If the underlying region is entirely a hole, then we create an unwritten extent for the requested range. 4. If the underlying region is a large unwritten extent, then we split the extent into 2 unwritten extent of required size. 5. If the underlying region has any type of mixed mapping, then we call ext4_map_blocks() in a loop to zero out the unwritten and the hole regions within the requested range. This then provide a single mapped extent type mapping for the requested range. Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO flag only when the underlying extent mapping of the requested range is not entirely a hole, an unwritten extent, or a fully mapped extent. That is, if the underlying region contains a mix of hole(s), unwritten extent(s), and mapped extent(s), we use this loop to ensure that all the short mappings are zeroed out. This guarantees that the entire requested range becomes a single, uniformly mapped extent. It is ok to do so because we know this is being done on a bigalloc enabled filesystem where the block bitmap represents the entire cluster unit. Note having a single contiguous underlying region of type mapped, unwrittn or hole is not a problem. But the reason to avoid writing on top of mixed mapping region is because, atomic writes requires all or nothing should get written for the userspace pwritev2 request. So if at any point in time during the write if a crash or a sudden poweroff occurs, the region undergoing atomic write should read either complete old data or complete new data. But it should never have a mix of both old and new data. So, we first convert any mixed mapping region to a single contiguous mapped extent before any data gets written to it. This is because normally FS will only convert unwritten extents to written at the end of the write in ->end_io() call. And if we allow the writes over a mixed mapping and if a sudden power off happens in between, we will end up reading mix of new data (over mapped extents) and old data (over unwritten extents), because unwritten to written conversion never went through. So to avoid this and to avoid writes getting torned due to mixed mapping, we first allocate a single contiguous block mapping and then do the write. Acked-by: Darrick J. Wong <djwong@kernel.org> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/c4965ac3407cbc773f0bc954d0966d9696f5038a.1747337952.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKSRitesh Harjani (IBM)3-8/+112
There can be a case where there are contiguous extents on the adjacent leaf nodes of on-disk extent trees. So when someone tries to write to this contiguous range, ext4_map_blocks() call will split by returning 1 extent at a time if this is not already cached in extent_status tree cache (where if these extents when cached can get merged since they are contiguous). This is fine for a normal write however in case of atomic writes, it can't afford to break the write into two. Now this is also something that will only happen in the slow write case where we call ext4_map_blocks() for each of these extents spread across different leaf nodes. However, there is no guarantee that these extent status cache cannot be reclaimed before the last call to ext4_map_blocks() in ext4_map_blocks_atomic_write_slow(). Hence this patch adds support of EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS. This flag checks if the requested range can be fully found in extent status cache and return. If not, it looks up in on-disk extent tree via ext4_map_query_blocks(). If the found extent is the last entry in the leaf node, then it goes and queries the next lblk to see if there is an adjacent contiguous extent in the adjacent leaf node of the on-disk extent tree. Even though there can be a case where there are multiple adjacent extent entries spread across multiple leaf nodes. But we only read an adjacent leaf block i.e. in total of 2 extent entries spread across 2 leaf nodes. The reason for this is that we are mostly only going to support atomic writes with upto 64KB or maybe max upto 1MB of atomic write support. Acked-by: Darrick J. Wong <djwong@kernel.org> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/6bb563e661f5fbd80e266a9e6ce6e29178f555f6.1747337952.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: Make ext4_meta_trans_blocks() non-static for later useRitesh Harjani (IBM)2-5/+3
Let's make ext4_meta_trans_blocks() non-static for use in later functions during ->end_io conversion for atomic writes. We will need this function to estimate journal credits for a special case. Instead of adding another wrapper around it, let's make this non-static. Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/23ce80d4286f792831ce99d13558182ee228fedb.1747337952.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: Check if inode uses extents in ext4_inode_can_atomic_write()Ritesh Harjani (IBM)1-1/+3
EXT4 only supports doing atomic write on inodes which uses extents, so add a check in ext4_inode_can_atomic_write() which gets called during open. Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/86bb502c979398a736ab371d8f35f6866a477f6c.1747337952.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: Document an edge case for overwritesRitesh Harjani (IBM)1-0/+4
ext4_iomap_overwrite_begin() clears the flag for IOMAP_WRITE before calling ext4_iomap_begin(). Document this above ext4_map_blocks() call as it is easy to miss it when focusing on write paths alone. Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/fd50ba05440042dff77d555e463a620a79f8d0e9.1747337952.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: remove sb argument from ext4_superblock_csum()Eric Biggers4-10/+8
Since ext4_superblock_csum() no longer uses its sb argument, remove it. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250513053809.699974-3-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: remove sbi argument from ext4_chksum()Eric Biggers12-55/+45
Since ext4_chksum() no longer uses its sbi argument, remove it. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250513053809.699974-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: enable large folio for regular fileZhang Yi4-1/+26
Besides fsverity, fscrypt, and the data=journal mode, ext4 now supports large folios for regular files. Enable this feature by default. However, since we cannot change the folio order limitation of mappings on active inodes, setting the journal=data mode via ioctl on an active inode will not take immediate effect in non-delalloc mode. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250512063319.3539411-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: make online defragmentation support large foliosZhang Yi1-7/+4
move_extent_per_page() currently assumes that each folio is the size of PAGE_SIZE and only copies data for one page. ext4_move_extents() should call move_extent_per_page() for each page. To support larger folios, simply modify the calculations for the block start and end offsets within the folio based on the provided range of 'data_offset_in_page' and 'block_len_in_page'. This function will continue to handle PAGE_SIZE of data at a time and will not convert this function to manage an entire folio. Additionally, we use the source folio to copy data, so it doesn't matter if the source and dest folios are different in size. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250512063319.3539411-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: make the writeback path support large foliosZhang Yi1-3/+3
In mpage_map_and_submit_buffers(), the 'lblk' is now aligned to PAGE_SIZE. Convert it to be aligned to folio size. Additionally, modify the wbc->nr_to_write update to reduce the number of pages in a single folio, ensuring that the entire writeback path can support large folios. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250512063319.3539411-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: correct the journal credits calculations of allocating blocksZhang Yi2-8/+7
The journal credits calculation in ext4_ext_index_trans_blocks() is currently inadequate. It only multiplies the depth of the extents tree and doesn't account for the blocks that may be required for adding the leaf extents themselves. After enabling large folios, we can easily run out of handle credits, triggering a warning in jbd2_journal_dirty_metadata() on filesystems with a 1KB block size. This occurs because we may need more extents when iterating through each large folio in ext4_do_writepages()->mpage_map_and_submit_extent(). Therefore, we should modify ext4_ext_index_trans_blocks() to include a count of the leaf extents in the worst case as well. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250512063319.3539411-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4/jbd2: convert jbd2_journal_blocks_per_page() to support large folioZhang Yi2-5/+5
jbd2_journal_blocks_per_page() returns the number of blocks in a single page. Rename it to jbd2_journal_blocks_per_folio() and make it returns the number of blocks in the largest folio, preparing for the calculation of journal credits blocks when allocating blocks within a large folio in the writeback path. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250512063319.3539411-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: make __ext4_block_zero_page_range() support large folioZhang Yi1-4/+3
The partial block zero range helper __ext4_block_zero_page_range() currently only supports folios of PAGE_SIZE in size. The calculations for the start block and the offset within a folio for the given range are incorrect. Modify the implementation to use offset_in_folio() instead of directly masking PAGE_SIZE - 1, which will be able to support for large folios. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250512063319.3539411-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: make regular file's buffered write path support large foliosZhang Yi1-10/+21
The current buffered write path in ext4 can only allocate and handle folios of PAGE_SIZE size. To support larger folios, modify ext4_da_write_begin() and ext4_write_begin() to allocate higher-order folios, and trim the write length if it exceeds the folio size. Additionally, in ext4_da_do_write_end(), use offset_in_folio() instead of PAGE_SIZE. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250512063319.3539411-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: make ext4_mpage_readpages() support large foliosZhang Yi1-11/+17
ext4_mpage_readpages() currently assumes that each folio is the size of PAGE_SIZE. Modify it to atomically calculate the number of blocks per folio and iterate through the blocks in each folio, which would allow for support of larger folios. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250512063319.3539411-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: ensure i_size is smaller than maxbytesZhang Yi1-1/+2
The inode i_size cannot be larger than maxbytes, check it while loading inode from the disk. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250506012009.3896990-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org