summaryrefslogtreecommitdiff
path: root/fs/f2fs
AgeCommit message (Collapse)AuthorFilesLines
7 daysf2fs: protect extension_list reading with sb_lock in f2fs_sbi_show()Yongpeng Yang1-2/+5
[ Upstream commit 5909bedbed38c558bee7cb6758ceedf9bc3a9194 ] In f2fs_sbi_show(), the extension_list, extension_count and hot_ext_count are read without holding sbi->sb_lock. If a concurrent sysfs store modifies the extension list via f2fs_update_extension_list(), the show path may read inconsistent count and array contents, potentially leading to out-of-bounds access or displaying stale data. Fix this by holding sb_lock around the entire extension list read and format operation. Fixes: b6a06cbbb5f7 ("f2fs: support hot file extension") Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
7 daysf2fs: allow empty mount string for Opt_usr|grp|projjquotaJaegeuk Kim1-12/+15
[ Upstream commit 2a3db1e02ce08c14af04da70bb99e8a0a31eb9e8 ] The fsparam_string_empty() gives an error when mounting without string, since its type is set to fsparam_flag in VFS. So, let's allow the flag as well. This addresses xfstests/f2fs/015 and f2fs/021. Fixes: d18535132523 ("f2fs: separate the options parsing and options checking") Reviewed-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
7 daysf2fs: fix to preserve previous reserve_{blocks,node} value when remountZhiguo Niu1-0/+2
[ Upstream commit 01968164d94762db2f703647c5acfa28613844f1 ] The following steps will change previous value of reserve_{blocks,node}, this dones not match the original intention. 1.mount -t f2fs -o reserve_root=8192 imgfile test_mount/ F2FS-fs (loop56): Mounted with checkpoint version = 1b69f8c7 mount info: /dev/block/loop56 on /data/test_mount type f2fs (xxx,reserve_root=8192,reserve_node=0,resuid=0,resgid=0,xxx) 2.mount -t f2fs -o remount,reserve_root=4096 /data/test_mount F2FS-fs (loop56): Preserve previous reserve_root=8192 check mount info: reserve_root change to 4096 /dev/block/loop56 on /data/test_mount type f2fs (xxx,reserve_root=4096,reserve_node=0,resuid=0,resgid=0,xxx) Prior to commit d18535132523 ("f2fs: separate the options parsing and options checking"), the value of reserve_{blocks,node} was only set during the first mount, along with the corresponding mount option F2FS_MOUNT_RESERVE_{ROOT,NODE} . If the mount option F2FS_MOUNT_RESERVE_{ROOT,NODE} was found to have been set during the mount/remount, the previously value of reserve_{blocks,node} would also be preserved, as shown in the code below. if (test_opt(sbi, RESERVE_ROOT)) { f2fs_info(sbi, "Preserve previous reserve_root=%u", F2FS_OPTION(sbi).root_reserved_blocks); } else { F2FS_OPTION(sbi).root_reserved_blocks = arg; set_opt(sbi, RESERVE_ROOT); } But commit d18535132523 ("f2fs: separate the options parsing and options checking") only preserved the previous mount option; it did not preserve the previous value of reserve_{blocks,node}. Since value of reserve_{blocks,node} value is assigned or not depends on ctx->spec_mask, ctx->spec_mask should be alos handled in f2fs_check_opt_consistency. This patch will clear the corresponding ctx->spec_mask bits in f2fs_check_opt_consistency to preserve the previously values of reserve_{blocks,node} if it already have a value. Fixes: d18535132523 ("f2fs: separate the options parsing and options checking") Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
7 daysf2fs: fix data loss caused by incorrect use of nat_entry flagYongpeng Yang1-0/+3
[ Upstream commit 238e14eb7226f883b72caccd2d37bf5707df066b ] Data loss can occur when fsync is performed on a newly created file (before any checkpoint has been written) concurrently with a checkpoint operation. The scenario is as follows: create & write & fsync 'file A' write checkpoint - f2fs_do_sync_file // inline inode - f2fs_write_inode // inode folio is dirty - f2fs_write_checkpoint - f2fs_flush_merged_writes - f2fs_sync_node_pages - f2fs_flush_nat_entries - f2fs_fsync_node_pages // no dirty node - f2fs_need_inode_block_update // return false SPO and lost 'file A' f2fs_flush_nat_entries() sets the IS_CHECKPOINTED and HAS_LAST_FSYNC flags for the nat_entry, but this does not mean that the checkpoint has actually completed successfully. However, f2fs_need_inode_block_update() checks these flags and incorrectly assumes that the checkpoint has finished. The root cause is that the semantics of IS_CHECKPOINTED and HAS_LAST_FSYNC are only guaranteed after the checkpoint write fully completes. This patch modifies f2fs_need_inode_block_update() to acquire the sbi->node_write lock before reading the nat_entry flags, ensuring that once IS_CHECKPOINTED and HAS_LAST_FSYNC are observed to be set, the checkpoint operation has already completed. Fixes: e05df3b115e7 ("f2fs: add node operations") Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
7 daysf2fs: avoid reading already updated pages during GCJianan Huang1-1/+4
[ Upstream commit 570e2ccc7cb35fe720106964e65060602d3d2ac4 ] We found the following issue during fuzz testing: page: refcount:3 mapcount:0 mapping:00000000b6e89c65 index:0x18b2dc pfn:0x161ba9 memcg:f8ffff800e269c00 aops:f2fs_meta_aops ino:2 flags: 0x52880000000080a9(locked|waiters|uptodate|lru|private|zone=1|kasantag=0x4a) raw: 52880000000080a9 fffffffec6e17588 fffffffec0ccc088 a7ffff8067063618 raw: 000000000018b2dc 0000000000000009 00000003ffffffff f8ffff800e269c00 page dumped because: VM_BUG_ON_FOLIO(folio_test_uptodate(folio)) page_owner tracks the page as allocated post_alloc_hook+0x58c/0x5ec prep_new_page+0x34/0x284 get_page_from_freelist+0x2dcc/0x2e8c __alloc_pages_noprof+0x280/0x76c __folio_alloc_noprof+0x18/0xac __filemap_get_folio+0x6bc/0xdc4 pagecache_get_page+0x3c/0x104 do_garbage_collect+0x5c78/0x77a4 f2fs_gc+0xd74/0x25f0 gc_thread_func+0xb28/0x2930 kthread+0x464/0x5d8 ret_from_fork+0x10/0x20 ------------[ cut here ]------------ kernel BUG at mm/filemap.c:1563! folio_end_read+0x140/0x168 f2fs_finish_read_bio+0x5c4/0xb80 f2fs_read_end_io+0x64c/0x708 bio_endio+0x85c/0x8c0 blk_update_request+0x690/0x127c scsi_end_request+0x9c/0xb8c scsi_io_completion+0xf0/0x250 scsi_finish_command+0x430/0x45c scsi_complete+0x178/0x6d4 blk_mq_complete_request+0xcc/0x104 scsi_done_internal+0x214/0x454 scsi_done+0x24/0x34 which is similar to the problem reported by syzbot: https://syzkaller.appspot.com/bug?extid=3686758660f980b402dc This case is consistent with the description in commit 9bf1a3f ("f2fs: avoid GC causing encrypted file corrupted"): Page 1 is moved from blkaddr A to blkaddr B by move_data_block, and after being written it is marked as uptodate. Then, Page 1 is moved from blkaddr B to blkaddr C, VM_BUG_ON_FOLIO was triggered in the endio initiated by ra_data_block. There is no need to read Page 1 again from blkaddr B, since it has already been updated. Therefore, avoid initiating I/O in this case. Fixes: 6aa58d8ad20a ("f2fs: readahead encrypted block during GC") Signed-off-by: Jianan Huang <huangjianan@xiaomi.com> Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-14f2fs: fix fsck inconsistency caused by FGGC of node blockYongpeng Yang1-14/+13
commit c3e238bd1f56993f205ef83889d406dfeaf717a8 upstream. During FGGC node block migration, fsck may incorrectly treat the migrated node block as fsync-written data. The reproduction scenario: root@vm:/mnt/f2fs# seq 1 2048 | xargs -n 1 ./test_sync // write inline inode and sync root@vm:/mnt/f2fs# rm -f 1 root@vm:/mnt/f2fs# sync root@vm:/mnt/f2fs# f2fs_io gc_range // move data block in sync mode and not write CP SPO, "fsck --dry-run" find inode has already checkpointed but still with DENT_BIT_SHIFT set The root cause is that GC does not clear the dentry mark and fsync mark during node block migration, leading fsck to misinterpret them as user-issued fsync writes. In BGGC mode, node block migration is handled by f2fs_sync_node_pages(), which guarantees the dentry and fsync marks are cleared before writing. This patch move the set/clear of the fsync|dentry marks into __write_node_folio to make the logic clearer, and ensures the fsync|dentry mark is cleared in FGGC. Cc: stable@kernel.org Fixes: da011cc0da8c ("f2fs: move node pages only in victim section during GC") Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14f2fs: fix inline data not being written to disk in writeback pathYongpeng Yang3-1/+12
commit fe9b8b30b97102859a9102be7bd2a09803bd90bd upstream. When f2fs_fiemap() is called with `fileinfo->fi_flags` containing the FIEMAP_FLAG_SYNC flag, it attempts to write data to disk before retrieving file mappings via filemap_write_and_wait(). However, there is an issue where the file does not get mapped as expected. The following scenario can occur: root@vm:/mnt/f2fs# dd if=/dev/zero of=data.3k bs=3k count=1 root@vm:/mnt/f2fs# xfs_io data.3k -c "fiemap -v 0 4096" data.3k: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..5]: 0..5 6 0x307 The root cause of this issue is that f2fs_write_single_data_page() only calls f2fs_write_inline_data() to copy data from the data folio to the inode folio, and it clears the dirty flag on the data folio. However, it does not mark the data folio as writeback. When __filemap_fdatawait_range() checks for folios with the writeback flag, it returns early, causing f2fs_fiemap() to report that the file has no mapping. To fix this issue, the solution is to call f2fs_write_single_node_folio() in f2fs_inline_data_fiemap() when getting fiemap with FIEMAP_FLAG_SYNC flags. This patch ensures that the inode folio is written back and the writeback process completes before proceeding. Cc: stable@kernel.org Fixes: 9ffe0fb5f3bb ("f2fs: handle inline data operations") Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14f2fs: refactor f2fs_move_node_folio functionYongpeng Yang1-22/+32
commit 92c20989366e023b74fa0c1028af9436c1917dbf upstream. This patch refactor the f2fs_move_node_folio() function. No logical changes. Cc: stable@kernel.org Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14f2fs: fix uninitialized kobject put in f2fs_init_sysfs()Guangshuo Li1-4/+6
commit b635f2ecdb5ad34f9c967cabb704d6bed9382fd0 upstream. In f2fs_init_sysfs(), all failure paths after kset_register() jump to put_kobject, which unconditionally releases both f2fs_tune and f2fs_feat. If kobject_init_and_add(&f2fs_feat, ...) fails, f2fs_tune has not been initialized yet, so calling kobject_put(&f2fs_tune) is invalid. Fix this by splitting the unwind path so each error path only releases objects that were successfully initialized. Fixes: a907f3a68ee26ba4 ("f2fs: add a sysfs entry to reclaim POSIX_FADV_NOREUSE pages") Cc: stable@vger.kernel.org Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14f2fs: fix node_cnt race between extent node destroy and writebackYongpeng Yang1-7/+10
commit ed78aeebef05212ef7dca93bd931e4eff67c113f upstream. f2fs_destroy_extent_node() does not set FI_NO_EXTENT before clearing extent nodes. When called from f2fs_drop_inode() with I_SYNC set, concurrent kworker writeback can insert new extent nodes into the same extent tree, racing with the destroy and triggering f2fs_bug_on() in __destroy_extent_node(). The scenario is as follows: drop inode writeback - iput - f2fs_drop_inode // I_SYNC set - f2fs_destroy_extent_node - __destroy_extent_node - while (node_cnt) { write_lock(&et->lock) __free_extent_tree write_unlock(&et->lock) - __writeback_single_inode - f2fs_outplace_write_data - f2fs_update_read_extent_cache - __update_extent_tree_range // FI_NO_EXTENT not set, // insert new extent node } // node_cnt == 0, exit while - f2fs_bug_on(node_cnt) // node_cnt > 0 Additionally, __update_extent_tree_range() only checks FI_NO_EXTENT for EX_READ type, leaving EX_BLOCK_AGE updates completely unprotected. This patch set FI_NO_EXTENT under et->lock in __destroy_extent_node(), consistent with other callers (__update_extent_tree_range and __drop_extent_tree) and check FI_NO_EXTENT for both EX_READ and EX_BLOCK_AGE tree. Fixes: 3fc5d5a182f6 ("f2fs: fix to shrink read extent node in batches") Cc: stable@vger.kernel.org Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14f2fs: fix incorrect multidevice info in trace_f2fs_map_blocks()Yongpeng Yang1-1/+2
commit eb2ca3ca983551a80e16a4a25df5a4ce59df8484 upstream. When f2fs_map_blocks()->f2fs_map_blocks_cached() hits the read extent cache, map->m_multidev_dio is not updated, which leads to incorrect multidevice information being reported by trace_f2fs_map_blocks(). This patch updates map->m_multidev_dio in f2fs_map_blocks_cached() when the read extent cache is hit. Cc: stable@kernel.org Fixes: 0094e98bd147 ("f2fs: factor a f2fs_map_blocks_cached helper") Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14f2fs: fix incorrect file address mapping when inline inode is unwrittenYongpeng Yang1-4/+9
commit 68a0178981a0f493295afa29f8880246e561494c upstream. When `fileinfo->fi_flags` does not have the `FIEMAP_FLAG_SYNC` bit set and inline data has not been persisted yet, the physical address of the extent is calculated incorrectly for unwritten inline inodes. root@vm:/mnt/f2fs# dd if=/dev/zero of=data.3k bs=3k count=1 root@vm:/mnt/f2fs# f2fs_io fiemap 0 100 data.3k Fiemap: offset = 0 len = 100 logical addr. physical addr. length flags 0 0000000000000000 00000ffffffff16c 0000000000000c00 00000301 This patch fixes the issue by checking if the inode's address is valid. If the inline inode is unwritten, set the physical address to 0 and mark the extent with `FIEMAP_EXTENT_UNKNOWN | FIEMAP_EXTENT_DELALLOC` flags. Cc: stable@kernel.org Fixes: 67f8cf3cee6f ("f2fs: support fiemap for inline_data") Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14f2fs: fix fsck inconsistency caused by incorrect nat_entry flag usageYongpeng Yang1-9/+5
commit 019f9dda7f66e55eb94cd32e1d3fff5835f73fbc upstream. f2fs_need_dentry_mark() reads nat_entry flags without mutual exclusion with the checkpoint path, which can result in an incorrect inode block marking state. The scenario is as follows: create & write & fsync 'file A' write checkpoint - f2fs_do_sync_file // inline inode - f2fs_write_inode // inode folio is dirty - f2fs_write_checkpoint - f2fs_flush_merged_writes - f2fs_sync_node_pages - f2fs_fsync_node_pages // no dirty node - f2fs_need_inode_block_update // return true - f2fs_fsync_node_pages // inode dirtied - f2fs_need_dentry_mark //return true - f2fs_flush_nat_entries - f2fs_write_checkpoint end - __write_node_folio // inode with DENT_BIT_SHIFT set SPO, "fsck --dry-run" find inode has already checkpointed but still with DENT_BIT_SHIFT set The state observed by f2fs_need_dentry_mark() can differ from the state observed in __write_node_folio() after acquiring sbi->node_write. The root cause is that the semantics of IS_CHECKPOINTED and HAS_FSYNCED_INODE are only guaranteed after the checkpoint write has fully completed. This patch moves set_dentry_mark() into __write_node_folio() and protects it with the sbi->node_write lock. Cc: stable@kernel.org Fixes: 88bd02c9472a ("f2fs: fix conditions to remain recovery information in f2fs_sync_file") Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14f2fs: fix fiemap boundary handling when read extent cache is incompleteYongpeng Yang1-3/+22
commit 95e159ad3e52f7478cfd22e44ec37c9f334f8993 upstream. f2fs_fiemap() calls f2fs_map_blocks() to obtain the block mapping a file, and then merges contiguous mappings into extents. If the mapping is found in the read extent cache, node blocks do not need to be read. However, in the following scenario, a contiguous extent can be split into two extents: $ dd if=/dev/zero of=data.128M bs=1M count=128 $ losetup -f data.128M $ mkfs.f2fs /dev/loop0 -f $ mount -o mode=lfs /dev/loop0 /mnt/f2fs/ $ cd /mnt/f2fs/ $ dd if=/dev/zero of=data.72M bs=1M count=72 && sync $ dd if=/dev/zero of=data.4M bs=1M count=4 && sync $ dd if=/dev/zero of=data.4M bs=1M count=2 seek=2 conv=notrunc && sync $ echo 3 > /proc/sys/vm/drop_caches $ dd if=/dev/zero of=data.4M bs=1M count=2 seek=0 conv=notrunc && sync $ dd if=/dev/zero of=data.4M bs=1M count=2 seek=0 conv=notrunc && sync $ f2fs_io fiemap 0 1024 data.4M Fiemap: offset = 0 len = 1024 logical addr. physical addr. length flags 0 0000000000000000 0000000006400000 0000000000200000 00001000 1 0000000000200000 0000000006600000 0000000000200000 00001001 Although the physical addresses of the ranges 0~2MB and 2M~4MB are contiguous, the mapping for the 2M~4MB range is not present in memory. When the physical addresses for the 0~2MB range are updated, no merge happens because the adjacent mapping is missing from the in-memory cache. As a result, fiemap reports two separate extents instead of a single contiguous one. The root cause is that the read extent cache does not guarantee that all blocks of an extent are present in memory. Therefore, when the extent length returned by f2fs_map_blocks_cached() is smaller than maxblocks, the remaining mappings are retrieved via f2fs_get_dnode_of_data() to ensure correct fiemap extent boundary handling. Cc: stable@kernel.org Fixes: cd8fc5226bef ("f2fs: remove the create argument to f2fs_map_blocks") Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14f2fs: fix false alarm of lockdep on cp_global_sem lockChao Yu2-0/+14
commit 6a5e3de9c2bb0b691d16789a5d19e9276a09b308 upstream. lockdep reported a potential deadlock: a) TCMU device removal context: - call del_gendisk() to get q->q_usage_counter - call start_flush_work() to get work_completion of wb->dwork b) f2fs writeback context: - in wb_workfn(), which holds work_completion of wb->dwork - call f2fs_balance_fs() to get sbi->gc_lock c) f2fs vfs_write context: - call f2fs_gc() to get sbi->gc_lock - call f2fs_write_checkpoint() to get sbi->cp_global_sem d) f2fs mount context: - call recover_fsync_data() to get sbi->cp_global_sem - call f2fs_check_and_fix_write_pointer() to call blkdev_report_zones() that goes down to blk_mq_alloc_request and get q->q_usage_counter Original callstack is in Closes tag. However, I think this is a false alarm due to before mount returns successfully (context d), we can not access file therein via vfs_write (context c). Let's introduce per-sb cp_global_sem_key, and assign the key for cp_global_sem, so that lockdep can recognize cp_global_sem from different super block correctly. A lot of work are done by Shin'ichiro Kawasaki, thanks a lot for the work. Fixes: c426d99127b1 ("f2fs: Check write pointer consistency of open zones") Cc: stable@kernel.org Reported-and-tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/linux-f2fs-devel/20260218125237.3340441-1-shinichiro.kawasaki@wdc.com Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14f2fs: add READ_ONCE() for i_blocks in f2fs_update_inode()Cen Zhang1-1/+1
commit 5471834a96fb697874be2ca0b052e74bcf3c23d1 upstream. f2fs_update_inode() reads inode->i_blocks without holding i_lock to serialize it to the on-disk inode, while concurrent truncate or allocation paths may modify i_blocks under i_lock. Since blkcnt_t is u64, this risks torn reads on 32-bit architectures. Following the approach in ext4_inode_blocks_set(), add READ_ONCE() to prevent potential compiler-induced tearing. Fixes: 19f99cee206c ("f2fs: add core inode operations") Cc: stable@vger.kernel.org Signed-off-by: Cen Zhang <zzzccc427@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-27f2fs: fix use-after-free of sbi in f2fs_compress_write_end_io()George Saad1-3/+11
commit 39d4ee19c1e7d753dd655aebee632271b171f43a upstream. In f2fs_compress_write_end_io(), dec_page_count(sbi, type) can bring the F2FS_WB_CP_DATA counter to zero, unblocking f2fs_wait_on_all_pages() in f2fs_put_super() on a concurrent unmount CPU. The unmount path then proceeds to call f2fs_destroy_page_array_cache(sbi), which destroys sbi->page_array_slab via kmem_cache_destroy(), and eventually kfree(sbi). Meanwhile, the bio completion callback is still executing: when it reaches page_array_free(sbi, ...), it dereferences sbi->page_array_slab — a destroyed slab cache — to call kmem_cache_free(), causing a use-after-free. This is the same class of bug as CVE-2026-23234 (which fixed the equivalent race in f2fs_write_end_io() in data.c), but in the compressed writeback completion path that was not covered by that fix. Fix this by moving dec_page_count() to after page_array_free(), so that all sbi accesses complete before the counter decrement that can unblock unmount. For non-last folios (where atomic_dec_return on cic->pending_pages is nonzero), dec_page_count is called immediately before returning — page_array_free is not reached on this path, so there is no post-decrement sbi access. For the last folio, page_array_free runs while the F2FS_WB_CP_DATA counter is still nonzero (this folio has not yet decremented it), keeping sbi alive, and dec_page_count runs as the final operation. Fixes: 4c8ff7095bef ("f2fs: support data compression") Cc: stable@vger.kernel.org Signed-off-by: George Saad <geoo115@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-27f2fs: fix to avoid uninit-value access in f2fs_sanity_check_node_footerChao Yu1-1/+2
commit 7b9161a605e91d0987e2596a245dc1f21621b23f upstream. syzbot reported a f2fs bug as below: BUG: KMSAN: uninit-value in f2fs_sanity_check_node_footer+0x374/0xa20 fs/f2fs/node.c:1520 f2fs_sanity_check_node_footer+0x374/0xa20 fs/f2fs/node.c:1520 f2fs_finish_read_bio+0xe1e/0x1d60 fs/f2fs/data.c:177 f2fs_read_end_io+0x6ab/0x2220 fs/f2fs/data.c:-1 bio_endio+0x1006/0x1160 block/bio.c:1792 submit_bio_noacct+0x533/0x2960 block/blk-core.c:891 submit_bio+0x57a/0x620 block/blk-core.c:926 blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline] f2fs_submit_read_bio+0x12c/0x360 fs/f2fs/data.c:557 f2fs_submit_page_bio+0xee2/0x1450 fs/f2fs/data.c:775 read_node_folio+0x384/0x4b0 fs/f2fs/node.c:1481 __get_node_folio+0x5db/0x15d0 fs/f2fs/node.c:1576 f2fs_get_inode_folio+0x40/0x50 fs/f2fs/node.c:1623 do_read_inode fs/f2fs/inode.c:425 [inline] f2fs_iget+0x1209/0x9380 fs/f2fs/inode.c:596 f2fs_fill_super+0x8f5a/0xb2e0 fs/f2fs/super.c:5184 get_tree_bdev_flags+0x6e6/0x920 fs/super.c:1694 get_tree_bdev+0x38/0x50 fs/super.c:1717 f2fs_get_tree+0x35/0x40 fs/f2fs/super.c:5436 vfs_get_tree+0xb3/0x5d0 fs/super.c:1754 fc_mount fs/namespace.c:1193 [inline] do_new_mount_fc fs/namespace.c:3763 [inline] do_new_mount+0x885/0x1dd0 fs/namespace.c:3839 path_mount+0x7a2/0x20b0 fs/namespace.c:4159 do_mount fs/namespace.c:4172 [inline] __do_sys_mount fs/namespace.c:4361 [inline] __se_sys_mount+0x704/0x7f0 fs/namespace.c:4338 __x64_sys_mount+0xe4/0x150 fs/namespace.c:4338 x64_sys_call+0x39f0/0x3ea0 arch/x86/include/generated/asm/syscalls_64.h:166 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x134/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f The root cause is: in f2fs_finish_read_bio(), we may access uninit data in folio if we failed to read the data from device into folio, let's add a check condition to avoid such issue. Cc: stable@kernel.org Fixes: 50ac3ecd8e05 ("f2fs: fix to do sanity check on node footer in {read,write}_end_io") Reported-by: syzbot+9aac813cdc456cdd49f8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/69a9ca26.a70a0220.305d9a.0000.GAE@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-27f2fs: fix to avoid memory leak in f2fs_rename()Chao Yu1-0/+1
commit 3cf11e6f36c170050c12171dd6fd3142711478fc upstream. syzbot reported a f2fs bug as below: BUG: memory leak unreferenced object 0xffff888127f70830 (size 16): comm "syz.0.23", pid 6144, jiffies 4294943712 hex dump (first 16 bytes): 3c af 57 72 5b e6 8f ad 6e 8e fd 33 42 39 03 ff <.Wr[...n..3B9.. backtrace (crc 925f8a80): kmemleak_alloc_recursive include/linux/kmemleak.h:44 [inline] slab_post_alloc_hook mm/slub.c:4520 [inline] slab_alloc_node mm/slub.c:4844 [inline] __do_kmalloc_node mm/slub.c:5237 [inline] __kmalloc_noprof+0x3bd/0x560 mm/slub.c:5250 kmalloc_noprof include/linux/slab.h:954 [inline] fscrypt_setup_filename+0x15e/0x3b0 fs/crypto/fname.c:364 f2fs_setup_filename+0x52/0xb0 fs/f2fs/dir.c:143 f2fs_rename+0x159/0xca0 fs/f2fs/namei.c:961 f2fs_rename2+0xd5/0xf20 fs/f2fs/namei.c:1308 vfs_rename+0x7ff/0x1250 fs/namei.c:6026 filename_renameat2+0x4f4/0x660 fs/namei.c:6144 __do_sys_renameat2 fs/namei.c:6173 [inline] __se_sys_renameat2 fs/namei.c:6168 [inline] __x64_sys_renameat2+0x59/0x80 fs/namei.c:6168 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f The root cause is in commit 40b2d55e0452 ("f2fs: fix to create selinux label during whiteout initialization"), we added a call to f2fs_setup_filename() without a matching call to f2fs_free_filename(), fix it. Fixes: 40b2d55e0452 ("f2fs: fix to create selinux label during whiteout initialization") Cc: stable@kernel.org Reported-by: syzbot+cf7946ab25b21abc4b66@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/69a75fe1.a70a0220.b118c.0014.GAE@google.com Suggested-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-27f2fs: fix UAF caused by decrementing sbi->nr_pages[] in f2fs_write_end_io()Yongpeng Yang1-2/+2
commit 2d9c4a4ed4eef1f82c5b16b037aee8bad819fd53 upstream. The xfstests case "generic/107" and syzbot have both reported a NULL pointer dereference. The concurrent scenario that triggers the panic is as follows: F2FS_WB_CP_DATA write callback umount - f2fs_write_checkpoint - f2fs_wait_on_all_pages(sbi, F2FS_WB_CP_DATA) - blk_mq_end_request - bio_endio - f2fs_write_end_io : dec_page_count(sbi, F2FS_WB_CP_DATA) : wake_up(&sbi->cp_wait) - kill_f2fs_super - kill_block_super - f2fs_put_super : iput(sbi->node_inode) : sbi->node_inode = NULL : f2fs_in_warm_node_list - is_node_folio // sbi->node_inode is NULL and panic The root cause is that f2fs_put_super() calls iput(sbi->node_inode) and sets sbi->node_inode to NULL after sbi->nr_pages[F2FS_WB_CP_DATA] is decremented to zero. As a result, f2fs_in_warm_node_list() may dereference a NULL node_inode when checking whether a folio belongs to the node inode, leading to a panic. This patch fixes the issue by calling f2fs_in_warm_node_list() before decrementing sbi->nr_pages[F2FS_WB_CP_DATA], thus preventing the use-after-free condition. Cc: stable@kernel.org Fixes: 50fa53eccf9f ("f2fs: fix to avoid broken of dnode block list") Reported-by: syzbot+6e4cb1cac5efc96ea0ca@syzkaller.appspotmail.com Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-27f2fs: fix to do sanity check on dcc->discard_cmd_cnt conditionallyChao Yu3-7/+12
commit 6af249c996f7d73a3435f9e577956fa259347d18 upstream. Syzbot reported a f2fs bug as below: ------------[ cut here ]------------ kernel BUG at fs/f2fs/segment.c:1900! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 1 UID: 0 PID: 6527 Comm: syz.5.110 Not tainted syzkaller #0 PREEMPT_{RT,(full)} Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026 RIP: 0010:f2fs_issue_discard_timeout+0x59b/0x5a0 fs/f2fs/segment.c:1900 Code: d9 80 e1 07 80 c1 03 38 c1 0f 8c d6 fe ff ff 48 89 df e8 a8 5e fa fd e9 c9 fe ff ff e8 4e 46 94 fd 90 0f 0b e8 46 46 94 fd 90 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 RSP: 0018:ffffc9000494f940 EFLAGS: 00010283 RAX: ffffffff843009ca RBX: 0000000000000001 RCX: 0000000000080000 RDX: ffffc9001ca78000 RSI: 00000000000029f3 RDI: 00000000000029f4 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: dffffc0000000000 R11: ffffed100893a431 R12: 1ffff1100893a430 R13: 1ffff1100c2b702c R14: dffffc0000000000 R15: ffff8880449d2160 FS: 00007ffa35fed6c0(0000) GS:ffff88812643d000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f2b68634000 CR3: 0000000039f62000 CR4: 00000000003526f0 Call Trace: <TASK> __f2fs_remount fs/f2fs/super.c:2960 [inline] f2fs_reconfigure+0x108a/0x1710 fs/f2fs/super.c:5443 reconfigure_super+0x227/0x8a0 fs/super.c:1080 do_remount fs/namespace.c:3391 [inline] path_mount+0xdc5/0x10e0 fs/namespace.c:4151 do_mount fs/namespace.c:4172 [inline] __do_sys_mount fs/namespace.c:4361 [inline] __se_sys_mount+0x31d/0x420 fs/namespace.c:4338 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ffa37dbda0a The root cause is there will be race condition in between f2fs_ioc_fitrim() and f2fs_remount(): - f2fs_remount - f2fs_ioc_fitrim - f2fs_issue_discard_timeout - __issue_discard_cmd - __drop_discard_cmd - __wait_all_discard_cmd - f2fs_trim_fs - f2fs_write_checkpoint - f2fs_clear_prefree_segments - f2fs_issue_discard - __issue_discard_async - __queue_discard_cmd - __update_discard_tree_range - __insert_discard_cmd - __create_discard_cmd : atomic_inc(&dcc->discard_cmd_cnt); - sanity check on dcc->discard_cmd_cnt (expect discard_cmd_cnt to be zero) This will only happen when fitrim races w/ remount rw, if we remount to readonly filesystem, remount will wait until mnt_pcp.mnt_writers to zero, that means fitrim is not in process at that time. Cc: stable@kernel.org Fixes: 2482c4325dfe ("f2fs: detect bug_on in f2fs_wait_discard_bios") Reported-by: syzbot+62538b67389ee582837a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/69b07d7c.050a0220.8df7.09a1.GAE@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-23Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds1-6/+5
Pull fsverity fixes from Eric Biggers: - Fix a build error on parisc - Remove the non-large-folio-aware function fsverity_verify_page() * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: fix build error by adding fsverity_readahead() stub fsverity: remove fsverity_verify_page() f2fs: make f2fs_verify_cluster() partially large-folio-aware f2fs: remove unnecessary ClearPageUptodate in f2fs_verify_cluster()
2026-02-22Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds1-4/+4
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook1-4/+4
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-18f2fs: make f2fs_verify_cluster() partially large-folio-awareEric Biggers1-4/+5
f2fs_verify_cluster() is the only remaining caller of the non-large-folio-aware function fsverity_verify_page(). To unblock the removal of that function, change f2fs_verify_cluster() to verify the entire folio of each page and mark it up-to-date. Note that this doesn't actually make f2fs_verify_cluster() large-folio-aware, as it is still passed an array of pages. Currently, it's never called with large folios. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20260218010630.7407-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-18f2fs: remove unnecessary ClearPageUptodate in f2fs_verify_cluster()Eric Biggers1-2/+0
Remove the unnecessary clearing of PG_uptodate. It's guaranteed to already be clear. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20260218010630.7407-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-17Merge tag 'vfs-7.0-rc1.misc.2' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull more misc vfs updates from Christian Brauner: "Features: - Optimize close_range() from O(range size) to O(active FDs) by using find_next_bit() on the open_fds bitmap instead of linearly scanning the entire requested range. This is a significant improvement for large-range close operations on sparse file descriptor tables. - Add FS_XFLAG_VERITY file attribute for fs-verity files, retrievable via FS_IOC_FSGETXATTR and file_getattr(). The flag is read-only. Add tracepoints for fs-verity enable and verify operations, replacing the previously removed debug printk's. - Prevent nfsd from exporting special kernel filesystems like pidfs and nsfs. These filesystems have custom ->open() and ->permission() export methods that are designed for open_by_handle_at(2) only and are incompatible with nfsd. Update the exportfs documentation accordingly. Fixes: - Fix KMSAN uninit-value in ovl_fill_real() where strcmp() was used on a non-null-terminated decrypted directory entry name from fscrypt. This triggered on encrypted lower layers when the decrypted name buffer contained uninitialized tail data. The fix also adds VFS-level name_is_dot(), name_is_dotdot(), and name_is_dot_dotdot() helpers, replacing various open-coded "." and ".." checks across the tree. - Fix read-only fsflags not being reset together with xflags in vfs_fileattr_set(). Currently harmless since no read-only xflags overlap with flags, but this would cause inconsistencies for any future shared read-only flag - Return -EREMOTE instead of -ESRCH from PIDFD_GET_INFO when the target process is in a different pid namespace. This lets userspace distinguish "process exited" from "process in another namespace", matching glibc's pidfd_getpid() behavior Cleanups: - Use C-string literals in the Rust seq_file bindings, replacing the kernel::c_str!() macro (available since Rust 1.77) - Fix typo in d_walk_ret enum comment, add porting notes for the readlink_copy() calling convention change" * tag 'vfs-7.0-rc1.misc.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add porting notes about readlink_copy() pidfs: return -EREMOTE when PIDFD_GET_INFO is called on another ns nfsd: do not allow exporting of special kernel filesystems exportfs: clarify the documentation of open()/permission() expotrfs ops fsverity: add tracepoints fs: add FS_XFLAG_VERITY for fs-verity files rust: seq_file: replace `kernel::c_str!` with C-Strings fs: dcache: fix typo in enum d_walk_ret comment ovl: use name_is_dot* helpers in readdir code fs: add helpers name_is_dot{,dot,_dotdot} ovl: Fix uninit-value in ovl_fill_real fs: reset read-only fsflags together with xflags fs/file: optimize close_range() complexity from O(N) to O(Sparse)
2026-02-14Merge tag 'f2fs-for-7.0-rc1' of ↵Linus Torvalds18-513/+1380
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this development cycle, we focused on several key performance optimizations: - introducing large folio support to enhance read speeds for immutable files - reducing checkpoint=enable latency by flushing only committed dirty pages - implementing tracepoints to diagnose and resolve lock priority inversion. Additionally, we introduced the packed_ssa feature to optimize the SSA footprint when utilizing large block sizes. Detail summary: Enhancements: - support large folio for immutable non-compressed case - support non-4KB block size without packed_ssa feature - optimize f2fs_enable_checkpoint() to avoid long delay - optimize f2fs_overwrite_io() for f2fs_iomap_begin - optimize NAT block loading during checkpoint write - add write latency stats for NAT and SIT blocks in f2fs_write_checkpoint - pin files do not require sbi->writepages lock for ordering - avoid f2fs_map_blocks() for consecutive holes in readpages - flush plug periodically during GC to maximize readahead effect - add tracepoints to catch lock overheads - add several sysfs entries to tune internal lock priorities Fixes: - fix lock priority inversion issue - fix incomplete block usage in compact SSA summaries - fix to show simulate_lock_timeout correctly - fix to avoid mapping wrong physical block for swapfile - fix IS_CHECKPOINTED flag inconsistency issue caused by concurrent atomic commit and checkpoint writes - fix to avoid UAF in f2fs_write_end_io()" * tag 'f2fs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (61 commits) f2fs: sysfs: introduce critical_task_priority f2fs: introduce trace_f2fs_priority_update f2fs: fix lock priority inversion issue f2fs: optimize f2fs_overwrite_io() for f2fs_iomap_begin f2fs: fix incomplete block usage in compact SSA summaries f2fs: decrease maximum flush retry count in f2fs_enable_checkpoint() f2fs: optimize NAT block loading during checkpoint write f2fs: change size parameter of __has_cursum_space() to unsigned int f2fs: add write latency stats for NAT and SIT blocks in f2fs_write_checkpoint f2fs: pin files do not require sbi->writepages lock for ordering f2fs: fix to show simulate_lock_timeout correctly f2fs: introduce FAULT_SKIP_WRITE f2fs: check skipped write in f2fs_enable_checkpoint() Revert "f2fs: add timeout in f2fs_enable_checkpoint()" f2fs: fix to unlock folio in f2fs_read_data_large_folio() f2fs: fix error path handling in f2fs_read_data_large_folio() f2fs: use folio_end_read f2fs: fix to avoid mapping wrong physical block for swapfile f2fs: avoid f2fs_map_blocks() for consecutive holes in readpages f2fs: advance index and offset after zeroing in large folio read ...
2026-02-12Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds7-80/+83
Pull fsverity updates from Eric Biggers: "fsverity cleanups, speedup, and memory usage optimization from Christoph Hellwig: - Move some logic into common code - Fix btrfs to reject truncates of fsverity files - Improve the readahead implementation - Store each inode's fsverity_info in a hash table instead of using a pointer in the filesystem-specific part of the inode. This optimizes for memory usage in the usual case where most files don't have fsverity enabled. - Look up the fsverity_info fewer times during verification, to amortize the hash table overhead" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: remove inode from fsverity_verification_ctx fsverity: use a hashtable to find the fsverity_info btrfs: consolidate fsverity_info lookup f2fs: consolidate fsverity_info lookup ext4: consolidate fsverity_info lookup fs: consolidate fsverity_info lookup in buffer.c fsverity: push out fsverity_info lookup fsverity: deconstify the inode pointer in struct fsverity_info fsverity: kick off hash readahead at data I/O submission time ext4: move ->read_folio and ->readahead to readpage.c readahead: push invalidate_lock out of page_cache_ra_unbounded fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio fsverity: start consolidating pagecache code fsverity: pass struct file to ->write_merkle_tree_block f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY fs,fsverity: clear out fsverity_info from common code fs,fsverity: reject size changes on fsverity files in setattr_prepare
2026-02-10f2fs: sysfs: introduce critical_task_priorityChao Yu5-0/+26
This patch introduces /sys/fs/f2fs/<disk>/critical_task_priority, w/ this new sysfs interface, we can tune priority of f2fs_ckpt thread and f2fs_gc thread. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-02-10Merge tag 'for-7.0/block-20260206' of ↵Linus Torvalds2-3/+4
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Support for batch request processing for ublk, improving the efficiency of the kernel/ublk server communication. This can yield nice 7-12% performance improvements - Support for integrity data for ublk - Various other ublk improvements and additions, including a ton of selftests additions and updated - Move the handling of blk-crypto software fallback from below the block layer to above it. This reduces the complexity of dealing with bio splitting - Series fixing a number of potential deadlocks in blk-mq related to the queue usage counter and writeback throttling and rq-qos debugfs handling - Add an async_depth queue attribute, to resolve a performance regression that's been around for a qhilw related to the scheduler depth handling - Only use task_work for IOPOLL completions on NVMe, if it is necessary to do so. An earlier fix for an issue resulted in all these completions being punted to task_work, to guarantee that completions were only run for a given io_uring ring when it was local to that ring. With the new changes, we can detect if it's necessary to use task_work or not, and avoid it if possible. - rnbd fixes: - Fix refcount underflow in device unmap path - Handle PREFLUSH and NOUNMAP flags properly in protocol - Fix server-side bi_size for special IOs - Zero response buffer before use - Fix trace format for flags - Add .release to rnbd_dev_ktype - MD pull requests via Yu Kuai - Fix raid5_run() to return error when log_init() fails - Fix IO hang with degraded array with llbitmap - Fix percpu_ref not resurrected on suspend timeout in llbitmap - Fix GPF in write_page caused by resize race - Fix NULL pointer dereference in process_metadata_update - Fix hang when stopping arrays with metadata through dm-raid - Fix any_working flag handling in raid10_sync_request - Refactor sync/recovery code path, improve error handling for badblocks, and remove unused recovery_disabled field - Consolidate mddev boolean fields into mddev_flags - Use mempool to allocate stripe_request_ctx and make sure max_sectors is not less than io_opt in raid5 - Fix return value of mddev_trylock - Fix memory leak in raid1_run() - Add Li Nan as mdraid reviewer - Move phys_vec definitions to the kernel types, mostly in preparation for some VFIO and RDMA changes - Improve the speed for secure erase for some devices - Various little rust updates - Various other minor fixes, improvements, and cleanups * tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) blk-mq: ABI/sysfs-block: fix docs build warnings selftests: ublk: organize test directories by test ID block: decouple secure erase size limit from discard size limit block: remove redundant kill_bdev() call in set_blocksize() blk-mq: add documentation for new queue attribute async_dpeth block, bfq: convert to use request_queue->async_depth mq-deadline: covert to use request_queue->async_depth kyber: covert to use request_queue->async_depth blk-mq: add a new queue sysfs attribute async_depth blk-mq: factor out a helper blk_mq_limit_depth() blk-mq-sched: unify elevators checking for async requests block: convert nr_requests to unsigned int block: don't use strcpy to copy blockdev name blk-mq-debugfs: warn about possible deadlock blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static blk-rq-qos: fix possible debugfs_mutex deadlock blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter ...
2026-02-09Merge tag 'vfs-7.0-rc1.fserror' of ↵Linus Torvalds1-3/+0
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs error reporting updates from Christian Brauner: "This contains the changes to support generic I/O error reporting. Filesystems currently have no standard mechanism for reporting metadata corruption and file I/O errors to userspace via fsnotify. Each filesystem (xfs, ext4, erofs, f2fs, etc.) privately defines EFSCORRUPTED, and error reporting to fanotify is inconsistent or absent entirely. This introduces a generic fserror infrastructure built around struct super_block that gives filesystems a standard way to queue metadata and file I/O error reports for delivery to fsnotify. Errors are queued via mempools and queue_work to avoid holding filesystem locks in the notification path; unmount waits for pending events to drain. A new super_operations::report_error callback lets filesystem drivers respond to file I/O errors themselves (to be used by an upcoming XFS self-healing patchset). On the uapi side, EFSCORRUPTED and EUCLEAN are promoted from private per-filesystem definitions to canonical errno.h values across all architectures" * tag 'vfs-7.0-rc1.fserror' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ext4: convert to new fserror helpers xfs: translate fsdax media errors into file "data lost" errors when convenient xfs: report fs metadata errors via fsnotify iomap: report file I/O errors to the VFS fs: report filesystem and file I/O errors to fsnotify uapi: promote EFSCORRUPTED and EUCLEAN to errno.h
2026-02-04fsverity: use a hashtable to find the fsverity_infoChristoph Hellwig3-8/+0
Use the kernel's resizable hash table (rhashtable) to find the fsverity_info. This way file systems that want to support fsverity don't have to bloat every inode in the system with an extra pointer. The trade-off is that looking up the fsverity_info is a bit more expensive now, but the main operations are still dominated by I/O and hashing overhead. The rhashtable implementations requires no external synchronization, and the _fast versions of the APIs provide the RCU critical sections required by the implementation. Because struct fsverity_info is only removed on inode eviction and does not contain a reference count, there is no need for an extended critical section to grab a reference or validate the object state. The file open path uses rhashtable_lookup_get_insert_fast, which can either find an existing object for the hash key or insert a new one in a single atomic operation, so that concurrent opens never instantiate duplicate fsverity_info structure. FS_IOC_ENABLE_VERITY must already be synchronized by a combination of i_rwsem and file system flags and uses rhashtable_lookup_insert_fast, which errors out on an existing object for the hash key as an additional safety check. Because insertion into the hash table now happens before S_VERITY is set, fsverity just becomes a barrier and a flag check and doesn't have to look up the fsverity_info at all, so there is only a single lookup per ->read_folio or ->readahead invocation. For btrfs there is an additional one for each bio completion, while for ext4 and f2fs the fsverity_info is stored in the per-I/O context and reused for the completion workqueue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Link: https://lore.kernel.org/r/20260202060754.270269-12-hch@lst.de [EB: folded in fix for missing fsverity_free_info()] Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04f2fs: consolidate fsverity_info lookupChristoph Hellwig3-56/+62
Look up the fsverity_info once in f2fs_mpage_readpages, and then use it for the readahead, local verification of holes and pass it along to the I/O completion workqueue in struct bio_post_read_ctx. Do the same thing in f2fs_get_read_data_folio for reads that come from garbage collection and other background activities. This amortizes the lookup better once it becomes less efficient. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20260202060754.270269-10-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-03fsverity: push out fsverity_info lookupChristoph Hellwig2-7/+16
Pass a struct fsverity_info to the verification and readahead helpers, and push the lookup into the callers. Right now this is a very dumb almost mechanic move that open codes a lot of fsverity_info_addr() calls in the file systems. The subsequent patches will clean this up. This prepares for reducing the number of fsverity_info lookups, which will allow to amortize them better when using a more expensive lookup method. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: David Sterba <dsterba@suse.com> # btrfs Link: https://lore.kernel.org/r/20260202060754.270269-7-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-03fsverity: kick off hash readahead at data I/O submission timeChristoph Hellwig2-8/+22
Currently all reads of the fsverity hashes are kicked off from the data I/O completion handler, leading to needlessly dependent I/O. This is worked around a bit by performing readahead on the level 0 nodes, but still fairly ineffective. Switch to a model where the ->read_folio and ->readahead methods instead kick off explicit readahead of the fsverity hashed so they are usually available at I/O completion time. For 64k sequential reads on my test VM this improves read performance from 2.4GB/s - 2.6GB/s to 3.5GB/s - 3.9GB/s. The improvements for random reads are likely to be even bigger. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> # btrfs Link: https://lore.kernel.org/r/20260202060754.270269-5-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-02readahead: push invalidate_lock out of page_cache_ra_unboundedChristoph Hellwig1-0/+2
Require the invalidate_lock to be held over calls to page_cache_ra_unbounded instead of acquiring it in this function. This prepares for calling page_cache_ra_unbounded from ->readahead for fsverity read-ahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20260202060754.270269-3-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-31f2fs: introduce trace_f2fs_priority_updateChao Yu1-5/+12
This patch introduces two new tracepoints for debug purpose. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-31f2fs: fix lock priority inversion issueChao Yu4-2/+96
If userspace thread has held f2fs rw semaphore, due to its low priority, it could be runnable or preempted state for long time, during the time, it will block high priority thread which is trying to grab the same rw semaphore, e.g. cp_rwsem, io_rwsem... To fix such issue, let's detect thread's priority when it tries to grab f2fs_rwsem lock, if the priority is lower than a priority threshold, let's uplift the priority before it enters into critical region of lock, and restore the priority after it leaves from critical region. Meanwhile, introducing two new sysfs nodes: - /sys/fs/f2fs/<disk>/adjust_lock_priority, it is used to control whether the functionality is enable or not. ========== ================== Flag_Value Flag_Description ========== ================== 0x00000000 Disabled (default) 0x00000001 cp_rwsem 0x00000002 node_change 0x00000004 node_write 0x00000008 gc_lock 0x00000010 cp_global 0x00000020 io_rwsem ========== ================== - /sys/fs/f2fs/<disk>/lock_duration_priority, it is used to control priority threshold. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-31f2fs: optimize f2fs_overwrite_io() for f2fs_iomap_beginYeongjin Gil1-2/+10
When overwriting already allocated blocks, f2fs_iomap_begin() calls f2fs_overwrite_io() to check block mappings. However, f2fs_overwrite_io() iterates through all mapped blocks in the range, which can be inefficient for fragmented files with large I/O requests. This patch optimizes f2fs_overwrite_io() by adding a 'check_first' parameter and introducing __f2fs_overwrite_io() helper. When called from f2fs_iomap_begin(), we only check the first mapping to determine if the range is already allocated, which is sufficient for setting map.m_may_create. This optimization significantly reduces the number of f2fs_map_blocks() calls in f2fs_overwrite_io() when called from f2fs_iomap_begin(), especially for fragmented files with large I/O requests. Cc: stable@kernel.org Fixes: 351bc761338d ("f2fs: optimize f2fs DIO overwrites") Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Reviewed-by: Sunmin Jeong <s_min.jeong@samsung.com> Signed-off-by: Yeongjin Gil <youngjin.gil@samsung.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-30f2fs: fix incomplete block usage in compact SSA summariesDaeho Jeong1-4/+4
In a previous commit, a bug was introduced where compact SSA summaries failed to utilize the entire block space in non-4KB block size configurations, leading to inefficient space management. This patch fixes the calculation logic to ensure that compact SSA summaries can fully occupy the block regardless of the block size. Reported-by: Chris Mason <clm@meta.com> Fixes: e48e16f3e37f ("f2fs: support non-4KB block size without packed_ssa feature") Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-29fsverity: start consolidating pagecache codeChristoph Hellwig1-16/+1
ext4 and f2fs are largely using the same code to read a page full of Merkle tree blocks from the page cache, and the upcoming xfs fsverity support would add another copy. Move the ext4 code to fs/verity/ and use it in f2fs as well. For f2fs this removes the previous f2fs-specific error injection, but otherwise the behavior remains unchanged. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Link: https://lore.kernel.org/r/20260128152630.627409-7-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29fsverity: pass struct file to ->write_merkle_tree_blockChristoph Hellwig1-3/+3
This will make an iomap implementation of the method easier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: David Sterba <dsterba@suse.com> # btrfs Link: https://lore.kernel.org/r/20260128152630.627409-6-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITYChristoph Hellwig1-1/+1
Use IS_ENABLED to disable this code, leading to a slight size reduction: text data bss dec hex filename 25709 2412 24 28145 6df1 fs/f2fs/compress.o.old 25198 2252 24 27474 6b52 fs/f2fs/compress.o Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20260128152630.627409-5-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29fs,fsverity: clear out fsverity_info from common codeChristoph Hellwig1-1/+0
Free the fsverity_info directly in clear_inode instead of requiring file systems to handle it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: David Sterba <dsterba@suse.com> # btrfs Link: https://lore.kernel.org/r/20260128152630.627409-3-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29fs,fsverity: reject size changes on fsverity files in setattr_prepareChristoph Hellwig1-4/+0
Add the check to reject truncates of fsverity files directly to setattr_prepare instead of requiring the file system to handle it. Besides removing boilerplate code, this also fixes the complete lack of such check in btrfs. Fixes: 146054090b08 ("btrfs: initial fsverity support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Link: https://lore.kernel.org/r/20260128152630.627409-2-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29fs: add helpers name_is_dot{,dot,_dotdot}Amir Goldstein2-2/+2
Rename the helper is_dot_dotdot() into the name_ namespace and add complementary helpers to check for dot and dotdot names individually. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://patch.msgid.link/20260128132406.23768-3-amir73il@gmail.com Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-27f2fs: decrease maximum flush retry count in f2fs_enable_checkpoint()Chao Yu2-1/+3
It's rare case that sync_inodes_sb() always skips to flush some drity datas, so it's enough to give extra three more chances to flush data. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27f2fs: optimize NAT block loading during checkpoint writeYongpeng Yang1-1/+13
Under stress tests with frequent metadata operations, checkpoint write time can become excessively long. Analysis shows that the slowdown is caused by synchronous, one-by-one reads of NAT blocks during checkpoint processing. The issue can be reproduced with the following workload: 1. seq 1 650000 | xargs -P 16 -n 1 touch 2. sync # avoid checkpoint write during deleting 3. delete 1 file every 455 files 4. echo 3 > /proc/sys/vm/drop_caches 5. sync # trigger checkpoint write This patch submits read I/O for all NAT blocks required in the __flush_nat_entry_set() phase in advance, reducing the overhead of synchronous waiting for individual NAT block reads. The NAT block flush latency before and after the change is as below: | |NAT blocks accessed|NAT blocks read|Flush time (ms)| |-------------|-------------------|---------------|---------------| |Before change|1205 |1191 |158 | |After change |1264 |1242 |11 | With a similar number of NAT blocks accessed and read from disk, adding NAT block readahead reduces the total NAT block flush time by more than 90%. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27f2fs: change size parameter of __has_cursum_space() to unsigned intYongpeng Yang1-1/+1
All callers of __has_cursum_space() pass an unsigned int value as the size parameter. Change the parameter type to unsigned int accordingly. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>