summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2026-02-27ext4: use optimized mballoc scanning regardless of inode formatJan Kara1-2/+0
commit 3574c322b1d0eb32dbd76b469cb08f9a67641599 upstream. Currently we don't used mballoc optimized scanning (using max free extent order and avg free extent order group lists) for inodes with indirect block based format. This is confusing for users and I don't see a good reason for that. Even with indirect block based inode format we can spend big amount of time searching for free blocks for large filesystems with fragmented free space. To add to the confusion before commit 077d0c2c78df ("ext4: make mb_optimize_scan performance mount option work with extents") optimized scanning was applied *only* to indirect block based inodes so that commit appears as a performance regression to some users. Just use optimized scanning whenever it is enabled by mount options. Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Link: https://patch.msgid.link/20260114182836.14120-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27ext4: always allocate blocks only from groups inode can useJan Kara1-9/+20
commit 4865c768b563deff1b6a6384e74a62f143427b42 upstream. For filesystems with more than 2^32 blocks inodes using indirect block based format cannot use blocks beyond the 32-bit limit. ext4_mb_scan_groups_linear() takes care to not select these unsupported groups for such inodes however other functions selecting groups for allocation don't. So far this is harmless because the other selection functions are used only with mb_optimize_scan and this is currently disabled for inodes with indirect blocks however in the following patch we want to enable mb_optimize_scan regardless of inode format. Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Pedro Falcato <pfalcato@suse.de> Cc: stable@kernel.org Link: https://patch.msgid.link/20260114182836.14120-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27ext4: fix dirtyclusters double decrement on fs shutdownBrian Foster2-17/+6
commit 94a8cea54cd935c54fa2fba70354757c0fc245e3 upstream. fstests test generic/388 occasionally reproduces a warning in ext4_put_super() associated with the dirty clusters count: WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4] Tracing the failure shows that the warning fires due to an s_dirtyclusters_counter value of -1. IOW, this appears to be a spurious decrement as opposed to some sort of leak. Further tracing of the dirty cluster count deltas and an LLM scan of the resulting output identified the cause as a double decrement in the error path between ext4_mb_mark_diskspace_used() and the caller ext4_mb_new_blocks(). First, note that generic/388 is a shutdown vs. fsstress test and so produces a random set of operations and shutdown injections. In the problematic case, the shutdown triggers an error return from the ext4_handle_dirty_metadata() call(s) made from ext4_mb_mark_context(). The changed value is non-zero at this point, so ext4_mb_mark_diskspace_used() does not exit after the error bubbles up from ext4_mb_mark_context(). Instead, the former decrements both cluster counters and returns the error up to ext4_mb_new_blocks(). The latter falls into the !ar->len out path which decrements the dirty clusters counter a second time, creating the inconsistency. To avoid this problem and simplify ownership of the cluster reservation in this codepath, lift the counter reduction to a single place in the caller. This makes it more clear that ext4_mb_new_blocks() is responsible for acquiring cluster reservation (via ext4_claim_free_clusters()) in the !delalloc case as well as releasing it, regardless of whether it ends up consumed or returned due to failure. Fixes: 0087d9fb3f29 ("ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20260113171905.118284-1-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27ext4: fix e4b bitmap inconsistency reportsYongjian Sun1-10/+11
commit bdc56a9c46b2a99c12313122b9352b619a2e719e upstream. A bitmap inconsistency issue was observed during stress tests under mixed huge-page workloads. Ext4 reported multiple e4b bitmap check failures like: ext4_mb_complex_scan_group:2508: group 350, 8179 free clusters as per group info. But got 8192 blocks Analysis and experimentation confirmed that the issue is caused by a race condition between page migration and bitmap modification. Although this timing window is extremely narrow, it is still hit in practice: folio_lock ext4_mb_load_buddy __migrate_folio check ref count folio_mc_copy __filemap_get_folio folio_try_get(folio) ...... mb_mark_used ext4_mb_unload_buddy __folio_migrate_mapping folio_ref_freeze folio_unlock The root cause of this issue is that the fast path of load_buddy only increments the folio's reference count, which is insufficient to prevent concurrent folio migration. We observed that the folio migration process acquires the folio lock. Therefore, we can determine whether to take the fast path in load_buddy by checking the lock status. If the folio is locked, we opt for the slow path (which acquires the lock) to close this concurrency window. Additionally, this change addresses the following issues: When the DOUBLE_CHECK macro is enabled to inspect bitmap-related issues, the following error may be triggered: corruption in group 324 at byte 784(6272): f in copy != ff on disk/prealloc Analysis reveals that this is a false positive. There is a specific race window where the bitmap and the group descriptor become momentarily inconsistent, leading to this error report: ext4_mb_load_buddy ext4_mb_load_buddy __filemap_get_folio(create|lock) folio_lock ext4_mb_init_cache folio_mark_uptodate __filemap_get_folio(no lock) ...... mb_mark_used mb_mark_used_double mb_cmp_bitmaps mb_set_bits(e4b->bd_bitmap) folio_unlock The original logic assumed that since mb_cmp_bitmaps is called when the bitmap is newly loaded from disk, the folio lock would be sufficient to prevent concurrent access. However, this overlooks a specific race condition: if another process attempts to load buddy and finds the folio is already in an uptodate state, it will immediately begin using it without holding folio lock. Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260106090820.836242-1-sunyongjian@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27ext4: fix memory leak in ext4_ext_shift_extents()Zilin Guan1-1/+2
commit ca81109d4a8f192dc1cbad4a1ee25246363c2833 upstream. In ext4_ext_shift_extents(), if the extent is NULL in the while loop, the function returns immediately without releasing the path obtained via ext4_find_extent(), leading to a memory leak. Fix this by jumping to the out label to ensure the path is properly released. Fixes: a18ed359bdddc ("ext4: always check ext4_ext_find_extent result") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20251225084800.905701-1-zilin@seu.edu.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27ext4: drop extent cache when splitting extent failsZhang Yi1-2/+6
commit 79b592e8f1b435796cbc2722190368e3e8ffd7a1 upstream. When the split extent fails, we might leave some extents still being processed and return an error directly, which will result in stale extent entries remaining in the extent status tree. So drop all of the remaining potentially stale extents if the splitting fails. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251129103247.686136-8-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27ext4: drop extent cache after doing PARTIAL_VALID1 zerooutZhang Yi1-1/+9
commit 6d882ea3b0931b43530d44149b79fcd4ffc13030 upstream. When splitting an unwritten extent in the middle and converting it to initialized in ext4_split_extent() with the EXT4_EXT_MAY_ZEROOUT and EXT4_EXT_DATA_VALID2 flags set, it could leave a stale unwritten extent. Assume we have an unwritten file and buffered write in the middle of it without dioread_nolock enabled, it will allocate blocks as written extent. 0 A B N [UUUUUUUUUUUU] on-disk extent U: unwritten extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDD--] D: valid data |<- ->| ----> this range needs to be initialized ext4_split_extent() first try to split this extent at B with EXT4_EXT_DATA_PARTIAL_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but ext4_split_extent_at() failed to split this extent due to temporary lack of space. It zeroout B to N and leave the entire extent as unwritten. 0 A B N [UUUUUUUUUUUU] on-disk extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDDZZ] Z: zeroed data ext4_split_extent() then try to split this extent at A with EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and leave an written extent from A to N. 0 A B N [UUWWWWWWWWWW] on-disk extent W: written extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDDZZ] Finally ext4_map_create_blocks() only insert extent A to B to the extent status tree, and leave an stale unwritten extent in the status tree. 0 A B N [UUWWWWWWWWWW] on-disk extent W: written extent [UUWWWWWWWWUU] extent status tree [--DDDDDDDDZZ] Fix this issue by always cached extent status entry after zeroing out the second part. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251129103247.686136-7-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27ext4: don't cache extent during splitting extentZhang Yi1-0/+6
commit 8b4b19a2f96348d70bfa306ef7d4a13b0bcbea79 upstream. Caching extents during the splitting process is risky, as it may result in stale extents remaining in the status tree. Moreover, in most cases, the corresponding extent block entries are likely already cached before the split happens, making caching here not particularly useful. Assume we have an unwritten extent, and then DIO writes the first half. [UUUUUUUUUUUUUUUU] on-disk extent U: unwritten extent [UUUUUUUUUUUUUUUU] extent status tree |<- ->| ----> dio write this range First, when ext4_split_extent_at() splits this extent, it truncates the existing extent and then inserts a new one. During this process, this extent status entry may be shrunk, and calls to ext4_find_extent() and ext4_cache_extents() may occur, which could potentially insert the truncated range as a hole into the extent status tree. After the split is completed, this hole is not replaced with the correct status. [UUUUUUU|UUUUUUUU] on-disk extent U: unwritten extent [UUUUUUU|HHHHHHHH] extent status tree H: hole Then, the outer calling functions will not correct this remaining hole extent either. Finally, if we perform a delayed buffer write on this latter part, it will re-insert the delayed extent and cause an error in space accounting. In adition, if the unwritten extent cache is not shrunk during the splitting, ext4_cache_extents() also conflicts with existing extents when caching extents. In the future, we will add checks when caching extents, which will trigger a warning. Therefore, Do not cache extents that are being split. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-6-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1Zhang Yi1-1/+12
commit 1bf6974822d1dba86cf11b5f05498581cf3488a2 upstream. When allocating initialized blocks from a large unwritten extent, or when splitting an unwritten extent during end I/O and converting it to initialized, there is currently a potential issue of stale data if the extent needs to be split in the middle. 0 A B N [UUUUUUUUUUUU] U: unwritten extent [--DDDDDDDD--] D: valid data |<- ->| ----> this range needs to be initialized ext4_split_extent() first try to split this extent at B with EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but ext4_split_extent_at() failed to split this extent due to temporary lack of space. It zeroout B to N and mark the entire extent from 0 to N as written. 0 A B N [WWWWWWWWWWWW] W: written extent [SSDDDDDDDDZZ] Z: zeroed, S: stale data ext4_split_extent() then try to split this extent at A with EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left a stale written extent from 0 to A. 0 A B N [WW|WWWWWWWWWW] [SS|DDDDDDDDZZ] Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at() when splitting at B, don't convert the entire extent to written and left it as unwritten after zeroing out B to N. The remaining work is just like the standard two-part split. ext4_split_extent() will pass the EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the second time, allowing it to properly handle the split. If the split is successful, it will keep extent from 0 to A as unwritten. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-3-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27ext4: subdivide EXT4_EXT_DATA_VALID1Zhang Yi1-6/+12
commit 22784ca541c0f01c5ebad14e8228298dc0a390ed upstream. When splitting an extent, if the EXT4_GET_BLOCKS_CONVERT flag is set and it is necessary to split the target extent in the middle, ext4_split_extent() first handles splitting the latter half of the extent and passes the EXT4_EXT_DATA_VALID1 flag. This flag implies that all blocks before the split point contain valid data; however, this assumption is incorrect. Therefore, subdivid EXT4_EXT_DATA_VALID1 into EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_DATA_PARTIAL_VALID1, which indicate that the first half of the extent is either entirely valid or only partially valid, respectively. These two flags cannot be set simultaneously. This patch does not use EXT4_EXT_DATA_PARTIAL_VALID1, it only replaces EXT4_EXT_DATA_VALID1 with EXT4_EXT_DATA_ENTIRE_VALID1 at the location where it is set, no logical changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-2-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27btrfs: fix invalid leaf access in btrfs_quota_enable() if ref key not foundFilipe Manana1-4/+7
[ Upstream commit ecb7c2484cfc83a93658907580035a8adf1e0a92 ] If btrfs_search_slot_for_read() returns 1, it means we did not find any key greater than or equals to the key we asked for, meaning we have reached the end of the tree and therefore the path is not valid. If this happens we need to break out of the loop and stop, instead of continuing and accessing an invalid path. Fixes: 5223cc60b40a ("btrfs: drop the path before adding qgroup items when enabling qgroups") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: use the correct type to initialize block reserve for delayed refsFilipe Manana2-4/+5
[ Upstream commit 2155d0c0a761a56ce7ede83a26eb23ea0f935260 ] When initializing the delayed refs block reserve for a transaction handle we are passing a type of BTRFS_BLOCK_RSV_DELOPS, which is meant for delayed items and not for delayed refs. The correct type for delayed refs is BTRFS_BLOCK_RSV_DELREFS. On release of any excess space reserved in a local delayed refs reserve, we also should transfer that excess space to the global block reserve (it it's full, we return to the space info for general availability). By initializing a transaction's local delayed refs block reserve with a type of BTRFS_BLOCK_RSV_DELOPS, we were also causing any excess space released from the delayed block reserve (fs_info->delayed_block_rsv, used for delayed inodes and items) to be transferred to the global block reserve instead of the global delayed refs block reserve. This was an unintentional change in commit 28270e25c69a ("btrfs: always reserve space for delayed refs when starting transaction"), but it's not particularly serious as things tend to cancel out each other most of the time and it's relatively rare to be anywhere near exhaustion of the global reserve. Fix this by initializing a transaction's local delayed refs reserve with a type of BTRFS_BLOCK_RSV_DELREFS and making btrfs_block_rsv_release() attempt to transfer unused space from such a reserve into the global block reserve, just as we did before that commit for when the block reserve is a delayed refs rsv. Reported-by: Alex Lyakas <alex.lyakas@zadara.com> Link: https://lore.kernel.org/linux-btrfs/CAOcd+r0FHG5LWzTSu=LknwSoqxfw+C00gFAW7fuX71+Z5AfEew@mail.gmail.com/ Fixes: 28270e25c69a ("btrfs: always reserve space for delayed refs when starting transaction") Reviewed-by: Alex Lyakas <alex.lyakas@zadara.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: reset block group size class when it becomes emptyJiasheng Jiang1-0/+10
[ Upstream commit 5870ec7c8fe57a8b2c65005e5da5efc054faa3e6 ] Block group size classes are managed consistently everywhere. Currently, btrfs_use_block_group_size_class() sets a block group's size class to specialize it for a specific allocation size. However, this size class remains "stale" even if the block group becomes completely empty (both used and reserved bytes reach zero). This happens in two scenarios: 1. When space reservations are freed (e.g., due to errors or transaction aborts) via btrfs_free_reserved_bytes(). 2. When the last extent in a block group is freed via btrfs_update_block_group(). While size classes are advisory, a stale size class can cause find_free_extent to unnecessarily skip candidate block groups during initial search loops. This undermines the purpose of size classes to reduce fragmentation by keeping block groups restricted to a specific size class when they could be reused for any size. Fix this by resetting the size class to BTRFS_BG_SZ_NONE whenever a block group's used and reserved counts both reach zero. This ensures that empty block groups are fully available for any allocation size in the next cycle. Fixes: 52bb7a2166af ("btrfs: introduce size class to block group allocator") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: reduce block group critical section in btrfs_free_reserved_bytes()Filipe Manana1-6/+9
[ Upstream commit 8b6fa164ab59f9e3f24e627fe09a0234783e7a8b ] There's no need to update the space_info fields (bytes_reserved, max_extent_size, bytes_readonly, bytes_zone_unusable) while holding the block group's spinlock. So move those updates to happen after we unlock the block group (and while holding the space_info locked of course), so that all we do under the block group's critical section is to update the block group itself. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 5870ec7c8fe5 ("btrfs: reset block group size class when it becomes empty") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: remove fs_info argument from btrfs_try_granting_tickets()Filipe Manana4-13/+12
[ Upstream commit e3df6408b13a75cf73e543e53453f28261874c6f ] We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 5870ec7c8fe5 ("btrfs: reset block group size class when it becomes empty") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27fs/ntfs3: Fix slab-out-of-bounds read in DeleteIndexEntryRootJiasheng Jiang1-0/+3
[ Upstream commit b2bc7c44ed1779fc9eaab9a186db0f0d01439622 ] In the 'DeleteIndexEntryRoot' case of the 'do_action' function, the entry size ('esize') is retrieved from the log record without adequate bounds checking. Specifically, the code calculates the end of the entry ('e2') using: e2 = Add2Ptr(e1, esize); It then calculates the size for memmove using 'PtrOffset(e2, ...)', which subtracts the end pointer from the buffer limit. If 'esize' is maliciously large, 'e2' exceeds the used buffer size. This results in a negative offset which, when cast to size_t for memmove, interprets as a massive unsigned integer, leading to a heap buffer overflow. This commit adds a check to ensure that the entry size ('esize') strictly fits within the remaining used space of the index header before performing memory operations. Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal") Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27fs/ntfs3: prevent infinite loops caused by the next valid being the sameEdward Adam Davis1-2/+6
[ Upstream commit 27b75ca4e51e3e4554dc85dbf1a0246c66106fd3 ] When processing valid within the range [valid : pos), if valid cannot be retrieved correctly, for example, if the retrieved valid value is always the same, this can trigger a potential infinite loop, similar to the hung problem reported by syzbot [1]. Adding a check for the valid value within the loop body, and terminating the loop and returning -EINVAL if the value is the same as the current value, can prevent this. [1] INFO: task syz.4.21:6056 blocked for more than 143 seconds. Call Trace: rwbase_write_lock+0x14f/0x750 kernel/locking/rwbase_rt.c:244 inode_lock include/linux/fs.h:1027 [inline] ntfs_file_write_iter+0xe6/0x870 fs/ntfs3/file.c:1284 Fixes: 4342306f0f0d ("fs/ntfs3: Add file operations and implementation") Reported-by: syzbot+bcf9e1868c1a0c7e04f1@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=bcf9e1868c1a0c7e04f1 Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27fs/ntfs3: Initialize new folios before useBartlomiej Kubik1-1/+1
[ Upstream commit f223ebffa185cc8da934333c5a31ff2d4f992dc9 ] KMSAN reports an uninitialized value in longest_match_std(), invoked from ntfs_compress_write(). When new folios are allocated without being marked uptodate and ni_read_frame() is skipped because the caller expects the frame to be completely overwritten, some reserved folios may remain only partially filled, leaving the rest memory uninitialized. Fixes: 584f60ba22f7 ("ntfs3: Convert ntfs_get_frame_pages() to use a folio") Tested-by: syzbot+08d8956768c96a2c52cf@syzkaller.appspotmail.com Reported-by: syzbot+08d8956768c96a2c52cf@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=08d8956768c96a2c52cf Signed-off-by: Bartlomiej Kubik <kubik.bartlomiej@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27pidfs: return -EREMOTE when PIDFD_GET_INFO is called on another nsLuca Boccassi1-1/+1
[ Upstream commit ab89060fbc92edd6e852bf0f533f29140afabe0e ] Currently it is not possible to distinguish between the case where a process has already exited and the case where a process is in a different namespace, as both return -ESRCH. glibc's pidfd_getpid() procfs-based implementation returns -EREMOTE in the latter, so that distinguishing the two is possible, as the fdinfo in procfs will list '0' as the PID in that case: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/pidfd_getpid.c;h=860829cf07da2267484299ccb02861822c0d07b4;hb=HEAD#l121 Change the error code so that the kernel also returns -EREMOTE in that case. Fixes: 7477d7dce48a ("pidfs: allow to retrieve exit information") Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> Link: https://patch.msgid.link/20260127225209.2293342-1-luca.boccassi@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27ovl: Fix uninit-value in ovl_fill_realQing Wang1-1/+1
[ Upstream commit 1992330d90dd766fcf1730fd7bf2d6af65370ac4 ] Syzbot reported a KMSAN uninit-value issue in ovl_fill_real. This iusse's call chain is: __do_sys_getdents64() -> iterate_dir() ... -> ext4_readdir() -> fscrypt_fname_alloc_buffer() // alloc -> fscrypt_fname_disk_to_usr // write without tail '\0' -> dir_emit() -> ovl_fill_real() // read by strcmp() The string is used to store the decrypted directory entry name for an encrypted inode. As shown in the call chain, fscrypt_fname_disk_to_usr() write it without null-terminate. However, ovl_fill_real() uses strcmp() to compare the name against "..", which assumes a null-terminated string and may trigger a KMSAN uninit-value warning when the buffer tail contains uninit data. Reported-by: syzbot+d130f98b2c265fae5297@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d130f98b2c265fae5297 Fixes: 4edb83bb1041 ("ovl: constant d_ino for non-merge dirs") Signed-off-by: Qing Wang <wangqing7171@gmail.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://patch.msgid.link/20260128132406.23768-2-amir73il@gmail.com Acked-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27fs/nfs: Fix readdir slow-start regressionSagi Grimberg1-2/+2
[ Upstream commit 42e7c876b182da65723700f6bc507a8aecb10d3b ] Commit 580f236737d1 ("NFS: Adjust the amount of readahead performed by NFS readdir") reduces the amount of readahead names caching done by the client. The downside of this approach is READDIR now may suffer from a slow-start issue, where initially it will fetch names that fit in a single page, then in 2, 4, 8 until the maximum supported transfer size (usually 1M). This patch tries to take a balanced approach between mitigating the slow-start issue still maintaining some efficiency gains. Fixes: 580f236737d1 ("NFS: Adjust the amount of readahead performed by NFS readdir") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27pNFS: fix a missing wake up while waiting on NFS_LAYOUT_DRAINOlga Kornievskaia1-1/+2
[ Upstream commit 5248d8474e594d156bee1ed10339cc16e207a28b ] It is possible to have a task get stuck on waiting on the NFS_LAYOUT_DRAIN in the following scenario 1. cpu a: waiter test NFS_LAYOUT_DRAIN (1) and plh_outstanding (1) 2. cpu b: atomic_dec_and_test() -> clear bit -> wake up 3. cpu c: sets NFS_LAYOUT_DRAIN again 4. cpu a: calls wait_on_bit() sleeps forever. To expand on this we have say 2 outstanding pnfs write IO that get ESTALE which causes both to call pnfs_destroy_layout() and set the NFS_LAYOUT_DRAIN bit but the 1st one doesn't call the pnfs_put_layout_hdr() yet (as that would prevent the 2nd ESTALE write from trying to call pnfs_destroy_layout()). If the 1st ESTALE write is the one that initially sets the NFS_LAYOUT_DRAIN so that new IO on this file initiates new LAYOUTGET. Another new write would find NFS_LAYOUT_DRAIN set and phl_outstanding>0 (step 1) and would wait_on_bit(). LAYOUTGET completes doing step 2. Now, the 2nd of ESTALE writes is calling pnfs_destory_layout() and set the NFS_LAYOUT_DRAIN bit (step 3). Finally, the waiting write wakes up to check the bit and goes back to sleep. The problem revolves around the fact that if NFS_LAYOUT_INVALID_STID was already set, it should not do the work of pnfs_mark_layout_stateid_invalid(), thus NFS_LAYOUT_DRAIN will not be set more than once for an invalid layout. Suggested-by: Trond Myklebust <trond.myklebust@hammerspace.com> Fixes: 880265c77ac4 ("pNFS: Avoid a live lock condition in pnfs_update_layout()") Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27NFS/localio: remove -EAGAIN handling in nfs_local_doio()Mike Snitzer1-2/+0
[ Upstream commit e72a73957613653f50375db1f3a3fbb907a9c40b ] Handling -EAGAIN in nfs_local_doio() was introduced with commit 0978e5b85fc08 (nfs_do_local_{read,write} were made to have negative checks for correspoding iter method) but commit e43e9a3a3d66 since eliminated the possibility for this -EAGAIN early return. So remove nfs_local_doio()'s -EAGAIN handling that calls nfs_localio_disable_client() -- while it should never happen from nfs_do_local_{read,write} this particular -EAGAIN handling is now "dead" and so it has become a liability. Fixes: e43e9a3a3d66 ("nfs/localio: refactor iocb initialization") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27NFS/localio: use GFP_NOIO and non-memreclaim workqueue in nfs_local_commitMike Snitzer1-3/+8
[ Upstream commit 9bb0060f7860aa4561c5b21163dd45ceb66946a9 ] nfslocaliod_workqueue is a non-memreclaim workqueue (it isn't initialized with WQ_MEM_RECLAIM), see commit b9f5dd57f4a5 ("nfs/localio: use dedicated workqueues for filesystem read and write"). Use nfslocaliod_workqueue for LOCALIO's SYNC work. Also, set PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO in nfs_local_fsync_work. Fixes: b9f5dd57f4a5 ("nfs/localio: use dedicated workqueues for filesystem read and write") Signed-off-by: Mike Snitzer <snitzer@hammerspace.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27NFS/localio: prevent direct reclaim recursion into NFS via nfs_writepagesMike Snitzer1-0/+15
[ Upstream commit 67435d2d8a33a75f9647724952cb1b18279d2e95 ] LOCALIO is an NFS loopback mount optimization that avoids using the network for READ, WRITE and COMMIT if the NFS client and server are determined to be on the same system. But because LOCALIO is still fundamentally "just NFS loopback mount" it is susceptible to recursion deadlock via direct reclaim, e.g.: NFS LOCALIO down to XFS and then back into NFS via nfs_writepages. Fix LOCALIO's potential for direct reclaim deadlock by ensuring that all its page cache allocations are done from GFP_NOFS context. Thanks to Ben Coddington for pointing out commit ad22c7a043c2 ("xfs: prevent stack overflows from page cache allocation"). Reported-by: John Cagle <john.cagle@hammerspace.com> Tested-by: Allen Lu <allen.lu@hammerspace.com> Suggested-by: Benjamin Coddington <bcodding@hammerspace.com> Fixes: 70ba381e1a43 ("nfs: add LOCALIO support") Signed-off-by: Mike Snitzer <snitzer@hammerspace.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27NFS/localio: Handle short writes by retryingTrond Myklebust1-17/+47
[ Upstream commit 615762059d284b863f9163b53679d95b3dcdd495 ] The current code for handling short writes in localio just truncates the I/O and then sets an error. While that is close to how the ordinary NFS code behaves, it does mean there is a chance the data that got written is lost because it isn't persisted. To fix this, change localio so that the upper layers can direct the behaviour to persist any unstable data by rewriting it, and then continuing writing until an ENOSPC is hit. Fixes: 70ba381e1a43 ("nfs: add LOCALIO support") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27smb: client: correct value for smbd_max_fragmented_recv_sizeStefan Metzmacher1-2/+17
[ Upstream commit 4a93d1ee2d0206970b6eb13fbffe07938cd95948 ] When we download a file without rdma offload or get a large directly enumeration from the server, the server might want to send up to smbd_max_fragmented_recv_size bytes, but if it is too large all our recv buffers might already be moved to the recv_io.reassembly.list and we're no longer able to grant recv credits. The maximum fragmented upper-layer payload receive size supported Assume max_payload_per_credit is smbd_max_receive_size - 24 = 1340 The maximum number would be smbd_receive_credit_max * max_payload_per_credit 1340 * 255 = 341700 (0x536C4) The minimum value from the spec is 131072 (0x20000) For now we use the logic we used in ksmbd before: (1364 * 255) / 2 = 173910 (0x2A756) Fixes: 03bee01d6215 ("CIFS: SMBD: Add SMB Direct protocol initial values and constants") Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27procfs: fix missing RCU protection when reading real_parent in do_task_stat()Jinliang Zheng1-1/+1
[ Upstream commit 76149d53502cf17ef3ae454ff384551236fba867 ] When reading /proc/[pid]/stat, do_task_stat() accesses task->real_parent without proper RCU protection, which leads to: cpu 0 cpu 1 ----- ----- do_task_stat var = task->real_parent release_task call_rcu(delayed_put_task_struct) task_tgid_nr_ns(var) rcu_read_lock <--- Too late to protect task->real_parent! task_pid_ptr <--- UAF! rcu_read_unlock This patch uses task_ppid_nr_ns() instead of task_tgid_nr_ns() to add proper RCU protection for accessing task->real_parent. Link: https://lkml.kernel.org/r/20260128083007.3173016-1-alexjlzheng@tencent.com Fixes: 06fffb1267c9 ("do_task_stat: don't take rcu_read_lock()") Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: ruippan <ruippan@tencent.com> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27jfs: avoid -Wtautological-constant-out-of-range-compare warningArnd Bergmann1-2/+2
[ Upstream commit 7833570dae833028337bb53b7f389825b910c100 ] A recent change for the range check started triggering a clang warning: fs/jfs/jfs_dtree.c:2906:31: error: result of comparison of constant 128 with expression of type 's8' (aka 'signed char') is always false [-Werror,-Wtautological-constant-out-of-range-compare] 2906 | if (stbl[i] < 0 || stbl[i] >= DTPAGEMAXSLOT) { | ~~~~~~~ ^ ~~~~~~~~~~~~~ fs/jfs/jfs_dtree.c:3111:30: error: result of comparison of constant 128 with expression of type 's8' (aka 'signed char') is always false [-Werror,-Wtautological-constant-out-of-range-compare] 3111 | if (stbl[0] < 0 || stbl[0] >= DTPAGEMAXSLOT) { | ~~~~~~~ ^ ~~~~~~~~~~~~~ Both the old and the new check were useless, but the previous version apparently did not lead to the warning. Remove the extraneous range check for simplicity. Fixes: cafc6679824a ("jfs: replace hardcoded magic number with DTPAGEMAXSLOT constant") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27fat: avoid parent link count underflow in rmdirZhiyu Zhang2-2/+12
[ Upstream commit 8cafcb881364af5ef3a8b9fed4db254054033d8a ] Corrupted FAT images can leave a directory inode with an incorrect i_nlink (e.g. 2 even though subdirectories exist). rmdir then unconditionally calls drop_nlink(dir) and can drive i_nlink to 0, triggering the WARN_ON in drop_nlink(). Add a sanity check in vfat_rmdir() and msdos_rmdir(): only drop the parent link count when it is at least 3, otherwise report a filesystem error. Link: https://lkml.kernel.org/r/20260101111148.1437-1-zhiyuzhang999@gmail.com Fixes: 9a53c3a783c2 ("[PATCH] r/o bind mounts: unlink: monitor i_nlink") Signed-off-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Closes: https://lore.kernel.org/linux-fsdevel/aVN06OKsKxZe6-Kv@casper.infradead.org/T/#t Tested-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27nfsd: never defer requests during idmap lookupAnthony Iliopoulos3-8/+58
[ Upstream commit f9c206cdc4266caad6a9a7f46341420a10f03ccb ] During v4 request compound arg decoding, some ops (e.g. SETATTR) can trigger idmap lookup upcalls. When those upcall responses get delayed beyond the allowed time limit, cache_check() will mark the request for deferral and cause it to be dropped. This prevents nfs4svc_encode_compoundres from being executed, and thus the session slot flag NFSD4_SLOT_INUSE never gets cleared. Subsequent client requests will fail with NFSERR_JUKEBOX, given that the slot will be marked as in-use, making the SEQUENCE op fail. Fix this by making sure that the RQ_USEDEFERRAL flag is always clear during nfs4svc_decode_compoundargs(), since no v4 request should ever be deferred. Fixes: 2f425878b6a7 ("nfsd: don't use the deferral service, return NFS4ERR_DELAY") Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27NFS: NFSERR_INVAL is not defined by NFSv2Chuck Lever2-2/+2
[ Upstream commit 0ac903d1bfdce8ff40657c2b7d996947b72b6645 ] A documenting comment in include/uapi/linux/nfs.h claims incorrectly that NFSv2 defines NFSERR_INVAL. There is no such definition in either RFC 1094 or https://pubs.opengroup.org/onlinepubs/9629799/chap7.htm NFS3ERR_INVAL is introduced in RFC 1813. NFSD returns NFSERR_INVAL for PROC_GETACL, which has no specification (yet). However, nfsd_map_status() maps nfserr_symlink and nfserr_wrong_type to nfserr_inval, which does not align with RFC 1094. This logic was introduced only recently by commit 438f81e0e92a ("nfsd: move error choice for incorrect object types to version-specific code."). Given that we have no INVAL or SERVERFAULT status in NFSv2, probably the only choice is NFSERR_IO. Fixes: 438f81e0e92a ("nfsd: move error choice for incorrect object types to version-specific code.") Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27ext4: fast commit: make s_fc_lock reclaim-safeLi Chen2-23/+44
[ Upstream commit 491f2927ae097e2d405afe0b3fe841931ab8aad2 ] s_fc_lock can be acquired from inode eviction and thus is reclaim unsafe. Since the fast commit path holds s_fc_lock while writing the commit log, allocations under the lock can enter reclaim and invert the lock order with fs_reclaim. Add ext4_fc_lock()/ext4_fc_unlock() helpers which acquire s_fc_lock under memalloc_nofs_save()/restore() context and use them everywhere so allocations under the lock cannot recurse into filesystem reclaim. Fixes: 6593714d67ba ("ext4: hold s_fc_lock while during fast commit") Signed-off-by: Li Chen <me@linux.beauty> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260106120621.440126-1-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27quota: fix livelock between quotactl and freeze_superAbhishek Bapat1-0/+1
[ Upstream commit 77449e453dfc006ad738dec55374c4cbc056fd39 ] When a filesystem is frozen, quotactl_block() enters a retry loop waiting for the filesystem to thaw. It acquires s_umount, checks the freeze state, drops s_umount and uses sb_start_write() - sb_end_write() pair to wait for the unfreeze. However, this retry loop can trigger a livelock issue, specifically on kernels with preemption disabled. The mechanism is as follows: 1. freeze_super() sets SB_FREEZE_WRITE and calls sb_wait_write(). 2. sb_wait_write() calls percpu_down_write(), which initiates synchronize_rcu(). 3. Simultaneously, quotactl_block() spins in its retry loop, immediately executing the sb_start_write() - sb_end_write() pair. 4. Because the kernel is non-preemptible and the loop contains no scheduling points, quotactl_block() never yields the CPU. This prevents that CPU from reaching an RCU quiescent state. 5. synchronize_rcu() in the freezer thread waits indefinitely for the quotactl_block() CPU to report a quiescent state. 6. quotactl_block() spins indefinitely waiting for the freezer to advance, which it cannot do as it is blocked on the RCU sync. This results in a hang of the freezer process and 100% CPU usage by the quota process. While this can occur intermittently on multi-core systems, it is reliably reproducing on a node with the following script, running both the freezer and the quota toggle on the same CPU: # mkfs.ext4 -O quota /dev/sda 2g && mkdir a_mount # mount /dev/sda -o quota,usrquota,grpquota a_mount # taskset -c 3 bash -c "while true; do xfs_freeze -f a_mount; \ xfs_freeze -u a_mount; done" & # taskset -c 3 bash -c "while true; do quotaon a_mount; \ quotaoff a_mount; done" & Adding cond_resched() to the retry loop fixes the issue. It acts as an RCU quiescent state, allowing synchronize_rcu() in percpu_down_write() to complete. Fixes: 576215cffdef ("fs: Drop wait_unfrozen wait queue") Signed-off-by: Abhishek Bapat <abhishekbapat@google.com> Link: https://patch.msgid.link/20260115213103.1089129-1-abhishekbapat@google.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27pstore/ram: fix buffer overflow in persistent_ram_save_old()Sai Ritvik Tanksalkar1-0/+11
[ Upstream commit 5669645c052f235726a85f443769b6fc02f66762 ] persistent_ram_save_old() can be called multiple times for the same persistent_ram_zone (e.g., via ramoops_pstore_read -> ramoops_get_next_prz for PSTORE_TYPE_DMESG records). Currently, the function only allocates prz->old_log when it is NULL, but it unconditionally updates prz->old_log_size to the current buffer size and then performs memcpy_fromio() using this new size. If the buffer size has grown since the first allocation (which can happen across different kernel boot cycles), this leads to: 1. A heap buffer overflow (OOB write) in the memcpy_fromio() calls 2. A subsequent OOB read when ramoops_pstore_read() accesses the buffer using the incorrect (larger) old_log_size The KASAN splat would look similar to: BUG: KASAN: slab-out-of-bounds in ramoops_pstore_read+0x... Read of size N at addr ... by task ... The conditions are likely extremely hard to hit: 0. Crash with a ramoops write of less-than-record-max-size bytes. 1. Reboot: ramoops registers, pstore_get_records(0) reads old crash, allocates old_log with size X 2. Crash handler registered, timer started (if pstore_update_ms >= 0) 3. Oops happens (non-fatal, system continues) 4. pstore_dump() writes oops via ramoops_pstore_write() size Y (>X) 5. pstore_new_entry = 1, pstore_timer_kick() called 6. System continues running (not a panic oops) 7. Timer fires after pstore_update_ms milliseconds 8. pstore_timefunc() → schedule_work() → pstore_dowork() → pstore_get_records(1) 9. ramoops_get_next_prz() → persistent_ram_save_old() 10. buffer_size() returns Y, but old_log is X bytes 11. Y > X: memcpy_fromio() overflows heap Requirements: - a prior crash record exists that did not fill the record size (almost impossible since the crash handler writes as much as it can possibly fit into the record, capped by max record size and the kmsg buffer almost always exceeds the max record size) - pstore_update_ms >= 0 (disabled by default) - Non-fatal oops (system survives) Free and reallocate the buffer when the new size differs from the previously allocated size. This ensures old_log always has sufficient space for the data being copied. Fixes: 201e4aca5aa1 ("pstore/ram: Should update old dmesg buffer before reading") Signed-off-by: Sai Ritvik Tanksalkar <stanksal@purdue.edu> Link: https://patch.msgid.link/20260201132240.2948732-1-stanksal@purdue.edu Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27fs/tests: exec: drop duplicate bprm_stack_limits test vectorsTitouan Ameline de Cadeville1-6/+0
[ Upstream commit 46a03ea50b5f380bdb99178b8f90b39c6ba1f528 ] Remove duplicate entries from the bprm_stack_limits KUnit test vector table. The duplicates do not add coverage and only increase test size. Signed-off-by: Titouan Ameline de Cadeville <titouan.ameline@gmail.com> Fixes: 60371f43e56b ("exec: Add KUnit test for bprm_stack_limits()") Link: https://patch.msgid.link/20260203175950.43710-1-titouan.ameline@gmail.com Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27iomap: fix submission side handling of completion side errorsChristoph Hellwig1-3/+7
[ Upstream commit 4ad357e39b2ecd5da7bcc7e840ee24d179593cd5 ] The "if (dio->error)" in iomap_dio_bio_iter exists to stop submitting more bios when a completion already return an error. Commit cfe057f7db1f ("iomap_dio_actor(): fix iov_iter bugs") made it revert the iov by "copied", which is very wrong given that we've already consumed that range and submitted a bio for it. Fixes: cfe057f7db1f ("iomap_dio_actor(): fix iov_iter bugs") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27netfs: avoid double increment of retry_count in subreqShyam Prasad N1-1/+0
[ Upstream commit a5ca32d031bbba5160e1f555aabb75a3f40f918d ] This change fixes the instance of double incrementing of retry_count. The increment of this count already happens when netfs_reissue_write gets called. Incrementing this value before is not necessary. Fixes: 4acb665cf4f3 ("netfs: Work around recursion by abandoning retry if nothing read") Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27smb: client: fix potential UAF and double free in smb2_open_file()Paulo Alcantara1-0/+2
[ Upstream commit ebbbc4bfad4cb355d17c671223d0814ee3ef4eda ] Zero out @err_iov and @err_buftype before retrying SMB2_open() to prevent an UAF bug if @data != NULL, otherwise a double free. Fixes: e3a43633023e ("smb/client: fix memory leak in smb2_open_file()") Reported-by: David Howells <dhowells@redhat.com> Closes: https://lore.kernel.org/r/2892312.1770306653@warthog.procyon.org.uk Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27erofs: fix inline data read failure for ztailpacking pclustersGao Xiang1-14/+16
[ Upstream commit c134a40f86efb8d6b5a949ef70e06d5752209be5 ] Compressed folios for ztailpacking pclusters must be valid before adding these pclusters to I/O chains. Otherwise, z_erofs_decompress_pcluster() may assume they are already valid and then trigger a NULL pointer dereference. It is somewhat hard to reproduce because the inline data is in the same block as the tail of the compressed indexes, which are usually read just before. However, it may still happen if a fatal signal arrives while read_mapping_folio() is running, as shown below: erofs: (device dm-1): z_erofs_pcluster_begin: failed to get inline data -4 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 ... pc : z_erofs_decompress_queue+0x4c8/0xa14 lr : z_erofs_decompress_queue+0x160/0xa14 sp : ffffffc08b3eb3a0 x29: ffffffc08b3eb570 x28: ffffffc08b3eb418 x27: 0000000000001000 x26: ffffff8086ebdbb8 x25: ffffff8086ebdbb8 x24: 0000000000000001 x23: 0000000000000008 x22: 00000000fffffffb x21: dead000000000700 x20: 00000000000015e7 x19: ffffff808babb400 x18: ffffffc089edc098 x17: 00000000c006287d x16: 00000000c006287d x15: 0000000000000004 x14: ffffff80ba8f8000 x13: 0000000000000004 x12: 00000006589a77c9 x11: 0000000000000015 x10: 0000000000000000 x9 : 0000000000000000 x8 : 0000000000000000 x7 : 0000000000000000 x6 : 000000000000003f x5 : 0000000000000040 x4 : ffffffffffffffe0 x3 : 0000000000000020 x2 : 0000000000000008 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: z_erofs_decompress_queue+0x4c8/0xa14 z_erofs_runqueue+0x908/0x97c z_erofs_read_folio+0x128/0x228 filemap_read_folio+0x68/0x128 filemap_get_pages+0x44c/0x8b4 filemap_read+0x12c/0x5b8 generic_file_read_iter+0x4c/0x15c do_iter_readv_writev+0x188/0x1e0 vfs_iter_read+0xac/0x1a4 backing_file_read_iter+0x170/0x34c ovl_read_iter+0xf0/0x140 vfs_read+0x28c/0x344 ksys_read+0x80/0xf0 __arm64_sys_read+0x24/0x34 invoke_syscall+0x60/0x114 el0_svc_common+0x88/0xe4 do_el0_svc+0x24/0x30 el0_svc+0x40/0xa8 el0t_64_sync_handler+0x70/0xbc el0t_64_sync+0x1bc/0x1c0 Fix this by reading the inline data before allocating and adding the pclusters to the I/O chains. Fixes: cecf864d3d76 ("erofs: support inline data decompression") Reported-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-and-tested-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocationBoris Burkov1-60/+183
[ Upstream commit b14c5e04bd0f722ed631845599d52d03fcae1bc1 ] I have been observing a number of systems aborting at insert_dev_extents() in btrfs_create_pending_block_groups(). The following is a sample stack trace of such an abort coming from forced chunk allocation (typically behind CONFIG_BTRFS_EXPERIMENTAL) but this can theoretically happen to any DUP chunk allocation. [81.801] ------------[ cut here ]------------ [81.801] BTRFS: Transaction aborted (error -17) [81.801] WARNING: fs/btrfs/block-group.c:2876 at btrfs_create_pending_block_groups+0x721/0x770 [btrfs], CPU#1: bash/319 [81.802] Modules linked in: virtio_net btrfs xor zstd_compress raid6_pq null_blk [81.803] CPU: 1 UID: 0 PID: 319 Comm: bash Kdump: loaded Not tainted 6.19.0-rc6+ #319 NONE [81.803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 [81.804] RIP: 0010:btrfs_create_pending_block_groups+0x723/0x770 [btrfs] [81.806] RSP: 0018:ffffa36241a6bce8 EFLAGS: 00010282 [81.806] RAX: 000000000000000d RBX: ffff8e699921e400 RCX: 0000000000000000 [81.807] RDX: 0000000002040001 RSI: 00000000ffffffef RDI: ffffffffc0608bf0 [81.807] RBP: 00000000ffffffef R08: ffff8e69830f6000 R09: 0000000000000007 [81.808] R10: ffff8e699921e5e8 R11: 0000000000000000 R12: ffff8e6999228000 [81.808] R13: ffff8e6984d82000 R14: ffff8e69966a69c0 R15: ffff8e69aa47b000 [81.809] FS: 00007fec6bdd9740(0000) GS:ffff8e6b1b379000(0000) knlGS:0000000000000000 [81.809] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [81.810] CR2: 00005604833670f0 CR3: 0000000116679000 CR4: 00000000000006f0 [81.810] Call Trace: [81.810] <TASK> [81.810] __btrfs_end_transaction+0x3e/0x2b0 [btrfs] [81.811] btrfs_force_chunk_alloc_store+0xcd/0x140 [btrfs] [81.811] kernfs_fop_write_iter+0x15f/0x240 [81.812] vfs_write+0x264/0x500 [81.812] ksys_write+0x6c/0xe0 [81.812] do_syscall_64+0x66/0x770 [81.812] entry_SYSCALL_64_after_hwframe+0x76/0x7e [81.813] RIP: 0033:0x7fec6be66197 [81.814] RSP: 002b:00007fffb159dd30 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [81.815] RAX: ffffffffffffffda RBX: 00007fec6bdd9740 RCX: 00007fec6be66197 [81.815] RDX: 0000000000000002 RSI: 0000560483374f80 RDI: 0000000000000001 [81.816] RBP: 0000560483374f80 R08: 0000000000000000 R09: 0000000000000000 [81.816] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002 [81.817] R13: 00007fec6bfb85c0 R14: 00007fec6bfb5ee0 R15: 00005604833729c0 [81.817] </TASK> [81.817] irq event stamp: 20039 [81.818] hardirqs last enabled at (20047): [<ffffffff99a68302>] __up_console_sem+0x52/0x60 [81.818] hardirqs last disabled at (20056): [<ffffffff99a682e7>] __up_console_sem+0x37/0x60 [81.819] softirqs last enabled at (19470): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0 [81.819] softirqs last disabled at (19463): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0 [81.820] ---[ end trace 0000000000000000 ]--- [81.820] BTRFS: error (device dm-7 state A) in btrfs_create_pending_block_groups:2876: errno=-17 Object already exists Inspecting these aborts with drgn, I observed a pattern of overlapping chunk_maps. Note how stripe 1 of the first chunk overlaps in physical address with stripe 0 of the second chunk. Physical Start Physical End Length Logical Type Stripe ---------------------------------------------------------------------------------------------------- 0x0000000102500000 0x0000000142500000 1.0G 0x0000000641d00000 META|DUP 0/2 0x0000000142500000 0x0000000182500000 1.0G 0x0000000641d00000 META|DUP 1/2 0x0000000142500000 0x0000000182500000 1.0G 0x0000000601d00000 META|DUP 0/2 0x0000000182500000 0x00000001c2500000 1.0G 0x0000000601d00000 META|DUP 1/2 Now how could this possibly happen? All chunk allocation is protected by the chunk_mutex so racing allocations should see a consistent view of the CHUNK_ALLOCATED bit in the chunk allocation extent-io-tree (device->alloc_state as set by chunk_map_device_set_bits()) The tree itself is protected by a spin lock, and clearing/setting the bits is always protected by fs_info->mapping_tree_lock, so no race is apparent. It turns out that there is a subtle bug in the logic regarding chunk allocations that have happened in the current transaction, known as "pending extents". The chunk allocation as defined in find_free_dev_extent() is a loop which searches the commit root of the dev_root and looks for gaps between DEV_EXTENT items. For those gaps, it then checks alloc_state bitmap for any pending extents and adjusts the hole that it finds accordingly. However, the logic in that adjustment assumes that the first pending extent is the only one in that range. e.g., given a layout with two non-consecutive pending extents in a hole passed to dev_extent_hole_check() via *hole_start and *hole_size: |----pending A----| real hole |----pending B----| | candidate hole | *hole_start *hole_start + *hole_size the code incorrectly returns a "hole" from the end of pending extent A until the passed in hole end, failing to account for pending B. However, it is not entirely obvious that it is actually possible to produce such a layout. I was able to reproduce it, but with some contortions: I continued to use the force chunk allocation sysfs file and I introduced a long delay (10 seconds) into the start of the cleaner thread. I also prevented the unused bgs cleaning logic from ever deleting metadata bgs. These help make it easier to deterministically produce the condition but shouldn't really matter if you imagine the conditions happening by race/luck. Allocations/frees can happen concurrently with the cleaner thread preparing to process an unused extent and both create some used chunks with an unused chunk interleaved, all during one transaction. Then btrfs_delete_unused_bgs() sees the unused one and clears it, leaving a range with several pending chunk allocations and a gap in the middle. The basic idea is that the unused_bgs cleanup work happens on a worker so if we allocate 3 block groups in one transaction, then the cleaner work kicked off by the previous transaction comes through and deletes the middle one of the 3, then the commit root shows no dev extents and we have the bad pattern in the extent-io-tree. One final consideration is that the code happens to loop to the next hole if there are no more extents at all, so we need one more dev extent way past the area we are working in. Something like the following demonstrates the technique: # push the BG frontier out to 20G fallocate -l 20G $mnt/foo # allocate one more that will prevent the "no more dev extents" luck fallocate -l 1G $mnt/sticky # sync sync # clear out the allocation area rm $mnt/foo sync _cleaner # let everything quiesce sleep 20 sync # dev tree should have one bg 20G out and the rest at the beginning.. # sort of like an empty FS but with a random sticky chunk. # kick off the cleaner in the background, remember it will sleep 10s # before doing interesting work _cleaner & sleep 3 # create 3 trivial block groups, all empty, all immediately marked as unused. echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" echo 1 > "$(_btrfs_sysfs_space_info $dev data)/force_chunk_alloc" echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" # let the cleaner thread definitely finish, it will remove the data bg sleep 10 # this allocation sees the non-consecutive pending metadata chunks with # data chunk gap of 1G and allocates a 2G extent in that hole. ENOSPC! echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" As for the fix, it is not that obvious. I could not see a trivial way to do it even by adding backup loops into find_free_dev_extent(), so I opted to change the semantics of dev_extent_hole_check() to not stop looping until it finds a sufficiently big hole. For clarity, this also required changing the helper function contains_pending_extent() into two new helpers which find the first pending extent and the first suitable hole in a range. I attempted to clean up the documentation and range calculations to be as consistent and clear as possible for the future. I also looked at the zoned case and concluded that the loop there is different and not to be unified with this one. As far as I can tell, the zoned check will only further constrain the hole so looping back to find more holes is acceptable. Though given that zoned really only appends, I find it highly unlikely that it is susceptible to this bug. Fixes: 1b9845081633 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole") Reported-by: Dimitrios Apostolou <jimis@gmx.net> Closes: https://lore.kernel.org/linux-btrfs/q7760374-q1p4-029o-5149-26p28421s468@tzk.arg/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: fix block_group_tree dirty_list corruptionBoris Burkov1-7/+0
[ Upstream commit 3a1f4264daed4b419c325a7fe35e756cada3cf82 ] When the incompat flag EXTENT_TREE_V2 is set, we unconditionally add the block group tree to the switch_commits list before calling switch_commit_roots, as we do for the tree root and the chunk root. However, the block group tree uses normal root dirty tracking and in any transaction that does an allocation and dirties a block group, the block group root will already be linked to a list by the dirty_list field and this use of list_add_tail() is invalid and corrupts the prev/next members of block_group_root->dirty_list. This is apparent on a subsequent list_del on the prev if we enable CONFIG_DEBUG_LIST: [32.1571] ------------[ cut here ]------------ [32.1572] list_del corruption. next->prev should beffff958890202538, but was ffff9588992bd538. (next=ffff958890201538) [32.1575] WARNING: lib/list_debug.c:65 at 0x0, CPU#3: sync/607 [32.1583] CPU: 3 UID: 0 PID: 607 Comm: sync Not tainted 6.18.0 #24PREEMPT(none) [32.1585] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS1.17.0-4.fc41 04/01/2014 [32.1587] RIP: 0010:__list_del_entry_valid_or_report+0x108/0x120 [32.1593] RSP: 0018:ffffaa288287fdd0 EFLAGS: 00010202 [32.1594] RAX: 0000000000000001 RBX: ffff95889326e800 RCX:ffff958890201538 [32.1596] RDX: ffff9588992bd538 RSI: ffff958890202538 RDI:ffffffff82a41e00 [32.1597] RBP: ffff958890202538 R08: ffffffff828fc1e8 R09:00000000ffffefff [32.1599] R10: ffffffff8288c200 R11: ffffffff828e4200 R12:ffff958890201538 [32.1601] R13: ffff95889326e958 R14: ffff958895c24000 R15:ffff958890202538 [32.1603] FS: 00007f0c28eb5740(0000) GS:ffff958af2bd2000(0000)knlGS:0000000000000000 [32.1605] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [32.1607] CR2: 00007f0c28e8a3cc CR3: 0000000109942005 CR4:0000000000370ef0 [32.1609] Call Trace: [32.1610] <TASK> [32.1611] switch_commit_roots+0x82/0x1d0 [btrfs] [32.1615] btrfs_commit_transaction+0x968/0x1550 [btrfs] [32.1618] ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs] [32.1621] __iterate_supers+0xe8/0x190 [32.1622] ? __pfx_sync_fs_one_sb+0x10/0x10 [32.1623] ksys_sync+0x63/0xb0 [32.1624] __do_sys_sync+0xe/0x20 [32.1625] do_syscall_64+0x73/0x450 [32.1626] entry_SYSCALL_64_after_hwframe+0x76/0x7e [32.1627] RIP: 0033:0x7f0c28d05d2b [32.1632] RSP: 002b:00007ffc9d988048 EFLAGS: 00000246 ORIG_RAX:00000000000000a2 [32.1634] RAX: ffffffffffffffda RBX: 00007ffc9d988228 RCX:00007f0c28d05d2b [32.1636] RDX: 00007f0c28e02301 RSI: 00007ffc9d989b21 RDI:00007f0c28dba90d [32.1637] RBP: 0000000000000001 R08: 0000000000000001 R09:0000000000000000 [32.1639] R10: 0000000000000000 R11: 0000000000000246 R12:000055b96572cb80 [32.1641] R13: 000055b96572b19f R14: 00007f0c28dfa434 R15:000055b96572b034 [32.1643] </TASK> [32.1644] irq event stamp: 0 [32.1644] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [32.1646] hardirqs last disabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260 [32.1648] softirqs last enabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260 [32.1650] softirqs last disabled at (0): [<0000000000000000>] 0x0 [32.1652] ---[ end trace 0000000000000000 ]--- Furthermore, this list corruption eventually (when we happen to add a new block group) results in getting the switch_commits and dirty_cowonly_roots lists mixed up and attempting to call update_root on the tree root which can't be found in the tree root, resulting in a transaction abort: [87.8269] BTRFS critical (device nvme1n1): unable to find root key (1 0 0) in tree 1 [87.8272] ------------[ cut here ]------------ [87.8274] BTRFS: Transaction aborted (error -117) [87.8275] WARNING: fs/btrfs/root-tree.c:153 at 0x0, CPU#4: sync/703 [87.8285] CPU: 4 UID: 0 PID: 703 Comm: sync Not tainted 6.18.0 #25 PREEMPT(none) [87.8287] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-4.fc41 04/01/2014 [87.8289] RIP: 0010:btrfs_update_root+0x296/0x790 [btrfs] [87.8295] RSP: 0018:ffffa58d035dfd60 EFLAGS: 00010282 [87.8297] RAX: ffff9a59126ddb68 RBX: ffff9a59126dc000 RCX: 0000000000000000 [87.8299] RDX: 0000000000000000 RSI: 00000000ffffff8b RDI: ffffffffc0b28270 [87.8301] RBP: ffff9a5904aec000 R08: 0000000000000000 R09: 00000000ffffefff [87.8303] R10: ffffffff9ac8c200 R11: ffffffff9ace4200 R12: 0000000000000001 [87.8305] R13: ffff9a59041740e8 R14: ffff9a5904aec1f7 R15: ffff9a590fdefaf0 [87.8307] FS: 00007f54cde6b740(0000) GS:ffff9a5b5a81c000(0000) knlGS:0000000000000000 [87.8309] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [87.8310] CR2: 00007f54cde403cc CR3: 0000000112902004 CR4: 0000000000370ef0 [87.8312] Call Trace: [87.8313] <TASK> [87.8314] ? _raw_spin_unlock+0x23/0x40 [87.8315] commit_cowonly_roots+0x1ad/0x250 [btrfs] [87.8317] ? btrfs_commit_transaction+0x79b/0x1560 [btrfs] [87.8320] btrfs_commit_transaction+0x8aa/0x1560 [btrfs] [87.8322] ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs] [87.8325] __iterate_supers+0xf1/0x170 [87.8326] ? __pfx_sync_fs_one_sb+0x10/0x10 [87.8327] ksys_sync+0x63/0xb0 [87.8328] __do_sys_sync+0xe/0x20 [87.8329] do_syscall_64+0x73/0x450 [87.8330] entry_SYSCALL_64_after_hwframe+0x76/0x7e [87.8331] RIP: 0033:0x7f54cdd05d2b [87.8336] RSP: 002b:00007fff1b58ff78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 [87.8338] RAX: ffffffffffffffda RBX: 00007fff1b590158 RCX: 00007f54cdd05d2b [87.8340] RDX: 00007f54cde02301 RSI: 00007fff1b592b66 RDI: 00007f54cddba90d [87.8342] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 [87.8344] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e07ca96b80 [87.8346] R13: 000055e07ca9519f R14: 00007f54cddfa434 R15: 000055e07ca95034 [87.8348] </TASK> [87.8348] irq event stamp: 0 [87.8349] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [87.8351] hardirqs last disabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0 [87.8353] softirqs last enabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0 [87.8355] softirqs last disabled at (0): [<0000000000000000>] 0x0 [87.8357] ---[ end trace 0000000000000000 ]--- [87.8358] BTRFS: error (device nvme1n1 state A) in btrfs_update_root:153: errno=-117 Filesystem corrupted [87.8360] BTRFS info (device nvme1n1 state EA): forced readonly [87.8362] BTRFS warning (device nvme1n1 state EA): Skipping commit of aborted transaction. [87.8364] BTRFS: error (device nvme1n1 state EA) in cleanup_transaction:2037: errno=-117 Filesystem corrupted Since the block group tree was pulled out of the extent tree and uses normal root dirty tracking, remove the offending extra list_add. This fixes the list corruption and the resulting fs corruption. Fixes: 14033b08a029 ("btrfs: don't save block group root into super block") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: qgroup: return correct error when deleting qgroup relation itemFilipe Manana1-1/+3
[ Upstream commit 51b1fcf71c88c3c89e7dcf07869c5de837b1f428 ] If we fail to delete the second qgroup relation item, we end up returning success or -ENOENT in case the first item does not exist, instead of returning the error from the second item deletion. Fixes: 73798c465b66 ("btrfs: qgroup: Try our best to delete qgroup relations") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: zoned: don't zone append to conventional zoneJohannes Thumshirn2-10/+12
[ Upstream commit b39b26e017c7889181cb84032e22bef72e81cf29 ] In case of a zoned RAID, it can happen that a data write is targeting a sequential write required zone and a conventional zone. In this case the bio will be marked as REQ_OP_ZONE_APPEND but for the conventional zone, this needs to be REQ_OP_WRITE. The setting of REQ_OP_ZONE_APPEND is deferred to the last possible time in btrfs_submit_dev_bio(), but the decision if we can use zone append is cached in btrfs_bio. CC: Naohiro Aota <naohiro.aota@wdc.com> Fixes: e9b9b911e03c ("btrfs: add raid stripe tree to features enabled with debug config") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: introduce btrfs_bio::async_csumQu Wenruo4-23/+67
[ Upstream commit dd57c78aec398717a2fa6488d87b1a6cd43c7d0d ] [ENHANCEMENT] Btrfs currently calculates data checksums then submits the bio. But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write if the inode requires checksum"), any writes with data checksum will fallback to buffered IO, meaning the content will not change during writeback. This means we're safe to calculate the data checksum and submit the bio in parallel, and only need the following new behavior: - Wait the csum generation to finish before calling btrfs_bio::end_io() Or this can lead to use-after-free for the csum generation worker. - Save the current bi_iter for csum_one_bio() As the submission part can advance btrfs_bio::bio.bi_iter, if not saved csum_one_bio() may got an empty bi_iter and do not generate any checksum. Unfortunately this means we have to increase the size of btrfs_bio for 16 bytes, but this is still acceptable. As usual, such new feature is hidden behind the experimental flag. [THEORETIC ANALYZE] Consider the following theoretic hardware performance, which should be more or less close to modern mainstream hardware: Memory bandwidth: 50GiB/s CRC32C bandwidth: 45GiB/s SSD bandwidth: 8GiB/s Then write bandwidth with data checksum before the patch is: 1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s After the patch, the bandwidth is: 1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s The difference is 15.32% improvement. [REAL WORLD BENCHMARK] I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the storage is backed by a PCIe gen3 x4 NVMe. The test is a direct IO write, with 1MiB block size, write 7GiB data into a btrfs mount with data checksum. Thus the direct write will fallback to buffered one: Vanilla Datasum: 1619.97 GiB/s Patched Datasum: 1792.26 GiB/s Diff +10.6 % In my case, the bottleneck is the storage, thus the improvement is not reaching the theoretic one, but still some observable improvement. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: b39b26e017c7 ("btrfs: zoned: don't zone append to conventional zone") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: make sure all btrfs_bio::end_io are called in task contextQu Wenruo1-18/+46
[ Upstream commit 4591c3ef751d861d7dd95ff4d2aadb1b5e95854e ] [BACKGROUND] Btrfs has a lot of different bi_end_io functions, to handle different raid profiles. But they introduced a lot of different contexts for btrfs_bio::end_io() calls: - Simple read bios Run in task context, backed by either endio_meta_workers or endio_workers. - Simple write bios Run in IRQ context. - RAID56 write or rebuild bios Run in task context, backed by rmw_workers. - Mirrored write bios Run in irq context. This is inconsistent, and contributes to the number of workqueues used in btrfs. [ENHANCEMENT] Make all the above bios call their btrfs_bio::end_io() in task context, backed by either endio_meta_workers for metadata, or endio_workers for data. For simple write bios, merge the handling into simple_end_io_work(). For mirrored write bios, it will be a little more complex, since both the original or the cloned bios can run the final btrfs_bio::end_io(). Here we make sure the cloned bios are using btrfs_bioset, to reuse the end_io_work, and run both original and cloned work inside the workqueue. Add extra ASSERT()s to make sure btrfs_bio_end_io() is running in task context. This not only unifies the context for btrfs_bio::end_io() functions, but also opens a new door for further btrfs_bio::end_io() related cleanups. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: b39b26e017c7 ("btrfs: zoned: don't zone append to conventional zone") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inodeQu Wenruo9-81/+87
[ Upstream commit 81cea6cd7041ebd42281e0517f856d88527d3326 ] Currently there is only one caller which doesn't populate btrfs_bio::inode, and that's scrub. The idea is scrub doesn't want any automatic csum verification nor read-repair, as everything will be handled by scrub itself. However that behavior is really no different than metadata inode, thus we can reuse btree_inode as btrfs_bio::inode for scrub. The only exception is in btrfs_submit_chunk() where if a bbio is from scrub or data reloc inode, we set rst_search_commit_root to true. This means we still need a way to distinguish scrub from metadata, but that can be done by a new flag inside btrfs_bio. Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from btrfs_bio structure. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: b39b26e017c7 ("btrfs: zoned: don't zone append to conventional zone") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: headers cleanup to remove unnecessary local includesQu Wenruo21-22/+21
[ Upstream commit c5667f9c8eb90293dfa4e52c65eb89fe39f5652d ] [BUG] When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to grab the fs_info, the header "btrfs_inode.h" is needed to access the full btrfs_inode structure. Then btrfs will fail to compile. [CAUSE] There is a recursive including chain: "bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" -> "bio.h" That recursive including is causing problems for btrfs. [ENHANCEMENT] To reduce the risk of recursive including: - Remove unnecessary local includes from btrfs headers Either the included header is pulled in by other headers, or is completely unnecessary. - Remove btrfs local includes if the header only requires a pointer In that case let the implementing C file to pull the required header. This is especially important for headers like "btrfs_inode.h" which pulls in a lot of other btrfs headers, thus it's a mine field of recursive including. - Remove unnecessary temporary structure definition Either if we have included the header defining the structure, or completely unused. Now including "btrfs_inode.h" inside "bio.h" is completely fine, although "btrfs_inode.h" still includes "extent_map.h", but that header only includes "fs.h", no more chain back to "bio.h". Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: b39b26e017c7 ("btrfs: zoned: don't zone append to conventional zone") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27erofs: handle end of filesystem properly for file-backed mountsGao Xiang1-12/+8
[ Upstream commit bc804a8d7e865ef47fb7edcaf5e77d18bf444ebc ] I/O requests beyond the end of the filesystem should be zeroed out, similar to loopback devices and that is what we expect. Fixes: ce63cb62d794 ("erofs: support unencoded inodes for fileio") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27erofs: get rid of raw bi_end_io() usageGao Xiang2-3/+3
[ Upstream commit 80d0c27a0a4af8e0678d7412781482e6f73c22c7 ] These BIOs are actually harmless in practice, as they are all pseudo BIOs and do not use advanced features like chaining. Using the BIO interface is a more friendly and unified approach for both bdev and and file-backed I/Os (compared to awkward bvec interfaces). Let's use bio_endio() instead. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Stable-dep-of: bc804a8d7e86 ("erofs: handle end of filesystem properly for file-backed mounts") Signed-off-by: Sasha Levin <sashal@kernel.org>