kernel/linux.git/fs/xfs, branch v5.10.257

xfs: stop reclaim before pushing AIL during unmount

2026-04-18T08:31:16+00:00

[ Upstream commit 4f24a767e3d64a5f58c595b5c29b6063a201f1e3 ] The unmount sequence in xfs_unmount_flush_inodes() pushed the AIL while background reclaim and inodegc are still running. This is broken independently of any use-after-free issues - background reclaim and inodegc should not be running while the AIL is being pushed during unmount, as inodegc can dirty and insert inodes into the AIL during the flush, and background reclaim can race to abort and free dirty inodes. Reorder xfs_unmount_flush_inodes() to stop inodegc and cancel background reclaim before pushing the AIL. Stop inodegc before cancelling m_reclaim_work because the inodegc worker can re-queue m_reclaim_work via xfs_inodegc_set_reclaimable. Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary") Cc: stable@vger.kernel.org # v5.9 Signed-off-by: Yuto Ohnuki Reviewed-by: Darrick J. Wong Signed-off-by: Carlos Maiolino [ dropped xfs_inodegc_stop() call ] Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

xfs: save ailp before dropping the AIL lock in push callbacks

2026-04-18T08:31:16+00:00

[ Upstream commit 394d70b86fae9fe865e7e6d9540b7696f73aa9b6 ] In xfs_inode_item_push() and xfs_qm_dquot_logitem_push(), the AIL lock is dropped to perform buffer IO. Once the cluster buffer no longer protects the log item from reclaim, the log item may be freed by background reclaim or the dquot shrinker. The subsequent spin_lock() call dereferences lip->li_ailp, which is a use-after-free. Fix this by saving the ailp pointer in a local variable while the AIL lock is held and the log item is guaranteed to be valid. Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary") Cc: stable@vger.kernel.org # v5.9 Reviewed-by: Darrick J. Wong Reviewed-by: Dave Chinner Signed-off-by: Yuto Ohnuki Signed-off-by: Carlos Maiolino Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

xfs: fix integer overflow in bmap intent sort comparator

2026-04-18T08:31:00+00:00

[ Upstream commit 362c490980867930a098b99f421268fbd7ca05fd ] xfs_bmap_update_diff_items() sorts bmap intents by inode number using a subtraction of two xfs_ino_t (uint64_t) values, with the result truncated to int. This is incorrect when two inode numbers differ by more than INT_MAX (2^31 - 1), which is entirely possible on large XFS filesystems. Fix this by replacing the subtraction with cmp_int(). Cc: # v4.9 Fixes: 9f3afb57d5f1 ("xfs: implement deferred bmbt map/unmap operations") Signed-off-by: Long Li Reviewed-by: Darrick J. Wong Signed-off-by: Carlos Maiolino [ replaced `bi_entry()` macro with `container_of()` and inlined `cmp_int()` as a manual three-way comparison expression ] Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

xfs: ensure dquot item is deleted from AIL only after log shutdown

2026-04-18T08:31:00+00:00

[ Upstream commit 186ac39b8a7d3ec7ce9c5dd45e5c2730177f375c ] In xfs_qm_dqflush(), when a dquot flush fails due to corruption (the out_abort error path), the original code removed the dquot log item from the AIL before calling xfs_force_shutdown(). This ordering introduces a subtle race condition that can lead to data loss after a crash. The AIL tracks the oldest dirty metadata in the journal. The position of the tail item in the AIL determines the log tail LSN, which is the oldest LSN that must be preserved for crash recovery. When an item is removed from the AIL, the log tail can advance past the LSN of that item. The race window is as follows: if the dquot item happens to be at the tail of the log, removing it from the AIL allows the log tail to advance. If a concurrent log write is sampling the tail LSN at the same time and subsequently writes a complete checkpoint (i.e., one containing a commit record) to disk before the shutdown takes effect, the journal will no longer protect the dquot's last modification. On the next mount, log recovery will not replay the dquot changes, even though they were never written back to disk, resulting in silent data loss. Fix this by calling xfs_force_shutdown() before xfs_trans_ail_delete() in the out_abort path. Once the log is shut down, no new log writes can complete with an updated tail LSN, making it safe to remove the dquot item from the AIL. Cc: stable@vger.kernel.org Fixes: b707fffda6a3 ("xfs: abort consistently on dquot flush failure") Signed-off-by: Long Li Reviewed-by: Carlos Maiolino Reviewed-by: Christoph Hellwig Signed-off-by: Carlos Maiolino [ adapted error path to preserve existing out_unlock label between xfs_trans_ail_delete and xfs_dqfunlock ] Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

xfs: fix remote xattr valuelblk check

2026-03-04T12:20:19+00:00

[ Upstream commit bd3138e8912c9db182eac5fed1337645a98b7a4f ] In debugging other problems with generic/753, it turns out that it's possible for the system go to down in the middle of a remote xattr set operation such that the leaf block entry is marked incomplete and valueblk is set to zero. Make this no longer a failure. Cc: # v4.15 Fixes: 13791d3b833428 ("xfs: scrub extended attribute leaf space") Signed-off-by: "Darrick J. Wong" Reviewed-by: Christoph Hellwig Signed-off-by: Sasha Levin

xfs: fix freemap adjustments when adding xattrs to leaf blocks

2026-03-04T12:20:18+00:00

[ Upstream commit 3eefc0c2b78444b64feeb3783c017d6adc3cd3ce ] xfs/592 and xfs/794 both trip this assertion in the leaf block freemap adjustment code after ~20 minutes of running on my test VMs: ASSERT(ichdr->firstused >= ichdr->count * sizeof(xfs_attr_leaf_entry_t) + xfs_attr3_leaf_hdr_size(leaf)); Upon enabling quite a lot more debugging code, I narrowed this down to fsstress trying to set a local extended attribute with namelen=3 and valuelen=71. This results in an entry size of 80 bytes. At the start of xfs_attr3_leaf_add_work, the freemap looks like this: i 0 base 448 size 0 rhs 448 count 46 i 1 base 388 size 132 rhs 448 count 46 i 2 base 2120 size 4 rhs 448 count 46 firstused = 520 where "rhs" is the first byte past the end of the leaf entry array. This is inconsistent -- the entries array ends at byte 448, but freemap[1] says there's free space starting at byte 388! By the end of the function, the freemap is in worse shape: i 0 base 456 size 0 rhs 456 count 47 i 1 base 388 size 52 rhs 456 count 47 i 2 base 2120 size 4 rhs 456 count 47 firstused = 440 Important note: 388 is not aligned with the entries array element size of 8 bytes. Based on the incorrect freemap, the name area starts at byte 440, which is below the end of the entries array! That's why the assertion triggers and the filesystem shuts down. How did we end up here? First, recall from the previous patch that the freemap array in an xattr leaf block is not intended to be a comprehensive map of all free space in the leaf block. In other words, it's perfectly legal to have a leaf block with: * 376 bytes in use by the entries array * freemap[0] has [base = 376, size = 8] * freemap[1] has [base = 388, size = 1500] * the space between 376 and 388 is free, but the freemap stopped tracking that some time ago If we add one xattr, the entries array grows to 384 bytes, and freemap[0] becomes [base = 384, size = 0]. So far, so good. But if we add a second xattr, the entries array grows to 392 bytes, and freemap[0] gets pushed up to [base = 392, size = 0]. This is bad, because freemap[1] hasn't been updated, and now the entries array and the free space claim the same space. The fix here is to adjust all freemap entries so that none of them collide with the entries array. Note that this fix relies on commit 2a2b5932db6758 ("xfs: fix attr leaf header freemap.size underflow") and the previous patch that resets zero length freemap entries to have base = 0. Cc: # v2.6.12 Fixes: 1da177e4c3f415 ("Linux-2.6.12-rc2") Signed-off-by: "Darrick J. Wong" Reviewed-by: Christoph Hellwig Signed-off-by: Sasha Levin

xfs: delete attr leaf freemap entries when empty

2026-03-04T12:20:18+00:00

[ Upstream commit 6f13c1d2a6271c2e73226864a0e83de2770b6f34 ] Back in commit 2a2b5932db6758 ("xfs: fix attr leaf header freemap.size underflow"), Brian Foster observed that it's possible for a small freemap at the end of the end of the xattr entries array to experience a size underflow when subtracting the space consumed by an expansion of the entries array. There are only three freemap entries, which means that it is not a complete index of all free space in the leaf block. This code can leave behind a zero-length freemap entry with a nonzero base. Subsequent setxattr operations can increase the base up to the point that it overlaps with another freemap entry. This isn't in and of itself a problem because the code in _leaf_add that finds free space ignores any freemap entry with zero size. However, there's another bug in the freemap update code in _leaf_add, which is that it fails to update a freemap entry that begins midway through the xattr entry that was just appended to the array. That can result in the freemap containing two entries with the same base but different sizes (0 for the "pushed-up" entry, nonzero for the entry that's actually tracking free space). A subsequent _leaf_add can then allocate xattr namevalue entries on top of the entries array, leading to data loss. But fixing that is for later. For now, eliminate the possibility of confusion by zeroing out the base of any freemap entry that has zero size. Because the freemap is not intended to be a complete index of free space, a subsequent failure to find any free space for a new xattr will trigger block compaction, which regenerates the freemap. It looks like this bug has been in the codebase for quite a long time. Cc: # v2.6.12 Fixes: 1da177e4c3f415 ("Linux-2.6.12-rc2") Signed-off-by: "Darrick J. Wong" Reviewed-by: Christoph Hellwig Signed-off-by: Sasha Levin

xfs: mark data structures corrupt on EIO and ENODATA

2026-03-04T12:20:18+00:00

[ Upstream commit f39854a3fb2f06dc69b81ada002b641ba5b4696b ] I learned a few things this year: first, blk_status_to_errno can return ENODATA for critical media errors; and second, the scrub code doesn't mark data structures as corrupt on ENODATA or EIO. Currently, scrub failing to capture these errors isn't all that impactful -- the checking code will exit to userspace with EIO/ENODATA, and xfs_scrub will log a complaint and exit with nonzero status. Most people treat fsck tools failing as a sign that the fs is corrupt, but online fsck should mark the metadata bad and keep moving. Cc: stable@vger.kernel.org # v4.15 Fixes: 4700d22980d459 ("xfs: create helpers to record and deal with scrub problems") Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig Signed-off-by: Carlos Maiolino Signed-off-by: Sasha Levin

xfs: set max_agbno to allow sparse alloc of last full inode chunk

2026-02-06T15:40:12+00:00

[ Upstream commit c360004c0160dbe345870f59f24595519008926f ] Sparse inode cluster allocation sets min/max agbno values to avoid allocating an inode cluster that might map to an invalid inode chunk. For example, we can't have an inode record mapped to agbno 0 or that extends past the end of a runt AG of misaligned size. The initial calculation of max_agbno is unnecessarily conservative, however. This has triggered a corner case allocation failure where a small runt AG (i.e. 2063 blocks) is mostly full save for an extent to the EOFS boundary: [2050,13]. max_agbno is set to 2048 in this case, which happens to be the offset of the last possible valid inode chunk in the AG. In practice, we should be able to allocate the 4-block cluster at agbno 2052 to map to the parent inode record at agbno 2048, but the max_agbno value precludes it. Note that this can result in filesystem shutdown via dirty trans cancel on stable kernels prior to commit 9eb775968b68 ("xfs: walk all AGs if TRYLOCK passed to xfs_alloc_vextent_iterate_ags") because the tail AG selection by the allocator sets t_highest_agno on the transaction. If the inode allocator spins around and finds an inode chunk with free inodes in an earlier AG, the subsequent dir name creation path may still fail to allocate due to the AG restriction and cancel. To avoid this problem, update the max_agbno calculation to the agbno prior to the last chunk aligned agbno in the AG. This is not necessarily the last valid allocation target for a sparse chunk, but since inode chunks (i.e. records) are chunk aligned and sparse allocs are cluster sized/aligned, this allows the sb_spino_align alignment restriction to take over and round down the max effective agbno to within the last valid inode chunk in the AG. Note that even though the allocator improvements in the aforementioned commit seem to avoid this particular dirty trans cancel situation, the max_agbno logic improvement still applies as we should be able to allocate from an AG that has been appropriately selected. The more important target for this patch however are older/stable kernels prior to this allocator rework/improvement. Cc: stable@vger.kernel.org # v4.2 Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure") Signed-off-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Carlos Maiolino [ xfs_ag_block_count(args.mp, pag_agno(pag)) => args.mp->m_sb.sb_agblocks ] Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

xfs: fix a memory leak in xfs_buf_item_init()

2026-01-19T12:12:02+00:00

[ Upstream commit fc40459de82543b565ebc839dca8f7987f16f62e ] xfs_buf_item_get_format() may allocate memory for bip->bli_formats, free the memory in the error path. Fixes: c3d5f0c2fb85 ("xfs: complain if anyone tries to create a too-large buffer log item") Cc: stable@vger.kernel.org Signed-off-by: Haoxiang Li Reviewed-by: Christoph Hellwig Reviewed-by: Carlos Maiolino Signed-off-by: Carlos Maiolino [ Adjust context ] Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman