kernel/linux.git/fs/btrfs, branch v7.1-rc5

Merge tag 'for-7.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

2026-05-23T23:54:48+00:00

Pull btrfs fixes from David Sterba: "A batch of fixes to simple quotas: - add conditional rescheduling point not dependent on the lock during inode iterations to avoid delays with PREEMPT_NONE enabled - fix subvolume deletion so it does not break the squota invariants - properly handle enabling squota, tracking extents in the initial transaction - catch and warn about underflows, clamp to zero to avoid further problems And one fix to inode size handling: - fix handling of preallocated extents beyond i_size when not using the no-holes feature" * tag 'for-7.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: swallow btrfs_record_squota_delta() ENOENT btrfs: clamp to avoid squota underflow btrfs: fix squota accounting during enable generation btrfs: check for subvolume before deleting squota qgroup btrfs: always drop root->inodes lock before cond_resched() btrfs: mark file extent range dirty after converting prealloc extents

btrfs: swallow btrfs_record_squota_delta() ENOENT

2026-05-16T01:08:40+00:00

I thought that it was likely I could harden squota deletion to the point that it was impossible to end up with an extent accounted to a qgroup outliving its qgroup. Several recent bugs have made me re-consider that position. Ultimately, this is a tradeoff between short term stability and long term strictness, but I think given that there could be another layer of bugs behind the 2-3 I just fixed, I would feel much more confident in people using squotas if the risk was "your values can get a bit out of whack which you can fix by deleting stuff or disabling/re-enabling/repairing" vs "it will abort your filesystem". As the final nail in the coffin, the Meta production kernel was lacking earlier fixes from me and Qu regarding subvol qgroup lifetime, so this is what we have been testing at scale, so I think at least for now upstream should have the same extra layer of protection. Reviewed-by: Qu Wenruo Signed-off-by: Boris Burkov Reviewed-by: David Sterba Signed-off-by: David Sterba

btrfs: clamp to avoid squota underflow

2026-05-16T01:07:20+00:00

Simple quota accounting can undercount metadata tree block allocations in certain scenarios. When an undercounted subvolume is deleted and its tree blocks freed, the free deltas decrement rfer/excl past zero, wrapping the u64 to a value near U64_MAX. Once wrapped, can_delete_squota_qgroup() sees non-zero rfer and refuses to delete the qgroup. The qgroup becomes permanently orphaned in the quota tree, since there is no subvolume left to generate frees that would bring the counter back to zero. While we ultimately want to fix any mis-accounting at the source, it is also helpful and worthwhile to mitigate the damage by clamping rfer and excl to zero on underflow rather than allowing the u64 to wrap. This at least allows us to clean up the messed up qgroups on subvol deletion. Reviewed-by: Qu Wenruo Signed-off-by: Boris Burkov Reviewed-by: David Sterba Signed-off-by: David Sterba

btrfs: fix squota accounting during enable generation

2026-05-16T01:07:19+00:00

The first transaction that enables squotas is special and a bit tricky. We have to set BTRFS_FS_QUOTA_ENABLED after the transaction to avoid a deadlock, so any delayed refs that run before we set the bit are not squota accounted. For data this is fine, we don't get an owner_ref, so there is no real harm, it's as if the extent predated squotas. However for metadata, the tree block will have gen == enable_gen so when we free it later, we will decrement the squota accounting, which can result in an underflow. Before it is freed, btrfs check shows errors, as we have mismatched usage between the node generations/owners and the squota values. There are two angles to this fix: 1. For extents that come in delayed_refs that run during the enable_gen transaction, we must actually set enable_gen to the *next* transaction. That is the first transaction that we can really properly account in any way. 2. For extents that come in between the end of our transaction handle and the time we set the BTRFS_FS_QUOTA_ENABLED bit, we need an additional bit, BTRFS_FS_SQUOTA_ENABLING which only affects recording squota deltas, so we do pick up those extents. Otherwise, we would miss them, even for enable_gen + 1. Fixes: bd7c1ea3a302 ("btrfs: qgroup: check generation when recording simple quota delta") Reviewed-by: Qu Wenruo Signed-off-by: Boris Burkov Reviewed-by: David Sterba Signed-off-by: David Sterba

btrfs: check for subvolume before deleting squota qgroup

2026-05-16T01:07:17+00:00

The invariant that we want to maintain with subvolume qgroups is that the qgroup can only be deleted if there is no root. With squotas, we thought that it was sufficient to just check the usage, because we assumed that deleting a subvolume will drive it's qgroups usage to 0, and thus 0 usage implies no subvolume. However, this is false, for two reasons: - A subvol whose extents are all from before squotas was enabled. - A subvol that was created in this transaction and for which we have not yet run any delayed refs. In both cases, deleting the qgroup breaks the desired invariant and we are left with a subvolume with no qgroup but squotas are enabled. Fix this by unifying the deletion check logic between full qgroups and squotas. Squotas do all the same checks *and* the additional usage == 0 check, which is the one extra rule peculiar to squotas. Link: https://lore.kernel.org/linux-btrfs/adnBhWfJQ1n3hZC8@merlins.org/ Fixes: a8df35619948 ("btrfs: forbid deleting live subvol qgroup") Reviewed-by: Qu Wenruo Signed-off-by: Boris Burkov Reviewed-by: David Sterba Signed-off-by: David Sterba

btrfs: always drop root->inodes lock before cond_resched()

2026-05-16T01:06:56+00:00

find_first_inode() and find_first_inode_to_shrink() lock root->inodes, then loop over them, occasionally skipping some inodes. When they skip an inode, they attempt to share the cpu/lock with cond_resched_lock(). However, that has a subtle problem associated with it. cond_resched_lock() only drops the lock if it needs to actually call schedule(). With CONFIG_PREEMPT_NONE, this means the full timeslice as detected at ticks. With 8+ cpus and default tunables, this is 2.8ms. So regardless of HZ, we will run for at least 2.8ms in this loop without dropping the lock, assuming it finds no suitable inodes. If HZ is small enough, it might be even worse as the tick granularity becomes bigger than the timeslice. The knock-on effect of this is that callers to btrfs_del_inode_from_root() like kswapd trying to shrink the inode slab or userspace threads calling evict() will spin on xa_lock(&root->inodes) for 2.8ms, so the extent map shrinker dominates the lock even though ostensibly it is intending to share it. This produces memory pressure as there is only one kswapd and it runs sequentially so it can get stuck in the inode slab shrinking. To fix it, simply replace cond_resched_lock() with an open coded variant which unconditionally does unlock/lock around cond_resched. Sharing the lock is decoupled from sharing the CPU, and all the users of the lock now share it fairly. I was able to reproduce this on test systems by producing a lot of empty files (to make a big root->inodes xarray), then producing memory pressure by reading large files larger than ram, triggering kswapd and the extent_map shrinker. The lock contention is visible with perf or lockstat. This patch also relieved a user-apparent bottleneck on a production system from the original report. Tested-by: Rik van Riel Reviewed-by: Filipe Manana Signed-off-by: Boris Burkov Signed-off-by: David Sterba

btrfs: mark file extent range dirty after converting prealloc extents

2026-05-16T01:06:37+00:00

When writing into a preallocated extent, ordered extent completion calls btrfs_mark_extent_written() to convert the file extent item from the BTRFS_FILE_EXTENT_PREALLOC type to the BTRFS_FILE_EXTENT_REG type. If the preallocated extent was created beyond i_size with fallocate keep-size, and the inode is evicted and loaded again before the write, the inode's file_extent_tree is initialized only up to i_size. The beyond i_size prealloc extent is therefore not tracked there. After a write into that extent extends i_size, btrfs_mark_extent_written() updates the file extent item, but the corresponding range is not marked dirty in the inode's file_extent_tree. This can leave disk_i_size stale when the filesystem does not use the no-holes feature, so after remount the file size can go back to the old value. The following reproducer triggers the problem: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f -O ^no-holes $DEV mount $DEV $MNT touch $MNT/file fallocate -n -l 2M $MNT/file umount $MNT mount $DEV $MNT dd if=/dev/zero of=$MNT/file bs=1M count=1 conv=notrunc ls -lh $MNT/file umount $MNT mount $DEV $MNT ls -lh $MNT/file umount $MNT Running the reproducer gives the following result: $ ./test.sh (...) 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.000596024 s, 1.8 GB/s -rw-rw-r-- 1 root root 1.0M May 8 16:34 /mnt/sdi/file -rw-rw-r-- 1 root root 0 May 8 16:34 /mnt/sdi/file Fix this by marking the written range dirty in the inode's file_extent_tree after successfully converting the prealloc extent to a regular extent. Fixes: 9ddc959e802b ("btrfs: use the file extent tree infrastructure") Reviewed-by: Filipe Manana Signed-off-by: Robbie Ko [ Minor change log updates ] Signed-off-by: Filipe Manana Signed-off-by: David Sterba

Merge tag 'for-7.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

2026-05-15T20:22:07+00:00

Pull btrfs fixes from David Sterba: - fixup warning when allocating memory for readahead, __GFP_NOWARN was accidentally dropped when setting mapping constraints - in tracepoint of file sync, fix sleeping in atomic context when handling dentries - harden initial loading of block group on crafted/fuzzed images, iterate all chunk mapping entries unconditionally - fix freeing pages of submitted io after checking for errors - fix incorrect inode size after remount when using fallocate KEEP_SIZE mode (also requires disabled 'no-holes' feature) * tag 'for-7.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix incorrect i_size after remount caused by KEEP_SIZE prealloc gap btrfs: only release the dirty pages io tree after successful writes btrfs: tracepoints: fix sleep while in atomic context in btrfs_sync_file() btrfs: always pass __GFP_NOWARN from add_ra_bio_pages() btrfs: fix check_chunk_block_group_mappings() to iterate all chunk maps

btrfs: fix incorrect i_size after remount caused by KEEP_SIZE prealloc gap

2026-05-07T22:32:08+00:00

When fallocate() with FALLOC_FL_KEEP_SIZE preallocates an extent past the current i_size, the file_extent_tree of the inode is updated to cover that range. However, on the next mount, btrfs_read_locked_inode() only re-populates file_extent_tree with [0, round_up(i_size, sectorsize)), losing the marks that belonged to the KEEP_SIZE prealloc extent beyond i_size. Later, when a non-KEEP_SIZE fallocate() extends i_size into / past that old prealloc extent, the reservation loop in btrfs_fallocate() skips already-prealloc segments and does not call into the path that marks the file_extent_tree, so a gap remains inside the file_extent_tree across [old_aligned_i_size, start_of_new_alloc). Then __btrfs_prealloc_file_range() calls btrfs_inode_safe_disk_i_size_write(), which uses find_contiguous_extent_bit() starting at offset 0 to derive disk_i_size. The walk stops at the gap, so disk_i_size ends up smaller than i_size and gets persisted. After the next mount, the file shows the wrong (smaller) size. The following reproducer triggers the problem: $ cat test.sh MNT=/mnt/sdi DEV=/dev/sdi mkdir -p $MNT mkfs.btrfs -f -O ^no-holes $DEV mount $DEV $MNT touch $MNT/file1 # KEEP_SIZE prealloc beyond i_size (i_size stays 0) fallocate -n -o 4M -l 4M $MNT/file1 umount $MNT mount $DEV $MNT # non-KEEP_SIZE fallocate that overlaps the previous prealloc tail # and extends past it fallocate -o 7M -l 2M $MNT/file1 ls -lh $MNT/file1 umount $MNT mount $DEV $MNT ls -lh $MNT/file1 umount $MNT Running the reproducer gives the following result: $ ./test.sh (...) -rw-rw-r-- 1 root root 9.0M May 4 16:35 /mnt/sdi/file1 -rw-rw-r-- 1 root root 7.0M May 4 16:35 /mnt/sdi/file1 The size before the second mount is correct (9M), but after the remount it drops to 7M, i.e. the start of the gap inside file_extent_tree. Fix this in __btrfs_prealloc_file_range() by marking the entire range [round_down(old_i_size, sectorsize), round_up(new_i_size, sectorsize)) in file_extent_tree before updating i_size and calling btrfs_inode_safe_disk_i_size_write(). This ensures the contiguous bit search starting from 0 is not truncated by a stale gap left behind by a previous KEEP_SIZE prealloc that was not restored on inode load. The fix has no effect when the NO_HOLES feature is enabled because btrfs_inode_safe_disk_i_size_write() and btrfs_inode_set_file_extent_range() both take the fast path that directly tracks disk_i_size without consulting file_extent_tree. Fixes: 9ddc959e802b ("btrfs: use the file extent tree infrastructure") Reviewed-by: Filipe Manana Signed-off-by: Robbie Ko [ Minor updates to the change log ] Signed-off-by: Filipe Manana Signed-off-by: David Sterba

btrfs: only release the dirty pages io tree after successful writes

2026-05-07T22:31:47+00:00

[WARNING] With extra warning on dirty extent buffers at umount (aka, the next patch in the series), test case generic/388 can trigger the following warning about dirty extent buffers at unmount time: BTRFS critical (device dm-2 state E): emergency shutdown BTRFS error (device dm-2 state E): error while writing out transaction: -30 BTRFS warning (device dm-2 state E): Skipping commit of aborted transaction. BTRFS error (device dm-2 state EA): Transaction 9 aborted (error -30) BTRFS: error (device dm-2 state EA) in cleanup_transaction:2068: errno=-30 Readonly filesystem BTRFS info (device dm-2 state EA): forced readonly BTRFS info (device dm-2 state EA): last unmount of filesystem 4fbf2e15-f941-49a0-bc7c-716315d2777c ------------[ cut here ]------------ WARNING: disk-io.c:3311 at invalidate_and_check_btree_folios+0xfd/0x1ca [btrfs], CPU#8: umount/914368 CPU: 8 UID: 0 PID: 914368 Comm: umount Tainted: G OE 7.1.0-rc1-custom+ #372 PREEMPT(full) 2de38db8d1deae71fde295430a0ff3ab98ccf596 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:invalidate_and_check_btree_folios+0xfd/0x1ca [btrfs] Call Trace: close_ctree+0x52e/0x574 [btrfs d2f0b1cd330d1287e7a9919d112eadfc0e914efd] generic_shutdown_super+0x89/0x1a0 kill_anon_super+0x16/0x40 btrfs_kill_super+0x16/0x20 [btrfs d2f0b1cd330d1287e7a9919d112eadfc0e914efd] deactivate_locked_super+0x2d/0xb0 cleanup_mnt+0xdc/0x140 task_work_run+0x5a/0xa0 exit_to_user_mode_loop+0x123/0x4b0 do_syscall_64+0x243/0x7c0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 ---[ end trace 0000000000000000 ]--- BTRFS warning (device dm-2 state EA): unable to release extent buffer 30539776 owner 9 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30621696 owner 257 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30638080 owner 258 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30654464 owner 7 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30703616 owner 2 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30720000 owner 10 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30736384 owner 4 gen 9 refs 2 flags 0x7 BTRFS warning (device dm-2 state EA): unable to release extent buffer 30752768 owner 11 gen 9 refs 2 flags 0x7 I'm using a stripped down version, which seems to trigger the warning more reliably: _fsstress_pid="" workload() { dmesg -C mkfs.btrfs -f -K $dev > /dev/null echo 1 > /sys/kernel/debug/clear_warn_once mount $dev $mnt $fsstress -w -n 1024 -p 4 -d $mnt & _fsstress_pid=$! sleep 0 $godown $mnt pkill --echo -PIPE fsstress > /dev/null wait $_fsstress_pid unset _fsstress_pid umount $mnt if dmesg | grep -q "WARNING"; then fail fi } for (( i = 0; i < $runtime; i++ )); do echo "=== $i/$runtime ===" workload done [CAUSE] Inside btrfs_write_and_wait_transaction(), we first try to write all dirty ebs, then wait for them to finish. After that we call btrfs_extent_io_tree_release() to free all extent states from dirty_pages io tree. However if we hit an error from btrfs_write_marked_extent(), then we still call btrfs_extent_io_tree_release() to clear that dirty_pages io tree, which may contain dirty records that we haven't yet submitted. Furthermore, the later transaction cleanup path will utilize that dirty_pages io tree to properly cleanup those dirty ebs, but since it's already empty, no dirty ebs are properly cleaned up, thus will later trigger the warnings inside invalidate_btree_folios(). [FIX] Normally such dirty ebs won't cause problems, as when the iput() is called on the btree inode, the dirty ebs will be forcibly written back, and since the fs is already in an error status, such writeback will not reach disk and finish immediately. But it's still better to get rid of such dirty ebs, if we ended up with dirty ebs but the fs is not in an error status, then such writeback at iput() time will be too late, as all workers are already stopped but writeback will utilize workers, which will lead to NULL pointer dereferences. Instead of unconditionally calling btrfs_extent_io_tree_release(), only call it if btrfs_write_and_wait_transaction() finished successfully, so that @dirty_pages extent io tree is kept untouched for transaction cleanup. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana Signed-off-by: Qu Wenruo Signed-off-by: David Sterba