summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2026-04-07btrfs: remove folio parameter from ordered io related functionsQu Wenruo6-45/+30
Both functions btrfs_finish_ordered_extent() and btrfs_mark_ordered_io_finished() are accepting an optional folio parameter. That @folio is passed into can_finish_ordered_extent(), which later will test and clear the ordered flag for the involved range. However I do not think there is any other call site that can clear ordered flags of an page cache folio and can affect can_finish_ordered_extent(). There are limited *_clear_ordered() callers out of can_finish_ordered_extent() function: - btrfs_migrate_folio() This is completely unrelated, it's just migrating the ordered flag to the new folio. - btrfs_cleanup_ordered_extents() We manually clean the ordered flags of all involved folios, then call btrfs_mark_ordered_io_finished() without a @folio parameter. So it doesn't need and didn't pass a @folio parameter in the first place. - btrfs_writepage_fixup_worker() This function is going to be removed soon, and we should not hit that function anymore. - btrfs_invalidate_folio() This is the real call site we need to bother with. If we already have a bio running, btrfs_finish_ordered_extent() in end_bbio_data_write() will be executed first, as btrfs_invalidate_folio() will wait for the writeback to finish. Thus if there is a running bio, it will not see the range has ordered flags, and just skip to the next range. If there is no bio running, meaning the ordered extent is created but the folio is not yet submitted. In that case btrfs_invalidate_folio() will manually clear the folio ordered range, but then manually finish the ordered extent with btrfs_dec_test_ordered_pending() without bothering the folio ordered flags. Meaning if the OE range with folio ordered flags will be finished manually without the need to call can_finish_ordered_extent(). This means all can_finish_ordered_extent() call sites should get a range that has folio ordered flag set, thus the old "return false" branch should never be triggered. Now we can: - Remove the @folio parameter from involved functions * btrfs_mark_ordered_io_finished() * btrfs_finish_ordered_extent() For call sites passing a @folio into those functions, let them manually clear the ordered flag of involved folios. - Move btrfs_finish_ordered_extent() out of the loop in end_bbio_data_write() We only need to call btrfs_finish_ordered_extent() once per bbio, not per folio. - Add an ASSERT() to make sure all folio ranges have ordered flags It's only for end_bbio_data_write(). And we already have enough safe nets to catch over-accounting of ordered extents. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove the btrfs_inode parameter from btrfs_remove_ordered_extent()Qu Wenruo3-6/+5
We already have btrfs_ordered_extent::inode, thus there is no need to pass a btrfs_inode parameter to btrfs_remove_ordered_extent(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove out-of-date comments in btree_writepages()Qu Wenruo1-30/+4
There is a lengthy comment introduced in commit b3ff8f1d380e ("btrfs: Don't submit any btree write bio if the fs has errors") and commit c9583ada8cc4 ("btrfs: avoid double clean up when submit_one_bio() failed"), explaining two things: - Why we don't want to submit metadata write if the fs has errors - Why we re-set @ret to 0 if it's positive However it's no longer uptodate by the following reasons: - We have better checks nowadays Commit 2618849f31e7 ("btrfs: ensure no dirty metadata is written back for an fs with errors") has introduced better checks, that if the fs is in an error state, metadata writes will not result in any bio but instead complete immediately. That covers all metadata writes better. - Mentioned incorrect function name The commit c9583ada8cc4 ("btrfs: avoid double clean up when submit_one_bio() failed") introduced this ret > 0 handling, but at that time the function name submit_extent_page() was already incorrect. It was submit_eb_page() that could return >0 at that time, and submit_extent_page() could only return 0 or <0 for errors, never >0. Later commit b35397d1d325 ("btrfs: convert submit_extent_page() to use a folio") changed "submit_extent_page()" to "submit_extent_folio()" in the comment, but it doesn't make any difference since the function name is wrong from day 1. Finally commit 5e121ae687b8 ("btrfs: use buffer xarray for extent buffer writeback operations") completely reworked how metadata writeback works, and removed submit_eb_page(), leaving only the wrong function name in the comment. Furthermore the function submit_extent_folio() still exists in the latest code base, but is never utilized for metadata writeback, causing more confusion. Just remove the lengthy comment, and replace the "if (ret > 0)" check with an ASSERT(), since only btrfs_check_meta_write_pointer() can modify @ret and it returns 0 or <0 for errors. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove bogus root search condition in load_extent_tree_free()Filipe Manana1-2/+1
There's no need to pass the maximum between the block group's start offset and BTRFS_SUPER_INFO_OFFSET (64K) since we can't have any block groups allocated in the first megabyte, as that's reserved space. Furthermore, even if we could, the correct thing to do was to pass the block group's start offset anyway - and that's precisely what we do for block groups hat happen to contain a superblock mirror (the range for the super block is never marked as free and it's marked as dirty in the fs_info->excluded_extents io tree). So simplify this and get rid of that maximum expression. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove duplicate include of delayed-inode.h in disk-io.cChen Ni1-1/+0
Remove duplicate inclusion of delayed-inode.h in disk-io.c to clean up redundant code. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: pass literal booleans to functions that take boolean argumentsFilipe Manana10-28/+26
We have several functions with parameters defined as booleans but then we have callers passing integers, 0 or 1, instead of false and true. While this isn't a bug since 0 and 1 are converted to false and true, it is odd and less readable. Change the callers to pass true and false literals instead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove pointless out label in qgroup_account_snapshot()Filipe Manana1-9/+8
The 'out' label is pointless as there are no cleanups to perform there, we can replace every goto with a direct return. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: rename btrfs_csum_file_blocks() to btrfs_insert_data_csums()Qu Wenruo4-10/+10
The function btrfs_csum_file_blocks() is a little confusing, unlike btrfs_csum_one_bio(), it is not calculating the checksum of some file blocks. Instead it's just inserting the already calculated checksums into a given root (can be a csum root or a log tree). So rename it to btrfs_insert_data_csums() to reflect its behavior better. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: make add_pending_csums() to take an ordered extent as parameterQu Wenruo1-5/+7
The structure btrfs_ordered_extent has a lot of list heads for different purposes, passing a random list_head pointer is never a good idea as if the wrong list is passed in, the type casting along with the fs will be screwed up. Instead pass the btrfs_ordered_extent pointer, and grab the csum_list inside add_pending_csums() to make it a little safer. Since we're here, also update the comments to follow the current style. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: rename btrfs_ordered_extent::list to csum_listQu Wenruo5-14/+13
That list head records all pending checksums for that ordered extent. And unlike other lists, we just use the name "list", which can be very confusing for readers. Rename it to "csum_list" which follows the remaining lists, showing the purpose of the list. And since we're here, remove a comment inside btrfs_finish_ordered_zoned() where we have "ASSERT(!list_empty(&ordered->csum_list))" to make sure the OE has pending csums. That comment is only here to make sure we do not call list_first_entry() before checking BTRFS_ORDERED_PREALLOC. But since we already have that bit checked and even have a dedicated ASSERT(), there is no need for that comment anymore. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: change return type of cache_save_setup to voidJohannes Thumshirn1-10/+7
None of the callers of `cache_save_setup` care about the return type as the space cache is purely and optimization. Also the free space cache is a deprecated feature that is being phased out. Change the return type of `cache_save_setup` to void to reflect this. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: avoid starting new transaction and commit in relocate_block_group()Filipe Manana1-6/+1
We join a transaction with the goal of catching the current transaction and then commit it to get rid of pinned extents and reclaim free space, but a join can create a new transaction if there isn't any running, and if right before we did the join the current transaction happened to be committed by someone else (like the transaction kthread for example), we end up starting and committing a new transaction, causing rotation of the super block backup roots besides extra and useless IO. So instead of doing a transaction join followed by a commit, use the helper btrfs_commit_current_transaction() which ensures no transaction is created if there isn't any running. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove redundant extent_buffer_uptodate() checks after read_tree_block()Filipe Manana7-48/+6
We have several places that call extent_buffer_uptodate() after reading a tree block with read_tree_block(), but that is redundant since we already call extent_buffer_uptodate() in the call chain of read_tree_block(): read_tree_block() btrfs_read_extent_buffer() read_extent_buffer_pages() returns -EIO if extent_buffer_uptodate() returns false So remove those redundant checks. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: use the helper extent_buffer_uptodate() everywhereFilipe Manana2-4/+4
Instead of open coding testing the uptodate bit on the extent buffer's flags, use the existing helper extent_buffer_uptodate() (which is even shorter to type). Also change the helper's return value from int to bool, since we always use it in a boolean context. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: zoned: add zone reclaim flush state for DATA space_infoJohannes Thumshirn2-0/+23
On zoned block devices, DATA block groups can accumulate large amounts of zone_unusable space (space between the write pointer and zone end). When zone_unusable reaches high levels (e.g., 98% of total space), new allocations fail with ENOSPC even though space could be reclaimed by relocating data and resetting zones. The existing flush states don't handle this scenario effectively - they either try to free cached space (which doesn't exist for zone_unusable) or reset empty zones (which doesn't help when zones contain valid data mixed with zone_unusable space). Add a new RECLAIM_ZONES flush state that triggers the block group reclaim machinery. This state: - Calls btrfs_reclaim_sweep() to identify reclaimable block groups - Calls btrfs_reclaim_bgs() to queue reclaim work - Waits for reclaim_bgs_work to complete via flush_work() - Commits the transaction to finalize changes The reclaim work (btrfs_reclaim_bgs_work) safely relocates valid data from fragmented block groups to other locations before resetting zones, converting zone_unusable space back into usable space. Insert RECLAIM_ZONES before RESET_ZONES in data_flush_states so that we attempt to reclaim partially-used block groups before falling back to resetting completely empty ones. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: zoned: move partially zone_unusable block groups to reclaim listJohannes Thumshirn1-0/+18
On zoned block devices, block groups accumulate zone_unusable space (space between the write pointer and zone end that cannot be allocated until the zone is reset). When a block group becomes mostly zone_unusable but still contains some valid data and it gets added to the unused_bgs list it can never be deleted because it's not actually empty. The deletion code (btrfs_delete_unused_bgs) skips these block groups due to the btrfs_is_block_group_used() check, leaving them on the unused_bgs list indefinitely. This causes two problems: 1. The block groups are never reclaimed, permanently wasting space 2. Eventually leads to ENOSPC even though reclaimable space exists Fix by detecting block groups where zone_unusable exceeds 50% of the block group size. Move these to the reclaim_bgs list instead of skipping them. This triggers btrfs_reclaim_bgs_work() which: 1. Marks the block group read-only 2. Relocates the remaining valid data via btrfs_relocate_chunk() 3. Removes the emptied block group 4. Resets the zones, converting zone_unusable back to usable space The 50% threshold ensures we only reclaim block groups where most space is unusable, making relocation worthwhile. Block groups with less zone_unusable are left on unused_bgs to potentially become fully empty through normal deletion. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: zoned: cap delayed refs metadata reservation to avoid overcommitJohannes Thumshirn2-0/+36
On zoned filesystems metadata space accounting can become overly optimistic due to delayed refs reservations growing without a hard upper bound. The delayed_refs_rsv block reservation is allowed to speculatively grow and is only backed by actual metadata space when refilled. On zoned devices this can result in delayed_refs_rsv reserving a large portion of metadata space that is already effectively unusable due to zone write pointer constraints. As a result, space_info->may_use can grow far beyond the usable metadata capacity, causing the allocator to believe space is available when it is not. This leads to premature ENOSPC failures and "cannot satisfy tickets" reports even though commits would be able to make progress by flushing delayed refs. Analysis of "-o enospc_debug" dumps using a Python debug script confirmed that delayed_refs_rsv was responsible for the majority of metadata overcommit on zoned devices. By correlating space_info counters (total, used, may_use, zone_unusable) across transactions, the analysis showed that may_use continued to grow even after usable metadata space was exhausted, with delayed refs refills accounting for the excess reservations. Here's the output of the analysis: ====================================================================== Space Type: METADATA ====================================================================== Raw Values: Total: 256.00 MB (268435456 bytes) Used: 128.00 KB (131072 bytes) Pinned: 16.00 KB (16384 bytes) Reserved: 144.00 KB (147456 bytes) May Use: 255.48 MB (267894784 bytes) Zone Unusable: 192.00 KB (196608 bytes) Calculated Metrics: Actually Usable: 255.81 MB (total - zone_unusable) Committed: 255.77 MB (used + pinned + reserved + may_use) Consumed: 320.00 KB (used + zone_unusable) Percentages: Zone Unusable: 0.07% of total May Use: 99.80% of total Fix this by adding a zoned-specific cap in btrfs_delayed_refs_rsv_refill(): Before reserving additional metadata bytes, limit the delayed refs reservation based on the usable metadata space (total bytes minus zone_unusable). If the reservation would exceed this cap, return -EAGAIN to trigger the existing flush/commit logic instead of overcommitting metadata space. This preserves the existing reservation and flushing semantics while preventing metadata overcommit on zoned devices. The change is limited to metadata space and does not affect non-zoned filesystems. This patch addresses premature metadata ENOSPC conditions on zoned devices and ensures delayed refs are throttled before exhausting usable metadata. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove duplicated eb uptodate check in btrfs_buffer_uptodate()Filipe Manana1-2/+1
We are calling extent_buffer_uptodate() twice, and the result will not change before the second call. So remove the second call. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: fix the inline compressed extent check in inode_need_compress()Qu Wenruo1-1/+2
[BUG] Since commit 59615e2c1f63 ("btrfs: reject single block sized compression early"), the following script will result the inode to have NOCOMPRESS flag, meanwhile old kernels don't: # mkfs.btrfs -f $dev # mount $dev $mnt -o max_inline=2k,compress=zstd # truncate -s 8k $mnt/foobar # xfs_io -f -c "pwrite 0 2k" $mnt/foobar # sync Before that commit, the inode will not have NOCOMPRESS flag: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 generation 9 transid 9 size 8192 nbytes 4096 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 sequence 3 flags 0x0(none) But after that commit, the inode will have NOCOMPRESS flag: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 generation 9 transid 10 size 8192 nbytes 4096 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 sequence 3 flags 0x8(NOCOMPRESS) This will make a lot of files no longer to be compressed. [CAUSE] The old compressed inline check looks like this: if (total_compressed <= blocksize && (start > 0 || end + 1 < inode->disk_i_size)) goto cleanup_and_bail_uncompressed; That inline part check is equal to "!(start == 0 && end + 1 >= inode->disk_i_size)", but the new check no longer has that disk_i_size check. Thus it means any single block sized write at file offset 0 will pass the inline check, which is wrong. Furthermore, since we have merged the old check into inode_need_compress(), there is no disk_i_size based inline check anymore, we will always try compressing that single block at file offset 0, then later find out it's not a net win and go to the mark_incompressible tag. This results the inode to have NOCOMPRESS flag. [FIX] Add back the missing disk_i_size based check into inode_need_compress(). Now the same script will no longer cause NOCOMPRESS flag. Fixes: 59615e2c1f63 ("btrfs: reject single block sized compression early") Reported-by: Chris Mason <clm@meta.com> Link: https://lore.kernel.org/linux-btrfs/20260208183840.975975-1-clm@meta.com/ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: set written super flag once in write_all_supers()Filipe Manana1-4/+2
In case we have multiple devices, there is no point in setting the written flag in the super block on every iteration over the device list. Just do it once before the loop. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove max_mirrors argument from write_all_supers()Filipe Manana4-18/+12
There's no need to pass max_mirrors to write_all_supers() since from the given transaction handle we can infer if we are in a transaction commit or fsync context, so we can determine how many mirrors we need to use. So remove the max_mirror argument from write_all_supers() and stop adjusting it in the callees write_dev_supers() and wait_dev_supers(), simplifying them besides the parameter list for write_all_supers(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tag error branches as unlikely during super block writesFilipe Manana1-14/+13
Mark all the unexpected error checks as unlikely, to make it more clear they are unexpected and to allow the compiler to potentially generate better code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: abort transaction on error in write_all_supers()Filipe Manana1-11/+10
We are in a transaction context and have an handle, so instead of using the less preferred btrfs_handle_fs_error(), abort the transaction and log an error to give some context information. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: pass transaction handle to write_all_supers()Filipe Manana4-4/+5
We are holding a transaction In every context we call write_all_supers(), so pass the transaction handle instead of fs_info to it. This will allow us to abort the transaction in write_all_supers() instead of calling btrfs_handle_fs_error() in a later patch. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: mark all error and warning checks as unlikely in btrfs_validate_super()Filipe Manana1-39/+39
When validating a super block, either when mounting or every time we write a super block to disk, we do many checks for error and warnings and we don't expect to hit any. So mark each one as unlikely to reflect that and allow the compiler to potentially generate better code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: update comment for BTRFS_RESERVE_NO_FLUSHFilipe Manana1-1/+18
The comment is incomplete as BTRFS_RESERVE_NO_FLUSH is used for more reasons than currently holding a transaction handle open. Update the comment with all the other reasons and give some details. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: don't allow log trees to consume global reserve or overcommit metadataFilipe Manana1-0/+25
For a fsync we never reserve space in advance, we just start a transaction without reserving space and we use an empty block reserve for a log tree. We reserve space as we need while updating a log tree, we end up in btrfs_use_block_rsv() when reserving space for the allocation of a log tree extent buffer and we attempt first to reserve without flushing, and if that fails we attempt to consume from the global reserve or overcommit metadata. This makes us consume space that may be the last resort for a transaction commit to succeed, therefore increasing the chances for a transaction abort with -ENOSPC. So make btrfs_use_block_rsv() fail if we can't reserve metadata space for a log tree extent buffer allocation without flushing, making the fsync fallback to a transaction commit and avoid using critical space that could be the only resort for a transaction commit to succeed when we are in a critical space situation. Reviewed-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: be less aggressive with metadata overcommit when we can do full flushingFilipe Manana1-3/+3
Over the years we often get reports of some -ENOSPC failure while updating metadata that leads to a transaction abort. I have seen this happen for filesystems of all sizes and with workloads that are very user/customer specific and unable to reproduce, but Aleksandar recently reported a simple way to reproduce this with a 1G filesystem and using the bonnie++ benchmark tool. The following test script reproduces the failure: $ cat test.sh #!/bin/bash # Create and use a 1G null block device, memory backed, otherwise # the test takes a very long time. modprobe null_blk nr_devices="0" null_dev="/sys/kernel/config/nullb/nullb0" mkdir "$null_dev" size=$((1 * 1024)) # in MB echo 2 > "$null_dev/submit_queues" echo "$size" > "$null_dev/size" echo 1 > "$null_dev/memory_backed" echo 1 > "$null_dev/discard" echo 1 > "$null_dev/power" DEV=/dev/nullb0 MNT=/mnt/nullb0 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/test/ bonnie++ -d $MNT/test/ -m BTRFS -u 0 -s 256M -r 128M -b umount $MNT echo 0 > "$null_dev/power" rmdir "$null_dev" When running this bonnie++ fails in the phase where it deletes test directories and files: $ ./test.sh (...) Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently...done Rewriting...done Reading a byte at a time...done Reading intelligently...done start 'em...done...done...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...Can't sync directory, turning off dir-sync. Can't delete file 9Bq7sr0000000338 Cleaning up test directory after error. Bonnie: drastic I/O error (rmdir): Read-only file system And in the syslog/dmesg we can see the following transaction abort trace: [161915.501506] BTRFS warning (device nullb0): Skipping commit of aborted transaction. [161915.502983] ------------[ cut here ]------------ [161915.503832] BTRFS: Transaction aborted (error -28) [161915.504748] WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975 [161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...) [161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted: G W 6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full) [161915.520857] Tainted: [W]=WARN [161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30 [btrfs] [161915.524630] Code: 48 8b 7c 24 (...) [161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292 [161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX: 0000000000000000 [161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI: ffffffffc088f780 [161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09: ffffd3fe8206fb90 [161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12: 00000000ffffffe4 [161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15: ffff8f4f62d18000 [161915.533229] FS: 00007ff93112a780(0000) GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000 [161915.534611] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005 CR4: 0000000000370ef0 [161915.536758] Call Trace: [161915.537185] <TASK> [161915.537575] btrfs_sync_file+0x431/0x530 [btrfs] [161915.538473] do_fsync+0x39/0x80 [161915.539042] __x64_sys_fsync+0xf/0x20 [161915.539750] do_syscall_64+0x50/0xf20 [161915.540396] entry_SYSCALL_64_after_hwframe+0x76/0x7e [161915.541301] RIP: 0033:0x7ff930ca49ee [161915.541904] Code: 08 0f 85 f5 (...) [161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX: 00007ff930ca49ee [161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 [161915.548383] RBP: 0000000000000dab R08: 0000000000000000 R09: 0000000000000000 [161915.549853] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd94291fb0 [161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15: 00007ffd94292340 [161915.552161] </TASK> [161915.552457] ---[ end trace 0000000000000000 ]--- [161915.553232] BTRFS info (device nullb0 state A): dumping space info: [161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-group id 0) has 12582912 free, is not full [161915.553239] BTRFS info (device nullb0 state A): space_info total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 [161915.553243] BTRFS info (device nullb0 state A): space_info METADATA (sub-group id 0) has -5767168 free, is full [161915.553245] BTRFS info (device nullb0 state A): space_info total=53673984, used=6635520, pinned=46956544, reserved=16384, may_use=5767168, readonly=65536 zone_unusable=0 [161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM (sub-group id 0) has 8355840 free, is not full [161915.553254] BTRFS info (device nullb0 state A): space_info total=8388608, used=16384, pinned=16384, reserved=0, may_use=0, readonly=0 zone_unusable=0 [161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size 5767168 reserved 5767168 [161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0 reserved 0 [161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size 0 reserved 0 [161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size 0 reserved 0 [161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv: size 0 reserved 0 [161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size 0 reserved 0 [161915.553272] BTRFS: error (device nullb0 state A) in cleanup_transaction:2045: errno=-28 No space left [161915.554463] BTRFS info (device nullb0 state EA): forced readonly The problem is that we allow for a very aggressive metadata overcommit, about 1/8th of the currently available space, even when the task attempting the reservation allows for full flushing. Over time this allows more and more tasks to overcommit without getting a transaction commit to release pinned extents, joining the same transaction and eventually lead to the transaction abort when attempting some tree update, as the extent allocator is not able to find any available metadata extent and it's not able to allocate a new metadata block group either (not enough unallocated space for that). Fix this by allowing the overcommit to be up to 1/64th of the available (unallocated) space instead and for that limit to apply to both types of full flushing, BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL. This way we get more frequent transaction commits to release pinned extents in case our caller is in a context where full flushing is allowed. Note that the space infos dump in the dmesg/syslog right after the transaction abort give the wrong idea that we have plenty of unallocated space when the abort happened. During the bonnie++ workload we had a metadata chunk allocation attempt and it failed with -ENOSPC because at that time we had a bunch of data block groups allocated, which then became empty and got deleted by the cleaner kthread after the metadata chunk allocation failed with -ENOSPC and before the transaction abort happened and dumped the space infos. The custom tracing (some trace_printk() calls spread in strategic places) used to check that: mount-1793735 [011] ...1. 28877.261096: btrfs_add_bg_to_space_info: added bg offset 13631488 length 8388608 flags 1 to space_info->flags 1 total_bytes 8388608 bytes_used 0 bytes_may_use 0 mount-1793735 [011] ...1. 28877.261098: btrfs_add_bg_to_space_info: added bg offset 22020096 length 8388608 flags 34 to space_info->flags 2 total_bytes 8388608 bytes_used 16384 bytes_may_use 0 mount-1793735 [011] ...1. 28877.261100: btrfs_add_bg_to_space_info: added bg offset 30408704 length 53673984 flags 36 to space_info->flags 4 total_bytes 53673984 bytes_used 131072 bytes_may_use 0 These are from loading the block groups created by mkfs during mount. Then when bonnie++ starts doing its thing: kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 927596544 kworker/u48:5-1792004 [011] ..... 28886.122055: btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1 kworker/u48:5-1792004 [011] ...1. 28886.122064: btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512 flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0 bytes_may_use 5251072 First allocation of a data block group of 112M. kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 810156032 kworker/u48:5-1792004 [011] ..... 28886.192415: btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1 kworker/u48:5-1792004 [011] ...1. 28886.192425: btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512 flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0 bytes_may_use 122691584 Another 112M data block group allocated. kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 692715520 kworker/u48:5-1792004 [011] ..... 28886.260943: btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1 kworker/u48:5-1792004 [011] ...1. 28886.260954: btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512 flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0 bytes_may_use 240132096 Yet another one. bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 575275008 bonnie++-1793755 [010] ..... 28886.280414: btrfs_make_block_group: make bg offset 436404224 size 117440512 type 1 bonnie++-1793755 [010] ...1. 28886.280419: btrfs_add_bg_to_space_info: added bg offset 436404224 length 117440512 flags 1 to space_info->flags 1 total_bytes 478150656 bytes_used 0 bytes_may_use 268435456 One more. kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 457834496 kworker/u48:5-1792004 [011] ..... 28886.566241: btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1 kworker/u48:5-1792004 [011] ...1. 28886.566250: btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512 flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used 268435456 bytes_may_use 209723392 Another one. bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 340393984 bonnie++-1793755 [009] ..... 28886.613453: btrfs_make_block_group: make bg offset 671285248 size 117440512 type 1 bonnie++-1793755 [009] ...1. 28886.613458: btrfs_add_bg_to_space_info: added bg offset 671285248 length 117440512 flags 1 to space_info->flags 1 total_bytes 713031680 bytes_used 268435456 bytes_may_use 2 68435456 Another one. bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 222953472 bonnie++-1793755 [009] ..... 28886.674959: btrfs_make_block_group: make bg offset 788725760 size 117440512 type 1 bonnie++-1793755 [009] ...1. 28886.674963: btrfs_add_bg_to_space_info: added bg offset 788725760 length 117440512 flags 1 to space_info->flags 1 total_bytes 830472192 bytes_used 268435456 bytes_may_use 1 34217728 Another one. bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 105512960 bonnie++-1793755 [009] ..... 28886.674983: btrfs_make_block_group: make bg offset 906166272 size 105512960 type 1 bonnie++-1793755 [009] ...1. 28886.674984: btrfs_add_bg_to_space_info: added bg offset 906166272 length 105512960 flags 1 to space_info->flags 1 total_bytes 935985152 bytes_used 268435456 bytes_may_use 67108864 Another one, but a bit smaller (~100.6M) since we now have less space. bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 12582912 bonnie++-1793758 [009] ..... 28891.962105: btrfs_make_block_group: make bg offset 1011679232 size 12582912 type 1 bonnie++-1793758 [009] ...1. 28891.962114: btrfs_add_bg_to_space_info: added bg offset 1011679232 length 12582912 flags 1 to space_info->flags 1 total_bytes 948568064 bytes_used 268435456 bytes_may_use 8192 Another one, this one even smaller (12M). kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc: enter first metadata chunk alloc attempt kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want 536870912 kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want 536870912 max_avail 0 536870912 is 512M, the standard 256M metadata chunk size times 2 because of the DUP profile for metadata. 'max_avail' is what find_free_dev_extent() returns to us in gather_device_info(). As a result, gather_device_info() sets ctl->ndevs to 0, making decide_stripe_size() fail with -ENOSPC, and therefore metadata chunk allocation fails while we are attempting to run delayed items during the transaction commit. kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk: decide_stripe_size fail -ENOSPC In the syslog/dmesg pasted above, which happened after the transaction was aborted, the space info dumps did not account for all these data block groups that were allocated during bonnie++'s workload. And that is because after the metadata chunk allocation failed with -ENOSPC and before the transaction abort happened, most of the data block groups had become empty and got deleted by by the cleaner kthread - when the abort happened, we had bonnie++ in the middle of deleting the files it created. But dumping the space infos right after the metadata chunk allocation fails by adding a call to btrfs_dump_space_info_for_trans_abort() in decide_stripe_size() when it returns -ENOSPC, we get: [29972.409295] BTRFS info (device nullb0): dumping space info: [29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group id 0) has 673341440 free, is not full [29972.409303] BTRFS info (device nullb0): space_info total=948568064, used=0, pinned=275226624, reserved=0, may_use=0, readonly=0 zone_unusable=0 [29972.409305] BTRFS info (device nullb0): space_info METADATA (sub-group id 0) has 3915776 free, is not full [29972.409306] BTRFS info (device nullb0): space_info total=53673984, used=163840, pinned=42827776, reserved=147456, may_use=6553600, readonly=65536 zone_unusable=0 [29972.409308] BTRFS info (device nullb0): space_info SYSTEM (sub-group id 0) has 7979008 free, is not full [29972.409310] BTRFS info (device nullb0): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=393216, readonly=0 zone_unusable=0 [29972.409311] BTRFS info (device nullb0): global_block_rsv: size 5767168 reserved 5767168 [29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0 [29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size 393216 reserved 393216 [29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0 [29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0 So here we see there's ~904.6M of data space, ~51.2M of metadata space and 8M of system space, making a total of 963.8M. Reported-by: Aleksandar Gerasimovski <Aleksandar.Gerasimovski@belden.com> Link: https://lore.kernel.org/linux-btrfs/SA1PR18MB56922F690C5EC2D85371408B998FA@SA1PR18MB5692.namprd18.prod.outlook.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H61vZ3_+eqJ1A9po2WcgNJJjUu9MJQoYB2oDSAAecHaug@mail.gmail.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: use per-profile available space in calc_available_free_space()Qu Wenruo1-12/+15
For the following disk layout, can_overcommit() can cause false confidence in available space: devid 1 unallocated: 1GiB devid 2 unallocated: 50GiB metadata type: RAID1 As can_overcommit() simply uses unallocated space with factor to calculate the allocatable metadata chunk size, resulting 25.5GiB available space. But in reality we can only allocate one 1GiB RAID1 chunk, the remaining 49GiB on devid 2 will never be utilized to fulfill a RAID1 chunk. This leads to various ENOSPC related transaction abort and flips the fs read-only. Now use per-profile available space in calc_available_free_space(), and only when that failed we fall back to the old factor based estimation. And for zoned devices or for the very low chance of temporary memory allocation failure, we will still fallback to factor based estimation. But I hope in reality it's very rare. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: update per-profile available estimationQu Wenruo1-1/+19
This involves the following timing: - Chunk allocation - Chunk removal - After Mount - New device - Device removal - Device shrink - Device enlarge And since the function btrfs_update_per_profile_avail() will not return an error, this won't cause new error handling path. Although when btrfs_update_per_profile_avail() failed (only ENOSPC possible) it will mark the per-profile available estimation as unreliable, so later btrfs_get_per_profile_avail() will return false and require the caller to have a fallback solution. The function btrfs_update_per_profile_avail() will be executed with chunk_mutex hold, thus it will slightly slow down those involved functions, but not a lot. As all the core workload is just various u64 calculations inside a loop, without any tree search, the overhead should be acceptable even for all supported 9 profiles. For 4 disks (which exercises all 9 profiles), the execution time of that function will still be less than 10 us. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: introduce the device layout aware per-profile available spaceQu Wenruo2-0/+198
[BUG] There is a long known bug that if metadata is using RAID1 on two disks with unbalanced sizes, there is a very high chance to hit ENOSPC related transaction abort. [CAUSE] The root cause is in the available space estimation code: - Factor based calculation Just use all unallocated space, divide by the profile factor One obvious user is can_overcommit(). This can not handle the following example: devid 1 unallocated: 1GiB devid 2 unallocated: 50GiB metadata type: RAID1 If using factor based estimation, we can use (1GiB + 50GiB) / 2 = 25.5GiB free space for metadata. Thus we can continue allocating metadata (over-commit) way beyond the 1GiB limit. But this estimation is completely wrong, in reality we can only allocate one single 1GiB RAID1 block group, thus if we continue over-commit, at one time we will hit ENOSPC at some critical path and flips the fs read-only. [SOLUTION] This patch will introduce per-profile available space estimation, which can provide chunk-allocator like behavior to give a (mostly) accurate result, with under-estimate corner cases. There are some differences between the estimation and real chunk allocator: - No consideration on hole size It's fine for most cases, as all data/metadata strips are in 1GiB size thus there should not be any hole wasting much space. And chunk allocator is able to use smaller stripes when there is really no other choice. Although in theory this means it can lead to some over-estimation, it should not cause too much hassle in the real world. The other benefit of such behavior is, we avoid dev-extent tree search completely, thus the overhead is very small. - No true balance for certain cases If we have 3 disks RAID1, and each device has 2GiB unallocated space, we can load balance the chunk allocation so that we can allocate 3GiB RAID1 chunks, and that's what chunk allocator will do. But this current estimation code is using the largest available space to do a single allocation. Meaning the estimation will be 2GiB, thus under estimate. Such under estimation is fine and after the first chunk allocation, the estimation will be updated and still give a correct 2GiB estimation. So this only means the estimation will be a little conservative, which is safer for call sites like metadata over-commit check. With that facility, for above 1GiB + 50GiB case, it will give a RAID1 estimation of 1GiB, instead of the incorrect 25.5GiB. Or for a more complex example: devid 1 unallocated: 1T devid 2 unallocated: 1T devid 3 unallocated: 10T We will get an array of: RAID10: 2T RAID1: 2T RAID1C3: 1T RAID1C4: 0 (not enough devices) DUP: 6T RAID0: 3T SINGLE: 12T RAID5: 2T RAID6: 1T [IMPLEMENTATION] And for the each profile , we go chunk allocator level calculation: The pseudo code looks like: clear_virtual_used_space_of_all_rw_devices(); do { /* * The same as chunk allocator, despite used space, * we also take virtual used space into consideration. */ sort_device_with_virtual_free_space(); /* * Unlike chunk allocator, we don't need to bother hole/stripe * size, so we use the smallest device to make sure we can * allocated as many stripes as regular chunk allocator */ stripe_size = device_with_smallest_free->avail_space; stripe_size = min(stripe_size, to_alloc / ndevs); /* * Allocate a virtual chunk, allocated virtual chunk will * increase virtual used space, allow next iteration to * properly emulate chunk allocator behavior. */ ret = alloc_virtual_chunk(stripe_size, &allocated_size); if (ret == 0) avail += allocated_size; } while (ret == 0) This minimal available space based calculation is not perfect, but the important part is, the estimation is never exceeding the real available space. This patch just introduces the infrastructure, no hooks are executed yet. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: zoned: remove redundant space_info lock and variable in ↵Jiasheng Jiang1-6/+2
do_allocation_zoned() In do_allocation_zoned(), the code acquires space_info->lock before block_group->lock. However, the critical section does not access or modify any members of the space_info structure. Thus, the lock is redundant as it provides no necessary synchronization here. This change simplifies the locking logic and aligns the function with other zoned paths, such as __btrfs_add_free_space_zoned(), which only rely on block_group->lock. Since the 'space_info' local variable is no longer used after removing the lock calls, it is also removed. Removing this unnecessary lock reduces contention on the global space_info lock, improving concurrency in the zoned allocation path. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: move min sys chunk array size check to validate_sys_chunk_array()Filipe Manana1-13/+9
We check the minimum size of the sys chunk array in btrfs_validate_super() but we have a better place for that, the helper validate_sys_chunk_array() which we use for every other sys chunk array check. So move it there, also converting the return error from -EINVAL to -EUCLEAN, which is a better fit and also consistent with the other checks. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove duplicate system chunk array max size overflow checkFilipe Manana1-6/+0
We check it twice, once in validate_sys_chunk_array() and then again in its caller, btrfs_validate_super(), right after it calls validate_sys_chunk_array(). So remove the duplicated check from btrfs_validate_super(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: pass boolean literals as the last argument to inc_block_group_ro()Filipe Manana1-8/+8
The last argument of inc_block_group_ro() is defined as a boolean, but every caller is passing an integer literal, 0 or 1 for false and true respectively. While this is not incorrect, as 0 and 1 are converted to false and true, it's less readable and somewhat awkward since the argument is defined as boolean. Replace 0 and 1 with false and true. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tests: zoned: add tests cases for zoned codeNaohiro Aota5-0/+695
Add a test function for the zoned code, for now it tests btrfs_load_block_group_by_raid_type() with various test cases. The load_zone_info_tests[] array defines the test cases. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07landlock: Document fallocate(2) as another truncation corner caseGünther Noack1-2/+6
Reinforce the already stated policy that LANDLOCK_ACCESS_FS_TRUNCATE should always go hand in hand with LANDLOCK_ACCESS_FS_WRITE_FILE, as their meanings and enforcement overlap in counterintuitive ways. On many common file systems, fallocate(2) offers a way to shorten files as long as the file is opened for writing, side-stepping the LANDLOCK_ACCESS_FS_TRUNCATE right. Assisted-by: Gemini-CLI:gemini-3.1 Signed-off-by: Günther Noack <gnoack@google.com> Link: https://lore.kernel.org/r/20260401150911.1038072-1-gnoack@google.com Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07landlock: Document FS access right for pathname UNIX socketsGünther Noack1-1/+13
Add LANDLOCK_ACCESS_FS_RESOLVE_UNIX to the example code, and explain it in the section about previous limitations. The bulk of the interesting flag documentation lives in the kernel header and is included in the Sphinx rendering. Cc: Justin Suess <utilityemal77@gmail.com> Cc: Mickaël Salaün <mic@digikod.net> Signed-off-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20260327164838.38231-13-gnoack3000@gmail.com Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07selftests/landlock: Simplify ruleset creation and enforcement in fs_testGünther Noack1-611/+210
* Add enforce_fs() for defining and enforcing a ruleset in one step * In some places, dropped "ASSERT_LE(0, fd)" checks after create_ruleset() call -- create_ruleset() already checks that. * In some places, rename "file_fd" to "fd" if it is not needed to disambiguate any more. Signed-off-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20260327164838.38231-12-gnoack3000@gmail.com [mic: Tweak subjet] Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07selftests/landlock: Check that coredump sockets stay unrestrictedGünther Noack1-0/+143
Even when a process is restricted with the new LANDLOCK_ACCESS_FS_RESOLVE_UNIX right, the kernel can continue writing its coredump to the configured coredump socket. In the test, we create a local server and rewire the system to write coredumps into it. We then create a child process within a Landlock domain where LANDLOCK_ACCESS_FS_RESOLVE_UNIX is restricted and make the process crash. The test uses SO_PEERCRED to check that the connecting client process is the expected one. Includes a fix by Mickaël Salaün for setting the EUID to 0 (see [1]). Link[1]: https://lore.kernel.org/all/20260218.ohth8theu8Yi@digikod.net/ Suggested-by: Mickaël Salaün <mic@digikod.net> Signed-off-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20260327164838.38231-11-gnoack3000@gmail.com Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07selftests/landlock: Audit test for LANDLOCK_ACCESS_FS_RESOLVE_UNIXGünther Noack1-0/+40
Add an audit test to check that Landlock denials from LANDLOCK_ACCESS_FS_RESOLVE_UNIX result in audit logs in the expected format. (There is one audit test for each filesystem access right, so we should add one for LANDLOCK_ACCESS_FS_RESOLVE_UNIX as well.) Signed-off-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20260327164838.38231-10-gnoack3000@gmail.com Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07selftests/landlock: Test LANDLOCK_ACCESS_FS_RESOLVE_UNIXGünther Noack1-16/+374
* Extract common helpers from an existing IOCTL test that also uses pathname unix(7) sockets. * These tests use the common scoped domains fixture which is also used in other Landlock scoping tests and which was used in Tingmao Wang's earlier patch set in [1]. These tests exercise the cross product of the following scenarios: * Stream connect(), Datagram connect(), Datagram sendmsg() and Seqpacket connect(). * Child-to-parent and parent-to-child communication * The Landlock policy configuration as listed in the scoped_domains fixture. * In the default variant, Landlock domains are only placed where prescribed in the fixture. * In the "ALL_DOMAINS" variant, Landlock domains are also placed in the places where the fixture says to omit them, but with a LANDLOCK_RULE_PATH_BENEATH that allows connection. Cc: Justin Suess <utilityemal77@gmail.com> Cc: Tingmao Wang <m@maowtm.org> Cc: Mickaël Salaün <mic@digikod.net> Link[1]: https://lore.kernel.org/all/53b9883648225d5a08e82d2636ab0b4fda003bc9.1767115163.git.m@maowtm.org/ Signed-off-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20260327164838.38231-9-gnoack3000@gmail.com Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07selftests/landlock: Replace access_fs_16 with ACCESS_ALL in fs_testGünther Noack1-37/+17
The access_fs_16 variable was originally intended to stay frozen at 16 access rights so that audit tests would not need updating when new access rights are added. Now that we have 17 access rights, the name is confusing. Replace all uses of access_fs_16 with ACCESS_ALL and delete the variable. Suggested-by: Mickaël Salaün <mic@digikod.net> Signed-off-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20260327164838.38231-8-gnoack3000@gmail.com Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07samples/landlock: Add support for named UNIX domain socket restrictionsGünther Noack1-3/+9
The access right for UNIX domain socket lookups is grouped with the read-write rights in the sample tool. Rationale: In the general case, any operations are possible through a UNIX domain socket, including data-mutating operations. Cc: Justin Suess <utilityemal77@gmail.com> Cc: Mickaël Salaün <mic@digikod.net> Signed-off-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20260327164838.38231-7-gnoack3000@gmail.com Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07landlock: Clarify BUILD_BUG_ON check in scoping logicGünther Noack2-6/+12
The BUILD_BUG_ON check in domain_is_scoped() and unmask_scoped_access() should check that the loop that counts down client_layer finishes. We therefore check that the numbers LANDLOCK_MAX_NUM_LAYERS-1 and -1 are both representable by that integer. If they are representable, the numbers in between are representable too, and the loop finishes. Signed-off-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20260327164838.38231-6-gnoack3000@gmail.com Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07landlock: Control pathname UNIX domain socket resolution by pathGünther Noack9-9/+200
* Add a new access right LANDLOCK_ACCESS_FS_RESOLVE_UNIX, which controls the lookup operations for named UNIX domain sockets. The resolution happens during connect() and sendmsg() (depending on socket type). * Change access_mask_t from u16 to u32 (see below) * Hook into the path lookup in unix_find_bsd() in af_unix.c, using a LSM hook. Make policy decisions based on the new access rights * Increment the Landlock ABI version. * Minor test adaptations to keep the tests working. * Document the design rationale for scoped access rights, and cross-reference it from the header documentation. With this access right, access is granted if either of the following conditions is met: * The target socket's filesystem path was allow-listed using a LANDLOCK_RULE_PATH_BENEATH rule, *or*: * The target socket was created in the same Landlock domain in which LANDLOCK_ACCESS_FS_RESOLVE_UNIX was restricted. In case of a denial, connect() and sendmsg() return EACCES, which is the same error as it is returned if the user does not have the write bit in the traditional UNIX file system permissions of that file. The access_mask_t type grows from u16 to u32 to make space for the new access right. This also doubles the size of struct layer_access_masks from 32 byte to 64 byte. To avoid memory layout inconsistencies between architectures (especially m68k), pack and align struct access_masks [2]. Document the (possible future) interaction between scoped flags and other access rights in struct landlock_ruleset_attr, and summarize the rationale, as discussed in code review leading up to [3]. This feature was created with substantial discussion and input from Justin Suess, Tingmao Wang and Mickaël Salaün. Cc: Tingmao Wang <m@maowtm.org> Cc: Justin Suess <utilityemal77@gmail.com> Cc: Kuniyuki Iwashima <kuniyu@google.com> Suggested-by: Jann Horn <jannh@google.com> Link[1]: https://github.com/landlock-lsm/linux/issues/36 Link[2]: https://lore.kernel.org/all/20260401.Re1Eesu1Yaij@digikod.net/ Link[3]: https://lore.kernel.org/all/20260205.8531e4005118@gnoack.org/ Signed-off-by: Günther Noack <gnoack3000@gmail.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20260327164838.38231-5-gnoack3000@gmail.com [mic: Fix kernel-doc formatting, pack and align access_masks] Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07landlock: Use mem_is_zero() in is_layer_masks_allowed()Günther Noack1-1/+1
This is equivalent, but expresses the intent a bit clearer. Signed-off-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20260327164838.38231-3-gnoack3000@gmail.com Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07lsm: Add LSM hook security_unix_findJustin Suess4-3/+43
Add an LSM hook security_unix_find. This hook is called to check the path of a named UNIX socket before a connection is initiated. The peer socket may be inspected as well. Why existing hooks are unsuitable: Existing socket hooks, security_unix_stream_connect(), security_unix_may_send(), and security_socket_connect() don't provide TOCTOU-free / namespace independent access to the paths of sockets. (1) We cannot resolve the path from the struct sockaddr in existing hooks. This requires another path lookup. A change in the path between the two lookups will cause a TOCTOU bug. (2) We cannot use the struct path from the listening socket, because it may be bound to a path in a different namespace than the caller, resulting in a path that cannot be referenced at policy creation time. Consumers of the hook wishing to reference @other are responsible for acquiring the unix_state_lock and checking for the SOCK_DEAD flag therein, ensuring the socket hasn't died since lookup. Cc: Günther Noack <gnoack3000@gmail.com> Cc: Tingmao Wang <m@maowtm.org> Cc: Mickaël Salaün <mic@digikod.net> Cc: Paul Moore <paul@paul-moore.com> Signed-off-by: Justin Suess <utilityemal77@gmail.com> Signed-off-by: Günther Noack <gnoack3000@gmail.com> Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/20260327164838.38231-2-gnoack3000@gmail.com Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07landlock: Fix kernel-doc warning for pointer-to-array parametersMickaël Salaün1-2/+2
The insert_rule() and create_rule() functions take a pointer-to-flexible-array parameter declared as: const struct landlock_layer (*const layers)[] The kernel-doc parser cannot handle a qualifier between * and the parameter name in this syntax, producing spurious "Invalid param" and "not described" warnings. Remove the const qualifier of the "layers" argument to avoid this parsing issue. Cc: Günther Noack <gnoack@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Reviewed-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20260310172004.1839864-1-mic@digikod.net Signed-off-by: Mickaël Salaün <mic@digikod.net>
2026-04-07landlock: Fix formatting in tsync.cMickaël Salaün1-49/+58
Fix comment formatting in tsync.c to fit in 80 columns. Cc: Günther Noack <gnoack@google.com> Reviewed-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20260304193134.250495-4-mic@digikod.net Signed-off-by: Mickaël Salaün <mic@digikod.net>