kernel/linux.git/fs/btrfs/relocation.c, branch v6.18.21

btrfs: fix clearing of BTRFS_FS_RELOC_RUNNING if relocation already running

2025-10-13T20:29:03+00:00

When starting relocation, at reloc_chunk_start(), if we happen to find the flag BTRFS_FS_RELOC_RUNNING is already set we return an error (-EINPROGRESS) to the callers, however the callers call reloc_chunk_end() which will clear the flag BTRFS_FS_RELOC_RUNNING, which is wrong since relocation was started by another task and still running. Finding the BTRFS_FS_RELOC_RUNNING flag already set is an unexpected scenario, but still our current behaviour is not correct. Fix this by never calling reloc_chunk_end() if reloc_chunk_start() has returned an error, which is what logically makes sense, since the general widespread pattern is to have end functions called only if the counterpart start functions succeeded. This requires changing reloc_chunk_start() to clear BTRFS_FS_RELOC_RUNNING if there's a pending cancel request. Fixes: 907d2710d727 ("btrfs: add cancellable chunk relocation support") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov Reviewed-by: Johannes Thumshirn Reviewed-by: Qu Wenruo Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba

btrfs: add unlikely annotations to branches leading to transaction abort

2025-09-23T06:49:26+00:00

The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen. Transaction abort is one such error, the btrfs_abort_transaction() inlines code to check the state and print a warning, this ought to be out of the hot path. The most common pattern is when transaction abort is called after checking a return value and the control flow leads to a quick return. In other cases it may not be necessary to add unlikely() e.g. when the function returns anyway or the control flow is not changed noticeably. Reviewed-by: Filipe Manana Signed-off-by: David Sterba

btrfs: add unlikely annotations to branches leading to EIO

2025-09-23T06:49:26+00:00

btrfs: add unlikely annotations to branches leading to EUCLEAN

2025-09-23T06:49:26+00:00

btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions

2025-09-23T06:49:26+00:00

Trivial pattern for the auto freeing with goto -> return conversions if possible. The following cases are considered trivial in this patch: 1. Cases where there are no operations between btrfs_free_path() and the function returns. 2. Cases where only simple cleanup operations (such as kfree(), kvfree(), clear_bit(), and fs_path_free()) are present between btrfs_free_path() and the function return. Signed-off-by: Sun YangKai Reviewed-by: David Sterba Signed-off-by: David Sterba

btrfs: convert several int parameters to bool

2025-09-22T08:54:32+00:00

We're almost done cleaning misused int/bool parameters. Convert a bunch of them, found by manual grepping. Note that btrfs_sync_fs() needs an int as it's mandated by the struct super_operations prototype. Reviewed-by: Boris Burkov Signed-off-by: David Sterba

btrfs: do not allow relocation of partially dropped subvolumes

2025-08-07T15:07:15+00:00

[BUG] There is an internal report that balance triggered transaction abort, with the following call trace: item 85 key (594509824 169 0) itemoff 12599 itemsize 33 extent refs 1 gen 197740 flags 2 ref#0: tree block backref root 7 item 86 key (594558976 169 0) itemoff 12566 itemsize 33 extent refs 1 gen 197522 flags 2 ref#0: tree block backref root 7 ... BTRFS error (device loop0): extent item not found for insert, bytenr 594526208 num_bytes 16384 parent 449921024 root_objectid 934 owner 1 offset 0 BTRFS error (device loop0): failed to run delayed ref for logical 594526208 num_bytes 16384 type 182 action 1 ref_mod 1: -117 ------------[ cut here ]------------ BTRFS: Transaction aborted (error -117) WARNING: CPU: 1 PID: 6963 at ../fs/btrfs/extent-tree.c:2168 btrfs_run_delayed_refs+0xfa/0x110 [btrfs] And btrfs check doesn't report anything wrong related to the extent tree. [CAUSE] The cause is a little complex, firstly the extent tree indeed doesn't have the backref for 594526208. The extent tree only have the following two backrefs around that bytenr on-disk: item 65 key (594509824 METADATA_ITEM 0) itemoff 13880 itemsize 33 refs 1 gen 197740 flags TREE_BLOCK tree block skinny level 0 (176 0x7) tree block backref root CSUM_TREE item 66 key (594558976 METADATA_ITEM 0) itemoff 13847 itemsize 33 refs 1 gen 197522 flags TREE_BLOCK tree block skinny level 0 (176 0x7) tree block backref root CSUM_TREE But the such missing backref item is not an corruption on disk, as the offending delayed ref belongs to subvolume 934, and that subvolume is being dropped: item 0 key (934 ROOT_ITEM 198229) itemoff 15844 itemsize 439 generation 198229 root_dirid 256 bytenr 10741039104 byte_limit 0 bytes_used 345571328 last_snapshot 198229 flags 0x1000000000001(RDONLY) refs 0 drop_progress key (206324 EXTENT_DATA 2711650304) drop_level 2 level 2 generation_v2 198229 And that offending tree block 594526208 is inside the dropped range of that subvolume. That explains why there is no backref item for that bytenr and why btrfs check is not reporting anything wrong. But this also shows another problem, as btrfs will do all the orphan subvolume cleanup at a read-write mount. So half-dropped subvolume should not exist after an RW mount, and balance itself is also exclusive to subvolume cleanup, meaning we shouldn't hit a subvolume half-dropped during relocation. The root cause is, there is no orphan item for this subvolume. In fact there are 5 subvolumes from around 2021 that have the same problem. It looks like the original report has some older kernels running, and caused those zombie subvolumes. Thankfully upstream commit 8d488a8c7ba2 ("btrfs: fix subvolume/snapshot deletion not triggered on mount") has long fixed the bug. [ENHANCEMENT] For repairing such old fs, btrfs-progs will be enhanced. Considering how delayed the problem will show up (at run delayed ref time) and at that time we have to abort transaction already, it is too late. Instead here we reject any half-dropped subvolume for reloc tree at the earliest time, preventing confusion and extra time wasted on debugging similar bugs. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana Signed-off-by: Qu Wenruo Signed-off-by: David Sterba

btrfs: enable large data folios for data reloc inode

2025-07-21T23:13:03+00:00

For data reloc inodes, they are a special type of inodes that are not exposed to user space, and are only utilized during data block groups relocation. They do not go under regular read-write operations, but have their file extents manually created to have the same layout of a block group, then its content is read from the original block group, and written back to the new location which is in a new block group. Previously all the handling was done in page units, and commit c2832898126f ("btrfs: make relocate_one_page() handle subpage case") changed the handling to subpage blocks. On the other hand, data reloc inodes are a perfect match for large data folios, as each relocation cluster represents one or more data extents that are contiguous in their logical addresses. This patch enables large folios for data reloc inodes by: - Remove the special handling of data reloc inodes when setting folio order - Change relocate_one_folio() to return the file offset of the next folio Originally it's designed to handle fixed page sized blocks, but with large folios, we can handle a large folio, thus we have to return the end of the current folio. - Remove the warning on folio_order() - Use folio_size() to replace fixed PAGE_SIZE usage - Use file_offset as iterator inside relocate_file_extent_cluster Signed-off-by: Qu Wenruo Signed-off-by: David Sterba

btrfs: reloc: unconditionally invalidate the page cache for each cluster

2025-07-21T23:13:03+00:00

Commit 9d9ea1e68a05 ("btrfs: subpage: fix relocation potentially overwriting last page data") fixed a bug when relocating data block groups for subpage cases. However for the incoming large folios for data reloc inode, we can hit the same situation where block size is the same as page size, but the folio we got is still larger than a block. In that case, the old subpage specific check is no longer reliable. Here we have to enhance the handling by: - Unconditionally invalidate the page cache for the current cluster We set the @flush to true so that any dirty folios are properly written back first. And this time instead of dropping the whole page cache, just drop the range covered by the current cluster. This will bring some minor performance drop, as for a large folio, the heading half will be read twice (read by previous cluster, then invalidated, then read again by the current cluster). However that is required to support large folios, and this gets rid of the kinda tricky manual uptodate flag clearing for each block. - Remove the special handling of writing back the whole page cache filemap_invalidate_inode() handles the write back already, and since we're invalidating all pages in the range, we no longer need to manually clear the uptodate flags for involved blocks. Thus there is no need to manually write back the whole page cache. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba

btrfs: remove btrfs_clear_extent_bits()

2025-07-21T22:09:22+00:00

It's just a simple wrapper around btrfs_clear_extent_bit() that passes a NULL for its last argument (a cached extent state record), plus there is not counter part - we have a btrfs_set_extent_bit() but we do not have a btrfs_set_extent_bits() (plural version). So just remove it and make all callers use btrfs_clear_extent_bit() directly. Reviewed-by: Qu Wenruo Reviewed-by: Johannes Thumshirn Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba