summaryrefslogtreecommitdiff
path: root/fs/btrfs/inode.c
AgeCommit message (Collapse)AuthorFilesLines
11 daysbtrfs: fix a race between renames and directory loggingFilipe Manana1-17/+64
We have a race between a rename and directory inode logging that if it happens and we crash/power fail before the rename completes, the next time the filesystem is mounted, the log replay code will end up deleting the file that was being renamed. This is best explained following a step by step analysis of an interleaving of steps that lead into this situation. Consider the initial conditions: 1) We are at transaction N; 2) We have directories A and B created in a past transaction (< N); 3) We have inode X corresponding to a file that has 2 hardlinks, one in directory A and the other in directory B, so we'll name them as "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a past transaction (< N); 4) We have inode Y corresponding to a file that as a single hard link and is located in directory A, we'll name it as "A/bar". This file was also persisted in a past transaction (< N). The steps leading to a file loss are the following and for all of them we are under transaction N: 1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir(); 2) Task A starts a rename for inode Y, with the goal of renaming from "A/bar" to "A/baz", so we enter btrfs_rename(); 3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling btrfs_insert_inode_ref(); 4) Because the rename happens in the same directory, we don't set the last_unlink_trans field of directoty A's inode to the current transaction id, that is, we don't cal btrfs_record_unlink_dir(); 5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode() (actually the dir index item is added as a delayed item, but the effect is the same); 6) Now before task A adds the new entry "A/baz" to directory A by calling btrfs_add_link(), another task, task B is logging inode X; 7) Task B starts a fsync of inode X and after logging inode X, at btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since inode X has a last_unlink_trans value of N, set at in step 1; 8) At btrfs_log_all_parents() we search for all parent directories of inode X using the commit root, so we find directories A and B and log them. Bu when logging direct A, we don't have a dir index item for inode Y anymore, neither the old name "A/bar" nor for the new name "A/baz" since the rename has deleted the old name but has not yet inserted the new name - task A hasn't called yet btrfs_add_link() to do that. Note that logging directory A doesn't fallback to a transaction commit because its last_unlink_trans has a lower value than the current transaction's id (see step 4); 9) Task B finishes logging directories A and B and gets back to btrfs_sync_file() where it calls btrfs_sync_log() to persist the log tree; 10) Task B successfully persisted the log tree, btrfs_sync_log() completed with success, and a power failure happened. We have a log tree without any directory entry for inode Y, so the log replay code deletes the entry for inode Y, name "A/bar", from the subvolume tree since it doesn't exist in the log tree and the log tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY item that covers the index range for the dentry that corresponds to "A/bar"). Since there's no other hard link for inode Y and the log replay code deletes the name "A/bar", the file is lost. The issue wouldn't happen if task B synced the log only after task A called btrfs_log_new_name(), which would update the log with the new name for inode Y ("A/bar"). Fix this by pinning the log root during renames before removing the old directory entry, and unpinning after btrfs_log_new_name() is called. Fixes: 259c4b96d78d ("btrfs: stop doing unnecessary log updates during a rename") CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
11 daysbtrfs: include root in error message when unlinking inodeFilipe Manana1-3/+3
To help debugging include the root number in the error message, and since this is a critical error that implies a metadata inconsistency and results in a transaction abort change the log message level from "info" to "critical", which is a much better fit. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: return real error from __filemap_get_folio() callsFilipe Manana1-1/+1
We have a few places that always assume a -ENOMEM error happened in case a call to __filemap_get_folio() returns an error, which is just too much of an assumption and even if it would be the case at some point in time, it's not future proof and there's nothing in the documentation that guarantees that only ERR_PTR(-ENOMEM) can be returned with the flags we are passing to it. So use the exact error returned by __filemap_get_folio() instead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: fix invalid data space release when truncating block in NOCOW modeFilipe Manana1-2/+5
If when truncating a block we fail to reserve data space and then we proceed anyway because we can do a NOCOW write, if we later get an error when trying to get the folio from the inode's mapping, we end up releasing data space that we haven't reserved, screwing up the bytes_may_use counter from the data space_info, eventually resulting in an underflow when all other reservations done by other tasks are released, if any, or right away if there are no other reservations at the moment. This is because when we get an error when trying to grab the block's folio we call btrfs_delalloc_release_space(), which releases metadata (which we have reserved) and data (which we haven't reserved). Fix this by calling btrfs_delalloc_release_space() only if we did reserve data space, that is, if we aren't falling back to NOCOW, meaning the local variable @only_release_metadata has a false value, otherwise release only metadata by calling btrfs_delalloc_release_metadata(). Fixes: 6d4572a9d71d ("btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent()Filipe Manana1-5/+5
We are using an integer for the 'delalloc' argument but all we need is a boolean, so switch the type to 'bool' and rename the parameter to 'is_delalloc' to better match the fact that it's a boolean. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: handle aligned EOF truncation correctly for subpage casesQu Wenruo1-1/+54
[BUG] For the following fsx -e 1 run, the btrfs still fails the run on 64K page size with 4K fs block size: READ BAD DATA: offset = 0x26b3a, size = 0xfafa, fname = /mnt/btrfs/junk OFFSET GOOD BAD RANGE 0x26b3a 0x0000 0x15b4 0x0 operation# (mod 256) for the bad data may be 21 [...] LOG DUMP (28 total operations): 1( 1 mod 256): SKIPPED (no operation) 2( 2 mod 256): SKIPPED (no operation) 3( 3 mod 256): SKIPPED (no operation) 4( 4 mod 256): SKIPPED (no operation) 5( 5 mod 256): WRITE 0x1ea90 thru 0x285e0 (0x9b51 bytes) HOLE 6( 6 mod 256): ZERO 0x1b1a8 thru 0x20bd4 (0x5a2d bytes) 7( 7 mod 256): FALLOC 0x22b1a thru 0x272fa (0x47e0 bytes) INTERIOR 8( 8 mod 256): WRITE 0x741d thru 0x13522 (0xc106 bytes) 9( 9 mod 256): MAPWRITE 0x73ee thru 0xdeeb (0x6afe bytes) 10( 10 mod 256): FALLOC 0xb719 thru 0xb994 (0x27b bytes) INTERIOR 11( 11 mod 256): COPY 0x15ed8 thru 0x18be1 (0x2d0a bytes) to 0x25f6e thru 0x28c77 12( 12 mod 256): ZERO 0x1615e thru 0x1770e (0x15b1 bytes) 13( 13 mod 256): SKIPPED (no operation) 14( 14 mod 256): DEDUPE 0x20000 thru 0x27fff (0x8000 bytes) to 0x1000 thru 0x8fff 15( 15 mod 256): SKIPPED (no operation) 16( 16 mod 256): CLONE 0xa000 thru 0xffff (0x6000 bytes) to 0x36000 thru 0x3bfff 17( 17 mod 256): ZERO 0x14adc thru 0x1b78a (0x6caf bytes) 18( 18 mod 256): TRUNCATE DOWN from 0x3c000 to 0x1e2e3 ******WWWW 19( 19 mod 256): CLONE 0x4000 thru 0x11fff (0xe000 bytes) to 0x16000 thru 0x23fff 20( 20 mod 256): FALLOC 0x311e1 thru 0x3681b (0x563a bytes) PAST_EOF 21( 21 mod 256): FALLOC 0x351c5 thru 0x40000 (0xae3b bytes) EXTENDING 22( 22 mod 256): WRITE 0x920 thru 0x7e51 (0x7532 bytes) 23( 23 mod 256): COPY 0x2b58 thru 0xc508 (0x99b1 bytes) to 0x117b1 thru 0x1b161 24( 24 mod 256): TRUNCATE DOWN from 0x40000 to 0x3c9a5 25( 25 mod 256): SKIPPED (no operation) 26( 26 mod 256): MAPWRITE 0x25020 thru 0x26b06 (0x1ae7 bytes) 27( 27 mod 256): SKIPPED (no operation) 28( 28 mod 256): READ 0x26b3a thru 0x36633 (0xfafa bytes) ***RRRR*** [CAUSE] The involved operations are: fallocating to largest ever: 0x40000 21 pollute_eof 0x24000 thru 0x2ffff (0xc000 bytes) 21 falloc from 0x351c5 to 0x40000 (0xae3b bytes) 28 read 0x26b3a thru 0x36633 (0xfafa bytes) At operation #21 a pollute_eof is done, by memory mapped write into range [0x24000, 0x2ffff). At this stage, the inode size is 0x24000, which is block aligned. Then fallocate happens, and since it's expanding the inode, it will call btrfs_truncate_block() to truncate any unaligned range. But since the inode size is already block aligned, btrfs_truncate_block() does nothing and exits. However remember the folio at 0x20000 has some range polluted already, although it will not be written back to disk, it still affects the page cache, resulting the later operation #28 to read out the polluted value. [FIX] Instead of early exit from btrfs_truncate_block() if the range is already block aligned, do extra filio zeroing if the fs block size is smaller than the page size and we're truncating beyond EOF. This is to address exactly the above case where memory mapped write can still leave some garbage beyond EOF. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: handle unaligned EOF truncation correctly for subpage casesQu Wenruo1-27/+80
[BUG] The following fsx sequence will fail on btrfs with 64K page size and 4K fs block size: #fsx -d -e 1 -N 4 $mnt/junk -S 36386 READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk OFFSET GOOD BAD RANGE 0xe9ba 0x0000 0x03ac 0x0 operation# (mod 256) for the bad data may be 3 ... LOG DUMP (4 total operations): 1( 1 mod 256): WRITE 0x6c62 thru 0x1147d (0xa81c bytes) HOLE ***WWWW 2( 2 mod 256): TRUNCATE DOWN from 0x1147e to 0x5448 ******WWWW 3( 3 mod 256): ZERO 0x1c7aa thru 0x28fe2 (0xc839 bytes) 4( 4 mod 256): MAPREAD 0xe9ba thru 0x1578e (0x6dd5 bytes) ***RRRR*** [CAUSE] Only 2 operations are really involved in this case: 3 pollute_eof 0x5448 thru 0xffff (0xabb8 bytes) 3 zero from 0x1c7aa to 0x28fe3, (0xc839 bytes) 4 mapread 0xe9ba thru 0x1578e (0x6dd5 bytes) At operation 3, fsx pollutes beyond EOF, that is done by mmap() and write into that mmap() range beyond EOF. Such write will fill the range beyond EOF, but it will never reach disk as ranges beyond EOF will not be marked dirty nor uptodate. Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond our isize (which was 0x5448), we should zero out any range beyond EOF (0x5448). During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the unaligned head block. But that function only really zeroes out the block at [0x5000, 0x5fff], it doesn't bother any range other that that block, since those ranges will not be marked dirty nor written back. So the range [0x6000, 0xffff] is still polluted, and later mapread() will return the poisoned value. [FIX] Enhance btrfs_truncate_block() by: - Pass a @start/@end pair to indicate the full truncation range This is to handle the following truncation case: Page size is 64K, fs block size is 4K, truncate range is [6K, 60K] 0 32K 64K | |///////////////////////////////////| | 6K 60K The range is not aligned for its head block, so we need to call btrfs_truncate_block() with @from = 6K, @front = 0, @len = 0. But with that information we only know to zero the range [6K, 8K), if we zero out the range [6K, 64K), the last block will also be zeroed, causing data loss. So here we need the full range we're truncating, so that we can avoid over-truncation. - Rename @from to @offset As now the parameter is only utilized to locate a block, it's not really carrying the old @from meaning well. - Remove @front parameter With the full truncate range passed in, we can determine if the @offset is at the head or tail block. - Skip truncation if @offset is not in the head nor tail blocks The call site in hole punch unconditionally call btrfs_truncate_block() without even checking the range is aligned or not. If the @offset is neither in the head nor in tail block, it means we can safely ignore it. - Skip truncate if the range inside the target block is already aligned - Make btrfs_truncate_block() zero all blocks beyond EOF Since we have the original range, we know exactly if we're doing truncation beyond EOF (the @end will be (u64)-1). If we're doing truncation beyond EOF, then enlarge the truncation range to the folio end, to address the possibly polluted ranges. Otherwise still keep the zero range inside the block, as we can have large data folios soon, always truncating every blocks inside the same folio can be costly for large folios. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: pass struct btrfs_inode to btrfs_free_reserved_data_space_noquota()Naohiro Aota1-2/+2
As well as the last patch, pass struct btrfs_inode to the function and let it distinguish which data space it is working on in a later patch. There is no functional change with this commit. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: reformat comments in acls_after_inode_item()David Sterba1-14/+23
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: trivial conversion to return bool instead of intDavid Sterba1-7/+7
Old code has a lot of int for bool return values, bool is recommended and done in new code. Convert the trivial cases that do simple 0/false and 1/true. Functions comment are updated if needed. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use unsigned types for constants defined as bit shiftsDavid Sterba1-6/+6
The unsigned type is a recommended practice (CWE-190, CWE-194) for bit shifts to avoid problems with potential unwanted sign extensions. Although there are no such cases in btrfs codebase, follow the recommendation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use list_first_entry() everywhereDavid Sterba1-3/+3
Using the helper makes it a bit more clear that we're accessing the first list entry. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: convert WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)) to DEBUG_WARNDavid Sterba1-4/+2
Use the conditional warning instead of typing the whole condition. Optional message is printed where it seems clear what could be the problem. Conversion is left out in btree_csum_one_bio() because of the additional condition. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: move kmapping out of btrfs_check_sector_csum()Christoph Hellwig1-10/+10
Move kmapping the page out of btrfs_check_sector_csum(). This allows using bvec_kmap_local() where suitable and reduces the number of kmap*() calls in the raid56 code. This also means btrfs_check_sector_csum() will only accept a properly kmapped address. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename remaining exported extent map functionsFilipe Manana1-6/+6
Rename all the exported functions from extent_map.h that don't have a 'btrfs_' prefix in their names, so that they are consistent with all the other functions, to make it clear they are btrfs specific functions and to avoid potential name collisions in the future with functions defined elsewhere in the kernel. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename functions to allocate and free extent mapsFilipe Manana1-24/+24
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename extent map functions to get block start, end and check if in treeFilipe Manana1-7/+7
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename exported extent map compression functionsFilipe Manana1-3/+3
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename the functions to count, test and get bit ranges in io treesFilipe Manana1-6/+7
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a 'btrfs_' prefix to their names to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename the functions to init and release an extent io treeFilipe Manana1-2/+3
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename set_extent_bit() to include a btrfs prefixFilipe Manana1-7/+7
This is an exported function so it should have a 'btrfs_' prefix by convention, to make it clear it's btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So rename it to btrfs_set_extent_bit(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename the functions to clear bits for an extent rangeFilipe Manana1-21/+22
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. One of them has a double underscore prefix which is also discouraged. So remove double underscore prefix where applicable and add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename __lock_extent() and __try_lock_extent()Filipe Manana1-3/+3
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. Their double underscore prefix is also discouraged. So remove their double underscore prefix, add a 'btrfs_' prefix to their name to make it clear they are from btrfs and a '_bits' suffix to avoid collision with btrfs_lock_extent() and btrfs_try_lock_extent(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: add btrfs prefix to main lock, try lock and unlock extent functionsFilipe Manana1-42/+42
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a prefix to their name. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_encoded_read_inline()David Sterba1-18/+13
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use BTRFS_PATH_AUTO_FREE in can_nocow_extent()David Sterba1-22/+15
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_set_inode_index_count()David Sterba1-9/+7
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use BTRFS_PATH_AUTO_FREE in may_destroy_subvol()David Sterba1-7/+5
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: do more trivial BTRFS_PATH_AUTO_FREE conversionsDavid Sterba1-18/+9
The most trivial pattern for the auto freeing when the variable is declared with the macro and the final btrfs_free_path() is removed. There are almost none goto -> return conversions and there's no other function cleanup. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use clear_extent_bits() instead of clear_extent_bit() where possibleFilipe Manana1-2/+1
Several places are using clear_extent_bit() and passing a NULL value for the 'cached' argument, which is pointless as they can use instead clear_extent_bits(). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: allow folios to be released while ordered extent is finishingFilipe Manana1-2/+4
When the release_folio callback (from struct address_space_operations) is invoked we don't allow the folio to be released if its range is currently locked in the inode's io_tree, as it may indicate the folio may be needed by the task that locked the range. However if the range is locked because an ordered extent is finishing, then we can safely allow the folio to be released because ordered extent completion doesn't need to use the folio at all. When we are under memory pressure, the kernel starts writeback of dirty pages (folios) with the goal of releasing the pages from the page cache after writeback completes, however this often is not possible on btrfs because: * Once the writeback completes we queue the ordered extent completion; * Once the ordered extent completion starts, we lock the range in the inode's io_tree (at btrfs_finish_one_ordered()); * If the release_folio callback is called while the folio's range is locked in the inode's io_tree, we don't allow the folio to be released, so the kernel has to try to release memory elsewhere, which may result in triggering more writeback or releasing other pages from the page cache which may be more useful to have around for applications. In contrast, when the release_folio callback is invoked after writeback finishes and before ordered extent completion starts or locks the range, we allow the folio to be released, as well as when the release_folio callback is invoked after ordered extent completion unlocks the range. Improve on this by detecting if the range is locked for ordered extent completion and if it is, allow the folio to be released. This detection is achieved by adding a new extent flag in the io_tree that is set when the range is locked during ordered extent completion. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove leftover EXTENT_UPTODATE clear from an inode's io_treeFilipe Manana1-12/+10
After commit 52b029f42751 ("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path") we never set EXTENT_UPTODATE in an inode's io_tree anymore, but we still have some code attempting to clear that bit from an inode's io_tree. Remove that code as it doesn't do anything anymore. The sole use of the EXTENT_UPTODATE bit is for the excluded extents io_tree (fs_info->excluded_extents), which is used to track the locations of super blocks, so that their ranges are never marked as free, making them unavailable for extent allocation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: move block perfect compression out of experimental featuresQu Wenruo1-15/+0
Commit 1d2fbb7f1f9e ("btrfs: allow compression even if the range is not page aligned") introduced the block perfect compression for block size < page size cases. Before that commit, if the fs block size is smaller than page size (aka subpage cases), compressed write is only enabled if the dirty range is fully page aligned. This block perfect compression support was introduced in v6.13, and has been tested for two kernel releases. I believe it's time to move it out of experimental features so that we can get more tests in the real world. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15Merge tag 'for-6.15-rc6-tag' of ↵Linus Torvalds1-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix potential endless loop when discarding a block group when disabling discard - reinstate message when setting a large value of mount option 'commit' - fix a folio leak when async extent submission fails * tag 'for-6.15-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add back warning for mount option commit values exceeding 300 btrfs: fix folio leak in submit_one_async_extent() btrfs: fix discard worker infinite loop after disabling discard
2025-05-12btrfs: fix folio leak in submit_one_async_extent()Boris Burkov1-0/+7
If btrfs_reserve_extent() fails while submitting an async_extent for a compressed write, then we fail to call free_async_extent_pages() on the async_extent and leak its folios. A likely cause for such a failure would be btrfs_reserve_extent() failing to find a large enough contiguous free extent for the compressed extent. I was able to reproduce this by: 1. mount with compress-force=zstd:3 2. fallocating most of a filesystem to a big file 3. fragmenting the remaining free space 4. trying to copy in a file which zstd would generate large compressed extents for (vmlinux worked well for this) Step 4. hits the memory leak and can be repeated ad nauseam to eventually exhaust the system memory. Fix this by detecting the case where we fallback to uncompressed submission for a compressed async_extent and ensuring that we call free_async_extent_pages(). Fixes: 131a821a243f ("btrfs: fallback if compressed IO fails for ENOSPC") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Co-developed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-30Merge tag 'for-6.15-rc4-tag' of ↵Linus Torvalds1-5/+8
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix potential inode leak in iget() after memory allocation failure - in subpage mode, fix extent buffer bitmap iteration when writing out dirty sectors - fix range calculation when falling back to COW for a NOCOW file * tag 'for-6.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: adjust subpage bit start based on sectorsize btrfs: fix the inode leak in btrfs_iget() btrfs: fix COW handling in run_delalloc_nocow()
2025-04-23btrfs: fix the inode leak in btrfs_iget()Penglei Jiang1-1/+3
[BUG] There is a bug report that a syzbot reproducer can lead to the following busy inode at unmount time: BTRFS info (device loop1): last unmount of filesystem 1680000e-3c1e-4c46-84b6-56bd3909af50 VFS: Busy inodes after unmount of loop1 (btrfs) ------------[ cut here ]------------ kernel BUG at fs/super.c:650! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 0 UID: 0 PID: 48168 Comm: syz-executor Not tainted 6.15.0-rc2-00471-g119009db2674 #2 PREEMPT(full) Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:generic_shutdown_super+0x2e9/0x390 fs/super.c:650 Call Trace: <TASK> kill_anon_super+0x3a/0x60 fs/super.c:1237 btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2099 deactivate_locked_super+0xbe/0x1a0 fs/super.c:473 deactivate_super fs/super.c:506 [inline] deactivate_super+0xe2/0x100 fs/super.c:502 cleanup_mnt+0x21f/0x440 fs/namespace.c:1435 task_work_run+0x14d/0x240 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop kernel/entry/common.c:114 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x269/0x290 kernel/entry/common.c:218 do_syscall_64+0xd4/0x250 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> [CAUSE] When btrfs_alloc_path() failed, btrfs_iget() directly returned without releasing the inode already allocated by btrfs_iget_locked(). This results the above busy inode and trigger the kernel BUG. [FIX] Fix it by calling iget_failed() if btrfs_alloc_path() failed. If we hit error inside btrfs_read_locked_inode(), it will properly call iget_failed(), so nothing to worry about. Although the iget_failed() cleanup inside btrfs_read_locked_inode() is a break of the normal error handling scheme, let's fix the obvious bug and backport first, then rework the error handling later. Reported-by: Penglei Jiang <superman.xpt@gmail.com> Link: https://lore.kernel.org/linux-btrfs/20250421102425.44431-1-superman.xpt@gmail.com/ Fixes: 7c855e16ab72 ("btrfs: remove conditional path allocation in btrfs_read_locked_inode()") CC: stable@vger.kernel.org # 6.13+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-23btrfs: fix COW handling in run_delalloc_nocow()Dave Chen1-4/+5
In run_delalloc_nocow(), when the found btrfs_key's offset > cur_offset, it indicates a gap between the current processing region and the next file extent. The original code would directly jump to the "must_cow" label, which increments the slot and forces a fallback to COW. This behavior might skip an extent item and result in an overestimated COW fallback range. This patch modifies the logic so that when a gap is detected: - If no COW range is already being recorded (cow_start is unset), cow_start is set to cur_offset. - cur_offset is then advanced to the beginning of the next extent. - Instead of jumping to "must_cow", control flows directly to "next_slot" so that the same extent item can be reexamined properly. The change ensures that we accurately account for the extent gap and avoid accidentally extending the range that needs to fallback to COW. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-27Merge tag 'for-6.15-tag' of ↵Linus Torvalds1-273/+316
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "User visible changes: - fall back to buffered write if direct io is done on a file that requires checksums - this avoids a problem with checksum mismatch errors, observed e.g. on virtual images when writes to pages under writeback cause the checksum mismatch reports - this may lead to some performance degradation but currently the recommended setup for VM images is to use the NOCOW file attribute that also disables checksums - fast/realtime zstd levels -15 to -1 - supported by mount options (compress=zstd:-5) and defrag ioctl - improved speed, reduced compression ratio, check the commit for sample measurements - defrag ioctl extended to accept negative compression levels - subpage mode - remove warning when subpage mode is used, the feature is now reasonably complete and tested - in debug mode allow to create 2K b-tree nodes to allow testing subpage on x86_64 with 4K pages too Performance improvements: - in send, better file path caching improves runtime (on sample load by -30%) - on s390x with hardware zlib support prepare the input buffer in a better way to get the best results from the acceleration - minor speed improvement in encoded read, avoid memory allocation in synchronous mode Core: - enable stable writes on inodes, replacing manually waiting for writeback and allowing to skip that on inodes without checksums - add last checks and warnings for out-of-band dirty writes to pages, requiring a fixup ("fixup worker"), this should not be necessary since 5.8 where get_user_page() and pin_user_pages*() prevent this - long history behind that, we'll be happy to remove the whole infrastructure in the near future - more folio API conversions and preparations for large folio support - subpage cleanups and refactoring, split handling of data and metadata to allow future support for large folios - readpage works as block-by-block, no change for normal mode, this is preparation for future subpage updates - block group refcount fixes and hardening - delayed iput fixes - in zoned mode, fix zone activation on filesystem with missing devices Cleanups: - inode parameter cleanups - path auto-freeing updates - code flow simplifications in send - redundant parameter cleanups" * tag 'for-6.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (164 commits) btrfs: zoned: fix zone finishing with missing devices btrfs: zoned: fix zone activation with missing devices btrfs: remove end_no_trans label from btrfs_log_inode_parent() btrfs: simplify condition for logging new dentries at btrfs_log_inode_parent() btrfs: remove redundant else statement from btrfs_log_inode_parent() btrfs: use memcmp_extent_buffer() at replay_one_extent() btrfs: update outdated comment for overwrite_item() btrfs: use variables to store extent buffer and slot at overwrite_item() btrfs: avoid unnecessary memory allocation and copy at overwrite_item() btrfs: don't clobber ret in btrfs_validate_super() btrfs: prepare btrfs_page_mkwrite() for large folios btrfs: prepare extent_io.c for future large folio support btrfs: prepare btrfs_launcher_folio() for large folios support btrfs: replace PAGE_SIZE with folio_size for subpage.[ch] btrfs: add a size parameter to btrfs_alloc_subpage() btrfs: subpage: make btrfs_is_subpage() check against a folio btrfs: add extra warning if delayed iput is added when it's not allowed btrfs: avoid redundant path slot assignment in btrfs_search_forward() btrfs: remove unnecessary btrfs_key local variable in btrfs_search_forward() btrfs: simplify the return value handling in search_ioctl() ...
2025-03-24Merge tag 'vfs-6.15-rc1.async.dir' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async dir updates from Christian Brauner: "This contains cleanups that fell out of the work from async directory handling: - Change kern_path_locked() and user_path_locked_at() to never return a negative dentry. This simplifies the usability of these helpers in various places - Drop d_exact_alias() from the remaining place in NFS where it is still used. This also allows us to drop the d_exact_alias() helper completely - Drop an unnecessary call to fh_update() from nfsd_create_locked() - Change i_op->mkdir() to return a struct dentry Change vfs_mkdir() to return a dentry provided by the filesystems which is hashed and positive. This allows us to reduce the number of cases where the resulting dentry is not positive to very few cases. The code in these places becomes simpler and easier to understand. - Repack DENTRY_* and LOOKUP_* flags" * tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: fix inline emphasis warning VFS: Change vfs_mkdir() to return the dentry. nfs: change mkdir inode_operation to return alternate dentry if needed. fuse: return correct dentry for ->mkdir ceph: return the correct dentry on mkdir hostfs: store inode in dentry after mkdir if possible. Change inode_operations.mkdir to return struct dentry * nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() VFS: add common error checks to lookup_one_qstr_excl() VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry VFS: repack LOOKUP_ bit flags. VFS: repack DENTRY_ flags.
2025-03-18btrfs: prepare btrfs_launcher_folio() for large folios supportQu Wenruo1-1/+1
That function is only calling btrfs_qgroup_free_data(), which doesn't care about the size of the folio. Just replace the fixed PAGE_SIZE with folio_size(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: subpage: make btrfs_is_subpage() check against a folioQu Wenruo1-1/+1
To support large data folios, we can no longer assume every filemap folio is page sized. So btrfs_is_subpage() check must be done against a folio. Thankfully for metadata folios, we have the full control and ensure a large folio will not be large than nodesize, so btrfs_meta_is_subpage() doesn't need this change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: add extra warning if delayed iput is added when it's not allowedQu Wenruo1-0/+1
Since I have triggered the ASSERT() on the delayed iput too many times, now is the time to add some extra debug warnings for delayed iput. All delayed iputs should be queued after all ordered extents finish their IO and all involved workqueues are flushed. Thus after the btrfs_run_delayed_iputs() inside close_ctree(), there should be no more delayed puts added. So introduce a new BTRFS_FS_STATE_NO_DELAYED_IPUT, set after the above mentioned timing. And all btrfs_add_delayed_iput() will check that flag and give a WARN_ON_ONCE(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_iget_path() return a btrfs inode insteadFilipe Manana1-4/+4
It's an internal function and btrfs_iget() is now returning a btrfs inode, so change btrfs_iget_path() to also return a btrfs inode instead of a VFS inode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_iget() return a btrfs inode insteadFilipe Manana1-28/+28
It's an internal function and most of the time the callers are doing a lot of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so change the return type to struct btrfs_inode instead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: defrag: extend ioctl to accept compression levelsDaniel Vacek1-3/+6
The zstd and zlib compression types support setting compression levels. Enhance the defrag interface to specify the levels as well. For zstd the negative (realtime) levels are also accepted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: reject out-of-band dirty folios during writebackQu Wenruo1-0/+15
[OUT-OF-BAND DIRTY FOLIOS] An out-of-band folio means the folio is marked dirty but without notifying the filesystem. This can lead to various problems, not limited to: - No folio::private to track per block status - No proper space reserved for such a dirty folio [HISTORY IN BTRFS] This used to be a problem related to get_user_page(), but with the introduction of pin_user_pages*(), we should no longer hit such case anymore. In btrfs, we have a long history of catching such out-of-band dirty folios by: - Mark the folio ordered during delayed allocation - Check the folio ordered flag during writeback If the folio has no ordered flag, it means it doesn't go through delayed allocation, thus it's definitely an out-of-band one. If we got one, we go through COW fixup, which will re-dirty the folio with proper handling in another workqueue. [PROBLEMS OF COW-FIXUP] Such workaround is a blockage for us to migrate to iomap (it requires extra flags to trace if a folio is dirtied by the fs or not) and I'd argue it's not data checksum safe, since if a folio can be marked dirty without informing the fs, the content can also change at any time. But with the introduction of pin_user_pages*() during v5.8 merge window, such out-of-band dirty folio such be treated as a bug. Ext4 has treated such case by warning and erroring out even before pin_user_pages*(). Furthermore, there are already proofs that such folio ordered flag tracking can be screwed up by incorrect error handling, check the commit messages of the following commits: 06f364284794 ("btrfs: do proper folio cleanup when cow_file_range() failed") c2b47df81c8e ("btrfs: do proper folio cleanup when run_delalloc_nocow() failed") [FIXES] Unlike btrfs, ext4 and xfs (iomap) never bother handling such out-of-band dirty folios. - Ext4 just warns and errors out - Iomap always follows the folio/block dirty flags And there is nothing really COW specific, xfs also supports COW too. Here we take one step towards ext4 by doing warning and erroring out. But since the cow fixup thing is introduced from the beginning, we keep the old behavior for non-experimental builds, and only do the new warning for experimental builds before we're 100% sure and remove cow fixup. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: unify inode variable namingDavid Sterba1-15/+13
Rename binode to inode in local variables or parameters so it's more unified with the rest of the code. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags()David Sterba1-2/+2
Pass a struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags() as it's an internal interface. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: properly limit inline data extent according to block sizeQu Wenruo1-1/+10
Btrfs utilizes inline data extent for the following cases: - Regular small files - Symlinks And "btrfs check" detects any file extents that are too large as an error. It's not a problem for 4K block size, but for the incoming smaller block sizes (2K), it can cause problems due to bad limits: - Non-compressed inline data extents We do not allow a non-compressed inline data extent to be as large as block size. - Symlinks Currently the only real limit on symlinks are 4K, which can be larger than 2K block size. These will result btrfs-check to report too large file extents. Fix it by adding proper size checks for the above cases. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>