summaryrefslogtreecommitdiff
path: root/fs/btrfs/volumes.c
AgeCommit message (Collapse)AuthorFilesLines
12 daysbtrfs: update superblock's device bytes_used when dropping chunkMark Harmstone1-0/+6
Each superblock contains a copy of the device item for that device. In a transaction which drops a chunk but doesn't create any new ones, we were correctly updating the device item in the chunk tree but not copying over the new bytes_used value to the superblock. This can be seen by doing the following: # dd if=/dev/zero of=test bs=4096 count=2621440 # mkfs.btrfs test # mount test /root/temp # cd /root/temp # for i in {00..10}; do dd if=/dev/zero of=$i bs=4096 count=32768; done # sync # rm * # sync # btrfs balance start -dusage=0 . # sync # cd # umount /root/temp # btrfs check test For btrfs-check to detect this, you will also need my patch at https://github.com/kdave/btrfs-progs/pull/991. Change btrfs_remove_dev_extents() so that it adds the devices to the fs_info->post_commit_list if they're not there already. This causes btrfs_commit_device_sizes() to be called, which updates the bytes_used value in the superblock. Fixes: bbbf7243d62d ("btrfs: combine device update operations during transaction commit") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: add space_info parameter for block group creationNaohiro Aota1-5/+30
Add struct btrfs_space_info parameter to btrfs_make_block_group(), its related functions and related struct. Passed space_info will have a new block group. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: get rid of btrfs_read_dev_super()Qu Wenruo1-1/+1
The function is introduced by commit a512bbf855ff ("Btrfs: superblock duplication") at the beginning of btrfs. It leaved a comment saying we'd need a special mount option to read all super blocks, but it's never been implemented and there was not need/request for it. The check/rescue tools are able to start from a specific copy and use it as primary eventually. This means btrfs_read_dev_super() is always reading the first super block, making all the code finding the latest super block unnecessary. Just remove that function and replace all call sites with btrfs_read_disk_super(bdev, 0, false). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: merge btrfs_read_dev_one_super() into btrfs_read_disk_super()Qu Wenruo1-44/+39
We have two functions to read a super block from a block device: - btrfs_read_dev_one_super() Exported from disk-io.c - btrfs_read_disk_super() Local to volumes.c And they have some minor differences: - btrfs_read_dev_one_super() uses @copy_num Meanwhile btrfs_read_disk_super() relies on the physical and expected bytenr passed from the caller. The parameter list of btrfs_read_dev_one_super() is more user friendly. - btrfs_read_disk_super() makes sure the label is NUL terminated We do not need two different functions doing the same job, so merge the behavior into btrfs_read_disk_super() by: - Remove btrfs_read_dev_one_super() - Export btrfs_read_disk_super() The name pairs with btrfs_release_disk_super() perfectly. - Change the parameter list of btrfs_read_disk_super() to mimic btrfs_read_dev_one_super() All existing callers are calculating the physical address and expect bytenr before calling btrfs_read_disk_super() already. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: on unknown chunk allocation policy fallback to regularDavid Sterba1-8/+12
We have only two chunk allocation policies right now and the switch/cases don't handle an unknown one properly. The error is in the impossible category (the policy is stored only in memory), we don't have to BUG(), falling back to regular policy should be safe. Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: switch int dev_replace_is_ongoing variables/parameters to boolDavid Sterba1-2/+2
Both the variable and the parameter are used as logical indicators so convert them to bool. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: trivial conversion to return bool instead of intDavid Sterba1-53/+47
Old code has a lot of int for bool return values, bool is recommended and done in new code. Convert the trivial cases that do simple 0/false and 1/true. Functions comment are updated if needed. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use list_first_entry() everywhereDavid Sterba1-4/+4
Using the helper makes it a bit more clear that we're accessing the first list entry. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: convert ASSERT(0) with handled errors to DEBUG_WARN()David Sterba1-3/+4
The use of ASSERT(0) is maybe useful for some cases but more like a notice for developers. Assertions can be compiled in independently so convert it to a debugging helper. The difference is that it's just a warning and will not end up in BUG(). The converted cases are in connection with proper error handling. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use verbose ASSERT() in volumes.cDavid Sterba1-20/+35
The file volumes.c has about 40 assertions and half of them are suitable for ASSERT() with additional data. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename the functions to init and release an extent io treeFilipe Manana1-3/+3
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename the functions to search for bits in extent rangesFilipe Manana1-3/+3
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename set_extent_bit() to include a btrfs prefixFilipe Manana1-3/+3
This is an exported function so it should have a 'btrfs_' prefix by convention, to make it clear it's btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So rename it to btrfs_set_extent_bit(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename the functions to clear bits for an extent rangeFilipe Manana1-5/+5
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. One of them has a double underscore prefix which is also discouraged. So remove double underscore prefix where applicable and add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use clear_extent_bits() at chunk_map_device_clear_bits()Filipe Manana1-4/+3
Instead of using __clear_extent_bit() we can use clear_extent_bits() since we pass a NULL value for the cached and changeset arguments. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-02Revert "btrfs: canonicalize the device path before adding it"Qu Wenruo1-90/+1
This reverts commit 7e06de7c83a746e58d4701e013182af133395188. Commit 7e06de7c83a7 ("btrfs: canonicalize the device path before adding it") tries to make btrfs to use "/dev/mapper/*" name first, then any filename inside "/dev/" as the device path. This is mostly fine when there is only the root namespace involved, but when multiple namespace are involved, things can easily go wrong for the d_path() usage. As d_path() returns a file path that is namespace dependent, the resulted string may not make any sense in another namespace. Furthermore, the "/dev/" prefix checks itself is not reliable, one can still make a valid initramfs without devtmpfs, and fill all needed device nodes manually. Overall the userspace has all its might to pass whatever device path for mount, and we are not going to win the war trying to cover every corner case. So just revert that commit, and do no extra d_path() based file path sanity check. CC: stable@vger.kernel.org # 6.12+ Link: https://lore.kernel.org/linux-fsdevel/20250115185608.GA2223535@zen.localdomain/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: unify ordering of btrfs_key initializationsDavid Sterba1-8/+8
The btrfs_key is defined as objectid/type/offset and the keys are also printed like that. For better readability, update all key initializations to match this order. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-08Merge tag 'for-6.14-rc5-tag' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix leaked extent map after error when reading chunks - replace use of deprecated strncpy - in zoned mode, fixed range when ulocking extent range, causing a hang * tag 'for-6.14-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix a leaked chunk map issue in read_one_chunk() btrfs: replace deprecated strncpy() with strscpy() btrfs: zoned: fix extent range end unlock in cow_file_range()
2025-03-06btrfs: fix a leaked chunk map issue in read_one_chunk()Haoxiang Li1-0/+1
Add btrfs_free_chunk_map() to free the memory allocated by btrfs_alloc_chunk_map() if btrfs_add_chunk_map() fails. Fixes: 7dc66abb5a47 ("btrfs: use a dedicated data structure for chunk maps") CC: stable@vger.kernel.org Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-25Merge tag 'for-6.14-rc4-tag' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - extent map shrinker fixes: - fix potential use after free accessing an inode to reach fs_info, the shrinker could do iput() in the meantime - skip unnecessary scanning of inodes without extent maps - do direct iput(), no need for indirection via workqueue - in block < page mode, fix race when extending i_size in buffered mode - fix minor memory leak in selftests - print descriptive error message when seeding device is not found * tag 'for-6.14-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix data overwriting bug during buffered write when block size < page size btrfs: output an error message if btrfs failed to find the seed fsid btrfs: do regular iput instead of delayed iput during extent map shrinking btrfs: skip inodes without loaded extent maps when shrinking extent maps btrfs: fix use-after-free on inode when scanning root during em shrinking btrfs: selftests: fix btrfs_test_delayed_refs() leak of transaction
2025-02-21btrfs: output an error message if btrfs failed to find the seed fsidQu Wenruo1-1/+5
[BUG] If btrfs failed to locate the seed device for whatever reason, mounting the sprouted device will fail without any meaning error message: # mkfs.btrfs -f /dev/test/scratch1 # btrfstune -S1 /dev/test/scratch1 # mount /dev/test/scratch1 /mnt/btrfs # btrfs dev add -f /dev/test/scratch2 /mnt/btrfs # umount /mnt/btrfs # btrfs dev scan -u # btrfs mount /dev/test/scratch2 /mnt/btrfs mount: /mnt/btrfs: fsconfig system call failed: No such file or directory. dmesg(1) may have more information after failed mount system call. # dmesg -t | tail -n6 BTRFS info (device dm-5): first mount of filesystem 64252ded-5953-4868-b962-cea48f7ac4ea BTRFS info (device dm-5): using crc32c (crc32c-generic) checksum algorithm BTRFS info (device dm-5): using free-space-tree BTRFS error (device dm-5): failed to read chunk tree: -2 BTRFS error (device dm-5): open_ctree failed: -2 [CAUSE] The failure to mount is pretty straight forward, just unable to find the seed device and its fsid, caused by `btrfs dev scan -u`. But the lack of any useful info is a problem. [FIX] Just add an extra error message in open_seed_devices() to indicate the error. Now the error message would look like this: BTRFS info (device dm-4): first mount of filesystem 7769223d-4db1-4e4c-ac29-0a96f53576ab BTRFS info (device dm-4): using crc32c (crc32c-generic) checksum algorithm BTRFS info (device dm-4): using free-space-tree BTRFS error (device dm-4): failed to find fsid e87c12e6-584b-4e98-8b88-962c33a619ff when attempting to open seed devices BTRFS error (device dm-4): failed to read chunk tree: -2 BTRFS error (device dm-4): open_ctree failed: -2 Link: https://github.com/kdave/btrfs-progs/issues/959 Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-21Merge tag 'for-6.14-tag' of ↵Linus Torvalds1-102/+140
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "User visible changes, features: - rebuilding of the free space tree at mount time is done in more transactions, fix potential hangs when the transaction thread is blocked due to large amount of block groups - more read IO balancing strategies (experimental config), add two new ways how to select a device for read if the profiles allow that (all RAID1*), the current default selects the device by pid which is good on average but less performant for single reader workloads - select preferred device for all reads (namely for testing) - round-robin, balance reads across devices relevant for the requested IO range - add encoded write ioctl support to io_uring (read was added in 6.12), basis for writing send stream using that instead of syscalls, non-blocking mode is not yet implemented - support FS_IOC_READ_VERITY_METADATA, applications can use the metadata to do their own verification - pass inode's i_write_hint to bios, for parity with other filesystems, ioctls F_GET_RW_HINT/F_SET_RW_HINT Core: - in zoned mode: allow to directly reclaim a block group by simply resetting it, then it can be reused and another block group does not need to be allocated - super block validation now also does more comprehensive sys array validation, adding it to the points where superblock is validated (post-read, pre-write) - subpage mode fixes: - fix double accounting of blocks due to some races - improved or fixed error handling in a few cases (compression, delalloc) - raid stripe tree: - fix various cases with extent range splitting or deleting - implement hole punching to extent range - reduce number of stripe tree lookups during bio submission - more self-tests - updated self-tests (delayed refs) - error handling improvements - cleanups, refactoring - remove rest of backref caching infrastructure from relocation, not needed anymore - error message updates - remove unnecessary calls when extent buffer was marked dirty - unused parameter removal - code moved to new files Other code changes: add rb_find_add_cached() to the rb-tree API" * tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (127 commits) btrfs: selftests: add a selftest for deleting two out of three extents btrfs: selftests: add test for punching a hole into 3 RAID stripe-extents btrfs: selftests: add selftest for punching holes into the RAID stripe extents btrfs: selftests: test RAID stripe-tree deletion spanning two items btrfs: selftests: don't split RAID extents in half btrfs: selftests: check for correct return value of failed lookup btrfs: don't use btrfs_set_item_key_safe on RAID stripe-extents btrfs: implement hole punching for RAID stripe extents btrfs: fix deletion of a range spanning parts two RAID stripe extents btrfs: fix tail delete of RAID stripe-extents btrfs: fix front delete range calculation for RAID stripe extents btrfs: assert RAID stripe-extent length is always greater than 0 btrfs: don't try to delete RAID stripe-extents if we don't need to btrfs: selftests: correct RAID stripe-tree feature flag setting btrfs: add io_uring interface for encoded writes btrfs: remove the unused locked_folio parameter from btrfs_cleanup_ordered_extents() btrfs: add extra error messages for delalloc range related errors btrfs: subpage: dump the involved bitmap when ASSERT() failed btrfs: subpage: fix the bitmap dump of the locked flags btrfs: do proper folio cleanup when run_delalloc_nocow() failed ...
2025-01-13btrfs: add the missing error handling inside get_canonical_dev_pathQu Wenruo1-0/+4
Inside function get_canonical_dev_path(), we call d_path() to get the final device path. But d_path() can return error, and in that case the next strscpy() call will trigger an invalid memory access. Add back the missing error handling for d_path(). Reported-by: Boris Burkov <boris@bur.io> Fixes: 7e06de7c83a7 ("btrfs: canonicalize the device path before adding it") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: configure read policy via module parameterAnand Jain1-1/+14
For testing purposes allow to configure the read policy via module parameter from the beginning. Available only with CONFIG_BTRFS_EXPERIMENTAL Examples: - Set the RAID1 balancing method to round-robin with a custom min_contig_read of 4k: $ modprobe btrfs read_policy=round-robin:4096 - Set the round-robin balancing method with the default min_contiguous_read: $ modprobe btrfs read_policy=round-robin - Set the "devid" balancing method, defaulting to the latest device: $ modprobe btrfs read_policy=devid Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: add read policy to set a preferred deviceAnand Jain1-0/+17
Add read policy that will force all reads to go to the given device (specified by devid) on the RAID1 profiles. This will be used for testing, e.g. to read from stale device. Users may find other use cases. Can be set in sysfs, the value format is "devid:<devid>" to the file /sys/fs/btrfs/FSID/read_policy Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: introduce RAID1 round-robin read balancingAnand Jain1-0/+65
Add round-robin read policy that balances reads over available devices (all RAID1 block group profiles). Switch to the next devices is done after a number of blocks is read, which is 256K by default and is configurable in sysfs. The format is "round-robin:<min-contig-read>" and can be set in file /sys/fs/btrfs/FSID/read_policy Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: initialize fs_devices->fs_info earlier in btrfs_init_devices_late()Anand Jain1-2/+0
Currently, fs_devices->fs_info is initialized in btrfs_init_devices_late(), but this occurs too late for find_live_mirror(), which is invoked by load_super_root() much earlier than btrfs_init_devices_late(). Fix this by moving the initialization to open_ctree(), before load_super_root(). Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: volumes: remove unnecessary calls to btrfs_mark_buffer_dirty()Filipe Manana1-11/+1
We have several places explicitly calling btrfs_mark_buffer_dirty() but that is not necessarily since the target leaf came from a path that was obtained for a btree search function that modifies the btree, something like btrfs_insert_empty_item() or anything else that ends up calling btrfs_search_slot() with a value of 1 for its 'cow' argument. These just make the code more verbose, confusing and add a little extra overhead and well as increase the module's text size, so remove them. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: validate system chunk array at btrfs_validate_super()Qu Wenruo1-60/+13
Currently btrfs_validate_super() only does a very basic check on the array chunk size (not too large than the available space, but not too small to contain no chunk). The more comprehensive checks (the regular chunk checks and size check inside the system chunk array) are all done inside btrfs_read_sys_array(). It's not a big deal, but it also means we do not do any validation on the system chunk array at super block writeback time either. Do the following modification to centralize the system chunk array checks into btrfs_validate_super(): - Make chunk_err() helper accept stack chunk pointer If @leaf parameter is NULL, then the @chunk pointer will be a pointer to the chunk item, other than the offset inside the leaf. And since @leaf can be NULL, add a new @fs_info parameter for that case. - Make btrfs_check_chunk_valid() handle stack chunk pointer The same as chunk_err(), a new @fs_info parameter, and if @leaf is NULL, then @chunk will be a pointer to a stack chunk. If @chunk is NULL, then all needed btrfs_chunk members will be read using the stack helper instead of the leaf helper. This means we need to read out all the needed member at the beginning of the function. Furthermore, at super block read time, fs_info->sectorsize is not yet initialized, we need one extra @sectorsize parameter to grab the correct sectorsize. - Introduce a helper validate_sys_chunk_array() * Validate the disk key. * Validate the size before we access the full chunk items. * Do the full chunk item validation. - Call validate_sys_chunk_array() at btrfs_validate_super() - Simplify the checks inside btrfs_read_sys_array() Now the checks will be converted to an ASSERT(). - Simplify the checks inside read_one_chunk() Now that all chunk items inside system chunk array and chunk tree are verified, there is no need to verify them again inside read_one_chunk(). This change has the following advantages: - More comprehensive checks at write time And unlike the sys_chunk_array read routine, this time we do not need to allocate a dummy extent buffer to do the check. All the checks done here require no new memory allocation. - Slightly improved readability when iterating the system chunk array Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: update btrfs_add_chunk_map() to use rb helpersRoger L. Beckermeyer III1-21/+22
Update btrfs_add_chunk_map() to use rb_find_add_cached(). Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move extent-tree function declarations out of ctree.hFilipe Manana1-1/+1
We have 3 functions that have their prototypes declared in ctree.h but they are defined at extent-tree.c and they are unrelated to the btree data structure. Move the prototypes out of ctree.h and into extent-tree.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: pass btrfs_io_geometry to is_single_device_ioJohannes Thumshirn1-5/+4
Now that we have the stripe tree decision saved in struct btrfs_io_geometry we can pass it into is_single_device_io() and get rid of another call to btrfs_need_raid_stripe_tree_update(). Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: cache RAID stripe tree decision in btrfs_io_contextJohannes Thumshirn1-0/+1
Cache the decision if a particular I/O needs to update RAID stripe tree entries in struct btrfs_io_context. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: cache stripe tree usage in struct btrfs_io_geometryJohannes Thumshirn1-2/+3
Cache the return of btrfs_need_stripe_tree_update() in struct btrfs_io_geometry starting from btrfs_map_block(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: do not clear read-only when adding sprout deviceBoris Burkov1-4/+0
If you follow the seed/sprout wiki, it suggests the following workflow: btrfstune -S 1 seed_dev mount seed_dev mnt btrfs device add sprout_dev mount -o remount,rw mnt The first mount mounts the FS readonly, which results in not setting BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add somewhat surprisingly clears the readonly bit on the sb (though the mount is still practically readonly, from the users perspective...). Finally, the remount checks the readonly bit on the sb against the flag and sees no change, so it does not run the code intended to run on ro->rw transitions, leaving BTRFS_FS_OPEN unset. As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and does no work. This results in leaking deleted snapshots until we run out of space. I propose fixing it at the first departure from what feels reasonable: when we clear the readonly bit on the sb during device add. A new fstest I have written reproduces the bug and confirms the fix. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: remove unused btrfs_is_parity_mirror()Dr. David Alan Gilbert1-18/+0
btrfs_is_parity_mirror() has been unused since commit 4886ff7b50f6 ("btrfs: introduce a new helper to submit write bio for repair"). Remove it as the code was refactored and we don't need the helper anymore. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: tests: add selftests for raid-stripe-treeJohannes Thumshirn1-3/+3
Add first stash of very basic self tests for the RAID stripe-tree. More test cases will follow exercising the tree. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: correct typos in multiple comments across various filesShen Lichuan1-1/+1
Fix some confusing spelling errors that were currently identified, the details are as follows: block-group.c: 2800: uncompressible ==> incompressible extent-tree.c: 3131: EXTEMT ==> EXTENT extent_io.c: 3124: utlizing ==> utilizing extent_map.c: 1323: ealier ==> earlier extent_map.c: 1325: possiblity ==> possibility fiemap.c: 189: emmitted ==> emitted fiemap.c: 197: emmitted ==> emitted fiemap.c: 203: emmitted ==> emitted transaction.h: 36: trasaction ==> transaction volumes.c: 5312: filesysmte ==> filesystem zoned.c: 1977: trasnsaction ==> transaction Signed-off-by: Shen Lichuan <shenlichuan@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: canonicalize the device path before adding itQu Wenruo1-1/+86
[PROBLEM] Currently btrfs accepts any file path for its device, resulting some weird situation: # ./mount_by_fd /dev/test/scratch1 /mnt/btrfs/ The program has the following source code: #include <fcntl.h> #include <stdio.h> #include <sys/mount.h> int main(int argc, char *argv[]) { int fd = open(argv[1], O_RDWR); char path[256]; snprintf(path, sizeof(path), "/proc/self/fd/%d", fd); return mount(path, argv[2], "btrfs", 0, NULL); } Then we can have the following weird device path: BTRFS: device fsid 2378be81-fe12-46d2-a9e8-68cf08dd98d5 devid 1 transid 7 /proc/self/fd/3 (253:2) scanned by mount_by_fd (18440) Normally it's not a big deal, and later udev can trigger a device path rename. But if udev didn't trigger, the device path "/proc/self/fd/3" will show up in mtab. [CAUSE] For filename "/proc/self/fd/3", it means the opened file descriptor 3. In above case, it's exactly the device we want to open, aka points to "/dev/test/scratch1" which is another symlink pointing to "/dev/dm-2". Inside kernel we solve the mount source using LOOKUP_FOLLOW, which follows the symbolic link and grab the proper block device. But inside btrfs we also save the filename into btrfs_device::name, and utilize that member to report our mount source, which leads to the above situation. [FIX] Instead of unconditionally trust the path, check if the original file (not following the symbolic link) is inside "/dev/", if not, then manually lookup the path to its final destination, and use that as our device path. This allows us to still use symbolic links, like "/dev/mapper/test-scratch" from LVM2, which is required for fstests runs with LVM2 setup. And for really weird names, like the above case, we solve it to "/dev/dm-2" instead. Reviewed-by: Filipe Manana <fdmanana@suse.com> Link: https://bugzilla.suse.com/show_bug.cgi?id=1230641 Reported-by: Fabian Vogt <fvogt@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: avoid unnecessary device path update for the same deviceQu Wenruo1-1/+37
[PROBLEM] It is very common for udev to trigger device scan, and every time a mounted btrfs device got re-scan from different soft links, we will get some of unnecessary device path updates, this is especially common for LVM based storage: # lvs scratch1 test -wi-ao---- 10.00g scratch2 test -wi-a----- 10.00g scratch3 test -wi-a----- 10.00g scratch4 test -wi-a----- 10.00g scratch5 test -wi-a----- 10.00g test test -wi-a----- 10.00g # mkfs.btrfs -f /dev/test/scratch1 # mount /dev/test/scratch1 /mnt/btrfs # dmesg -c [ 205.705234] BTRFS: device fsid 7be2602f-9e35-4ecf-a6ff-9e91d2c182c9 devid 1 transid 6 /dev/mapper/test-scratch1 (253:4) scanned by mount (1154) [ 205.710864] BTRFS info (device dm-4): first mount of filesystem 7be2602f-9e35-4ecf-a6ff-9e91d2c182c9 [ 205.711923] BTRFS info (device dm-4): using crc32c (crc32c-intel) checksum algorithm [ 205.713856] BTRFS info (device dm-4): using free-space-tree [ 205.722324] BTRFS info (device dm-4): checking UUID tree So far so good, but even if we just touched any soft link of "dm-4", we will get quite some unnecessary device path updates. # touch /dev/mapper/test-scratch1 # dmesg -c [ 469.295796] BTRFS info: devid 1 device path /dev/mapper/test-scratch1 changed to /dev/dm-4 scanned by (udev-worker) (1221) [ 469.300494] BTRFS info: devid 1 device path /dev/dm-4 changed to /dev/mapper/test-scratch1 scanned by (udev-worker) (1221) Such device path rename is unnecessary and can lead to random path change due to the udev race. [CAUSE] Inside device_list_add(), we are using a very primitive way checking if the device has changed, strcmp(). Which can never handle links well, no matter if it's hard or soft links. So every different link of the same device will be treated as a different device, causing the unnecessary device path update. [FIX] Introduce a helper, is_same_device(), and use path_equal() to properly detect the same block device. So that the different soft links won't trigger the rename race. Reviewed-by: Filipe Manana <fdmanana@suse.com> Link: https://bugzilla.suse.com/show_bug.cgi?id=1230641 Reported-by: Fabian Vogt <fvogt@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: don't take dev_replace rwsem on task already holding itJohannes Thumshirn1-3/+5
Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of the RAID stripe-tree, we get the following splat from lockdep: BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started ============================================ WARNING: possible recursive locking detected 6.11.0-rc3-btrfs-for-next #599 Not tainted -------------------------------------------- btrfs/2326 is trying to acquire lock: ffff88810f215c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250 but task is already holding lock: ffff88810f215c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&fs_info->dev_replace.rwsem); lock(&fs_info->dev_replace.rwsem); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by btrfs/2326: #0: ffff88810f215c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250 stack backtrace: CPU: 1 UID: 0 PID: 2326 Comm: btrfs Not tainted 6.11.0-rc3-btrfs-for-next #599 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0x5b/0x80 __lock_acquire+0x2798/0x69d0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx___lock_acquire+0x10/0x10 lock_acquire+0x19d/0x4a0 ? btrfs_map_block+0x39f/0x2250 ? __pfx_lock_acquire+0x10/0x10 ? find_held_lock+0x2d/0x110 ? lock_is_held_type+0x8f/0x100 down_read+0x8e/0x440 ? btrfs_map_block+0x39f/0x2250 ? __pfx_down_read+0x10/0x10 ? do_raw_read_unlock+0x44/0x70 ? _raw_read_unlock+0x23/0x40 btrfs_map_block+0x39f/0x2250 ? btrfs_dev_replace_by_ioctl+0xd69/0x1d00 ? btrfs_bio_counter_inc_blocked+0xd9/0x2e0 ? __kasan_slab_alloc+0x6e/0x70 ? __pfx_btrfs_map_block+0x10/0x10 ? __pfx_btrfs_bio_counter_inc_blocked+0x10/0x10 ? kmem_cache_alloc_noprof+0x1f2/0x300 ? mempool_alloc_noprof+0xed/0x2b0 btrfs_submit_chunk+0x28d/0x17e0 ? __pfx_btrfs_submit_chunk+0x10/0x10 ? bvec_alloc+0xd7/0x1b0 ? bio_add_folio+0x171/0x270 ? __pfx_bio_add_folio+0x10/0x10 ? __kasan_check_read+0x20/0x20 btrfs_submit_bio+0x37/0x80 read_extent_buffer_pages+0x3df/0x6c0 btrfs_read_extent_buffer+0x13e/0x5f0 read_tree_block+0x81/0xe0 read_block_for_search+0x4bd/0x7a0 ? __pfx_read_block_for_search+0x10/0x10 btrfs_search_slot+0x78d/0x2720 ? __pfx_btrfs_search_slot+0x10/0x10 ? lock_is_held_type+0x8f/0x100 ? kasan_save_track+0x14/0x30 ? __kasan_slab_alloc+0x6e/0x70 ? kmem_cache_alloc_noprof+0x1f2/0x300 btrfs_get_raid_extent_offset+0x181/0x820 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_btrfs_get_raid_extent_offset+0x10/0x10 ? down_read+0x194/0x440 ? __pfx_down_read+0x10/0x10 ? do_raw_read_unlock+0x44/0x70 ? _raw_read_unlock+0x23/0x40 btrfs_map_block+0x5b5/0x2250 ? __pfx_btrfs_map_block+0x10/0x10 scrub_submit_initial_read+0x8fe/0x11b0 ? __pfx_scrub_submit_initial_read+0x10/0x10 submit_initial_group_read+0x161/0x3a0 ? lock_release+0x20e/0x710 ? __pfx_submit_initial_group_read+0x10/0x10 ? __pfx_lock_release+0x10/0x10 scrub_simple_mirror.isra.0+0x3eb/0x580 scrub_stripe+0xe4d/0x1440 ? lock_release+0x20e/0x710 ? __pfx_scrub_stripe+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? do_raw_read_unlock+0x44/0x70 ? _raw_read_unlock+0x23/0x40 scrub_chunk+0x257/0x4a0 scrub_enumerate_chunks+0x64c/0xf70 ? __mutex_unlock_slowpath+0x147/0x5f0 ? __pfx_scrub_enumerate_chunks+0x10/0x10 ? bit_wait_timeout+0xb0/0x170 ? __up_read+0x189/0x700 ? scrub_workers_get+0x231/0x300 ? up_write+0x490/0x4f0 btrfs_scrub_dev+0x52e/0xcd0 ? create_pending_snapshots+0x230/0x250 ? __pfx_btrfs_scrub_dev+0x10/0x10 btrfs_dev_replace_by_ioctl+0xd69/0x1d00 ? lock_acquire+0x19d/0x4a0 ? __pfx_btrfs_dev_replace_by_ioctl+0x10/0x10 ? lock_release+0x20e/0x710 ? btrfs_ioctl+0xa09/0x74f0 ? __pfx_lock_release+0x10/0x10 ? do_raw_spin_lock+0x11e/0x240 ? __pfx_do_raw_spin_lock+0x10/0x10 btrfs_ioctl+0xa14/0x74f0 ? lock_acquire+0x19d/0x4a0 ? find_held_lock+0x2d/0x110 ? __pfx_btrfs_ioctl+0x10/0x10 ? lock_release+0x20e/0x710 ? do_sigaction+0x3f0/0x860 ? __pfx_do_vfs_ioctl+0x10/0x10 ? do_raw_spin_lock+0x11e/0x240 ? lockdep_hardirqs_on_prepare+0x270/0x3e0 ? _raw_spin_unlock_irq+0x28/0x50 ? do_sigaction+0x3f0/0x860 ? __pfx_do_sigaction+0x10/0x10 ? __x64_sys_rt_sigaction+0x18e/0x1e0 ? __pfx___x64_sys_rt_sigaction+0x10/0x10 ? __x64_sys_close+0x7c/0xd0 __x64_sys_ioctl+0x137/0x190 do_syscall_64+0x71/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f0bd1114f9b Code: Unable to access opcode bytes at 0x7f0bd1114f71. RSP: 002b:00007ffc8a8c3130 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f0bd1114f9b RDX: 00007ffc8a8c35e0 RSI: 00000000ca289435 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007 R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffc8a8c6c85 R13: 00000000398e72a0 R14: 0000000000004361 R15: 0000000000000004 </TASK> This happens because on RAID stripe-tree filesystems we recurse back into btrfs_map_block() on scrub to perform the logical to device physical mapping. But as the device replace task is already holding the dev_replace::rwsem we deadlock. So don't take the dev_replace::rwsem in case our task is the task performing the device replace. Suggested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-29btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids()Zhihao Cheng1-0/+1
Mounting btrfs from two images (which have the same one fsid and two different dev_uuids) in certain executing order may trigger an UAF for variable 'device->bdev_file' in __btrfs_free_extra_devids(). And following are the details: 1. Attach image_1 to loop0, attach image_2 to loop1, and scan btrfs devices by ioctl(BTRFS_IOC_SCAN_DEV): / btrfs_device_1 → loop0 fs_device \ btrfs_device_2 → loop1 2. mount /dev/loop0 /mnt btrfs_open_devices btrfs_device_1->bdev_file = btrfs_get_bdev_and_sb(loop0) btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1) btrfs_fill_super open_ctree fail: btrfs_close_devices // -ENOMEM btrfs_close_bdev(btrfs_device_1) fput(btrfs_device_1->bdev_file) // btrfs_device_1->bdev_file is freed btrfs_close_bdev(btrfs_device_2) fput(btrfs_device_2->bdev_file) 3. mount /dev/loop1 /mnt btrfs_open_devices btrfs_get_bdev_and_sb(&bdev_file) // EIO, btrfs_device_1->bdev_file is not assigned, // which points to a freed memory area btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1) btrfs_fill_super open_ctree btrfs_free_extra_devids if (btrfs_device_1->bdev_file) fput(btrfs_device_1->bdev_file) // UAF ! Fix it by setting 'device->bdev_file' as 'NULL' after closing the btrfs_device in btrfs_close_one_device(). Fixes: 142388194191 ("btrfs: do not background blkdev_put()") CC: stable@vger.kernel.org # 4.19+ Link: https://bugzilla.kernel.org/show_bug.cgi?id=219408 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: reduce chunk_map lookups in btrfs_map_block()Johannes Thumshirn1-22/+27
Currently we're calling btrfs_num_copies() before btrfs_get_chunk_map() in btrfs_map_block(). But btrfs_num_copies() itself does a chunk map lookup to be able to calculate the number of copies. So split out the code getting the number of copies from btrfs_num_copies() into a helper called btrfs_chunk_map_num_copies() and directly call it from btrfs_map_block() and btrfs_num_copies(). This saves us one rbtree lookup per btrfs_map_block() invocation. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: print message on device opening error during mountLi Zhang1-0/+2
[ENHANCEMENT] When mounting a btrfs filesystem, the filesystem opens the block device, and if this fails, there is no message about it. Print a message about it to help debugging. [TEST] I have a btrfs filesystem on three block devices, one of which is write-protected, so regular mounts fail, but there is no message in dmesg. /dev/vdb normal /dev/vdc write protected /dev/vdd normal Before patch: $ sudo mount /dev/vdb /mnt/ mount: mount(2) failed: no such file or directory $ sudo dmesg # Show only messages about missing block devices .... [ 352.947196] BTRFS error (device vdb): devid 2 uuid 4ee2c625-a3b2-4fe0-b411-756b23e08533 missing .... After patch: $ sudo mount /dev/vdb /mnt/ mount: mount(2) failed: no such file or directory $ sudo dmesg # Show bdev_file_open_by_path failed. .... [ 352.944328] BTRFS error: failed to open device for path /dev/vdc with flags 0x3: -13 [ 352.947196] BTRFS error (device vdb): missing devid 2 uuid 4ee2c625-a3b2-4fe0-b411-756b23e08533 .... Signed-off-by: Li Zhang <zhanglikernel@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: move uuid tree related code to uuid-tree.[ch]Qu Wenruo1-177/+0
Functions btrfs_uuid_scan_kthread() and btrfs_create_uuid_tree() are for UUID tree rescan and creation, it's not suitable for volumes.[ch]. Move them to uuid-tree.[ch] instead. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: abort transaction on errors in btrfs_free_chunk()David Sterba1-6/+9
The errors during removing a chunk item are fatal, we expect to have a matching item in the chunk map from which the chunk_offset is taken. Handle that by transaction abort. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: constify pointer parameters where applicableDavid Sterba1-1/+1
We can add const to many parameters, this is for clarity and minor addition to safety. There are some minor effects, in the assembly code and .ko measured on release config. This patch does not cover all possible conversions. Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use for-local variables that shadow function variablesDavid Sterba1-6/+3
We've started to use for-loop local variables and in a few places this shadows a function variable. Convert a few cases reported by 'make W=2'. If applicable also change the style to post-increment, that's the preferred one. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: drop bytenr_orig and fix comment in btrfs_scan_one_device()Anand Jain1-10/+8
Drop the single-use variable bytenr_orig and instead use btrfs_sb_offset() in the function argument passing. Fix a stale comment about not automatically fixing a bad primary superblock from the backup mirror copies. Also, move the comment closer to where the primary superblock read occurs. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: pass struct btrfs_io_geometry into handle_ops_on_dev_replace()Johannes Thumshirn1-10/+8
Passing in a 'struct btrfs_io_geometry into handle_ops_on_dev_replace can reduce the number of arguments by two. No functional changes otherwise. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>