summaryrefslogtreecommitdiff
path: root/fs/btrfs/inode.c
AgeCommit message (Collapse)AuthorFilesLines
2024-05-15Merge tag 'for-6.10-tag' of ↵Linus Torvalds1-511/+412
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This update brings a few minor performance improvements, otherwise there's a lot of refactoring, cleanups and other sort of not user visible changes. Performance improvements: - inline b-tree locking functions, improvement in metadata-heavy changes - relax locking on a range that's being reflinked, allows read operations to run in parallel - speed up NOCOW write checks (throughput +9% on a sample test) - extent locking ranges have been reduced in several places, namely around delayed ref processing Core: - more page to folio conversions: - relocation - send - compression - inline extent handling - super block write and wait - extent_map structure optimizations: - reduced structure size - code simplifications - add shrinker for allocated objects, the numbers can go high and could exhaust memory on smaller systems (reported) as they may not get an opportunity to be freed fast enough - extent locking optimizations: - reduce locking ranges where it does not seem to be necessary and are safe due to other means of synchronization - potential improvements due to lower contention, allocation/freeing and state management operations of extent state tracking structures - delayed ref cleanups and simplifications - updated trace points - improved error handling, warnings and assertions - cleanups and refactoring, unification of error handling paths" * tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (122 commits) btrfs: qgroup: fix initialization of auto inherit array btrfs: count super block write errors in device instead of tracking folio error state btrfs: use the folio iterator in btrfs_end_super_write() btrfs: convert super block writes to folio in write_dev_supers() btrfs: convert super block writes to folio in wait_dev_supers() bio: Export bio_add_folio_nofail to modules btrfs: remove duplicate included header from fs.h btrfs: add a cached state to extent_clear_unlock_delalloc btrfs: push extent lock down in submit_one_async_extent btrfs: push lock_extent down in cow_file_range() btrfs: move can_cow_file_range_inline() outside of the extent lock btrfs: push lock_extent into cow_file_range_inline btrfs: push extent lock into cow_file_range btrfs: push extent lock into run_delalloc_cow btrfs: remove unlock_extent from run_delalloc_compressed btrfs: push extent lock down in run_delalloc_nocow btrfs: adjust while loop condition in run_delalloc_nocow btrfs: push extent lock into run_delalloc_nocow btrfs: push the extent lock into btrfs_run_delalloc_range btrfs: lock extent when doing inline extent in compression ...
2024-05-13Merge tag 'vfs-6.10.misc' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That means we now have six free FMODE_* bits in total (but bit #6 already got used for FMODE_WRITE_RESTRICTED) - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup) - Add fd_raw cleanup class so we can make use of automatic cleanup provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well - Optimize seq_puts() - Simplify __seq_puts() - Add new anon_inode_getfile_fmode() api to allow specifying f_mode instead of open-coding it in multiple places - Annotate struct file_handle with __counted_by() and use struct_size() - Warn in get_file() whether f_count resurrection from zero is attempted (epoll/drm discussion) - Folio-sophize aio - Export the subvolume id in statx() for both btrfs and bcachefs - Relax linkat(AT_EMPTY_PATH) requirements - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors for dup*() equality replacing kcmp() Cleanups: - Compile out swapfile inode checks when swap isn't enabled - Use (1 << n) notation for FMODE_* bitshifts for clarity - Remove redundant variable assignment in fs/direct-io - Cleanup uses of strncpy in orangefs - Speed up and cleanup writeback - Move fsparam_string_empty() helper into header since it's currently open-coded in multiple places - Add kernel-doc comments to proc_create_net_data_write() - Don't needlessly read dentry->d_flags twice Fixes: - Fix out-of-range warning in nilfs2 - Fix ecryptfs overflow due to wrong encryption packet size calculation - Fix overly long line in xfs file_operations (follow-up to FMODE_* cleanup) - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up to FMODE_* cleanup) - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_* cleanup) - Fix stable offset api to prevent endless loops - Fix afs file server rotations - Prevent xattr node from overflowing the eraseblock in jffs2 - Move fdinfo PTRACE_MODE_READ procfs check into the .permission() operation instead of .open() operation since this caused userspace regressions" * tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) afs: Fix fileserver rotation getting stuck selftests: add F_DUPDFD_QUERY selftests fcntl: add F_DUPFD_QUERY fcntl() file: add fd_raw cleanup class fs: WARN when f_count resurrection is attempted seq_file: Simplify __seq_puts() seq_file: Optimize seq_puts() proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation fs: Create anon_inode_getfile_fmode() xfs: don't call xfs_file_open from xfs_dir_open xfs: drop fop_flags for directories xfs: fix overly long line in the file_operations shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() jffs2: prevent xattr node from overflowing the eraseblock vfs, swap: compile out IS_SWAPFILE() on swapless configs vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements fs/direct-io: remove redundant assignment to variable retval fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading ...
2024-05-07btrfs: add a cached state to extent_clear_unlock_delallocJosef Bacik1-18/+24
Now that we have the lock_extent tightly coupled with extent_clear_unlock_delalloc we can add a cached state to extent_clear_unlock_delalloc and benefit from skipping the extra lookup when we're doing cow. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: push extent lock down in submit_one_async_extentJosef Bacik1-1/+2
We don't need to include the time we spend in the allocator under our extent lock protection, move it after the allocator and make sure we lock the extent in the error case to ensure we're not clearing these bits without the extent lock held. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: push lock_extent down in cow_file_range()Josef Bacik1-2/+14
Now that we've got the extent lock pushed into cow_file_range() we can push it further down into the allocation loop. This allows us to only hold the extent lock during the dropping of the extent map range and inserting the ordered extent. This makes the error case a little trickier as we'll now have to lock the range before clearing any of the other extent bits for the range, but this is the error path so is less performance critical. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: move can_cow_file_range_inline() outside of the extent lockJosef Bacik1-4/+8
These checks aren't reliant on the extent lock. Move this up into cow_file_range_inline(), and then update encoded writes to call this check before calling __cow_file_range_inline(). This will allow us to skip the extent lock if we're not able to inline the given extent. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: push lock_extent into cow_file_range_inlineJosef Bacik1-5/+8
Now that we've pushed the lock_extent() into cow_file_range() we can push the extent locking into cow_file_range_inline() and move the lock_extent in cow_file_range() to after we call cow_file_range_inline(). Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: push extent lock into cow_file_rangeJosef Bacik1-8/+7
Now that cow_file_range is the only function that is called with the range locked, push this call into cow_file_range so we can further narrow the scope. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: push extent lock into run_delalloc_cowJosef Bacik1-8/+7
This is used by zoned but also as the fallback for uncompressed extents when we fail to compress the ranges. Push the extent lock into run_dealloc_cow(), and adjust the compression case to take the extent lock after calling run_delalloc_cow(). Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: remove unlock_extent from run_delalloc_compressedJosef Bacik1-6/+5
Since we immediately unlock the extent range when we enter run_delalloc_compressed() simply move the lock_extent() down to cover cow_file_range() and then remove the unlock_extent() from run_delalloc_compressed. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: push extent lock down in run_delalloc_nocowJosef Bacik1-3/+18
run_delalloc_nocow is a little special because we use the file extents to see if we can nocow a range. We don't actually need the protection of the extent lock to look at the file extents at this point however. We are currently holding the page lock for this range, so we are protected from anybody who would simultaneously be modifying the file extent items for this range. * mmap() - we're holding the page lock. * buffered writes - we're holding the page lock. * direct writes - we're holding the page lock and direct IO has to flush page cache before it's able to continue. * fallocate() - all callers flush the range and wait on ordered extents while holding the inode lock and the mmap lock, so we are again saved by the page lock. We want to use the extent lock to protect 1) The mapping tree for the given range. 2) The ordered extents for the given range. 3) The io_tree for the given range. Push the extent lock down to cover these operations. In the fallback_to_cow() case we simply lock before doing anything and rely on the cow_file_range() helper to handle it's range properly. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: adjust while loop condition in run_delalloc_nocowJosef Bacik1-3/+1
We have the following pattern while (1) { if (cur_offset > end) break; } Which is just while (cur_offset <= end) { ... } so adjust the code to be more clear. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: push extent lock into run_delalloc_nocowJosef Bacik1-5/+7
run_delalloc_nocow is a bit special as it walks through the file extents for the inode and determines what it can nocow and what it can't. This is the more complicated area for extent locking, so start with this function. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: push the extent lock into btrfs_run_delalloc_rangeJosef Bacik1-0/+5
We want to limit the scope of the extent lock to be around operations that can change in flight. Currently we hold the extent lock through the entire writepage operation, which isn't really necessary. We want to protect to make sure nobody has updated DELALLOC. In find_lock_delalloc_range we must lock the range in order to validate the contents of our io_tree. However once we've done that we're safe to unlock the range and continue, as we have the page lock already held for the range. We are protected from all operations at this point. * mmap() - we're holding the page lock, thus are protected. * buffered writes - again, we're protected because we take the page lock for the first and last page in our range for buffered writes so we won't create new delalloc ranges in this area. * direct IO - we invalidate pagecache before attempting to write a new area, which requires the page lock, so again are protected once we're holding the page lock on this range. Additionally this behavior actually already exists for compressed, we unlock the range as soon as we start to process the async extents, and re-lock it during compression. So this is completely safe, and makes the locking more consistent. Make this simple by just pushing the extent lock into btrfs_run_delalloc_range. From there followup patches will push the lock further down into its users. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: lock extent when doing inline extent in compressionJosef Bacik1-10/+7
We currently don't lock the extent when we're doing a cow_file_range_inline() for a compressed extent. This isn't a problem necessarily, but it's inconsistent with the rest of our usage of cow_file_range_inline(). This also leads to some extra weird logic around whether the extent is locked or not. Fix this to lock the extent before calling cow_file_range_inline() in compression to make it consistent with the rest of the inline users. In future patches this will be pushed down into the cow_file_range_inline() helper, so we're fine with the quick and dirty locking here. This patch exists to make the behavior change obvious. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: move extent bit and page cleanup into cow_file_range_inlineJosef Bacik1-53/+51
We duplicate the extent cleanup for cow_file_range_inline() in the cow and compressed case. The encoded case doesn't need to do cleanup the same way, so rename cow_file_range_inline to __cow_file_range_inline and then make cow_file_range_inline handle the extent cleanup appropriately, and update the callers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: unlock all the pages with successful inline extent creationJosef Bacik1-14/+1
Since 4750af3bbe5d ("btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig()") we have been unlocking the locked page manually instead of via extent_clear_unlock_delalloc() because of subpage blocksize support. However we actually disable inline extent creation for subpage blocksize support, so this behavior isn't necessary. Remove this code and comment, if at some point the subpage blocksize code grows support for inline extents this can be re-evaluated. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: push all inline logic into cow_file_rangeJosef Bacik1-62/+81
Currently we have a lot of duplicated checks of if (start == 0 && fs_info->sectorsize == PAGE_SIZE) cow_file_range_inline(); Instead of duplicating this check everywhere, consolidate all of the inline extent logic into a helper which documents all of the checks and then use that helper inside of cow_file_range_inline(). With this we can clean up all of the calls to either unconditionally call cow_file_range_inline(), or at least reduce the checks we're doing before we call cow_file_range_inline(); Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: handle errors in btrfs_reloc_clone_csums properlyJosef Bacik1-3/+2
In the cow path we will clone the reloc csums for relocated data extents, and if there's an error we already have an ordered extent and rely on the ordered extent finishing to clean everything up. There's a problem however, we don't mark the ordered extent with an error, we pretend like everything was just fine. If we were at the end of our range we won't actually bubble up this error anywhere, and we could end up inserting an extent that doesn't have csums where it should have them. Fix this by adding a helper to mark the ordered extent with an error, and then use this when we fail to lookup the csums in btrfs_reloc_clone_csums. Use this helper in the other place where we use the same pattern while we're here. This will prevent us from erroneously inserting the extent that doesn't have the required checksums. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: add extra sanity checks for create_io_em()Qu Wenruo1-1/+39
The function create_io_em() is called before we submit an IO, to update the in-memory extent map for the involved range. This patch changes the following aspects: - Does not allow BTRFS_ORDERED_NOCOW type For real NOCOW (excluding NOCOW writes into preallocated ranges) writes, we never call create_io_em(), as we does not need to update the extent map at all. So remove the sanity check allowing BTRFS_ORDERED_NOCOW type. - Add extra sanity checks * PREALLOC - @block_len == len For uncompressed writes. * REGULAR - @block_len == @orig_block_len == @ram_bytes == @len We're creating a new uncompressed extent, and referring all of it. - @orig_start == @start We haven no offset inside the extent. * COMPRESSED - valid @compress_type - @len <= @ram_bytes This is to co-operate with encoded writes, which can cause a new file extent referring only part of a uncompressed extent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: make try_release_extent_mapping() return a boolFilipe Manana1-4/+3
Currently try_release_extent_mapping() as an int return type, but we use it as a boolean. Its only caller, the release folio callback, also returns a boolean which corresponds to try_release_extent_mapping()'s return value. So change its return value type to bool as well as its helper try_release_extent_state(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: change root->root_key.objectid to btrfs_root_id()Josef Bacik1-32/+29
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries()Filipe Manana1-52/+14
Currently btrfs_prune_dentries() has open code to find the first inode in a root with a minimum inode number. Remove that code and make it use the helper btrfs_find_first_inode() for that task. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: export find_next_inode() as btrfs_find_first_inode()Filipe Manana1-0/+59
Export the relocation private helper find_next_inode() to inode.c, as this same logic is also used at btrfs_prune_dentries() and will be used by an upcoming change that adds an extent map shrinker. The next patch will change btrfs_prune_dentries() to use this helper. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: pass an inode to btrfs_add_extent_mapping()Filipe Manana1-1/+1
Instead of passing fs_info and extent map tree arguments to btrfs_add_extent_mapping(), we can pass an inode instead, as extent maps are always inserted in the extent map tree of an inode, and the fs_info can be extracted from the inode (inode->root->fs_info). The only exception is in the self tests where we allocate an extent map tree and then use it to insert/update/remove extent maps. However the tests can be changed to use a test inode and then use the inode's extent map tree. So change btrfs_add_extent_mapping() to have an inode as an argument instead of a fs_info and an extent map tree. This reduces the number of parameters and will also be needed for an upcoming change. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: open code csum_exist_in_range()Filipe Manana1-12/+7
The csum_exist_in_range() function is now too trivial and is only used in one place, so open code it in its single caller. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: make NOCOW checks for existence of checksums in a range more efficientFilipe Manana1-16/+2
Before deciding if we can do a NOCOW write into a range, one of the things we have to do is check if there are checksum items for that range. We do that through the btrfs_lookup_csums_list() function, which searches for checksums and adds them to a list supplied by the caller. But all we need is to check if there is any checksum, we don't need to look for all of them and collect them into a list, which requires more search time in the checksums tree, allocating memory for checksums items to add to the list, copy checksums from a leaf into those list items, then free that memory, etc. This is all unnecessary overhead, wasting mostly CPU time, and perhaps some occasional IO if we need to read from disk any extent buffers. So change btrfs_lookup_csums_list() to allow to return immediately in case it finds any checksum, without the need to add it to a list and read it from a leaf. This is accomplished by allowing a NULL list parameter and making the function return 1 if it found any checksum, 0 if it didn't found any, and a negative value in case of an error. The following test with fio was used to measure performance: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bssplit=4k/20:8k/20:16k/20:32k/20:64k/20 direct=1 numjobs=16 fallocate=posix time_based runtime=300 [file1] size=8G ioengine=io_uring iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test was run on a release kernel (Debian's default kernel config). The results before this patch: WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec The results after this patch: WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: remove search_commit parameter from btrfs_lookup_csums_list()Filipe Manana1-1/+1
All the callers of btrfs_lookup_csums_list() pass a value of 0 as the "search_commit" parameter. So remove it and make the function behave as to always search from the regular root. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: move btrfs_page_mkwrite() from inode.c into file.cFilipe Manana1-167/+0
btrfs_page_mkwrite() is a struct vm_operations_struct callback and we define that structure in file.c. Currently the function is in inode.c and has to be exported to be used in file.c, which makes no sense because it's not used anywhere else. So move btrfs_page_mkwrite() from inode.c and into file.c. While at it do a few minor style changes: 1) Capitalize the first word of every comment and end each sentence with punctuation; 2) Avoid splitting some statements into two lines when everything fits in 85 characters or less. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: remove pointless return value assignment at btrfs_finish_one_ordered()Filipe Manana1-1/+0
At btrfs_finish_one_ordered() it's pointless to assign 0 to the 'ret' variable because if it has a non-zero value (error), we have already jumped to the 'out' label. So remove that redundant assignment. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: remove not needed mod_start and mod_len from struct extent_mapFilipe Manana1-3/+1
The mod_start and mod_len fields of struct extent_map were introduced by commit 4e2f84e63dc1 ("Btrfs: improve fsync by filtering extents that we want") in order to avoid too low performance when fsyncing a file that keeps getting extent maps merge, because it resulted in each fsync logging again csum ranges that were already merged before. We don't need this anymore as extent maps in the list of modified extents are never merged with other extent maps and once we log an extent map we remove it from the list of modified extent maps, so it's never logged twice. So remove the mod_start and mod_len fields from struct extent_map and use instead the start and len fields when logging checksums in the fast fsync path. This also makes EXTENT_FLAG_FILLING unused so remove it as well. Running the reproducer from the commit mentioned before, with a larger number of extents and against a null block device, so that IO is fast and we can better see any impact from searching checksums items and logging them, gave the following results from dd: Before this change: 409600000 bytes (410 MB, 391 MiB) copied, 22.948 s, 17.8 MB/s After this change: 409600000 bytes (410 MB, 391 MiB) copied, 22.9997 s, 17.8 MB/s So no changes in throughput. The test was done in a release kernel (non-debug, Debian's default kernel config) and its steps are the following: $ mkfs.btrfs -f /dev/nullb0 $ mount /dev/sdb /mnt $ dd if=/dev/zero of=/mnt/foobar bs=4k count=100000 oflag=sync $ umount /mnt This also reduces the size of struct extent_map from 128 bytes down to 112 bytes, so now we can have 36 extents maps per 4K page instead of 32. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: compression: migrate compression/decompression paths to foliosQu Wenruo1-56/+54
For both compression and decompression paths, we always require a "struct page **pages" and "unsigned long nr_pages", this involves quite some part of the btrfs compression paths: - All the compression entry points - compressed_bio structure This affects both compression and decompression. - async_extent structure Unfortunately with all those involved parts, there is no good way to split the conversion into smaller patches while still passing compiling. So do this in one big conversion in one go. Please note this is direct page->folio conversion, no change on the page sized folio requirement yet. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: migrate insert_inline_extent() to folio interfacesQu Wenruo1-8/+10
Since insert_inline_extent() now only accepts a single page, it's much easier to convert it to use folio interfaces. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: make insert_inline_extent() accept one page directlyQu Wenruo1-22/+25
Since our inline extent cannot accept anything larger than a sector, there is really no need to pass all the compressed pages to insert_inline_extent(). And just in case, expand the ASSERT()s to make sure we only try inline with compressed size no larger than sectorsize. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: compression: convert page allocation to folio interfacesQu Wenruo1-2/+2
Currently we have two wrappers to allocate and free a page for compression usage: - btrfs_alloc_compr_page() - btrfs_free_compr_page() The allocator would try to grab a page from the pool, and only allocate a new page if the pool is empty. The reclaimer would check if the pool is full, and if not full it would put the page into the pool. This patch converts both helpers to use folio interfaces, and allowing further conversion of compression path to folios. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename err to ret in btrfs_cont_expand()Anand Jain1-13/+13
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename err to ret in btrfs_rmdir()Anand Jain1-11/+11
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: remove pointless writepages callback wrapperFilipe Manana1-6/+0
There's no point in having a static writepages callback in inode.c that does nothing besides calling extent_writepages from extent_io.c. So just remove the callback at inode.c and rename extent_writepages() to btrfs_writepages(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: remove pointless readahead callback wrapperFilipe Manana1-5/+0
There's no point in having a static readahead callback in inode.c that does nothing besides calling extent_readahead() from extent_io.c. So just remove the callback at inode.c and rename extent_readahead() to btrfs_readahead(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-18btrfs: fallback if compressed IO fails for ENOSPCSweet Tea Dorminy1-7/+6
In commit b4ccace878f4 ("btrfs: refactor submit_compressed_extents()"), if an async extent compressed but failed to find enough space, we changed from falling back to an uncompressed write to just failing the write altogether. The principle was that if there's not enough space to write the compressed version of the data, there can't possibly be enough space to write the larger, uncompressed version of the data. However, this isn't necessarily true: due to fragmentation, there could be enough discontiguous free blocks to write the uncompressed version, but not enough contiguous free blocks to write the smaller but unsplittable compressed version. This has occurred to an internal workload which relied on write()'s return value indicating there was space. While rare, it has happened a few times. Thus, in order to prevent early ENOSPC, re-add a fallback to uncompressed writing. Fixes: b4ccace878f4 ("btrfs: refactor submit_compressed_extents()") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Co-developed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02btrfs: make btrfs_clear_delalloc_extent() free delalloc reserveBoris Burkov1-1/+1
Currently, this call site in btrfs_clear_delalloc_extent() only converts the reservation. We are marking it not delalloc, so I don't think it makes sense to keep the rsv around. This is a path where we are not sure to join a transaction, so it leads to incorrect free-ing during umount. Helps with the pass rate of generic/269 and generic/475. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operationsBoris Burkov1-1/+12
Create subvolume, create snapshot and delete subvolume all use btrfs_subvolume_reserve_metadata() to reserve metadata for the changes done to the parent subvolume's fs tree, which cannot be mediated in the normal way via start_transaction. When quota groups (squota or qgroups) are enabled, this reserves qgroup metadata of type PREALLOC. Once the operation is associated to a transaction, we convert PREALLOC to PERTRANS, which gets cleared in bulk at the end of the transaction. However, the error paths of these three operations were not implementing this lifecycle correctly. They unconditionally converted the PREALLOC to PERTRANS in a generic cleanup step regardless of errors or whether the operation was fully associated to a transaction or not. This resulted in error paths occasionally converting this rsv to PERTRANS without calling record_root_in_trans successfully, which meant that unless that root got recorded in the transaction by some other thread, the end of the transaction would not free that root's PERTRANS, leaking it. Ultimately, this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount for the leaked reservation. The fix is to ensure that every qgroup PREALLOC reservation observes the following properties: 1. any failure before record_root_in_trans is called successfully results in freeing the PREALLOC reservation. 2. after record_root_in_trans, we convert to PERTRANS, and now the transaction owns freeing the reservation. This patch enforces those properties on the three operations. Without it, generic/269 with squotas enabled at mkfs time would fail in ~5-10 runs on my system. With this patch, it ran successfully 1000 times in a row. Fixes: e85fde5162bf ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26statx: stx_subvolKent Overstreet1-0/+3
Add a new statx field for (sub)volume identifiers, as implemented by btrfs and bcachefs. This includes bcachefs support; we'll definitely want btrfs support as well. Link: https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240308022914.196982-1-kent.overstreet@linux.dev Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-05btrfs: remove SLAB_MEM_SPREAD flag useChengming Zhou1-1/+1
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete to avoid confusion for users. Here we can just remove all its users, which has no functional change. [1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: merge btrfs_del_delalloc_inode() helpersDavid Sterba1-9/+5
The helpers btrfs_del_delalloc_inode() and __btrfs_del_delalloc_inode() don't follow the pattern when the "__" helper does a special case and are in fact reversed regarding the naming. We can merge them into one as there's only one place that needs to be open coded. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: delete BUG_ON in btrfs_init_locked_inode()David Sterba1-1/+0
The purpose of the BUG_ON is not clear. The helper btrfs_grab_root() could return a NULL in case args->root would be a NULL or if there are zero references. Then we check if the root pointer stored in the inode still exists. The whole call chain is for iget: btrfs_iget btrfs_iget_path btrfs_iget_locked iget5_locked btrfs_init_locked_inode which is called from many contexts where we the root pointer is used and we can safely assume has enough references. Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle invalid root reference found in may_destroy_subvol()David Sterba1-1/+8
The may_destroy_subvol() looks up a root by a key, allowing to do an inexact search when key->offset is -1. It's never expected to find such item, as it would break the allowed range of a root id. Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: push errors up from add_async_extent()David Sterba1-5/+8
The memory allocation error in add_async_extent() is not handled properly, return an error and push the BUG_ON to the caller. Handling it there is not trivial so at least make it visible. Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove do_list variable at btrfs_clear_delalloc_extent()Filipe Manana1-3/+3
The "do_list" variable has a rather confusing name, so remove it and directly use btrfs_is_free_space_inode() instead. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove do_list variable at btrfs_set_delalloc_extent()Filipe Manana1-2/+1
The "do_list" variable is only used once, plus its name/meaning is a bit confusing, so remove it and directory use btrfs_is_free_space_inode(). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>