Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This update brings a few minor performance improvements, otherwise
there's a lot of refactoring, cleanups and other sort of not user
visible changes.
Performance improvements:
- inline b-tree locking functions, improvement in metadata-heavy
changes
- relax locking on a range that's being reflinked, allows read
operations to run in parallel
- speed up NOCOW write checks (throughput +9% on a sample test)
- extent locking ranges have been reduced in several places, namely
around delayed ref processing
Core:
- more page to folio conversions:
- relocation
- send
- compression
- inline extent handling
- super block write and wait
- extent_map structure optimizations:
- reduced structure size
- code simplifications
- add shrinker for allocated objects, the numbers can go high and
could exhaust memory on smaller systems (reported) as they may
not get an opportunity to be freed fast enough
- extent locking optimizations:
- reduce locking ranges where it does not seem to be necessary and
are safe due to other means of synchronization
- potential improvements due to lower contention,
allocation/freeing and state management operations of extent
state tracking structures
- delayed ref cleanups and simplifications
- updated trace points
- improved error handling, warnings and assertions
- cleanups and refactoring, unification of error handling paths"
* tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (122 commits)
btrfs: qgroup: fix initialization of auto inherit array
btrfs: count super block write errors in device instead of tracking folio error state
btrfs: use the folio iterator in btrfs_end_super_write()
btrfs: convert super block writes to folio in write_dev_supers()
btrfs: convert super block writes to folio in wait_dev_supers()
bio: Export bio_add_folio_nofail to modules
btrfs: remove duplicate included header from fs.h
btrfs: add a cached state to extent_clear_unlock_delalloc
btrfs: push extent lock down in submit_one_async_extent
btrfs: push lock_extent down in cow_file_range()
btrfs: move can_cow_file_range_inline() outside of the extent lock
btrfs: push lock_extent into cow_file_range_inline
btrfs: push extent lock into cow_file_range
btrfs: push extent lock into run_delalloc_cow
btrfs: remove unlock_extent from run_delalloc_compressed
btrfs: push extent lock down in run_delalloc_nocow
btrfs: adjust while loop condition in run_delalloc_nocow
btrfs: push extent lock into run_delalloc_nocow
btrfs: push the extent lock into btrfs_run_delalloc_range
btrfs: lock extent when doing inline extent in compression
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
means we now have six free FMODE_* bits in total (but bit #6
already got used for FMODE_WRITE_RESTRICTED)
- Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)
- Add fd_raw cleanup class so we can make use of automatic cleanup
provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well
- Optimize seq_puts()
- Simplify __seq_puts()
- Add new anon_inode_getfile_fmode() api to allow specifying f_mode
instead of open-coding it in multiple places
- Annotate struct file_handle with __counted_by() and use
struct_size()
- Warn in get_file() whether f_count resurrection from zero is
attempted (epoll/drm discussion)
- Folio-sophize aio
- Export the subvolume id in statx() for both btrfs and bcachefs
- Relax linkat(AT_EMPTY_PATH) requirements
- Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
for dup*() equality replacing kcmp()
Cleanups:
- Compile out swapfile inode checks when swap isn't enabled
- Use (1 << n) notation for FMODE_* bitshifts for clarity
- Remove redundant variable assignment in fs/direct-io
- Cleanup uses of strncpy in orangefs
- Speed up and cleanup writeback
- Move fsparam_string_empty() helper into header since it's currently
open-coded in multiple places
- Add kernel-doc comments to proc_create_net_data_write()
- Don't needlessly read dentry->d_flags twice
Fixes:
- Fix out-of-range warning in nilfs2
- Fix ecryptfs overflow due to wrong encryption packet size
calculation
- Fix overly long line in xfs file_operations (follow-up to FMODE_*
cleanup)
- Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
to FMODE_* cleanup)
- Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
cleanup)
- Fix stable offset api to prevent endless loops
- Fix afs file server rotations
- Prevent xattr node from overflowing the eraseblock in jffs2
- Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
operation instead of .open() operation since this caused userspace
regressions"
* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
afs: Fix fileserver rotation getting stuck
selftests: add F_DUPDFD_QUERY selftests
fcntl: add F_DUPFD_QUERY fcntl()
file: add fd_raw cleanup class
fs: WARN when f_count resurrection is attempted
seq_file: Simplify __seq_puts()
seq_file: Optimize seq_puts()
proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
fs: Create anon_inode_getfile_fmode()
xfs: don't call xfs_file_open from xfs_dir_open
xfs: drop fop_flags for directories
xfs: fix overly long line in the file_operations
shmem: Fix shmem_rename2()
libfs: Add simple_offset_rename() API
libfs: Fix simple_offset_rename_exchange()
jffs2: prevent xattr node from overflowing the eraseblock
vfs, swap: compile out IS_SWAPFILE() on swapless configs
vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
fs/direct-io: remove redundant assignment to variable retval
fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
...
|
|
Now that we have the lock_extent tightly coupled with
extent_clear_unlock_delalloc we can add a cached state to
extent_clear_unlock_delalloc and benefit from skipping the extra lookup
when we're doing cow.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We don't need to include the time we spend in the allocator under our
extent lock protection, move it after the allocator and make sure we
lock the extent in the error case to ensure we're not clearing these
bits without the extent lock held.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that we've got the extent lock pushed into cow_file_range() we can
push it further down into the allocation loop. This allows us to only
hold the extent lock during the dropping of the extent map range and
inserting the ordered extent.
This makes the error case a little trickier as we'll now have to lock
the range before clearing any of the other extent bits for the range,
but this is the error path so is less performance critical.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These checks aren't reliant on the extent lock. Move this up into
cow_file_range_inline(), and then update encoded writes to call this
check before calling __cow_file_range_inline(). This will allow us to
skip the extent lock if we're not able to inline the given extent.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that we've pushed the lock_extent() into cow_file_range() we can
push the extent locking into cow_file_range_inline() and move the
lock_extent in cow_file_range() to after we call
cow_file_range_inline().
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that cow_file_range is the only function that is called with the
range locked, push this call into cow_file_range so we can further
narrow the scope.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is used by zoned but also as the fallback for uncompressed extents
when we fail to compress the ranges. Push the extent lock into
run_dealloc_cow(), and adjust the compression case to take the extent
lock after calling run_delalloc_cow().
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since we immediately unlock the extent range when we enter
run_delalloc_compressed() simply move the lock_extent() down to cover
cow_file_range() and then remove the unlock_extent() from
run_delalloc_compressed.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
run_delalloc_nocow is a little special because we use the file extents
to see if we can nocow a range. We don't actually need the protection
of the extent lock to look at the file extents at this point however.
We are currently holding the page lock for this range, so we are
protected from anybody who would simultaneously be modifying the file
extent items for this range.
* mmap() - we're holding the page lock.
* buffered writes - we're holding the page lock.
* direct writes - we're holding the page lock and direct IO has to flush
page cache before it's able to continue.
* fallocate() - all callers flush the range and wait on ordered extents
while holding the inode lock and the mmap lock, so we are again saved
by the page lock.
We want to use the extent lock to protect
1) The mapping tree for the given range.
2) The ordered extents for the given range.
3) The io_tree for the given range.
Push the extent lock down to cover these operations. In the
fallback_to_cow() case we simply lock before doing anything and rely on
the cow_file_range() helper to handle it's range properly.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have the following pattern
while (1) {
if (cur_offset > end)
break;
}
Which is just
while (cur_offset <= end) {
...
}
so adjust the code to be more clear.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
run_delalloc_nocow is a bit special as it walks through the file extents
for the inode and determines what it can nocow and what it can't. This
is the more complicated area for extent locking, so start with this
function.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We want to limit the scope of the extent lock to be around operations
that can change in flight. Currently we hold the extent lock through
the entire writepage operation, which isn't really necessary.
We want to protect to make sure nobody has updated DELALLOC. In
find_lock_delalloc_range we must lock the range in order to validate the
contents of our io_tree. However once we've done that we're safe to
unlock the range and continue, as we have the page lock already held for
the range.
We are protected from all operations at this point.
* mmap() - we're holding the page lock, thus are protected.
* buffered writes - again, we're protected because we take the page lock
for the first and last page in our range for buffered writes so we
won't create new delalloc ranges in this area.
* direct IO - we invalidate pagecache before attempting to write a new
area, which requires the page lock, so again are protected once we're
holding the page lock on this range.
Additionally this behavior actually already exists for compressed, we
unlock the range as soon as we start to process the async extents, and
re-lock it during compression. So this is completely safe, and makes
the locking more consistent.
Make this simple by just pushing the extent lock into
btrfs_run_delalloc_range. From there followup patches will push the
lock further down into its users.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We currently don't lock the extent when we're doing a
cow_file_range_inline() for a compressed extent. This isn't a problem
necessarily, but it's inconsistent with the rest of our usage of
cow_file_range_inline(). This also leads to some extra weird logic
around whether the extent is locked or not. Fix this to lock the extent
before calling cow_file_range_inline() in compression to make it
consistent with the rest of the inline users. In future patches this
will be pushed down into the cow_file_range_inline() helper, so we're
fine with the quick and dirty locking here. This patch exists to make
the behavior change obvious.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We duplicate the extent cleanup for cow_file_range_inline() in the cow
and compressed case. The encoded case doesn't need to do cleanup the
same way, so rename cow_file_range_inline to __cow_file_range_inline and
then make cow_file_range_inline handle the extent cleanup appropriately,
and update the callers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since 4750af3bbe5d ("btrfs: prevent extent_clear_unlock_delalloc() to
unlock page not locked by __process_pages_contig()") we have been
unlocking the locked page manually instead of via
extent_clear_unlock_delalloc() because of subpage blocksize support.
However we actually disable inline extent creation for subpage blocksize
support, so this behavior isn't necessary. Remove this code and
comment, if at some point the subpage blocksize code grows support for
inline extents this can be re-evaluated.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we have a lot of duplicated checks of
if (start == 0 && fs_info->sectorsize == PAGE_SIZE)
cow_file_range_inline();
Instead of duplicating this check everywhere, consolidate all of the
inline extent logic into a helper which documents all of the checks and
then use that helper inside of cow_file_range_inline(). With this we
can clean up all of the calls to either unconditionally call
cow_file_range_inline(), or at least reduce the checks we're doing
before we call cow_file_range_inline();
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the cow path we will clone the reloc csums for relocated data
extents, and if there's an error we already have an ordered extent and
rely on the ordered extent finishing to clean everything up.
There's a problem however, we don't mark the ordered extent with an
error, we pretend like everything was just fine. If we were at the end
of our range we won't actually bubble up this error anywhere, and we
could end up inserting an extent that doesn't have csums where it should
have them.
Fix this by adding a helper to mark the ordered extent with an error,
and then use this when we fail to lookup the csums in
btrfs_reloc_clone_csums. Use this helper in the other place where we
use the same pattern while we're here.
This will prevent us from erroneously inserting the extent that doesn't
have the required checksums.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function create_io_em() is called before we submit an IO, to update
the in-memory extent map for the involved range.
This patch changes the following aspects:
- Does not allow BTRFS_ORDERED_NOCOW type
For real NOCOW (excluding NOCOW writes into preallocated ranges)
writes, we never call create_io_em(), as we does not need to update
the extent map at all.
So remove the sanity check allowing BTRFS_ORDERED_NOCOW type.
- Add extra sanity checks
* PREALLOC
- @block_len == len
For uncompressed writes.
* REGULAR
- @block_len == @orig_block_len == @ram_bytes == @len
We're creating a new uncompressed extent, and referring all of it.
- @orig_start == @start
We haven no offset inside the extent.
* COMPRESSED
- valid @compress_type
- @len <= @ram_bytes
This is to co-operate with encoded writes, which can cause a new
file extent referring only part of a uncompressed extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently try_release_extent_mapping() as an int return type, but we
use it as a boolean. Its only caller, the release folio callback, also
returns a boolean which corresponds to try_release_extent_mapping()'s
return value. So change its return value type to bool as well as its
helper try_release_extent_state().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
A comment from Filipe on one of my previous cleanups brought my
attention to a new helper we have for getting the root id of a root,
which makes it easier to read in the code.
The changes where made with the following Coccinelle semantic patch:
// <smpl>
@@
expression E,E1;
@@
(
E->root_key.objectid = E1
|
- E->root_key.objectid
+ btrfs_root_id(E)
)
// </smpl>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently btrfs_prune_dentries() has open code to find the first inode in
a root with a minimum inode number. Remove that code and make it use the
helper btrfs_find_first_inode() for that task.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Export the relocation private helper find_next_inode() to inode.c, as this
same logic is also used at btrfs_prune_dentries() and will be used by an
upcoming change that adds an extent map shrinker. The next patch will
change btrfs_prune_dentries() to use this helper.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of passing fs_info and extent map tree arguments to
btrfs_add_extent_mapping(), we can pass an inode instead, as extent maps
are always inserted in the extent map tree of an inode, and the fs_info
can be extracted from the inode (inode->root->fs_info). The only exception
is in the self tests where we allocate an extent map tree and then use it
to insert/update/remove extent maps. However the tests can be changed to
use a test inode and then use the inode's extent map tree.
So change btrfs_add_extent_mapping() to have an inode as an argument
instead of a fs_info and an extent map tree. This reduces the number of
parameters and will also be needed for an upcoming change.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The csum_exist_in_range() function is now too trivial and is only used in
one place, so open code it in its single caller.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.
But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.
So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.
The following test with fio was used to measure performance:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
cat <<EOF > /tmp/fio-job.ini
[global]
name=fio-rand-write
filename=$MNT/fio-rand-write
rw=randwrite
bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
direct=1
numjobs=16
fallocate=posix
time_based
runtime=300
[file1]
size=8G
ioengine=io_uring
iodepth=16
EOF
umount $MNT &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run on a release kernel (Debian's default kernel config).
The results before this patch:
WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec
The results after this patch:
WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
All the callers of btrfs_lookup_csums_list() pass a value of 0 as the
"search_commit" parameter. So remove it and make the function behave as
to always search from the regular root.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_page_mkwrite() is a struct vm_operations_struct callback and we
define that structure in file.c. Currently the function is in inode.c and
has to be exported to be used in file.c, which makes no sense because it's
not used anywhere else. So move btrfs_page_mkwrite() from inode.c and into
file.c.
While at it do a few minor style changes:
1) Capitalize the first word of every comment and end each sentence with
punctuation;
2) Avoid splitting some statements into two lines when everything fits in
85 characters or less.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At btrfs_finish_one_ordered() it's pointless to assign 0 to the 'ret'
variable because if it has a non-zero value (error), we have already
jumped to the 'out' label. So remove that redundant assignment.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The mod_start and mod_len fields of struct extent_map were introduced by
commit 4e2f84e63dc1 ("Btrfs: improve fsync by filtering extents that we
want") in order to avoid too low performance when fsyncing a file that
keeps getting extent maps merge, because it resulted in each fsync logging
again csum ranges that were already merged before.
We don't need this anymore as extent maps in the list of modified extents
are never merged with other extent maps and once we log an extent map we
remove it from the list of modified extent maps, so it's never logged
twice.
So remove the mod_start and mod_len fields from struct extent_map and use
instead the start and len fields when logging checksums in the fast fsync
path. This also makes EXTENT_FLAG_FILLING unused so remove it as well.
Running the reproducer from the commit mentioned before, with a larger
number of extents and against a null block device, so that IO is fast
and we can better see any impact from searching checksums items and
logging them, gave the following results from dd:
Before this change:
409600000 bytes (410 MB, 391 MiB) copied, 22.948 s, 17.8 MB/s
After this change:
409600000 bytes (410 MB, 391 MiB) copied, 22.9997 s, 17.8 MB/s
So no changes in throughput.
The test was done in a release kernel (non-debug, Debian's default kernel
config) and its steps are the following:
$ mkfs.btrfs -f /dev/nullb0
$ mount /dev/sdb /mnt
$ dd if=/dev/zero of=/mnt/foobar bs=4k count=100000 oflag=sync
$ umount /mnt
This also reduces the size of struct extent_map from 128 bytes down to 112
bytes, so now we can have 36 extents maps per 4K page instead of 32.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For both compression and decompression paths, we always require a
"struct page **pages" and "unsigned long nr_pages", this involves quite
some part of the btrfs compression paths:
- All the compression entry points
- compressed_bio structure
This affects both compression and decompression.
- async_extent structure
Unfortunately with all those involved parts, there is no good way to
split the conversion into smaller patches while still passing compiling.
So do this in one big conversion in one go.
Please note this is direct page->folio conversion, no change on the page
sized folio requirement yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since insert_inline_extent() now only accepts a single page, it's much
easier to convert it to use folio interfaces.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since our inline extent cannot accept anything larger than a sector,
there is really no need to pass all the compressed pages to
insert_inline_extent().
And just in case, expand the ASSERT()s to make sure we only try inline
with compressed size no larger than sectorsize.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we have two wrappers to allocate and free a page for
compression usage:
- btrfs_alloc_compr_page()
- btrfs_free_compr_page()
The allocator would try to grab a page from the pool, and only allocate
a new page if the pool is empty.
The reclaimer would check if the pool is full, and if not full it would
put the page into the pool.
This patch converts both helpers to use folio interfaces, and allowing
further conversion of compression path to folios.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no point in having a static writepages callback in inode.c that
does nothing besides calling extent_writepages from extent_io.c.
So just remove the callback at inode.c and rename extent_writepages()
to btrfs_writepages().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no point in having a static readahead callback in inode.c that
does nothing besides calling extent_readahead() from extent_io.c.
So just remove the callback at inode.c and rename extent_readahead()
to btrfs_readahead().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In commit b4ccace878f4 ("btrfs: refactor submit_compressed_extents()"), if
an async extent compressed but failed to find enough space, we changed
from falling back to an uncompressed write to just failing the write
altogether. The principle was that if there's not enough space to write
the compressed version of the data, there can't possibly be enough space
to write the larger, uncompressed version of the data.
However, this isn't necessarily true: due to fragmentation, there could
be enough discontiguous free blocks to write the uncompressed version,
but not enough contiguous free blocks to write the smaller but
unsplittable compressed version.
This has occurred to an internal workload which relied on write()'s
return value indicating there was space. While rare, it has happened a
few times.
Thus, in order to prevent early ENOSPC, re-add a fallback to
uncompressed writing.
Fixes: b4ccace878f4 ("btrfs: refactor submit_compressed_extents()")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Co-developed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently, this call site in btrfs_clear_delalloc_extent() only converts
the reservation. We are marking it not delalloc, so I don't think it
makes sense to keep the rsv around. This is a path where we are not
sure to join a transaction, so it leads to incorrect free-ing during
umount.
Helps with the pass rate of generic/269 and generic/475.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Create subvolume, create snapshot and delete subvolume all use
btrfs_subvolume_reserve_metadata() to reserve metadata for the changes
done to the parent subvolume's fs tree, which cannot be mediated in the
normal way via start_transaction. When quota groups (squota or qgroups)
are enabled, this reserves qgroup metadata of type PREALLOC. Once the
operation is associated to a transaction, we convert PREALLOC to
PERTRANS, which gets cleared in bulk at the end of the transaction.
However, the error paths of these three operations were not implementing
this lifecycle correctly. They unconditionally converted the PREALLOC to
PERTRANS in a generic cleanup step regardless of errors or whether the
operation was fully associated to a transaction or not. This resulted in
error paths occasionally converting this rsv to PERTRANS without calling
record_root_in_trans successfully, which meant that unless that root got
recorded in the transaction by some other thread, the end of the
transaction would not free that root's PERTRANS, leaking it. Ultimately,
this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount
for the leaked reservation.
The fix is to ensure that every qgroup PREALLOC reservation observes the
following properties:
1. any failure before record_root_in_trans is called successfully
results in freeing the PREALLOC reservation.
2. after record_root_in_trans, we convert to PERTRANS, and now the
transaction owns freeing the reservation.
This patch enforces those properties on the three operations. Without
it, generic/269 with squotas enabled at mkfs time would fail in ~5-10
runs on my system. With this patch, it ran successfully 1000 times in a
row.
Fixes: e85fde5162bf ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Add a new statx field for (sub)volume identifiers, as implemented by
btrfs and bcachefs.
This includes bcachefs support; we'll definitely want btrfs support as
well.
Link: https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240308022914.196982-1-kent.overstreet@linux.dev
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The helpers btrfs_del_delalloc_inode() and __btrfs_del_delalloc_inode()
don't follow the pattern when the "__" helper does a special case and
are in fact reversed regarding the naming. We can merge them into one as
there's only one place that needs to be open coded.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The purpose of the BUG_ON is not clear. The helper btrfs_grab_root()
could return a NULL in case args->root would be a NULL or if there are
zero references. Then we check if the root pointer stored in the inode
still exists.
The whole call chain is for iget:
btrfs_iget
btrfs_iget_path
btrfs_iget_locked
iget5_locked
btrfs_init_locked_inode
which is called from many contexts where we the root pointer is used and
we can safely assume has enough references.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The may_destroy_subvol() looks up a root by a key, allowing to do an
inexact search when key->offset is -1. It's never expected to find such
item, as it would break the allowed range of a root id.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The memory allocation error in add_async_extent() is not handled
properly, return an error and push the BUG_ON to the caller. Handling it
there is not trivial so at least make it visible.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The "do_list" variable has a rather confusing name, so remove it and
directly use btrfs_is_free_space_inode() instead.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The "do_list" variable is only used once, plus its name/meaning is a bit
confusing, so remove it and directory use btrfs_is_free_space_inode().
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|