summaryrefslogtreecommitdiff
path: root/fs/btrfs/file.c
AgeCommit message (Collapse)AuthorFilesLines
2023-03-30iov_iter: add iter_iovec() helperJens Axboe1-3/+8
This returns a pointer to the current iovec entry in the iterator. Only useful with ITER_IOVEC right now, but it prepares us to treat ITER_UBUF and ITER_IOVEC identically for the first segment. Rename struct iov_iter->iov to iov_iter->__iov to find any potentially troublesome spots, and also to prevent anyone from adding new code that accesses iter->iov directly. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-13btrfs: remove the wait argument to btrfs_start_ordered_extentChristoph Hellwig1-1/+1
Given that wait is always set to 1, so remove the argument. Last use of wait with 0 was in 0c304304feab ("Btrfs: remove csum_bytes_left"). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-16btrfs: fix invalid leaf access due to inline extent during lseekFilipe Manana1-3/+10
During lseek, for SEEK_DATA and SEEK_HOLE modes, we access the disk_bytenr of an extent without checking its type. However inline extents have their data starting the offset of the disk_bytenr field, so accessing that field when we have an inline extent can result in either of the following: 1) Interpret the inline extent's data as a disk_bytenr value; 2) In case the inline data is less than 8 bytes, we access part of some other item in the leaf, or unused space in the leaf; 3) In case the inline data is less than 8 bytes and the extent item is the first item in the leaf, we can access beyond the leaf's limit. So fix this by not accessing the disk_bytenr field if we have an inline extent. Fixes: b6e833567ea1 ("btrfs: make hole and data seeking a lot more efficient") Reported-by: Matthias Schoepfer <matthias.schoepfer@googlemail.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216908 Link: https://lore.kernel.org/linux-btrfs/7f25442f-b121-2a3a-5a3d-22bcaae83cd4@leemhuis.info/ CC: stable@vger.kernel.org # 6.1 Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-03btrfs: fix off-by-one in delalloc search during lseekFilipe Manana1-1/+1
During lseek, when searching for delalloc in a range that represents a hole and that range has a length of 1 byte, we end up not doing the actual delalloc search in the inode's io tree, resulting in not correctly reporting the offset with data or a hole. This actually only happens when the start offset is 0 because with any other start offset we round it down by sector size. Reproducer: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -q 0 1" /mnt/sdc/foo $ xfs_io -c "seek -d 0" /mnt/sdc/foo Whence Result DATA EOF It should have reported an offset of 0 instead of EOF. Fix this by updating btrfs_find_delalloc_in_range() and count_range_bits() to deal with inclusive ranges properly. These functions are already supposed to work with inclusive end offsets, they just got it wrong in a couple places due to off-by-one mistakes. A test case for fstests will be added later. Reported-by: Joan Bruguera Micó <joanbrugueram@gmail.com> Link: https://lore.kernel.org/linux-btrfs/20221223020509.457113-1-joanbrugueram@gmail.com/ Fixes: b6e833567ea1 ("btrfs: make hole and data seeking a lot more efficient") CC: stable@vger.kernel.org # 6.1 Tested-by: Joan Bruguera Micó <joanbrugueram@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: do not BUG_ON() on ENOMEM when dropping extent items for a rangeFilipe Manana1-2/+8
If we get -ENOMEM while dropping file extent items in a given range, at btrfs_drop_extents(), due to failure to allocate memory when attempting to increment the reference count for an extent or drop the reference count, we handle it with a BUG_ON(). This is excessive, instead we can simply abort the transaction and return the error to the caller. In fact most callers of btrfs_drop_extents(), directly or indirectly, already abort the transaction if btrfs_drop_extents() returns any error. Also, we already have error paths at btrfs_drop_extents() that may return -ENOMEM and in those cases we abort the transaction, like for example anything that changes the b+tree may return -ENOMEM due to a failure to allocate a new extent buffer when COWing an existing extent buffer, such as a call to btrfs_duplicate_item() for example. So replace the BUG_ON() calls with proper logic to abort the transaction and return the error. Reported-by: syzbot+0b1fb6b0108c27419f9f@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000089773e05ee4b9cb4@google.com/ CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use cached state when looking for delalloc ranges with lseekFilipe Manana1-8/+32
During lseek (SEEK_HOLE/DATA), whenever we find a hole or prealloc extent, we will look for delalloc in that range, and one of the things we do for that is to find out ranges in the inode's io_tree marked with EXTENT_DELALLOC, using calls to count_range_bits(). Typically there's a single, or few, searches in the io_tree for delalloc per lseek call. However it's common for applications to keep calling lseek with SEEK_HOLE and SEEK_DATA to find where extents and holes are in a file, read the extents and skip holes in order to avoid unnecessary IO and save disk space by preserving holes. One popular user is the cp utility from coreutils. Starting with coreutils 9.0, cp uses SEEK_HOLE and SEEK_DATA to iterate over the extents of a file. Before 9.0, it used fiemap to figure out where holes and extents are in the source file. Another popular user is the tar utility when used with the --sparse / -S option to detect and preserve holes. Given that the pattern is to keep calling lseek with a start offset that matches the returned offset from the previous lseek call, we can benefit from caching the last extent state visited in count_range_bits() and use it for the next count_range_bits() from the next lseek call. Example, the following strace excerpt from running tar: $ strace tar cJSvf foo.tar.xz qemu_disk_file.raw (...) lseek(5, 125019574272, SEEK_HOLE) = 125024989184 lseek(5, 125024989184, SEEK_DATA) = 125024993280 lseek(5, 125024993280, SEEK_HOLE) = 125025239040 lseek(5, 125025239040, SEEK_DATA) = 125025255424 lseek(5, 125025255424, SEEK_HOLE) = 125025353728 lseek(5, 125025353728, SEEK_DATA) = 125025357824 lseek(5, 125025357824, SEEK_HOLE) = 125026766848 lseek(5, 125026766848, SEEK_DATA) = 125026770944 lseek(5, 125026770944, SEEK_HOLE) = 125027053568 (...) Shows that pattern, which is the same as with cp from coreutils 9.0+. So start using a cached state for the delalloc searches in lseek, and store it in struct file's private data so that it can be reused across lseek calls. This change is part of a patchset that is comprised of the following patches: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek The following test was run before and after applying the whole patchset: $ cat test-cp.sh #!/bin/bash DEV=/dev/sdh MNT=/mnt/sdh # coreutils 8.32, cp uses fiemap to detect holes and extents #CP_PROG=/usr/bin/cp # coreutils 9.1, cp uses SEEK_HOLE/DATA to detect holes and extents CP_PROG=/home/fdmanana/git/hub/coreutils/src/cp umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount $DEV $MNT FILE_SIZE=$((1024 * 1024 * 1024)) echo "Creating file with a size of $((FILE_SIZE / 1024 / 1024))M" # Create a very sparse file, where each extent has a length of 4K and # is preceded by a 4K hole and followed by another 4K hole. start=$(date +%s%N) echo -n > $MNT/foobar for ((off = 0; off < $FILE_SIZE; off += 8192)); do xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null echo -ne "\r$off / $FILE_SIZE ..." done end=$(date +%s%N) echo -e "\nFile created ($(( (end - start) / 1000000 )) milliseconds)" start=$(date +%s%N) $CP_PROG $MNT/foobar /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "cp took $dur milliseconds with data/metadata cached and delalloc" # Flush all delalloc. sync start=$(date +%s%N) $CP_PROG $MNT/foobar /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "cp took $dur milliseconds with data/metadata cached and no delalloc" # Unmount and mount again to test the case without any metadata # loaded in memory. umount $MNT mount $DEV $MNT start=$(date +%s%N) $CP_PROG $MNT/foobar /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "cp took $dur milliseconds without data/metadata cached and no delalloc" umount $MNT The results, running on a box with a non-debug kernel (Debian's default kernel config), were the following: 128M file, before patchset: cp took 16574 milliseconds with data/metadata cached and delalloc cp took 122 milliseconds with data/metadata cached and no delalloc cp took 20144 milliseconds without data/metadata cached and no delalloc 128M file, after patchset: cp took 6277 milliseconds with data/metadata cached and delalloc cp took 109 milliseconds with data/metadata cached and no delalloc cp took 210 milliseconds without data/metadata cached and no delalloc 512M file, before patchset: cp took 14369 milliseconds with data/metadata cached and delalloc cp took 429 milliseconds with data/metadata cached and no delalloc cp took 88034 milliseconds without data/metadata cached and no delalloc 512M file, after patchset: cp took 12106 milliseconds with data/metadata cached and delalloc cp took 427 milliseconds with data/metadata cached and no delalloc cp took 824 milliseconds without data/metadata cached and no delalloc 1G file, before patchset: cp took 10074 milliseconds with data/metadata cached and delalloc cp took 886 milliseconds with data/metadata cached and no delalloc cp took 181261 milliseconds without data/metadata cached and no delalloc 1G file, after patchset: cp took 3320 milliseconds with data/metadata cached and delalloc cp took 880 milliseconds with data/metadata cached and no delalloc cp took 1801 milliseconds without data/metadata cached and no delalloc Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use cached state when looking for delalloc ranges with fiemapFilipe Manana1-3/+7
During fiemap, whenever we find a hole or prealloc extent, we will look for delalloc in that range, and one of the things we do for that is to find out ranges in the inode's io_tree marked with EXTENT_DELALLOC, using calls to count_range_bits(). Since we process file extents from left to right, if we have a file with several holes or prealloc extents, we benefit from keeping a cached extent state record for calls to count_range_bits(). Most of the time the last extent state record we visited in one call to count_range_bits() matches the first extent state record we will use in the next call to count_range_bits(), so there's a benefit here. So use an extent state record to cache results from count_range_bits() calls during fiemap. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: allow passing a cached state record to count_range_bits()Filipe Manana1-1/+2
An inode's io_tree can be quite large and there are cases where due to delalloc it can have thousands of extent state records, which makes the red black tree have a depth of 10 or more, making the operation of count_range_bits() slow if we repeatedly call it for a range that starts where, or after, the previous one we called it for. Such use cases are when searching for delalloc in a file range that corresponds to a hole or a prealloc extent, which is done during lseek SEEK_HOLE/DATA and fiemap. So introduce a cached state parameter to count_range_bits() which we use to store the last extent state record we visited, and then allow the caller to pass it again on its next call to count_range_bits(). The next patches in the series will make fiemap and lseek use the new parameter. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: search for delalloc more efficiently during lseek/fiemapFilipe Manana1-104/+48
During lseek (SEEK_HOLE/DATA) and fiemap, when processing a file range that corresponds to a hole or a prealloc extent, we have to check if there's any delalloc in the range. We do it by searching for delalloc ranges in the inode's io_tree (for unflushed delalloc) and in the inode's extent map tree (for delalloc that is flushing). We avoid searching the extent map tree if the number of outstanding extents is 0, as in that case we can't have extent maps for our search range in the tree that correspond to delalloc that is flushing. However if we have any unflushed delalloc, due to buffered writes or mmap writes, then the outstanding extents counter is not 0 and we'll search the extent map tree. The tree may be large because it can have lots of extent maps that were loaded by reads or created by previous writes, therefore taking a significant time to search the tree, specially if have a file with a lot of holes and/or prealloc extents. We can improve on this by instead of searching the extent map tree, searching the ordered extents tree of the inode, since when delalloc is flushing we create an ordered extent along with the new extent map, while holding the respective file range locked in the inode's io_tree. The ordered extents tree is typically much smaller, since ordered extents have a short life and get removed from the tree once they are completed, while extent maps can stay for a very long time in the extent map tree, either created by previous writes or loaded by read operations. So use the ordered extents tree instead of the extent maps tree. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: skip unnecessary delalloc searches during lseek/fiemapFilipe Manana1-1/+7
During lseek (SEEK_HOLE/DATA) and fiemap, when processing a file range that corresponds to a hole or a prealloc extent, if we find that there is no delalloc marked in the inode's io_tree but there is delalloc due to an extent map in the io tree, then on the next iteration that calls find_delalloc_subrange() we can skip searching the io tree again, since on the first call we had no delalloc in the io tree for the whole range. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add an early exit when searching for delalloc range for lseek/fiemapFilipe Manana1-6/+16
During fiemap and lseek (SEEK_HOLE/DATA), when looking for delalloc in a range corresponding to a hole or a prealloc extent, if we found the whole range marked as delalloc in the inode's io_tree, then we can terminate immediately and avoid searching the extent map tree. If not, and if the found delalloc starts at the same offset of our search start but ends before our search range's end, then we can adjust the search range for the search in the extent map tree. So implement those changes. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: pass btrfs_inode to btrfs_inode_unlockDavid Sterba1-16/+16
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: pass btrfs_inode to btrfs_inode_lockDavid Sterba1-8/+8
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: update stale comment for nowait direct IO writesFilipe Manana1-2/+2
If when doing a direct IO write we need to fallback to buffered IO, we this comment at btrfs_direct_write() that says we can't directly fallback to buffered IO if we have a NOWAIT iocb, because we have no support for NOWAIT buffered writes. That is not true anymore, as support for NOWAIT buffered writes was added recently in commit 926078b21db9 ("btrfs: enable nowait async buffered writes"). However we still can't fallback to a buffered write in case we have a NOWAIT iocb, because we'll need to flush delalloc and wait for it to complete after doing the buffered write, and that can block for several reasons, the main reason being waiting for IO to complete. So update the comment to mention all that. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move super_block specific helpers into super.hJosef Bacik1-0/+1
This will make syncing fs.h to user space a little easier if we can pull the super block specific helpers out of fs.h and put them in super.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move file prototypes to file.hJosef Bacik1-0/+1
Move these out of ctree.h into file.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move ioctl prototypes into ioctl.hJosef Bacik1-0/+1
Move these out of ctree.h into ioctl.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move file-item prototypes into their own headerJosef Bacik1-0/+1
Move these prototypes out of ctree.h and into file-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move the auto defrag code to defrag.cJosef Bacik1-340/+0
This currently exists in file.c, move it to the more natural location in defrag.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ reformat comments ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move extent-tree helpers into their own header fileJosef Bacik1-0/+1
Move all the extent tree related prototypes to extent-tree.h out of ctree.h, and then go include it everywhere needed so everything compiles. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move accessor helpers into accessors.hJosef Bacik1-0/+1
This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move fs wide helpers out of ctree.hJosef Bacik1-0/+1
We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: skip unnecessary delalloc search during fiemap and lseekFilipe Manana1-13/+20
During fiemap and lseek (hole and data seeking), there's no point in iterating the inode's io tree to count delalloc bits if the inode's delalloc bytes counter has a value of zero, as that counter is updated whenever we set a range for delalloc or clear a range from delalloc. So skip the counting and io tree iteration if the inode's delalloc bytes counter has a value of zero. This helps save time when processing a file range corresponding to a hole or prealloc (unwritten) extent. This patch is part of a series comprised of the following patches: btrfs: get the next extent map during fiemap/lseek more efficiently btrfs: skip unnecessary extent map searches during fiemap and lseek btrfs: skip unnecessary delalloc search during fiemap and lseek The following test was performed on a release kernel (Debian's default kernel config) before and after applying those 3 patches. # Wrapper to call fiemap in extent count only mode. # (struct fiemap::fm_extent_count set to 0) $ cat fiemap.c #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <sys/ioctl.h> #include <linux/fs.h> #include <linux/fiemap.h> int main(int argc, char **argv) { struct fiemap fiemap = { 0 }; int fd; if (argc != 2) { printf("usage: %s <path>\n", argv[0]); return 1; } fd = open(argv[1], O_RDONLY); if (fd < 0) { fprintf(stderr, "error opening file: %s\n", strerror(errno)); return 1; } /* fiemap.fm_extent_count set to 0, to count extents only. */ fiemap.fm_length = FIEMAP_MAX_OFFSET; if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) { fprintf(stderr, "fiemap error: %s\n", strerror(errno)); return 1; } close(fd); printf("fm_mapped_extents = %d\n", fiemap.fm_mapped_extents); return 0; } $ gcc -o fiemap fiemap.c And the wrapper shell script that creates a file with many holes and runs fiemap against it: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT FILE_SIZE=$((1 * 1024 * 1024 * 1024)) echo -n > $MNT/foobar for ((off = 0; off < $FILE_SIZE; off += 8192)); do xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null done # flush all delalloc sync start=$(date +%s%N) ./fiemap $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds" umount $MNT Result before applying patchset: fm_mapped_extents = 131072 fiemap took 63 milliseconds Result after applying patchset: fm_mapped_extents = 131072 fiemap took 39 milliseconds (-38.1%) Running the same test for a 512M file instead of a 1G file, gave the following results. Result before applying patchset: fm_mapped_extents = 65536 fiemap took 29 milliseconds Result after applying patchset: fm_mapped_extents = 65536 fiemap took 20 milliseconds (-31.0%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: skip unnecessary extent map searches during fiemap and lseekFilipe Manana1-0/+12
If we have no outstanding extents it means we don't have any extent maps corresponding to delalloc that is flushing, as when an ordered extent is created we increment the number of outstanding extents to 1 and when we remove the ordered extent we decrement them by 1. So skip extent map tree searches if the number of outstanding ordered extents is 0, saving time as the tree is not empty if we have previously made some reads or flushed delalloc, as in those cases it can have a very large number of extent maps for files with many extents. This helps save time when processing a file range corresponding to a hole or prealloc (unwritten) extent. The next patch in the series has a performance test in its changelog and its subject is: "btrfs: skip unnecessary delalloc search during fiemap and lseek" Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: get the next extent map during fiemap/lseek more efficientlyFilipe Manana1-17/+27
At find_delalloc_subrange(), when we need to get the next extent map, we do a full search on the extent map tree (a red black tree). This is fine but it's a lot more efficient to simply use rb_next(), which typically requires iterating over less nodes of the tree and never needs to compare the ranges of nodes with the one we are looking for. So add a public helper to extent_map.{h,c} to get the extent map that immediately follows another extent map, using rb_next(), and use that helper at find_delalloc_subrange(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use cached_state for btrfs_check_nocow_lockJosef Bacik1-3/+6
Now that try_lock_extent() takes a cached_state, plumb the cached_state through btrfs_try_lock_ordered_range() and then use a cached_state in btrfs_check_nocow_lock everywhere to avoid extra tree searches on the extent_io_tree. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add a cached_state to try_lock_extentJosef Bacik1-1/+2
With nowait becoming more pervasive throughout our codebase go ahead and add a cached_state to try_lock_extent(). This allows us to be faster about clearing the locked area if we have contention, and then gives us the same optimization for unlock if we are able to lock the range. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02btrfs: fix inode reserve space leak due to nowait buffered writeFilipe Manana1-1/+3
During a nowait buffered write, if we fail to balance dirty pages we exit btrfs_buffered_write() without releasing the delalloc space reserved for an extent, resulting in leaking space from the inode's block reserve. So fix that by releasing the delalloc space for the extent when balancing dirty pages fails. Reported-by: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/all/202210111304.d369bc32-yujie.liu@intel.com Fixes: 965f47aeb5de ("btrfs: make btrfs_buffered_write nowait compatible") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02btrfs: fix nowait buffered write returning -ENOSPCFilipe Manana1-0/+3
If we are doing a buffered write in NOWAIT context and we can't reserve metadata space due to -ENOSPC, then we should return -EAGAIN so that we retry the write in a context allowed to block and do metadata reservation with flushing, which might succeed this time due to the allowed flushing. Returning -ENOSPC while in NOWAIT context simply makes some writes fail with -ENOSPC when they would likely succeed after switching from NOWAIT context to blocking context. That is unexpected behaviour and even fio complains about it with a warning like this: fio: io_u error on file /mnt/sdi/task_0.0.0: No space left on device: write offset=1535705088, buflen=65536 fio: pid=592630, err=28/file:io_u.c:1846, func=io_u error, error=No space left on device The fio's job config is this: [global] bs=64K ioengine=io_uring iodepth=1 size=2236962133 nr_files=1 filesize=2236962133 direct=0 runtime=10 fallocate=posix io_size=2236962133 group_reporting time_based [task_0] rw=randwrite directory=/mnt/sdi numjobs=4 So fix this by returning -EAGAIN if we are in NOWAIT context and the metadata reservation failed with -ENOSPC. Fixes: 304e45acdb8f ("btrfs: plumb NOWAIT through the write path") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-10-31btrfs: fix lost file sync on direct IO write with nowait and dsync iocbFilipe Manana1-6/+16
When doing a direct IO write using a iocb with nowait and dsync set, we end up not syncing the file once the write completes. This is because we tell iomap to not call generic_write_sync(), which would result in calling btrfs_sync_file(), in order to avoid a deadlock since iomap can call it while we are holding the inode's lock and btrfs_sync_file() needs to acquire the inode's lock. The deadlock happens only if the write happens synchronously, when iomap_dio_rw() calls iomap_dio_complete() before it returns. Instead we do the sync ourselves at btrfs_do_write_iter(). For a nowait write however we can end up not doing the sync ourselves at at btrfs_do_write_iter() because the write could have been queued, and therefore we get -EIOCBQUEUED returned from iomap in such case. That makes us skip the sync call at btrfs_do_write_iter(), as we don't do it for any error returned from btrfs_direct_write(). We can't simply do the call even if -EIOCBQUEUED is returned, since that would block the task waiting for IO, both for the data since there are bios still in progress as well as potentially blocking when joining a log transaction and when syncing the log (writing log trees, super blocks, etc). So let iomap do the sync call itself and in order to avoid deadlocks for the case of synchronous writes (without nowait), use __iomap_dio_rw() and have ourselves call iomap_dio_complete() after unlocking the inode. A test case will later be sent for fstests, after this is fixed in Linus' tree. Fixes: 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO reads and writes") Reported-by: Марк Коренберг <socketpair@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAEmTpZGRKbzc16fWPvxbr6AfFsQoLmz-Lcg-7OgJOZDboJ+SGQ@mail.gmail.com/ CC: stable@vger.kernel.org # 6.0+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: add helper to replace extent map range with a new extent mapFilipe Manana1-7/+1
We have several places that need to drop all the extent maps in a given file range and then add a new extent map for that range. Currently they call btrfs_drop_extent_map_range() to delete all extent maps in the range and then keep trying to add the new extent map in a loop that keeps retrying while the insertion of the new extent map fails with -EEXIST. So instead of repeating this logic, add a helper to extent_map.c that does these steps and name it btrfs_replace_extent_map_range(). Also add a comment about why the retry loop is necessary. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: move btrfs_drop_extent_cache() to extent_map.cFilipe Manana1-187/+3
The function btrfs_drop_extent_cache() doesn't really belong at file.c because what it does is drop a range of extent maps for a file range. It directly allocates and manipulates extent maps, by dropping, splitting and replacing them in an extent map tree, so it should be located at extent_map.c, where all manipulations of an extent map tree and its extent maps are supposed to be done. So move it out of file.c and into extent_map.c. Additionally do the following changes: 1) Rename it into btrfs_drop_extent_map_range(), as this makes it more clear about what it does. The term "cache" is a bit confusing as it's not widely used, "extent maps" or "extent mapping" is much more common; 2) Change its 'skip_pinned' argument from int to bool; 3) Turn several of its local variables from int to bool, since they are used as booleans; 4) Move the declaration of some variables out of the function's main scope and into the scopes where they are used; 5) Remove pointless assignment of false to 'modified' early in the while loop, as later that variable is set and it's not used before that second assignment; 6) Remove checks for NULL before calling free_extent_map(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: fix missed extent on fsync after dropping extent mapsFilipe Manana1-12/+46
When dropping extent maps for a range, through btrfs_drop_extent_cache(), if we find an extent map that starts before our target range and/or ends before the target range, and we are not able to allocate extent maps for splitting that extent map, then we don't fail and simply remove the entire extent map from the inode's extent map tree. This is generally fine, because in case anyone needs to access the extent map, it can just load it again later from the respective file extent item(s) in the subvolume btree. However, if that extent map is new and is in the list of modified extents, then a fast fsync will miss the parts of the extent that were outside our range (that needed to be split), therefore not logging them. Fix that by marking the inode for a full fsync. This issue was introduced after removing BUG_ON()s triggered when the split extent map allocations failed, done by commit 7014cdb49305ed ("Btrfs: btrfs_drop_extent_cache should never fail"), back in 2012, and the fast fsync path already existed but was very recent. Also, in the case where we could allocate extent maps for the split operations but then fail to add a split extent map to the tree, mark the inode for a full fsync as well. This is not supposed to ever fail, and we assert that, but in case assertions are disabled (CONFIG_BTRFS_ASSERT is not set), it's the correct thing to do to make sure a fast fsync will not miss a new extent. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: enable nowait async buffered writesStefan Roesch1-2/+2
Enable nowait async buffered writes in btrfs_do_write_iter() and btrfs_file_open(). In this version encoded buffered writes have the optimization not enabled. Encoded writes are enabled by using an ioctl. io_uring currently does not support ioctls. This might be enabled in the future. Performance results: For fio the following results have been obtained with a queue depth of 1 and 4k block size (runtime 600 secs): sequential writes: without patch with patch libaio psync iops: 55k 134k 117K 148K bw: 221MB/s 538MB/s 469MB/s 592MB/s clat: 15286ns 82ns 994ns 6340ns For an io depth of 1, the new patch improves throughput by over two times (compared to the existing behavior, where buffered writes are processed by an io-worker process) and also the latency is considerably reduced. To achieve the same or better performance with the existing code an io depth of 4 is required. Increasing the iodepth further does not lead to improvements. The tests have been run like this: ./fio --name=seq-writers --ioengine=psync --iodepth=1 --rw=write \ --bs=4k --direct=0 --size=100000m --time_based --runtime=600 \ --numjobs=1 --filename=... ./fio --name=seq-writers --ioengine=io_uring --iodepth=1 --rw=write \ --bs=4k --direct=0 --size=100000m --time_based --runtime=600 \ --numjobs=1 --filename=... ./fio --name=seq-writers --ioengine=libaio --iodepth=1 --rw=write \ --bs=4k --direct=0 --size=100000m --time_based --runtime=600 \ --numjobs=1 --filename=... Testing: This patch has been tested with xfstests, fsx, fio. xfstests shows no new diffs compared to running without the patch series. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: make btrfs_buffered_write nowait compatibleStefan Roesch1-2/+5
We need to avoid unconditionally calling balance_dirty_pages_ratelimited as it could wait for some reason. Use balance_dirty_pages_ratelimited_flags with the BDP_ASYNC in case the buffered write is nowait, returning EAGAIN eventually. It also moves the function after the again label. This can cause the function to be called a bit later, but this should have no impact in the real world. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: plumb NOWAIT through the write pathStefan Roesch1-6/+13
We have everywhere setup for nowait, plumb NOWAIT through the write path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: make lock_and_cleanup_extent_if_need nowait compatibleStefan Roesch1-3/+16
Add the nowait parameter to lock_and_cleanup_extent_if_need(). If the nowait parameter is specified we try to lock the extent in nowait mode. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: make prepare_pages nowait compatibleStefan Roesch1-8/+35
Add nowait parameter to the prepare_pages function. In case nowait is specified for an async buffered write request, do a nowait allocation or return -EAGAIN. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: make btrfs_check_nocow_lock nowait compatibleJosef Bacik1-11/+22
Now all the helpers that btrfs_check_nocow_lock uses handle nowait, add a nowait flag to btrfs_check_nocow_lock so it can be used by the write path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: add the ability to use NO_FLUSH for data reservationsJosef Bacik1-1/+1
In order to accommodate NOWAIT IOCB's we need to be able to do NO_FLUSH data reservations, so plumb this through the delalloc reservation system. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: make can_nocow_extent nowait compatibleJosef Bacik1-1/+1
If we have NOWAIT specified on our IOCB and we're writing into a PREALLOC or NOCOW extent then we need to be able to tell can_nocow_extent that we don't want to wait on any locks or metadata IO. Fix can_nocow_extent to allow for NOWAIT. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: open code and remove btrfs_inode_sectorsize helperJosef Bacik1-6/+5
This is defined in btrfs_inode.h, and dereferences btrfs_root and btrfs_fs_info, both of which aren't defined in btrfs_inode.h. Additionally, in many places we already have root or fs_info, so this helper often makes the code harder to read. So delete the helper and simply open code it in the few places that we use it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITSJosef Bacik1-1/+1
Instead of taking up a whole argument to indicate we're clearing everything in a range, simply add another EXTENT bit to control this, and then update all the callers to drop this argument from the clear_extent_bit variants. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: unify the lock/unlock extent variantsJosef Bacik1-24/+22
We have two variants of lock/unlock extent, one set that takes a cached state, another that does not. This is slightly annoying, and generally speaking there are only a few places where we don't have a cached state. Simplify this by making lock_extent/unlock_extent the only variant and make it take a cached state, then convert all the callers appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove the wake argument from clear_extent_bitsJosef Bacik1-1/+1
This is only used in the case that we are clearing EXTENT_LOCKED, so infer this value from the bits passed in instead of taking it as an argument. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: make fiemap more efficient and accurate reporting extent sharednessFilipe Manana1-8/+8
The current fiemap implementation does not scale very well with the number of extents a file has. This is both because the main algorithm to find out the extents has a high algorithmic complexity and because for each extent we have to check if it's shared. This second part, checking if an extent is shared, is significantly improved by the two previous patches in this patchset, while the first part is improved by this specific patch. Every now and then we get reports from users mentioning fiemap is too slow or even unusable for files with a very large number of extents, such as the two recent reports referred to by the Link tags at the bottom of this change log. To understand why the part of finding which extents a file has is very inefficient, consider the example of doing a full ranged fiemap against a file that has over 100K extents (normal for example for a file with more than 10G of data and using compression, which limits the extent size to 128K). When we enter fiemap at extent_fiemap(), the following happens: 1) Before entering the main loop, we call get_extent_skip_holes() to get the first extent map. This leads us to btrfs_get_extent_fiemap(), which in turn calls btrfs_get_extent(), to find the first extent map that covers the file range [0, LLONG_MAX). btrfs_get_extent() will first search the inode's extent map tree, to see if we have an extent map there that covers the range. If it does not find one, then it will search the inode's subvolume b+tree for a fitting file extent item. After finding the file extent item, it will allocate an extent map, fill it in with information extracted from the file extent item, and add it to the inode's extent map tree (which requires a search for insertion in the tree). 2) Then we enter the main loop at extent_fiemap(), emit the details of the extent, and call again get_extent_skip_holes(), with a start offset matching the end of the extent map we previously processed. We end up at btrfs_get_extent() again, will search the extent map tree and then search the subvolume b+tree for a file extent item if we could not find an extent map in the extent tree. We allocate an extent map, fill it in with the details in the file extent item, and then insert it into the extent map tree (yet another search in this tree). 3) The second step is repeated over and over, until we have processed the whole file range. Each iteration ends at btrfs_get_extent(), which does a red black tree search on the extent map tree, then searches the subvolume b+tree, allocates an extent map and then does another search in the extent map tree in order to insert the extent map. In the best scenario we have all the extent maps already in the extent tree, and so for each extent we do a single search on a red black tree, so we have a complexity of O(n log n). In the worst scenario we don't have any extent map already loaded in the extent map tree, or have very few already there. In this case the complexity is much higher since we do: - A red black tree search on the extent map tree, which has O(log n) complexity, initially very fast since the tree is empty or very small, but as we end up allocating extent maps and adding them to the tree when we don't find them there, each subsequent search on the tree gets slower, since it's getting bigger and bigger after each iteration. - A search on the subvolume b+tree, also O(log n) complexity, but it has items for all inodes in the subvolume, not just items for our inode. Plus on a filesystem with concurrent operations on other inodes, we can block doing the search due to lock contention on b+tree nodes/leaves. - Allocate an extent map - this can block, and can also fail if we are under serious memory pressure. - Do another search on the extent maps red black tree, with the goal of inserting the extent map we just allocated. Again, after every iteration this tree is getting bigger by 1 element, so after many iterations the searches are slower and slower. - We will not need the allocated extent map anymore, so it's pointless to add it to the extent map tree. It's just wasting time and memory. In short we end up searching the extent map tree multiple times, on a tree that is growing bigger and bigger after each iteration. And besides that we visit the same leaf of the subvolume b+tree many times, since a leaf with the default size of 16K can easily have more than 200 file extent items. This is very inefficient overall. This patch changes the algorithm to instead iterate over the subvolume b+tree, visiting each leaf only once, and only searching in the extent map tree for file ranges that have holes or prealloc extents, in order to figure out if we have delalloc there. It will never allocate an extent map and add it to the extent map tree. This is very similar to what was previously done for the lseek's hole and data seeking features. Also, the current implementation relying on extent maps for figuring out which extents we have is not correct. This is because extent maps can be merged even if they represent different extents - we do this to minimize memory utilization and keep extent map trees smaller. For example if we have two extents that are contiguous on disk, once we load the two extent maps, they get merged into a single one - however if only one of the extents is shared, we end up reporting both as shared or both as not shared, which is incorrect. This reproducer triggers that bug: $ cat fiemap-bug.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT # Create a file with two 256K extents. # Since there is no other write activity, they will be contiguous, # and their extent maps merged, despite having two distinct extents. xfs_io -f -c "pwrite -S 0xab 0 256K" \ -c "fsync" \ -c "pwrite -S 0xcd 256K 256K" \ -c "fsync" \ $MNT/foo # Now clone only the second extent into another file. xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar # Filefrag will report a single 512K extent, and say it's not shared. echo filefrag -v $MNT/foo umount $MNT Running the reproducer: $ ./fiemap-bug.sh wrote 262144/262144 bytes at offset 0 256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec) wrote 262144/262144 bytes at offset 262144 256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec) linked 262144/262144 bytes at offset 0 256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec) Filesystem type is: 9123683e File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 3328.. 3455: 128: last,eof /mnt/sdj/foo: 1 extent found We end up reporting that we have a single 512K that is not shared, however we have two 256K extents, and the second one is shared. Changing the reproducer to clone instead the first extent into file 'bar', makes us report a single 512K extent that is shared, which is algo incorrect since we have two 256K extents and only the first one is shared. This patch is part of a larger patchset that is comprised of the following patches: btrfs: allow hole and data seeking to be interruptible btrfs: make hole and data seeking a lot more efficient btrfs: remove check for impossible block start for an extent map at fiemap btrfs: remove zero length check when entering fiemap btrfs: properly flush delalloc when entering fiemap btrfs: allow fiemap to be interruptible btrfs: rename btrfs_check_shared() to a more descriptive name btrfs: speedup checking for extent sharedness during fiemap btrfs: skip unnecessary extent buffer sharedness checks during fiemap btrfs: make fiemap more efficient and accurate reporting extent sharedness The patchset was tested on a machine running a non-debug kernel (Debian's default config) and compared the tests below on a branch without the patchset versus the same branch with the whole patchset applied. The following test for a large compressed file without holes: $ cat fiemap-perf-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 40G gives 327680 128K file extents (due to compression). xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata not cached)" start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata cached)" umount $MNT Before patchset: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 3597 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 2107 milliseconds (metadata cached) After patchset: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 1214 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 684 milliseconds (metadata cached) That's a speedup of about 3x for both cases (no metadata cached and all metadata cached). The test provided by Pavel (first Link tag at the bottom), which uses files with a large number of holes, was also used to measure the gains, and it consists on a small C program and a shell script to invoke it. The C program is the following: $ cat pavels-test.c #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <sys/stat.h> #include <sys/time.h> #include <sys/ioctl.h> #include <linux/fs.h> #include <linux/fiemap.h> #define FILE_INTERVAL (1<<13) /* 8Kb */ long long interval(struct timeval t1, struct timeval t2) { long long val = 0; val += (t2.tv_usec - t1.tv_usec); val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000; return val; } int main(int argc, char **argv) { struct fiemap fiemap = {}; struct timeval t1, t2; char data = 'a'; struct stat st; int fd, off, file_size = FILE_INTERVAL; if (argc != 3 && argc != 2) { printf("usage: %s <path> [size]\n", argv[0]); return 1; } if (argc == 3) file_size = atoi(argv[2]); if (file_size < FILE_INTERVAL) file_size = FILE_INTERVAL; file_size -= file_size % FILE_INTERVAL; fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644); if (fd < 0) { perror("open"); return 1; } for (off = 0; off < file_size; off += FILE_INTERVAL) { if (pwrite(fd, &data, 1, off) != 1) { perror("pwrite"); close(fd); return 1; } } if (ftruncate(fd, file_size)) { perror("ftruncate"); close(fd); return 1; } if (fstat(fd, &st) < 0) { perror("fstat"); close(fd); return 1; } printf("size: %ld\n", st.st_size); printf("actual size: %ld\n", st.st_blocks * 512); fiemap.fm_length = FIEMAP_MAX_OFFSET; gettimeofday(&t1, NULL); if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) { perror("fiemap"); close(fd); return 1; } gettimeofday(&t2, NULL); printf("fiemap: fm_mapped_extents = %d\n", fiemap.fm_mapped_extents); printf("time = %lld us\n", interval(t1, t2)); close(fd); return 0; } $ gcc -o pavels_test pavels_test.c And the wrapper shell script: $ cat fiemap-pavels-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f -O no-holes $DEV mount $DEV $MNT echo echo "*********** 256M ***********" echo ./pavels-test $MNT/testfile $((1 << 28)) echo ./pavels-test $MNT/testfile $((1 << 28)) echo echo "*********** 512M ***********" echo ./pavels-test $MNT/testfile $((1 << 29)) echo ./pavels-test $MNT/testfile $((1 << 29)) echo echo "*********** 1G ***********" echo ./pavels-test $MNT/testfile $((1 << 30)) echo ./pavels-test $MNT/testfile $((1 << 30)) umount $MNT Running his reproducer before applying the patchset: *********** 256M *********** size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = 4003133 us size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = 4895330 us *********** 512M *********** size: 536870912 actual size: 268435456 fiemap: fm_mapped_extents = 65536 time = 30123675 us size: 536870912 actual size: 268435456 fiemap: fm_mapped_extents = 65536 time = 33450934 us *********** 1G *********** size: 1073741824 actual size: 536870912 fiemap: fm_mapped_extents = 131072 time = 224924074 us size: 1073741824 actual size: 536870912 fiemap: fm_mapped_extents = 131072 time = 217239242 us Running it after applying the patchset: *********** 256M *********** size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = 29475 us size: 268435456 actual size: 134217728 fiemap: fm_mapped_extents = 32768 time = 29307 us *********** 512M *********** size: 536870912 actual size: 268435456 fiemap: fm_mapped_extents = 65536 time = 58996 us size: 536870912 actual size: 268435456 fiemap: fm_mapped_extents = 65536 time = 59115 us *********** 1G *********** size: 1073741824 actual size: 536870912 fiemap: fm_mapped_extents = 116251 time = 124141 us size: 1073741824 actual size: 536870912 fiemap: fm_mapped_extents = 131072 time = 119387 us The speedup is massive, both on the first fiemap call and on the second one as well, as his test creates files with many holes and small extents (every extent follows a hole and precedes another hole). For the 256M file we go from 4 seconds down to 29 milliseconds in the first run, and then from 4.9 seconds down to 29 milliseconds again in the second run, a speedup of 138x and 169x, respectively. For the 512M file we go from 30.1 seconds down to 59 milliseconds in the first run, and then from 33.5 seconds down to 59 milliseconds again in the second run, a speedup of 510x and 568x, respectively. For the 1G file, we go from 225 seconds down to 124 milliseconds in the first run, and then from 217 seconds down to 119 milliseconds in the second run, a speedup of 1815x and 1824x, respectively. Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/ Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: make hole and data seeking a lot more efficientFilipe Manana1-31/+406
The current implementation of hole and data seeking for llseek does not scale well in regards to the number of extents and the distance between the start offset and the next hole or extent. This is due to a very high algorithmic complexity. Often we also get reports of btrfs' hole and data seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link tag at the bottom). In order to better understand it, lets consider the case where the start offset is 0, we are seeking for a hole and the file size is 16G. Between file offset 0 and the first hole in the file there are 100K extents - this is common for large files, specially if we have compression enabled, since the maximum extent size is limited to 128K. The steps take by the main loop of the current algorithm are the following: 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which calls btrfs_get_extent(). This will first lookup for an extent map in the inode's extent map tree (a red black tree). If the extent map is not loaded in memory, then it will do a lookup for the corresponding file extent item in the subvolume's b+tree, create an extent map based on the contents of the file extent item and then add the extent map to the extent map tree of the inode; 2) The second iteration calls btrfs_get_extent_fiemap() again, this time with a start offset matching the end offset of the previous extent. Again, btrfs_get_extent() will first search the extent map tree, and if it doesn't find an extent map there, it will again search in the b+tree of the subvolume for a matching file extent item, build an extent map based on the file extent item, and add the extent map to to the extent map tree of the inode; 3) This repeats over and over until we find the first hole (when seeking for holes) or until we find the first extent (when seeking for data). If there no extent maps loaded in memory for each iteration, then on each iteration we do 1 extent map tree search, 1 b+tree search, plus 1 more extent map tree traversal to insert an extent map - plus we allocate memory for the extent map. On each iteration we are growing the size of the extent map tree, making each future search slower, and also visiting the same b+tree leaves over and over again - taking into account with the default leaf size of 16K we can fit more than 200 file extent items in a leaf - so we can visit the same b+tree leaf 200+ times, on each visit walking down a path from the root to the leaf. So it's easy to see that what we have now doesn't scale well. Also, it loads an extent map for every file extent item into memory, which is not efficient - we should add extents maps only when doing IO (writing or reading file data). This change implements a new algorithm which scales much better, and works like this: 1) We iterate over the subvolume's b+tree, visiting each leaf that has file extent items once and only once; 2) For any file extent items found, that don't represent holes or prealloc extents, it will not search the extent map tree - there's no need at all for that - an extent map is just an in-memory representation of a file extent item; 3) When a hole is found, or a prealloc extent, it will check if there's delalloc for its range. For this it will search for EXTENT_DELALLOC bits in the inode's io tree and check the extent map tree - this is for accounting for unflushed delalloc and for flushed delalloc (the period between running delalloc and ordered extent completion), respectively. This is similar to what the current implementation does when it finds a hole or prealloc extent, but without creating extent maps and adding them to the extent map tree in case they are not loaded in memory; 4) It never allocates extent maps, or adds extent maps to the inode's extent map tree. This not only saves memory and time (from the tree insertions and allocations), but also eliminates the possibility of -ENOMEM due to allocating too many extent maps. Part of this new code will also be used later for fiemap (which also suffers similar scalability problems). The following test example can be used to quickly measure the efficiency before and after this patch: $ cat test-seek-hole.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 16G file -> 131073 compressed extents. xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar # Leave a 1M hole at file offset 15G. xfs_io -c "fpunch 15G 1M" $MNT/foobar # Unmount and mount again, so that we can test when there's no # metadata cached in memory. umount $MNT mount -o compress=lzo $DEV $MNT # Test seeking for hole from offset 0 (hole is at offset 15G). start=$(date +%s%N) xfs_io -c "seek -h 0" $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Took $dur milliseconds to seek first hole (metadata not cached)" echo start=$(date +%s%N) xfs_io -c "seek -h 0" $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Took $dur milliseconds to seek first hole (metadata cached)" echo umount $MNT Before this change: $ ./test-seek-hole.sh (...) Whence Result HOLE 16106127360 Took 176 milliseconds to seek first hole (metadata not cached) Whence Result HOLE 16106127360 Took 17 milliseconds to seek first hole (metadata cached) After this change: $ ./test-seek-hole.sh (...) Whence Result HOLE 16106127360 Took 43 milliseconds to seek first hole (metadata not cached) Whence Result HOLE 16106127360 Took 13 milliseconds to seek first hole (metadata cached) That's about 4x faster when no metadata is cached and about 30% faster when all metadata is cached. In practice the differences may often be significantly higher, either due to a higher number of extents in a file or because the subvolume's b+tree is much bigger than in this example, where we only have one file. Link: https://lwn.net/Articles/718805/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: allow hole and data seeking to be interruptibleFilipe Manana1-0/+4
Doing hole or data seeking on a file with a very large number of extents can take a long time, and we have reports of it being too slow (such as at LSFMM from 2017, see the Link below). So make it interruptible. Link: https://lwn.net/Articles/718805/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: log conflicting inodes without holding log mutex of the initial inodeFilipe Manana1-0/+1
When logging an inode, if we detect the inode has a reference that conflicts with some other inode that got renamed, we log that other inode while holding the log mutex of the current inode. We then find out if there are other inodes that conflict with the first conflicting inode, and log them while under the log mutex of the original inode. This is fine because the recursion can only happen once. For the upcoming work where we directly log delayed items without flushing them first to the subvolume tree, this recursion adds a lot of complexity and it's hard to keep lockdep happy about it. So collect a list of conflicting inodes and then log the inodes after unlocking the log mutex of the inode we started with. Also limit the maximum number of conflict inodes we log to 10, to avoid spending too much time logging (and maybe allocating too many list elements too), as typically we don't have more than 1 or 2 conflicting inodes - if we go over the limit, simply fallback to a transaction commit. It is possible to have a very long list of conflicting inodes to be intentionally created by a user if he/she creates a very long succession of renames like this: (...) rename E to F rename D to E rename C to D rename B to C rename A to B touch A (create a new file named A) fsync A If that happened for a sequence of hundreds or thousands of renames, it could massively slow down the logging and cause other secondary effects like for example blocking other fsync operations and transaction commits for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM first). However such cases are very uncommon to happen in practice, nevertheless it's better to be prepared for them and avoid chaos. Such long sequence of conflicting inodes could be created before this change. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent()Omar Sandoval1-2/+2
btrfs_insert_file_extent() is only ever used to insert holes, so rename it and remove the redundant parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>