summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2021-08-23btrfs: make relocate_one_page() handle subpage caseQu Wenruo1-29/+77
For subpage case, one page of data reloc inode can contain several file extents, like this: |<--- File extent A --->| FE B | FE C |<--- File extent D -->| |<--------- Page --------->| We can no longer use PAGE_SIZE directly for various operations. This patch will relocate_one_page() to handle subpage case by: - Iterating through all extents of a cluster when marking pages When marking pages dirty and delalloc, we need to check the cluster extent boundary. Now we introduce a loop to go extent by extent of a page, until we either finished the last extent, or reach the page end. By this, regular sectorsize == PAGE_SIZE can still work as usual, since we will do that loop only once. - Iteration start from max(page_start, extent_start) Since we can have the following case: | FE B | FE C |<--- File extent D -->| |<--------- Page --------->| Thus we can't always start from page_start, but do a max(page_start, extent_start) - Iteration end when the cluster is exhausted Similar to previous case, the last file extent can end before the page end: |<--- File extent A --->| FE B | FE C | |<--------- Page --------->| In this case, we need to manually exit the loop after we have finished the last extent of the cluster. - Reserve metadata space for each extent range Since now we can hit multiple ranges in one page, we should reserve metadata for each range, not simply PAGE_SIZE. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: reloc: factor out relocation page read and dirty partQu Wenruo1-105/+93
In function relocate_file_extent_cluster(), we have a big loop for marking all involved page delalloc. That part is long enough to be contained in one function, so this patch will move that code chunk into a new function, relocate_one_page(). This also provides enough space for later subpage work. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: rework lzo_decompress_bio() to make it subpage compatibleQu Wenruo1-106/+90
For the initial subpage support, although we won't support compressed write, we still need to support compressed read. But for lzo_decompress_bio() it has several problems: - The abuse of PAGE_SIZE for boundary detection For subpage case, we should follow sectorsize to detect the padding zeros. Using PAGE_SIZE will cause subpage compress read to skip certain bytes, and causing read error. - Too many helper variables There are half a dozen helper variables, which is only making things harder to read This patch will rework lzo_decompress_bio() to make it work for subpage: - Use sectorsize to do boundary check, while still use PAGE_SIZE for page switching This allows us to have the same on-disk format for 4K sectorsize fs, while take advantage of larger page size. - Use two main cursors Only @cur_in and @cur_out is utilized as the main cursor. The helper variables will only be declared inside the loop, and only 2 helper variables needed. - Introduce a helper function to copy compressed segment payload Introduce a new helper, copy_compressed_segment(), to copy a compressed segment to workspace buffer. This function will handle the page switching. Now the net result is, with all the excessive comments and new helper function, the refactored code is still smaller, and easier to read. For other decompression code, they have no special padding rule, thus no need to bother for initial subpage support, but will be refactored to the same style later. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: rework btrfs_decompress_buf2page()Qu Wenruo5-99/+76
There are several bugs inside the function btrfs_decompress_buf2page() - @start_byte doesn't take bvec.bv_offset into consideration Thus it can't handle case where the target range is not page aligned. - Too many helper variables There are tons of helper variables, @buf_offset, @current_buf_start, @start_byte, @prev_start_byte, @working_bytes, @bytes. This hurts anyone who wants to read the function. - No obvious main cursor for the iteartion A new problem caused by previous problem. - Comments for parameter list makes no sense Like @buf_start is the offset to @buf, or offset inside the full decompressed extent? (Spoiler alert, the later case) And @total_out acts more like @buf_start + @size_of_buf. The worst is @disk_start. The real meaning of it is the file offset of the full decompressed extent. This patch will rework the whole function by: - Add a proper comment with ASCII art to explain the parameter list - Rework parameter list The old @buf_start is renamed to @decompressed, to show how many bytes are already decompressed inside the full decompressed extent. The old @total_out is replaced by @buf_len, which is the decompressed data size. For old @disk_start and @bio, just pass @compressed_bio in. - Use single main cursor The main cursor will be @cur_file_offset, to show what's the current file offset. Other helper variables will be declared inside the main loop, and only minimal amount of helper variables: * offset_inside_decompressed_buf: The only real helper * copy_start_file_offset: File offset we start memcpy * bvec_file_offset: File offset of current bvec Even with all these extensive comments, the final function is still smaller than the original function, which is definitely a win. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: grab correct extent map for subpage compressed extent readQu Wenruo1-3/+6
[BUG] When subpage compressed read write support is enabled, btrfs/038 always fails with EIO. A simplified script can easily trigger the problem: mkfs.btrfs -f -s 4k $dev mount $dev $mnt -o compress=lzo xfs_io -f -c "truncate 118811" $mnt/foo xfs_io -c "pwrite -S 0x0d -b 39987 92267 39987" $mnt/foo > /dev/null sync btrfs subvolume snapshot -r $mnt $mnt/mysnap1 xfs_io -c "pwrite -S 0x3e -b 80000 200000 80000" $mnt/foo > /dev/null sync xfs_io -c "pwrite -S 0xdc -b 10000 250000 10000" $mnt/foo > /dev/null xfs_io -c "pwrite -S 0xff -b 10000 300000 10000" $mnt/foo > /dev/null sync btrfs subvolume snapshot -r $mnt $mnt/mysnap2 cat $mnt/mysnap2/foo # Above cat will fail due to EIO [CAUSE] The problem is in btrfs_submit_compressed_read(). When it tries to grab the extent map of the read range, it uses the following call: em = lookup_extent_mapping(em_tree, page_offset(bio_first_page_all(bio)), fs_info->sectorsize); The problem is in the page_offset(bio_first_page_all(bio)) part. The offending inode has the following file extent layout item 10 key (257 EXTENT_DATA 131072) itemoff 15639 itemsize 53 generation 8 type 1 (regular) extent data disk byte 13680640 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) item 11 key (257 EXTENT_DATA 135168) itemoff 15586 itemsize 53 generation 8 type 1 (regular) extent data disk byte 0 nr 0 item 12 key (257 EXTENT_DATA 196608) itemoff 15533 itemsize 53 generation 8 type 1 (regular) extent data disk byte 13676544 nr 4096 extent data offset 0 nr 53248 ram 86016 extent compression 2 (lzo) And the bio passed in has the following parameters: page_offset(bio_first_page_all(bio)) = 131072 bio_first_bvec_all(bio)->bv_offset = 65536 If we use page_offset(bio_first_page_all(bio) without adding bv_offset, we will get an extent map for file offset 131072, not 196608. This means we read uncompressed data from disk, and later decompression will definitely fail. [FIX] Take bv_offset into consideration when trying to grab an extent map. And add an ASSERT() to ensure we're really getting a compressed extent. Thankfully this won't affect anything but subpage, thus we only need to ensure this patch get merged before we enabled basic subpage support. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: disable compressed readahead for subpageQu Wenruo1-0/+10
For current subpage support, we only support 64K page size with 4K sector size. This makes compressed readahead less effective, as maximum compressed extent size is only 128K, 2x the page size. On the other hand, the function add_ra_bio_pages() is still assuming sectorsize == PAGE_SIZE, and code change may affect 4K page size systems. So for now, let's disable subpage compressed readahead for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: subpage: check if there are compressed extents inside one pageQu Wenruo1-0/+14
[BUG] When testing experimental subpage compressed write support, it hits a NULL pointer dereference inside read path: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018 pc : __pi_memcmp+0x28/0x1ec lr : check_data_csum+0xd0/0x274 [btrfs] Call trace: __pi_memcmp+0x28/0x1ec btrfs_verify_data_csum+0xf4/0x244 [btrfs] end_bio_extent_readpage+0x1d0/0x6b0 [btrfs] bio_endio+0x15c/0x1dc end_workqueue_fn+0x44/0x64 [btrfs] btrfs_work_helper+0x74/0x250 [btrfs] process_one_work+0x1d4/0x47c worker_thread+0x180/0x400 kthread+0x11c/0x120 ret_from_fork+0x10/0x30 Code: 54000261 d100044c d343fd8c f8408403 (f8408424) ---[ end trace 9e2c59f33ea40866 ]--- [CAUSE] When reading two compressed extents inside the same page, like the following layout, we trigger above crash: 0 32K 64K |-------|\\\\\\\| | \- Compressed extent (A) \--------- Compressed extent (B) For compressed read, we don't need to populate its io_bio->csum, as we rely on compressed_bio->csum to verify the compressed data, and then copy the decompressed to inode pages. Normally btrfs_verify_data_csum() skip such page by checking and clearing its PageChecked flag But since that flag is still for the full page, when endio for inode page range [0, 32K) gets executed, it clears PageChecked flag for the full page. Then when endio for inode page range [32K, 64K) gets executed, since the page no longer has PageChecked flag, it just continues checking, even though io_bio->csum is NULL. [FIX] Thankfully there are only two users of PageChecked bit: - Cow fixup Since subpage has its own way to trace page dirty (dirty_bitmap) and ordered bit (ordered_bitmap), it should never trigger cow fixup. - Compressed read We can distinguish such read by just checking io_bio->csum. So just check io_bio->csum before doing the verification to avoid such NULL pointer dereference. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: reset this_bio_flag to avoid inheriting old flagsQu Wenruo1-1/+1
In btrfs_do_readpage(), we never reset @this_bio_flag after we hit a compressed extent. This is fine, as for PAGE_SIZE == sectorsize case, we can only have one sector for one page, thus @this_bio_flag will only be set at most once. But for subpage case, after hitting a compressed extent, @this_bio_flag will always have EXTENT_BIO_COMPRESSED bit, even we're reading a regular extent. This will lead to various read errors, and causing new ASSERT() in incoming subpage patches, which adds more strict check in btrfs_submit_compressed_read(). Fix it by declaring @this_bio_flag inside the main loop and reset its value for each iteration. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: constify and cleanup variables in comparatorsDavid Sterba5-16/+15
Comparators just read the data and thus get const parameters. This should be also preserved by the local variables, update all comparators passed to sort or bsearch. Cleanups: - unnecessary casts are dropped - btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern and 'inline' is dropped as the function address is taken Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: simplify data stripe calculation helpersDavid Sterba1-13/+2
There are two helpers doing the same calculations based on nparity and ncopies. calc_data_stripes can be simplified into one expression, so far we don't have profile with both copies and parity, so there's no effective change. calc_stripe_length should reuse the helper and not repeat the same calculation. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: merge alloc_device helpersDavid Sterba1-41/+25
The device allocation is split to two functions, but one just calls the other and they're very far in the file. Merge them together. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: uninline btrfs_bg_flags_to_raid_indexDavid Sterba2-26/+27
The helper does a simple translation from block group flags to index to the btrfs_raid_array table. There's no apparent reason to inline the function, the translation happens usually once per function and is not called in a loop. Making it a proper function saves quite some binary code (x86_64, release config): text data bss dec hex filename 1164011 19253 14912 1198176 124860 pre/btrfs.ko 1161559 19253 14912 1195724 123ecc post/btrfs.ko DELTA: -2451 Also add the const attribute as there are no side effects, this could help compiler to optimize a few things without the function body. Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: tree-checker: add missing stripe checks for raid1c3/4 profilesDavid Sterba1-0/+4
The stripe checks for raid1c3/raid1c4 are missing in the sequence in btrfs_check_chunk_valid. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: tree-checker: use table values for stripe checksDavid Sterba1-6/+11
There are hardcoded values in several checks regarding chunks and stripe constraints. We have that defined in the raid table and ought to use it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: make btrfs_next_leaf static inlineDavid Sterba2-11/+12
btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it to header to avoid the function call overhead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: remove uptodate parameter from btrfs_dec_test_first_ordered_pendingDavid Sterba3-6/+3
In commit e65f152e4348 ("btrfs: refactor how we finish ordered extent io for endio functions") there was last caller not using 1 for the uptodate parameter. Now there's only one, passing 1, so we can remove it and simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_orderedDavid Sterba3-7/+7
The uptodate parameter should be bool, change the type. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: remove unused start and end parameters from btrfs_run_delalloc_range()Qu Wenruo3-6/+5
Since commit d75855b4518b ("btrfs: Remove extent_io_ops::writepage_start_hook") removes the writepage_start_hook() and adds btrfs_writepage_cow_fixup() function, there is no need to follow the old hook parameters. Remove the @start and @end hook, since currently the fixup check is full page check, it doesn't need @start and @end hook. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: introduce btrfs_lookup_match_dirMarcos Paulo de Souza1-37/+39
btrfs_search_slot is called in multiple places in dir-item.c to search for a dir entry, and then calling btrfs_match_dir_name to return a btrfs_dir_item. In order to reduce the number of callers of btrfs_search_slot, create a common function that looks for the dir key, and if found call btrfs_match_dir_item_name. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: remove unneeded return variable in btrfs_lookup_file_extentMarcos Paulo de Souza1-3/+2
We can return from btrfs_search_slot directly which also shows that it follows the same return value convention. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: use btrfs_next_leaf instead of btrfs_next_item when slots > nritemsMarcos Paulo de Souza2-2/+2
After calling btrfs_search_slot is a common practice to check if the slot found isn't bigger than number of slots in the current leaf, and if so, search for the same key in the next leaf by calling btrfs_next_leaf, which calls btrfs_next_old_leaf to do the job. Calling btrfs_next_item in the same situation would end up in the same code flow, since * btrfs_next_item * btrfs_next_old_item * if slot >= nritems(curr_leaf) btrfs_next_old_leaf Change btrfs_verify_dev_extents and calculate_emulated_zone_size functions to use btrfs_next_leaf in the same situation. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: remove ignore_offset argument from btrfs_find_all_roots()Filipe Manana4-27/+17
Currently all the callers of btrfs_find_all_roots() pass a value of false for its ignore_offset argument. This makes the argument pointless and we can remove it and make btrfs_find_all_roots() always pass false as the ignore_offset argument for btrfs_find_all_roots_safe(). So just do that. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: avoid unnecessary lock and leaf splits when updating inode in the logFilipe Manana1-5/+34
During a fast fsync, if we have already fsynced the file before and in the current transaction, we can make the inode item update more efficient and avoid acquiring a write lock on the leaf's parent. To update the inode item we are always using btrfs_insert_empty_item() to get a path pointing to the inode item, which calls btrfs_search_slot() with an "ins_len" argument of 'sizeof(struct btrfs_inode_item) + sizeof(struct btrfs_item)', and that always results in the search taking a write lock on the level 1 node that is the parent of the leaf that contains the inode item. This adds unnecessary lock contention on log trees when we have multiple fsyncs in parallel against inodes in the same subvolume, which has a very significant impact due to the fact that log trees are short lived and their height very rarely goes beyond level 2. Also, by using btrfs_insert_empty_item() when we need to update the inode item, we also end up splitting the leaf of the existing inode item when the leaf has an amount of free space smaller than the size of an inode item. Improve this by using btrfs_seach_slot(), with a 0 "ins_len" argument, when we know the inode item already exists in the log. This avoids these two inefficiencies. The following script, using fio, was used to perform the tests: $ cat fio-test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-d single -m single" if [ $# -ne 4 ]; then echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 FSYNC_FREQ=$3 BLOCK_SIZE=$4 cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=$FSYNC_FREQ fallocate=none group_reporting=1 direct=0 bs=$BLOCK_SIZE ioengine=sync size=$FILE_SIZE directory=$MNT numjobs=$NUM_JOBS EOF echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo echo "mount options: $MOUNT_OPTIONS" echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The tests were done on a physical machine, with 12 cores, 64G of RAM, using a NVMEe device and using a non-debug kernel config (the default one from Debian). The summary line from fio is provided below for each test run. With 8 jobs, file size 256M, fsync frequency of 4 and a block size of 4K: Before: WRITE: bw=28.3MiB/s (29.7MB/s), 28.3MiB/s-28.3MiB/s (29.7MB/s-29.7MB/s), io=2048MiB (2147MB), run=72297-72297msec After: WRITE: bw=28.7MiB/s (30.1MB/s), 28.7MiB/s-28.7MiB/s (30.1MB/s-30.1MB/s), io=2048MiB (2147MB), run=71411-71411msec +1.4% throughput, -1.2% runtime With 16 jobs, file size 256M, fsync frequency of 4 and a block size of 4K: Before: WRITE: bw=40.0MiB/s (42.0MB/s), 40.0MiB/s-40.0MiB/s (42.0MB/s-42.0MB/s), io=4096MiB (4295MB), run=99980-99980msec After: WRITE: bw=40.9MiB/s (42.9MB/s), 40.9MiB/s-40.9MiB/s (42.9MB/s-42.9MB/s), io=4096MiB (4295MB), run=97933-97933msec +2.2% throughput, -2.1% runtime The changes are small but it's possible to be better on faster hardware as in the test machine used disk utilization was pretty much 100% during the whole time the tests were running (observed with 'iostat -xz 1'). The tests also included the previous patch with the subject of: "btrfs: avoid unnecessary log mutex contention when syncing log". So they compared a branch without that patch and without this patch versus a branch with these two patches applied. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: remove unnecessary list head initialization when syncing logFilipe Manana1-2/+0
One of the last steps of syncing the log is to remove all log contexts from the root's list of contexts, done at btrfs_remove_all_log_ctxs(). There we iterate over all the contexts in the list and delete each one from the list, and after that we call INIT_LIST_HEAD() on the list. That is unnecessary since at that point the list is empty. So just remove the INIT_LIST_HEAD() call. It's not needed, increases code size (bloat-o-meter reported a delta of -122 for btrfs_sync_log() after this change) and increases two critical sections delimited by log mutexes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: avoid unnecessary log mutex contention when syncing logFilipe Manana1-4/+10
When syncing the log we acquire the root's log mutex just to update the root's last_log_commit. This is unnecessary because: 1) At this point there can only be one task updating this value, which is the task committing the current log transaction. Any task that enters btrfs_sync_log() has to wait for the previous log transaction to commit and wait for the current log transaction to commit if someone else already started it (in this case it never reaches to the point of updating last_log_commit, as that is done by the committing task); 2) All readers of the root's last_log_commit don't acquire the root's log mutex. This is to avoid blocking the readers, potentially for too long and because getting a stale value of last_log_commit does not cause any functional problem, in the worst case getting a stale value results in logging an inode unnecessarily. Plus it's actually very rare to get a stale value that results in unnecessarily logging the inode. So in order to avoid unnecessary contention on the root's log mutex, which is used for several different purposes, like starting/joining a log transaction and starting writeback of a log transaction, stop acquiring the log mutex for updating the root's last_log_commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: remove racy and unnecessary inode transaction update when using no-holesFilipe Manana1-7/+5
When using the NO_HOLES feature and expanding the size of an inode, we update the inode's last_trans, last_sub_trans and last_log_commit fields at maybe_insert_hole() so that a fsync does know that the inode needs to be logged (by making sure that btrfs_inode_in_log() returns false). This happens for expanding truncate operations, buffered writes, direct IO writes and when cloning extents to an offset greater than the inode's i_size. However the way we do it is racy, because in between setting the inode's last_sub_trans and last_log_commit fields, the log transaction ID that was assigned to last_sub_trans might be committed before we read the root's last_log_commit and assign that value to last_log_commit. If that happens it would make a future call to btrfs_inode_in_log() return true. This is a race that should be extremely unlikely to be hit in practice, and it is the same that was described by commit bc0939fcfab0d7 ("btrfs: fix race between marking inode needs to be logged and log syncing"). The fix would simply be to set last_log_commit to the value we assigned to last_sub_trans minus 1, like it was done in that commit. However updating these two fields plus the last_trans field is pointless here because all the callers of btrfs_cont_expand() (which is the only caller of maybe_insert_hole()) always call btrfs_set_inode_last_trans() or btrfs_update_inode() after calling btrfs_cont_expand(). Calling either btrfs_set_inode_last_trans() or btrfs_update_inode() guarantees that the next fsync will log the inode, as it makes btrfs_inode_in_log() return false. So just remove the code that explicitly sets the inode's last_trans, last_sub_trans and last_log_commit fields. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: stop doing GFP_KERNEL memory allocations in the ref verify toolFilipe Manana2-17/+5
In commit 351cbf6e4410e7 ("btrfs: use nofs allocations for running delayed items") we wrapped all btree updates when running delayed items with memalloc_nofs_save() and memalloc_nofs_restore(), due to a lock inversion detected by lockdep involving reclaim and the mutex of delayed nodes. The problem is because the ref verify tool does some memory allocations with GFP_KERNEL, which can trigger reclaim and reclaim can trigger inode eviction, which requires locking the mutex of an inode's delayed node. On the other hand the ref verify tool is called when allocating metadata extents as part of operations that modify a btree, which is a problem when running delayed nodes, where we do btree updates while holding the mutex of a delayed node. This is what caused the lockdep warning. Instead of wrapping every btree update when running delayed nodes, change the ref verify tool to never do GFP_KERNEL allocations, because: 1) We get less repeated code, which at the moment does not even have a comment mentioning why we need to setup the NOFS context, which is a recommended good practice as mentioned at Documentation/core-api/gfp_mask-from-fs-io.rst 2) The ref verify tool is something meant only for debugging and not something that should be enabled on non-debug / non-development kernels; 3) We may have yet more places outside delayed-inode.c where we have similar problem: doing btree updates while holding some lock and then having the GFP_KERNEL memory allocations, from the ref verify tool, trigger reclaim and trying again to acquire the same lock through the reclaim path. Or we could get more such cases in the future, therefore this change prevents getting into similar cases when using the ref verify tool. Curiously most of the memory allocations done by the ref verify tool were already using GFP_NOFS, except a few ones for no apparent reason. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: improve the batch insertion of delayed itemsFilipe Manana1-133/+79
When we insert the delayed items of an inode, which corresponds to the directory index keys for a directory (key type BTRFS_DIR_INDEX_KEY), we do the following: 1) Pick the first delayed item from the rbtree and insert it into the fs/subvolume btree, using btrfs_insert_empty_item() for that; 2) Without releasing the path returned by btrfs_insert_empty_item(), keep collecting as many consecutive delayed items from the rbtree as possible, as long as each one's BTRFS_DIR_INDEX_KEY key is the immediate successor of the previously picked item and as long as they fit in the available space of the leaf the path points to; 3) Then insert all the collected items into the leaf; 4) Release the reserve metadata space for each collected item and release each item (implies deleting from the rbtree); 5) Unlock the path. While this is much better than inserting items one by one, it can be improved in a few aspects: 1) Instead of adding items based on the remaining free space of the leaf, collect as many items that can fit in a leaf and bulk insert them. This results in less and larger batches, reducing the total amount of time to insert the delayed items. For example when adding 100K files to a directory, we ended up creating 1658 batches with very variable sizes ranging from 1 item to 118 items, on a filesystem with a node/leaf size of 16K. After this change, we end up with 839 batches, with the vast majority of them having exactly 120 items; 2) We do the search for more items to batch, by iterating the rbtree, while holding a write lock on the leaf; 3) While still holding the leaf locked, we are releasing the reserved metadata for each item and then deleting each item, keeping a write lock on the leaf for longer than necessary. Releasing the delayed items one by one can take a significant amount of time, because deleting them from the rbtree can often be a bit slow when the deletion results in rebalancing the rbtree. So change this so that we try to create larger batches, with a total item size up to the maximum a leaf can support, and by unlocking the leaf immediately after inserting the items, releasing the reserved metadata space of each item and releasing each item without holding the write lock on the leaf. The following script that runs fs_mark was used to test this change: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" FILES=1000000 THREADS=16 FILE_SIZE=0 echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t 16" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT It was run on machine with 12 cores, 64G of ram, using a NVMe device and using a non-debug kernel config (Debian's default config). Results before this change: FSUse% Count Size Files/sec App Overhead 1 16000000 0 76182.1 72223046 3 32000000 0 62746.9 80776528 5 48000000 0 77029.0 93022381 6 64000000 0 73691.6 95251075 8 80000000 0 66288.0 85089634 Results after this change: FSUse% Count Size Files/sec App Overhead 1 16000000 0 79049.5 (+3.7%) 69700824 3 32000000 0 65248.9 (+3.9%) 80583693 5 48000000 0 77991.4 (+1.2%) 90040908 6 64000000 0 75096.8 (+1.9%) 89862241 8 80000000 0 66926.8 (+1.0%) 84429169 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: rescue: allow ibadroots to skip bad extent tree when reading block ↵Qu Wenruo1-0/+19
group items When extent tree gets corrupted, normally it's not extent tree root, but one toasted tree leaf/node. In that case, rescue=ibadroots mount option won't help as it can only handle the extent tree root corruption. This patch will enhance the behavior by: - Allow fill_dummy_bgs() to ignore -EEXIST error This means we may have some block group items read from disk, but then hit some error halfway. - Fallback to fill_dummy_bgs() if any error gets hit in btrfs_read_block_groups() Of course, this still needs rescue=ibadroots mount option. With that, rescue=ibadroots can handle extent tree corruption more gracefully and allow a better recover chance. Reported-by: Zhenyu Wu <wuzy001@gmail.com> Link: https://www.spinics.net/lists/linux-btrfs/msg114424.html Reviewed-by: Su Yue <l@damenly.su> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: pass NULL as trans to btrfs_search_slot if we only want to searchMarcos Paulo de Souza2-2/+2
Using a transaction in btrfs_search_slot is only useful when we are searching to add or modify the tree. When the function is used for searching, insert length and mod arguments are 0, there is no need to use a transaction. No functional changes, changing for consistency. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: continue readahead of siblings even if target node is in memoryFilipe Manana1-5/+8
At reada_for_search(), when attempting to readahead a node or leaf's siblings, we skip the readahead of the siblings if the node/leaf is already in memory. That is probably fine for the READA_FORWARD and READA_BACK readahead types, as they are used on contexts where we end up reading some consecutive leaves, but usually not the whole btree. However for a READA_FORWARD_ALWAYS mode, currently only used for full send operations, it does not make sense to skip the readahead if the target node or leaf is already loaded in memory, since we know the caller is visiting every node and leaf of the btree in ascending order. So change the behaviour to not skip the readahead when the target node is already in memory and the readahead mode is READA_FORWARD_ALWAYS. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk, with 32GiB of RAM and using a non-debug kernel config (Debian's default config). $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)" } file_count=2000000 add_files $file_count 0 4 echo echo "Creating snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT The duration of the full send operations, in seconds, were the following: Before this change: 85 seconds After this change: 76 seconds (-11.2%) Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: check-integrity: drop kmap/kunmap for block pagesDavid Sterba1-8/+3
The pages in block_ctx have never been allocated from highmem (in btrfsic_read_block) so the mapping is pointless and can be removed. Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: compression: drop kmap/kunmap from generic helpersDavid Sterba2-4/+2
The pages in compressed_pages are not from highmem anymore so we can drop the mapping for checksum calculation and inline extent. Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: compression: drop kmap/kunmap from zstdDavid Sterba1-18/+9
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is simply page_address and kunmap is a no-op. Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: compression: drop kmap/kunmap from zlibDavid Sterba1-25/+11
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is simply page_address and kunmap is a no-op. Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: compression: drop kmap/kunmap from lzoDavid Sterba1-28/+10
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is simply page_address and kunmap is a no-op. Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: drop from __GFP_HIGHMEM all allocationsDavid Sterba5-15/+14
The highmem flag is used for allocating pages for compression and for raid56 pages. The high memory makes sense on 32bit systems but is not without problems. On 64bit system's it's just another layer of wrappers. The time the pages are allocated for compression or raid56 is relatively short (about a transaction commit), so the pages are not blocked indefinitely. As the number of pages depends on the amount of data being written/read, there's a theoretical problem. A fast device on a 32bit system could use most of the low memory pool, while with the highmem allocation that would not happen. This was possibly the original idea long time ago, but nowadays we optimize for 64bit systems. This patch removes all usage of the __GFP_HIGHMEM flag for page allocation, the kmap/kunmap are still in place and will be removed in followup patches. Remaining is masking out the bit in alloc_extent_state and __lookup_free_space_inode, that can safely stay. Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: cleanup fs_devices pointer usage in btrfs_trim_fsAnand Jain1-5/+5
Drop variable 'devices' (used only once) and add new variable for the fs_devices, so it is used at two locations within btrfs_trim_fs() function and also helps to access fs_devices->devices. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: remove max argument from generic_bin_searchMarcos Paulo de Souza1-11/+7
Both callers use btrfs_header_nritems to feed the max argument. Remove the argument and let generic_bin_search call it itself. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: make btrfs_finish_chunk_alloc private to block-group.cNikolay Borisov3-96/+91
One of the final things that must be done to add a new chunk is inserting its device extent items in the device tree. They describe the portion of allocated device physical space during phase 1 of chunk allocation. This is currently done in btrfs_finish_chunk_alloc whose name isn't very informative. What's more, this function is only used in block-group.c but is defined as public. There isn't anything special about it that would warrant it being defined in volumes.c. Just move btrfs_finish_chunk_alloc and alloc_chunk_dev_extent to block-group.c, make the former static and rename both functions to insert_dev_extents and insert_dev_extent respectively. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: check-integrity: drop unnecessary function prototypesAnand Jain1-49/+0
The function prototypes below aren't necessary as the functions are first defined before called. Remove them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: add special case to setget helpers for 64k pagesDavid Sterba1-4/+4
On 64K pages the size of the extent_buffer::pages array is 1 and compilation with -Warray-bounds warns due to kaddr = page_address(eb->pages[idx + 1]); when reading byte range crossing page boundary. This does never actually overflow the array because on 64K because all the data fit in one page and bounds are checked by check_setget_bounds. To fix the reported overflows and warnings add a compile-time condition that will allow compiler to eliminate the dead code that reads from the idx + 1 page. Link: https://lore.kernel.org/lkml/20210623083901.1d49d19d@canb.auug.org.au/ CC: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: zoned: remove max_zone_append_size logicJohannes Thumshirn4-24/+0
There used to be a patch in the original series for zoned support which limited the extent size to max_zone_append_size, but this patch has been dropped somewhere around v9. We've decided to go the opposite direction, instead of limiting extents in the first place we split them before submission to comply with the device's limits. Remove the related code, btrfs_fs_info::max_zone_append_size and btrfs_zoned_device_info::max_zone_append_size. This also removes the workaround for dm-crypt introduced in 1d68128c107a ("btrfs: zoned: fail mount if the device does not support zone append") because the fix has been merged as f34ee1dce642 ("dm crypt: Fix zoned block device support"). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-18Merge tag 'for-5.14-rc6-tag' of ↵Linus Torvalds1-2/+8
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more fix for cross-rename, adding a missing check for directory and subvolume, this could lead to a crash" * tag 'for-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: prevent rename2 from exchanging a subvol with a directory from different parents
2021-08-16btrfs: prevent rename2 from exchanging a subvol with a directory from ↵NeilBrown1-2/+8
different parents Cross-rename lacks a check when that would prevent exchanging a directory and subvolume from different parent subvolume. This causes data inconsistencies and is caught before commit by tree-checker, turning the filesystem to read-only. Calling the renameat2 with RENAME_EXCHANGE flags like renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1)) on two paths: namesrc = dir1/subvol1/dir2 namedest = subvol2/subvol3 will cause key order problem with following write time tree-checker report: [1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258 [1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5 [1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561 [1194842.338772] item 0 key (256 1 0) itemoff 16123 itemsize 160 [1194842.338793] inode generation 3 size 16 mode 40755 [1194842.338801] item 1 key (256 12 256) itemoff 16111 itemsize 12 [1194842.338809] item 2 key (256 84 2248503653) itemoff 16077 itemsize 34 [1194842.338817] dir oid 258 type 2 [1194842.338823] item 3 key (256 84 2363071922) itemoff 16043 itemsize 34 [1194842.338830] dir oid 257 type 2 [1194842.338836] item 4 key (256 96 2) itemoff 16009 itemsize 34 [1194842.338843] item 5 key (256 96 3) itemoff 15975 itemsize 34 [1194842.338852] item 6 key (257 1 0) itemoff 15815 itemsize 160 [1194842.338863] inode generation 6 size 8 mode 40755 [1194842.338869] item 7 key (257 12 256) itemoff 15801 itemsize 14 [1194842.338876] item 8 key (257 84 2505409169) itemoff 15767 itemsize 34 [1194842.338883] dir oid 256 type 2 [1194842.338888] item 9 key (257 96 2) itemoff 15733 itemsize 34 [1194842.338895] item 10 key (258 12 256) itemoff 15719 itemsize 14 [1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected [1194842.339245] ------------[ cut here ]------------ [1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs] [1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793 [1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 [1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs] [1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs] [1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282 [1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978 [1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970 [1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003 [1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000 [1194842.512092] FS: 0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000 [1194842.512095] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0 [1194842.512103] Call Trace: [1194842.512128] ? run_one_async_free+0x10/0x10 [btrfs] [1194842.631729] btree_csum_one_bio+0x1ac/0x1d0 [btrfs] [1194842.631837] run_one_async_start+0x18/0x30 [btrfs] [1194842.631938] btrfs_work_helper+0xd5/0x1d0 [btrfs] [1194842.647482] process_one_work+0x262/0x5e0 [1194842.647520] worker_thread+0x4c/0x320 [1194842.655935] ? process_one_work+0x5e0/0x5e0 [1194842.655946] kthread+0x135/0x160 [1194842.655953] ? set_kthread_struct+0x40/0x40 [1194842.655965] ret_from_fork+0x1f/0x30 [1194842.672465] irq event stamp: 1729 [1194842.672469] hardirqs last enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0 [1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0 [1194842.672482] softirqs last enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a [1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0 The corrupted data will not be written, and filesystem can be unmounted and mounted again (all changes since the last commit will be lost). Add the missing check for new_ino so that all non-subvolumes must reside under the same parent subvolume. There's an exception allowing to exchange two subvolumes from any parents as the directory representing a subvolume is only a logical link and does not have any other structures related to the parent subvolume, unlike files, directories etc, that are always in the inode namespace of the parent subvolume. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.7+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-30Merge tag 'for-5.14-rc3-tag' of ↵Linus Torvalds4-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix -Warray-bounds warning, to help external patchset to make it default treewide - fix writeable device accounting (syzbot report) - fix fsync and log replay after a rename and inode eviction - fix potentially lost error code when submitting multiple bios for compressed range * tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: calculate number of eb pages properly in csum_tree_block btrfs: fix rw device counting in __btrfs_free_extra_devids btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction btrfs: mark compressed range uptodate only if all bio succeed
2021-07-29btrfs: calculate number of eb pages properly in csum_tree_blockDavid Sterba1-1/+1
Building with -Warray-bounds on systems with 64K pages there's a warning: fs/btrfs/disk-io.c: In function ‘csum_tree_block’: fs/btrfs/disk-io.c:226:34: warning: array subscript 1 is above array bounds of ‘struct page *[1]’ [-Warray-bounds] 226 | kaddr = page_address(buf->pages[i]); | ~~~~~~~~~~^~~ ./include/linux/mm.h:1630:48: note: in definition of macro ‘page_address’ 1630 | #define page_address(page) lowmem_page_address(page) | ^~~~ In file included from fs/btrfs/ctree.h:32, from fs/btrfs/disk-io.c:23: fs/btrfs/extent_io.h:98:15: note: while referencing ‘pages’ 98 | struct page *pages[1]; | ^~~~~ The compiler has no way to know that in that case the nodesize is exactly PAGE_SIZE, so the resulting number of pages will be correct (1). Let's use num_extent_pages that makes the case nodesize == PAGE_SIZE explicitly 1. Reported-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-28btrfs: fix rw device counting in __btrfs_free_extra_devidsDesmond Cheong Zhi Xi1-0/+1
When removing a writeable device in __btrfs_free_extra_devids, the rw device count should be decremented. This error was caught by Syzbot which reported a warning in close_fs_devices: WARNING: CPU: 1 PID: 9355 at fs/btrfs/volumes.c:1168 close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168 Modules linked in: CPU: 0 PID: 9355 Comm: syz-executor552 Not tainted 5.13.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168 RSP: 0018:ffffc9000333f2f0 EFLAGS: 00010293 RAX: ffffffff8365f5c3 RBX: 0000000000000001 RCX: ffff888029afd4c0 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000 RBP: ffff88802846f508 R08: ffffffff8365f525 R09: ffffed100337d128 R10: ffffed100337d128 R11: 0000000000000000 R12: dffffc0000000000 R13: ffff888019be8868 R14: 1ffff1100337d10d R15: 1ffff1100337d10a FS: 00007f6f53828700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000047c410 CR3: 00000000302a6000 CR4: 00000000001506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_close_devices+0xc9/0x450 fs/btrfs/volumes.c:1180 open_ctree+0x8e1/0x3968 fs/btrfs/disk-io.c:3693 btrfs_fill_super fs/btrfs/super.c:1382 [inline] btrfs_mount_root+0xac5/0xc60 fs/btrfs/super.c:1749 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x86/0x270 fs/super.c:1498 fc_mount fs/namespace.c:993 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1023 btrfs_mount+0x3d3/0xb50 fs/btrfs/super.c:1809 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x86/0x270 fs/super.c:1498 do_new_mount fs/namespace.c:2905 [inline] path_mount+0x196f/0x2be0 fs/namespace.c:3235 do_mount fs/namespace.c:3248 [inline] __do_sys_mount fs/namespace.c:3456 [inline] __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433 do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae Because fs_devices->rw_devices was not 0 after closing all devices. Here is the call trace that was observed: btrfs_mount_root(): btrfs_scan_one_device(): device_list_add(); <---------------- device added btrfs_open_devices(): open_fs_devices(): btrfs_open_one_device(); <-------- writable device opened, rw device count ++ btrfs_fill_super(): open_ctree(): btrfs_free_extra_devids(): __btrfs_free_extra_devids(); <--- writable device removed, rw device count not decremented fail_tree_roots: btrfs_close_devices(): close_fs_devices(); <------- rw device count off by 1 As a note, prior to commit cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device"), rw_devices was decremented on removing a writable device in __btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit was not set for the device. However, this check does not need to be reinstated as it is now redundant and incorrect. In __btrfs_free_extra_devids, we skip removing the device if it is the target for replacement. This is done by checking whether device->devid == BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check, and so it's redundant to test for that bit. Additionally, following commit 82372bc816d7 ("Btrfs: make the logic of source device removing more clear"), rw_devices is incremented whenever a writeable device is added to the alloc list (including the target device in btrfs_dev_replace_finishing), so all removals of writable devices from the alloc list should also be accompanied by a decrement to rw_devices. Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device") CC: stable@vger.kernel.org # 5.10+ Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-28btrfs: fix lost inode on log replay after mix of fsync, rename and inode ↵Filipe Manana1-2/+2
eviction When checking if we need to log the new name of a renamed inode, we are checking if the inode and its parent inode have been logged before, and if not we don't log the new name. The check however is buggy, as it directly compares the logged_trans field of the inodes versus the ID of the current transaction. The problem is that logged_trans is a transient field, only stored in memory and never persisted in the inode item, so if an inode was logged before, evicted and reloaded, its logged_trans field is set to a value of 0, meaning the check will return false and the new name of the renamed inode is not logged. If the old parent directory was previously fsynced and we deleted the logged directory entries corresponding to the old name, we end up with a log that when replayed will delete the renamed inode. The following example triggers the problem: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/A $ mkdir /mnt/B $ echo -n "hello world" > /mnt/A/foo $ sync # Add some new file to A and fsync directory A. $ touch /mnt/A/bar $ xfs_io -c "fsync" /mnt/A # Now trigger inode eviction. We are only interested in triggering # eviction for the inode of directory A. $ echo 2 > /proc/sys/vm/drop_caches # Move foo from directory A to directory B. # This deletes the directory entries for foo in A from the log, and # does not add the new name for foo in directory B to the log, because # logged_trans of A is 0, which is less than the current transaction ID. $ mv /mnt/A/foo /mnt/B/foo # Now make an fsync to anything except A, B or any file inside them, # like for example create a file at the root directory and fsync this # new file. This syncs the log that contains all the changes done by # previous rename operation. $ touch /mnt/baz $ xfs_io -c "fsync" /mnt/baz <power fail> # Mount the filesystem and replay the log. $ mount /dev/sdc /mnt # Check the filesystem content. $ ls -1R /mnt /mnt/: A B baz /mnt/A: bar /mnt/B: $ # File foo is gone, it's neither in A/ nor in B/. Fix this by using the inode_logged() helper at btrfs_log_new_name(), which safely checks if an inode was logged before in the current transaction. A test case for fstests will follow soon. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-28btrfs: mark compressed range uptodate only if all bio succeedGoldwyn Rodrigues1-1/+1
In compression write endio sequence, the range which the compressed_bio writes is marked as uptodate if the last bio of the compressed (sub)bios is completed successfully. There could be previous bio which may have failed which is recorded in cb->errors. Set the writeback range as uptodate only if cb->errors is zero, as opposed to checking only the last bio's status. Backporting notes: in all versions up to 4.4 the last argument is always replaced by "!cb->errors". CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>