summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2022-01-03btrfs: do not special case the extent root for switch commit rootsJosef Bacik1-6/+2
This is a leftover from when we used to independently swap the extent root's commit root and the fs tree commit roots. At the time I simply changed the helper to a list_add. There's actually no reason to not add the extent root to the switch commit root at this point, we don't care about the order we do the switching since it's all done under the commit_root_sem. If we re-mark the extent root dirty after adding it to the switch_commits list we'll see that BTRFS_ROOT_DIRTY isn't set and then list_move it back onto the dirty list, and then we'll redo the tree update and everything will be ok. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: use chunk_root in find_free_extent_update_loopJosef Bacik1-1/+1
We're only using this to start the transaction with to possibly allocate a chunk. It doesn't really matter which root to use, but with extent tree v2 we'll need a bytenr to look up a extent root which makes the usage of the extent_root awkward here. Simply change it to the chunk_root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: make remove_extent_backref pass the rootJosef Bacik1-8/+10
With extent tree v2 we'll have a different extent root based on where the bytenr is located, so adjust the remove_extent_backref() helper and it's helpers to pass the extent_root around. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: add a btrfs_block_group_root() helperJosef Bacik2-9/+15
With extent tree v2 we will have a separate root to hold the block group items. Add a btrfs_block_group_root() that will return the appropriate root given the flags of the fs, and convert all functions that need to modify block group items to use the helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove BUG_ON(!eie) in find_parent_nodesJosef Bacik1-3/+11
If we're looking for leafs that point to a data extent we want to record the extent items that point at our bytenr. At this point we have the reference and we know for a fact that this leaf should have a reference to our bytenr. However if there's some sort of corruption we may not find any references to our leaf, and thus could end up with eie == NULL. Replace this BUG_ON() with an ASSERT() and then return -EUCLEAN for the mortals. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove BUG_ON() in find_parent_nodes()Josef Bacik1-1/+6
We search for an extent entry with .offset = -1, which shouldn't be a thing, but corruption happens. Add an ASSERT() for the developers, return -EUCLEAN for mortals. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove SANITY_TESTS check form find_parent_nodesJosef Bacik1-4/+0
We define __TRANS_DUMMY always, so this extra ifdef stuff is not needed. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: move comment in find_parent_nodes()Josef Bacik1-7/+4
This comment was much closer to the related code when it was originally added, but has slowly migrated north far from its ancestral lands. Move it back down with its people. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: pass the root to add_keyed_refsJosef Bacik1-4/+5
We pass in the path, but use btrfs_next_item() using the root we searched with. Pass the root down to add_keyed_refs() instead of the fs_info so we can continue to use the same root we searched with. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove trans_handle->rootJosef Bacik2-2/+0
Nobody is using this anymore, remove it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: pass fs_info to trace_btrfs_transaction_commitJosef Bacik2-3/+3
The root on the trans->root can be anything, and generally we're committing from the transaction kthread so it's usually the tree_root. Change this to just take an fs_info, and to maintain compatibility simply put the ROOT_TREE_OBJECTID as the root objectid for the tracepoint. This will allow use to remove trans->root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: rework async transaction committingJosef Bacik5-61/+18
Currently we do this awful thing where we get another ref on a trans handle, async off that handle and commit the transaction from that work. Because we do this we have to mess with current->journal_info and the freeze counting stuff. We already have an async thing to kick for the transaction commit, the transaction kthread. Replace this work struct with a flag on the fs_info to tell the kthread to go ahead and commit even if it's before our timeout. Then we can drastically simplify the async transaction commit path. Note: this can be simplified and functionality based on the pending operation COMMIT. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add note ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove unused BTRFS_FS_BARRIER flagJosef Bacik2-2/+0
This is no longer used, the -o nobarrier is handled by BTRFS_MOUNT_NOBARRIER. Remove the flag. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: eliminate if in main loop in tree_search_offsetNikolay Borisov1-7/+4
Reshuffle the code inside the first loop of tree_search_offset so that one if() is eliminated and the becomes more linear. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: don't check stripe length if the profile is not stripe basedQu Wenruo1-1/+6
[BUG] When debugging calc_bio_boundaries(), I found that even for RAID1 metadata, we're following stripe length to calculate stripe boundary. # mkfs.btrfs -m raid1 -d raid1 /dev/test/scratch[12] # mount /dev/test/scratch /mnt/btrfs # xfs_io -f -c "pwrite 0 64K" /mnt/btrfs/file # umount Above very basic operations will make calc_bio_boundaries() to report the following result: submit_extent_page: r/i=1/1 file_offset=22036480 len_to_stripe_boundary=49152 submit_extent_page: r/i=1/1 file_offset=30474240 len_to_stripe_boundary=65536 ... submit_extent_page: r/i=1/1 file_offset=30523392 len_to_stripe_boundary=16384 submit_extent_page: r/i=1/1 file_offset=30457856 len_to_stripe_boundary=16384 submit_extent_page: r/i=5/257 file_offset=0 len_to_stripe_boundary=65536 submit_extent_page: r/i=5/257 file_offset=65536 len_to_stripe_boundary=65536 submit_extent_page: r/i=1/1 file_offset=30490624 len_to_stripe_boundary=49152 submit_extent_page: r/i=1/1 file_offset=30507008 len_to_stripe_boundary=32768 Where "r/i" is the rootid and inode, 1/1 means they metadata. The remaining names match the member used in kernel. Even all data/metadata are using RAID1, we're still following stripe length. [CAUSE] This behavior is caused by a wrong condition in btrfs_get_io_geometry(): if (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) { /* Fill using stripe_len */ len = min_t(u64, em->len - offset, max_len); } else { len = em->len - offset; } This means, only for SINGLE we will not follow stripe_len. However for profiles like RAID1*, DUP, they don't need to bother stripe_len. This can lead to unnecessary bio split for RAID1*/DUP profiles, and can even be a blockage for future zoned RAID support. [FIX] Introduce one single-use macro, BTRFS_BLOCK_GROUP_STRIPE_MASK, and change the condition to only calculate the length using stripe length for stripe based profiles. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: get next entry in tree_search_offset before doing checksNikolay Borisov1-5/+4
This is a small optimisation since the currently 'entry' is already checked in the if () {} else if {} construct above the loop. In essence the first iteration of the final while loop is redundant. To eliminate this extra check simply get the next entry at the beginning of the loop. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: add self test for bytes_index free space cacheJosef Bacik1-0/+181
I noticed a few corner cases when looking at my bytes_index patch for obvious bugs, so add a bunch of tests to validate proper behavior of the bytes_index tree. A couple of basic tests to make sure it puts things in the correct order, and then more complicated tests to make sure it re-arranges bitmap entries properly and does the right thing when we try to make allocations. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: index free space entries on sizeJosef Bacik2-44/+139
Currently we index free space on offset only, because usually we have a hint from the allocator that we want to honor for locality reasons. However if we fail to use this hint we have to go back to a brute force search through the free space entries to find a large enough extent. With sufficiently fragmented free space this becomes quite expensive, as we have to linearly search all of the free space entries to find if we have a part that's long enough. To fix this add a cached rb tree to index based on free space entry bytes. This will allow us to quickly look up the largest chunk in the free space tree for this block group, and stop searching once we've found an entry that is too small to satisfy our allocation. We simply choose to use this tree if we're searching from the beginning of the block group, as we know we do not care about locality at that point. I wrote an allocator test that creates a 10TiB ram backed null block device and then fallocates random files until the file system is full. I think go through and delete all of the odd files. Then I spawn 8 threads that fallocate 64MiB files (1/2 our extent size cap) until the file system is full again. I use bcc's funclatency to measure the latency of find_free_extent. The baseline results are nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 10356 |**** | 512 -> 1023 : 58242 |************************* | 1024 -> 2047 : 74418 |******************************** | 2048 -> 4095 : 90393 |****************************************| 4096 -> 8191 : 79119 |*********************************** | 8192 -> 16383 : 35614 |*************** | 16384 -> 32767 : 13418 |***** | 32768 -> 65535 : 12811 |***** | 65536 -> 131071 : 17090 |******* | 131072 -> 262143 : 26465 |*********** | 262144 -> 524287 : 40179 |***************** | 524288 -> 1048575 : 55469 |************************ | 1048576 -> 2097151 : 48807 |********************* | 2097152 -> 4194303 : 26744 |*********** | 4194304 -> 8388607 : 35351 |*************** | 8388608 -> 16777215 : 13918 |****** | 16777216 -> 33554431 : 21 | | avg = 908079 nsecs, total: 580889071441 nsecs, count: 639690 And the patch results are nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 6883 |** | 512 -> 1023 : 54346 |********************* | 1024 -> 2047 : 79170 |******************************** | 2048 -> 4095 : 98890 |****************************************| 4096 -> 8191 : 81911 |********************************* | 8192 -> 16383 : 27075 |********** | 16384 -> 32767 : 14668 |***** | 32768 -> 65535 : 13251 |***** | 65536 -> 131071 : 15340 |****** | 131072 -> 262143 : 26715 |********** | 262144 -> 524287 : 43274 |***************** | 524288 -> 1048575 : 53870 |********************* | 1048576 -> 2097151 : 55368 |********************** | 2097152 -> 4194303 : 41036 |**************** | 4194304 -> 8388607 : 24927 |********** | 8388608 -> 16777215 : 33 | | 16777216 -> 33554431 : 9 | | avg = 623599 nsecs, total: 397259314759 nsecs, count: 637042 There's a little variation in the amount of calls done because of timing of the threads with metadata requirements, but the avg, total, and count's are relatively consistent between runs (usually within 2-5% of each other). As you can see here we have around a 30% decrease in average latency with a 30% decrease in overall time spent in find_free_extent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: only use ->max_extent_size if it is set in the bitmapJosef Bacik1-1/+25
While adding self tests for my space index change I was hitting a problem where the space indexed tree wasn't returning the expected ->max_extent_size. This is because we will skip searching any entry that doesn't have ->bytes >= the amount of bytes we want. However we'll still set the max_extent_size based on that entry. The problem is if we don't search the bitmap we won't have ->max_extent_size set properly, so we can't really trust it. This doesn't really result in a problem per-se, it can just result in us not finding contiguous area that may exist. Fix the max_extent_size helper to return ->bytes if ->max_extent_size isn't set, and add a big comment explaining why we're doing this. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove unnecessary @nr_written parametersQu Wenruo1-16/+11
We use @nr_written to record how many pages have been started by btrfs_run_delalloc_range(). Currently there are only two cases that would populate @nr_written: - Inline extent creation - Compressed write But both cases will also set @page_started to one. In fact, in writepage_delalloc() we have the following code, showing that @nr_written is really only utilized for above two cases: /* did the fill delalloc function already unlock and start * the IO? */ if (page_started) { /* * we've unlocked the page, so we can't update * the mapping's writeback index, just update * nr_to_write. */ wbc->nr_to_write -= nr_written; return 1; } But for such cases, writepage_delalloc() will return 1, and exit __extent_writepage() without going through __extent_writepage_io(). Thus this means, inside __extent_writepage_io(), we always get @nr_written as 0. So this patch is going to remove the unnecessary parameter from the following functions: - writepage_delalloc() As @nr_written passed in is always the initial value 0. Although inside that function, we still need a local @nr_written to update wbc->nr_to_write. - __extent_writepage_io() As explained above, @nr_written passed in can only be 0. This also means we can remove one update_nr_written() call. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: change root to fs_info for btrfs_reserve_metadata_bytesJosef Bacik13-30/+30
We used to need the root for btrfs_reserve_metadata_bytes to check the orphan cleanup state, but we no longer need that, we simply need the fs_info. Change btrfs_reserve_metadata_bytes() to use the fs_info, and change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the same as they simply call btrfs_reserve_metadata_bytes() and then manipulate the block_rsv that is being used. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: get rid of root->orphan_cleanup_stateJosef Bacik3-11/+3
Now that we don't care about the stage of the orphan_cleanup_state, simply replace it with a bit on ->state to make sure we don't call the orphan cleanup every time we wander into this root. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove global rsv stealing logic for orphan cleanupJosef Bacik1-7/+0
This is very old code before we were stealing from the global reserve during evict. We have proper ways to steal from the global reserve while we're evicting, so rip out this code as it's no longer necessary. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: make BTRFS_RESERVE_FLUSH_EVICT use the global rsv stealing codeJosef Bacik2-12/+19
I forgot to convert this over when I introduced the global reserve stealing code to the space flushing code. Evict was simply trying to make its reservation and then if it failed it would steal from the global rsv, which is racey because it's outside of the normal ticketing code. Fix this by setting ticket->steal if we are BTRFS_RESERVE_FLUSH_EVICT, and then make the priority flushing path do the steal for us. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: check ticket->steal in steal_from_global_block_rsvJosef Bacik1-2/+4
We're going to use this helper in the priority flushing loop, move this check into the helper to simplify the logic. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: check for priority ticket granting before flushingJosef Bacik1-1/+14
Since we're dropping locks before we enter the priority flushing loops we could have had our ticket granted before we got the space_info->lock. So add this check to avoid doing some extra flushing in the priority flushing cases. The case in priority_reclaim_metadata_space is an optimization. Think we came in to reserve, we didn't have the space, we added our ticket to the list. But at the same time somebody was waiting on the space_info lock to add space and do btrfs_try_granting_ticket(), so we drop the lock, get satisfied, come in to do our loop, and we have been satisfied. This is the priority reclaim path, so to_reclaim could be !0 still because we may have only satisfied the priority tickets and still left non priority tickets on the list. We would then have to_reclaim but ->bytes == 0. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add note about the optimization ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: handle priority ticket failures in their respective helpersJosef Bacik1-25/+21
Currently the error case for the priority tickets is handled where we deal with all of the tickets, priority and non-priority. This is OK in general, but it makes for some awkward locking. We take and drop the space_info->lock back to back because of these different types of tickets. Rework the code to handle priority ticket failures in their respective helpers. This allows us to be less wonky with our space_info->lock usage, and means that the main handler simply has to check ticket->error, as the ticket is guaranteed to be off any list and completely handled by the time it exits one of the handlers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: zoned: cache reported zone during mountNaohiro Aota5-13/+87
When mounting a device, we are reporting the zones twice: once for checking the zone attributes in btrfs_get_dev_zone_info and once for loading block groups' zone info in btrfs_load_block_group_zone_info(). With a lot of block groups, that leads to a lot of REPORT ZONE commands and slows down the mount process. This patch introduces a zone info cache in struct btrfs_zoned_device_info. The cache is populated while in btrfs_get_dev_zone_info() and used for btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE commands. The zone cache is then released after loading the block groups, as it will not be much effective during the run time. Benchmark: Mount an HDD with 57,007 block groups Before patch: 171.368 seconds After patch: 64.064 seconds While it still takes a minute due to the slowness of loading all the block groups, the patch reduces the mount time by 1/3. Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT5WY+C=cD49nQ@mail.gmail.com/ Fixes: 5b316468983d ("btrfs: get zone information of zoned block devices") CC: stable@vger.kernel.org Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove unused parameter fs_devices from btrfs_init_workqueuesSu Yue1-3/+2
Since commit ba8a9d079543 ("Btrfs: delete the entire async bio submission framework") removed submit workqueues, the parameter fs_devices is not used anymore. Remove it, no functional changes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: reduce the scope of the tree log mutex during transaction commitFilipe Manana1-34/+42
In the transaction commit path we are acquiring the tree log mutex too early and we have a stale comment because: 1) It mentions a function named btrfs_commit_tree_roots(), which does not exists anymore, it was the old name of commit_cowonly_roots(), renamed a very long time ago by commit 5d4f98a28c7d33 ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)")); 2) It mentions that we need to acquire the tree log mutex at that point to ensure we have no running log writers. That is not correct anymore, for many years at least, since we are guaranteed that we do not have any log writers at that point simply because we have set the state of the transaction to TRANS_STATE_COMMIT_DOING and have waited for all writers to complete - meaning no one can log until we change the state of the transaction to TRANS_STATE_UNBLOCKED. Any attempts to join the transaction or start a new one will block until we do that state transition; 3) The comment mentions a "trans mutex" which doesn't exists since 2011, commit a4abeea41adf ("Btrfs: kill trans_mutex") removed it; 4) The current use of the tree log mutex is to ensure proper serialization of super block writes - if someone started a new transaction and uses it for logging, it will wait for the previous transaction to write its super block before writing the super block when attempting to sync the log. So acquire the tree log mutex only when it's absolutely needed, before setting the transaction state to TRANS_STATE_UNBLOCKED, fix and move the stale comment, add some assertions and new comments where appropriate. Also, this has no effect on concurrency or performance, since the new start of the critical section is still when the transaction is in the state TRANS_STATE_COMMIT_DOING. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: consolidate device_list_mutex in prepare_sprout to its parentAnand Jain1-18/+51
btrfs_prepare_sprout() splices seed devices into its own struct fs_devices, so that its parent function btrfs_init_new_device() can add the new sprout device to fs_info->fs_devices. Both btrfs_prepare_sprout() and btrfs_init_new_device() need device_list_mutex. But they are holding it separately, thus create a small race window. Close it and hold device_list_mutex across both functions btrfs_init_new_device() and btrfs_prepare_sprout(). Split btrfs_prepare_sprout() into btrfs_init_sprout() and btrfs_setup_sprout(). This split is essential because device_list_mutex must not be held for allocations in btrfs_init_sprout() but must be held for btrfs_setup_sprout(). So now a common device_list_mutex can be used between btrfs_init_new_device() and btrfs_setup_sprout(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: switch seeding_dev in init_new_device to boolAnand Jain1-2/+2
Declare int seeding_dev as a bool. Also, move its declaration a line below to adjust packing. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: send: remove unused type parameter to iterate_inode_ref_tOmar Sandoval1-19/+13
Again, I don't think this was ever used since iterate_dir_item() is only used for xattrs. No functional change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: send: remove unused found_type parameter to lookup_dir_item_inode()Omar Sandoval1-7/+3
As far as I can tell, this was never used. No functional change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: rename btrfs_item_end_nr to btrfs_item_data_endJosef Bacik3-10/+10
The name btrfs_item_end_nr() is a bit of a misnomer, as it's actually the offset of the end of the data the item points to. In fact all of the helpers that we use btrfs_item_end_nr() use data in their name, like BTRFS_LEAF_DATA_SIZE() and leaf_data(). Rename to btrfs_item_data_end() to make it clear what this helper is giving us. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove the btrfs_item_end() helperJosef Bacik1-11/+5
We're only using btrfs_item_end() from btrfs_item_end_nr(), so this can be collapsed. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: drop the _nr from the item helpersJosef Bacik24-218/+191
Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_*(), and then rework the _nr() helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: introduce item_nr token variant helpersJosef Bacik2-40/+57
The last remaining place where we have the pattern of item = btrfs_item_nr(slot) <do something with the item> are the token helpers. Handle this by introducing token helpers that will do the btrfs_item_nr() work inside of the helper itself, and then convert all users of the btrfs_item token helpers to the new _nr() variants. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: make btrfs_file_extent_inline_item_len take a slotJosef Bacik2-4/+3
Instead of getting the btrfs_item for this, simply pass in the slot of the item and then use the btrfs_item_size_nr() helper inside of btrfs_file_extent_inline_item_len(). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: add btrfs_set_item_*_nr() helpersJosef Bacik2-15/+21
We have the pattern of item = btrfs_item_nr(slot); btrfs_set_item_*(leaf, item); in a bunch of places in our code. Fix this by adding btrfs_set_item_*_nr() helpers which will do the appropriate work, and replace those calls with btrfs_set_item_*_nr(leaf, slot); Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: use btrfs_item_size_nr/btrfs_item_offset_nr everywhereJosef Bacik8-49/+19
We have this pattern in a lot of places item = btrfs_item_nr(slot); btrfs_item_size(leaf, item); when we could simply use btrfs_item_size(leaf, slot); Fix all callers of btrfs_item_size() and btrfs_item_offset() to use the _nr variation of the helpers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove no longer needed logic for replaying directory deletesFilipe Manana1-89/+69
Now that we log only dir index keys when logging a directory, we no longer need to deal with dir item keys in the log replay code for replaying directory deletes. This is also true for the case when we replay a log tree created by a kernel that still logs dir items. So remove the remaining code of the replay of directory deletes algorithm that deals with dir item keys. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: only copy dir index keys when logging a directoryFilipe Manana2-231/+182
Currently, when logging a directory, we copy both dir items and dir index items from the fs/subvolume tree to the log tree. Both items have exactly the same data (same struct btrfs_dir_item), the difference lies in the key values, where a dir index key contains the index number of a directory entry while the dir item key does not, as it's used for doing fast lookups of an entry by name, while the former is used for sorting entries when listing a directory. We can exploit that and log only the dir index items, since they contain all the information needed to correctly add, replace and delete directory entries when replaying a log tree. Logging only the dir index items is also backward and forward compatible: an unpatched kernel (without this change) can correctly replay a log tree generated by a patched kernel (with this patch), and a patched kernel can correctly replay a log tree generated by an unpatched kernel. The backward compatibility is ensured because: 1) For inserting a new dentry: a dentry is only inserted when we find a new dir index key - we can only insert if we know the dir index offset, which is encoded in the dir index key's offset; 2) For deleting dentries: during log replay, before adding or replacing dentries, we first replay dentry deletions. Whenever we find a dir item key or a dir index key in the subvolume/fs tree that is not logged in a range for which the log tree is authoritative, we do the unlink of the dentry, which removes both the existing dir item key and the dir index key. Therefore logging just dir index keys is enough to ensure dentry deletions are correctly replayed; 3) For dentry replacements: they work when we log only dir index keys and this is mostly due to a combination of 1) and 2). If we replace a dentry with name "foobar" to point from inode A to inode B, then we know the dir index key for the new dentry is different from the old one, as it has an index number (key offset) larger than the old one. This results in replaying a deletion, through replay_dir_deletes(), that causes the old dentry to be removed, both the dir item key and the dir index key, as mentioned at 2). Then when processing the new dir index key, we add the new dentry, adding both a new dir item key and a new index key pointing to inode B, as stated in 1). The forward compatibility, the ability for a patched kernel to replay a log created by an older, unpatched kernel, comes from the changes required for making sure we are able to replay a log that only contains dir index keys - we simply ignore every dir item key we find. So modify directory logging to log only dir index items, and modify the log replay process to ignore dir item keys, from log trees created by an unpatched kernel, and process only with dir index keys. This reduces the amount of logged metadata by about half, and therefore the time spent logging or fsyncing large directories (less CPU time and less IO). The following test script was used to measure this change: #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=1000000 NUM_FILE_DELETES=10000 mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" echo umount $MNT The tests were run on a physical machine, with a non-debug kernel (Debian's default kernel config), for different values of $NUM_NEW_FILES and $NUM_FILE_DELETES, and the results were the following: ** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 ** dir fsync took 8412 ms after adding 1000000 files dir fsync took 500 ms after deleting 10000 files ** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 ** dir fsync took 4252 ms after adding 1000000 files (-49.5%) dir fsync took 269 ms after deleting 10000 files (-46.2%) ** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 745 ms after adding 100000 files dir fsync took 59 ms after deleting 1000 files ** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 404 ms after adding 100000 files (-45.8%) dir fsync took 31 ms after deleting 1000 files (-47.5%) ** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 67 ms after adding 10000 files dir fsync took 9 ms after deleting 1000 files ** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 36 ms after adding 10000 files (-46.3%) dir fsync took 5 ms after deleting 1000 files (-44.4%) ** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 ** dir fsync took 9 ms after adding 1000 files dir fsync took 4 ms after deleting 100 files ** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 ** dir fsync took 7 ms after adding 1000 files (-22.2%) dir fsync took 3 ms after deleting 100 files (-25.0%) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove spurious unlock/lock of unused_bgs_lockNikolay Borisov1-2/+0
Since both unused block groups and reclaim bgs lists are protected by unused_bgs_lock then free them in the same critical section without doing an extra unlock/lock pair. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: fix deadlock between quota enable and other quota operationsFilipe Manana1-0/+19
When enabling quotas, we attempt to commit a transaction while holding the mutex fs_info->qgroup_ioctl_lock. This can result on a deadlock with other quota operations such as: - qgroup creation and deletion, ioctl BTRFS_IOC_QGROUP_CREATE; - adding and removing qgroup relations, ioctl BTRFS_IOC_QGROUP_ASSIGN. This is because these operations join a transaction and after that they attempt to lock the mutex fs_info->qgroup_ioctl_lock. Acquiring that mutex after joining or starting a transaction is a pattern followed everywhere in qgroups, so the quota enablement operation is the one at fault here, and should not commit a transaction while holding that mutex. Fix this by making the transaction commit while not holding the mutex. We are safe from two concurrent tasks trying to enable quotas because we are serialized by the rw semaphore fs_info->subvol_sem at btrfs_ioctl_quota_ctl(), which is the only call site for enabling quotas. When this deadlock happens, it produces a trace like the following: INFO: task syz-executor:25604 blocked for more than 143 seconds. Not tainted 5.15.0-rc6 #4 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor state:D stack:24800 pid:25604 ppid: 24873 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xcd9/0x2530 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 btrfs_commit_transaction+0x994/0x2e90 fs/btrfs/transaction.c:2201 btrfs_quota_enable+0x95c/0x1790 fs/btrfs/qgroup.c:1120 btrfs_ioctl_quota_ctl fs/btrfs/ioctl.c:4229 [inline] btrfs_ioctl+0x637e/0x7b70 fs/btrfs/ioctl.c:5010 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f86920b2c4d RSP: 002b:00007f868f61ac58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f86921d90a0 RCX: 00007f86920b2c4d RDX: 0000000020005e40 RSI: 00000000c0109428 RDI: 0000000000000008 RBP: 00007f869212bd80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f86921d90a0 R13: 00007fff6d233e4f R14: 00007fff6d233ff0 R15: 00007f868f61adc0 INFO: task syz-executor:25628 blocked for more than 143 seconds. Not tainted 5.15.0-rc6 #4 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor state:D stack:29080 pid:25628 ppid: 24873 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xcd9/0x2530 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425 __mutex_lock_common kernel/locking/mutex.c:669 [inline] __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729 btrfs_remove_qgroup+0xb7/0x7d0 fs/btrfs/qgroup.c:1548 btrfs_ioctl_qgroup_create fs/btrfs/ioctl.c:4333 [inline] btrfs_ioctl+0x683c/0x7b70 fs/btrfs/ioctl.c:5014 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Reported-by: Hao Sun <sunhao.th@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CACkBjsZQF19bQ1C6=yetF3BvL10OSORpFUcWXTP6HErshDB4dQ@mail.gmail.com/ Fixes: 340f1aa27f36 ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable") CC: stable@vger.kernel.org # 4.19 Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW rangeFilipe Manana1-64/+78
When doing a direct IO write against a file range that either has preallocated extents in that range or has regular extents and the file has the NOCOW attribute set, the write fails with -ENOSPC when all of the following conditions are met: 1) There are no data blocks groups with enough free space matching the size of the write; 2) There's not enough unallocated space for allocating a new data block group; 3) The extents in the target file range are not shared, neither through snapshots nor through reflinks. This is wrong because a NOCOW write can be done in such case, and in fact it's possible to do it using a buffered IO write, since when failing to allocate data space, the buffered IO path checks if a NOCOW write is possible. The failure in direct IO write path comes from the fact that early on, at btrfs_dio_iomap_begin(), we try to allocate data space for the write and if it that fails we return the error and stop - we never check if we can do NOCOW. But later, at btrfs_get_blocks_direct_write(), we check if we can do a NOCOW write into the range, or a subset of the range, and then release the previously reserved data space. Fix this by doing the data reservation only if needed, when we must COW, at btrfs_get_blocks_direct_write() instead of doing it at btrfs_dio_iomap_begin(). This also simplifies a bit the logic and removes the inneficiency of doing unnecessary data reservations. The following example test script reproduces the problem: $ cat dio-nocow-enospc.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj # Use a small fixed size (1G) filesystem so that it's quick to fill # it up. # Make sure the mixed block groups feature is not enabled because we # later want to not have more space available for allocating data # extents but still have enough metadata space free for the file writes. mkfs.btrfs -f -b $((1024 * 1024 * 1024)) -O ^mixed-bg $DEV mount $DEV $MNT # Create our test file with the NOCOW attribute set. touch $MNT/foobar chattr +C $MNT/foobar # Now fill in all unallocated space with data for our test file. # This will allocate a data block group that will be full and leave # no (or a very small amount of) unallocated space in the device, so # that it will not be possible to allocate a new block group later. echo echo "Creating test file with initial data..." xfs_io -c "pwrite -S 0xab -b 1M 0 900M" $MNT/foobar # Now try a direct IO write against file range [0, 10M[. # This should succeed since this is a NOCOW file and an extent for the # range was previously allocated. echo echo "Trying direct IO write over allocated space..." xfs_io -d -c "pwrite -S 0xcd -b 10M 0 10M" $MNT/foobar umount $MNT When running the test: $ ./dio-nocow-enospc.sh (...) Creating test file with initial data... wrote 943718400/943718400 bytes at offset 0 900 MiB, 900 ops; 0:00:01.43 (625.526 MiB/sec and 625.5265 ops/sec) Trying direct IO write over allocated space... pwrite: No space left on device A test case for fstests will follow, testing both this direct IO write scenario as well as the buffered IO write scenario to make it less likely to get future regressions on the buffered IO case. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-18Merge tag 'for-5.16-rc5-tag' of ↵Linus Torvalds10-24/+54
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes, almost all error handling one-liners and for stable. - regression fix in directory logging items - regression fix of extent buffer status bits handling after an error - fix memory leak in error handling path in tree-log - fix freeing invalid anon device number when handling errors during subvolume creation - fix warning when freeing leaf after subvolume creation failure - fix missing blkdev put in device scan error handling - fix invalid delayed ref after subvolume creation failure" * tag 'for-5.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix missing blkdev_put() call in btrfs_scan_one_device() btrfs: fix warning when freeing leaf after subvolume creation failure btrfs: fix invalid delayed ref after subvolume creation failure btrfs: check WRITE_ERR when trying to read an extent buffer btrfs: fix missing last dir item offset update when logging directory btrfs: fix double free of anon_dev after failure to create subvolume btrfs: fix memory leak in __add_inode_ref()
2021-12-15btrfs: fix missing blkdev_put() call in btrfs_scan_one_device()Shin'ichiro Kawasaki1-2/+4
The function btrfs_scan_one_device() calls blkdev_get_by_path() and blkdev_put() to get and release its target block device. However, when btrfs_sb_log_location_bdev() fails, blkdev_put() is not called and the block device is left without clean up. This triggered failure of fstests generic/085. Fix the failure path of btrfs_sb_log_location_bdev() to call blkdev_put(). Fixes: 12659251ca5df ("btrfs: implement log-structured superblock for ZONED mode") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-15btrfs: fix warning when freeing leaf after subvolume creation failureFilipe Manana1-3/+4
When creating a subvolume, at ioctl.c:create_subvol(), if we fail to insert the root item for the new subvolume into the root tree, we can trigger the following warning: [78961.741046] WARNING: CPU: 0 PID: 4079814 at fs/btrfs/extent-tree.c:3357 btrfs_free_tree_block+0x2af/0x310 [btrfs] [78961.743344] Modules linked in: [78961.749440] dm_snapshot dm_thin_pool (...) [78961.773648] CPU: 0 PID: 4079814 Comm: fsstress Not tainted 5.16.0-rc4-btrfs-next-108 #1 [78961.775198] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [78961.777266] RIP: 0010:btrfs_free_tree_block+0x2af/0x310 [btrfs] [78961.778398] Code: 17 00 48 85 (...) [78961.781067] RSP: 0018:ffffaa4001657b28 EFLAGS: 00010202 [78961.781877] RAX: 0000000000000213 RBX: ffff897f8a796910 RCX: 0000000000000000 [78961.782780] RDX: 0000000000000000 RSI: 0000000011004000 RDI: 00000000ffffffff [78961.783764] RBP: ffff8981f490e800 R08: 0000000000000001 R09: 0000000000000000 [78961.784740] R10: 0000000000000000 R11: 0000000000000001 R12: ffff897fc963fcc8 [78961.785665] R13: 0000000000000001 R14: ffff898063548000 R15: ffff898063548000 [78961.786620] FS: 00007f31283c6b80(0000) GS:ffff8982ace00000(0000) knlGS:0000000000000000 [78961.787717] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [78961.788598] CR2: 00007f31285c3000 CR3: 000000023fcc8003 CR4: 0000000000370ef0 [78961.789568] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [78961.790585] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [78961.791684] Call Trace: [78961.792082] <TASK> [78961.792359] create_subvol+0x5d1/0x9a0 [btrfs] [78961.793054] btrfs_mksubvol+0x447/0x4c0 [btrfs] [78961.794009] ? preempt_count_add+0x49/0xa0 [78961.794705] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs] [78961.795712] ? _copy_from_user+0x66/0xa0 [78961.796382] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs] [78961.797392] btrfs_ioctl+0xd1e/0x35c0 [btrfs] [78961.798172] ? __slab_free+0x10a/0x360 [78961.798820] ? rcu_read_lock_sched_held+0x12/0x60 [78961.799664] ? lock_release+0x223/0x4a0 [78961.800321] ? lock_acquired+0x19f/0x420 [78961.800992] ? rcu_read_lock_sched_held+0x12/0x60 [78961.801796] ? trace_hardirqs_on+0x1b/0xe0 [78961.802495] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [78961.803358] ? kmem_cache_free+0x321/0x3c0 [78961.804071] ? __x64_sys_ioctl+0x83/0xb0 [78961.804711] __x64_sys_ioctl+0x83/0xb0 [78961.805348] do_syscall_64+0x3b/0xc0 [78961.805969] entry_SYSCALL_64_after_hwframe+0x44/0xae [78961.806830] RIP: 0033:0x7f31284bc957 [78961.807517] Code: 3c 1c 48 f7 d8 (...) This is because we are calling btrfs_free_tree_block() on an extent buffer that is dirty. Fix that by cleaning the extent buffer, with btrfs_clean_tree_block(), before freeing it. This was triggered by test case generic/475 from fstests. Fixes: 67addf29004c5b ("btrfs: fix metadata extent leak after failure to create subvolume") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-15btrfs: fix invalid delayed ref after subvolume creation failureFilipe Manana6-22/+31
When creating a subvolume, at ioctl.c:create_subvol(), if we fail to insert the new root's root item into the root tree, we are freeing the metadata extent we reserved for the new root to prevent a metadata extent leak, as we don't abort the transaction at that point (since there is nothing at that point that is irreversible). However we allocated the metadata extent for the new root which we are creating for the new subvolume, so its delayed reference refers to the ID of this new root. But when we free the metadata extent we pass the root of the subvolume where the new subvolume is located to btrfs_free_tree_block() - this is incorrect because this will generate a delayed reference that refers to the ID of the parent subvolume's root, and not to ID of the new root. This results in a failure when running delayed references that leads to a transaction abort and a trace like the following: [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs] [3868.739857] Code: 68 0f 85 e6 fb ff (...) [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246 [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002 [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88 [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88 [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000 [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88 [3868.750725] FS: 00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000 [3868.752275] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0 [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [3868.757803] Call Trace: [3868.758281] <TASK> [3868.758655] ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs] [3868.759827] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs] [3868.761047] btrfs_run_delayed_refs+0x86/0x210 [btrfs] [3868.762069] ? lock_acquired+0x19f/0x420 [3868.762829] btrfs_commit_transaction+0x69/0xb20 [btrfs] [3868.763860] ? _raw_spin_unlock+0x29/0x40 [3868.764614] ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs] [3868.765870] create_subvol+0x1d8/0x9a0 [btrfs] [3868.766766] btrfs_mksubvol+0x447/0x4c0 [btrfs] [3868.767669] ? preempt_count_add+0x49/0xa0 [3868.768444] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs] [3868.769639] ? _copy_from_user+0x66/0xa0 [3868.770391] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs] [3868.771495] btrfs_ioctl+0xd1e/0x35c0 [btrfs] [3868.772364] ? __slab_free+0x10a/0x360 [3868.773198] ? rcu_read_lock_sched_held+0x12/0x60 [3868.774121] ? lock_release+0x223/0x4a0 [3868.774863] ? lock_acquired+0x19f/0x420 [3868.775634] ? rcu_read_lock_sched_held+0x12/0x60 [3868.776530] ? trace_hardirqs_on+0x1b/0xe0 [3868.777373] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [3868.778280] ? kmem_cache_free+0x321/0x3c0 [3868.779011] ? __x64_sys_ioctl+0x83/0xb0 [3868.779718] __x64_sys_ioctl+0x83/0xb0 [3868.780387] do_syscall_64+0x3b/0xc0 [3868.781059] entry_SYSCALL_64_after_hwframe+0x44/0xae [3868.781953] RIP: 0033:0x7f281c59e957 [3868.782585] Code: 3c 1c 48 f7 d8 4c (...) [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957 [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003 [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36 [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003 [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc [3868.793765] </TASK> [3868.794037] irq event stamp: 0 [3868.794548] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.797086] softirqs last enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0 [3868.799284] ---[ end trace be24c7002fe27747 ]--- [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2 [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627 [3868.802056] item 0 key (237436928 169 0) itemoff 16250 itemsize 33 [3868.802863] extent refs 1 gen 1265 flags 2 [3868.803447] ref#0: tree block backref root 1610 (...) [3869.064354] item 114 key (241008640 169 0) itemoff 12488 itemsize 33 [3869.065421] extent refs 1 gen 1268 flags 2 [3869.066115] ref#0: tree block backref root 1689 (...) [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622 owner 0 offset 0 [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry Fix this by passing the new subvolume's root ID to btrfs_free_tree_block(). This requires changing the root argument of btrfs_free_tree_block() from struct btrfs_root * to a u64, since at this point during the subvolume creation we have not yet created the struct btrfs_root for the new subvolume, and btrfs_free_tree_block() only needs a root ID and nothing else from a struct btrfs_root. This was triggered by test case generic/475 from fstests. Fixes: 67addf29004c5b ("btrfs: fix metadata extent leak after failure to create subvolume") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>