summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-06-19btrfs: handle tree backref walk error properlyQu Wenruo2-15/+29
[BUG] Smatch reports the following errors related to commit ("btrfs: output affected files when relocation fails"): fs/btrfs/inode.c:283 print_data_reloc_error() error: uninitialized symbol 'ref_level'. [CAUSE] That part of code is mostly copied from scrub, but unfortunately scrub code from the beginning is not doing the error handling properly. The offending code looks like this: do { ret = tree_backref_for_extent(); btrfs_warn_rl(); } while (ret != 1); There are several problems involved: - No error handling If that tree_backref_for_extent() failed, we would output the same error again and again, never really exit as it requires ret == 1 to exit. - Always do one extra output As tree_backref_for_extent() only return > 0 if there is no more backref item. This means after the last item we hit, we would output an invalid error message for ret > 0 case. [FIX] Fix the old code by: - Move @ref_root and @ref_level into the if branch And do not initialize them, so we can catch such uninitialized values just like what we do in the inode.c - Explicitly check the return value of tree_backref_for_extent() And handle ret < 0 and ret > 0 cases properly. - No more do {} while () loop Instead go while (true) {} loop since we will handle @ret manually. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: don't hold an extra reference for redirtied buffersChristoph Hellwig7-42/+4
When btrfs_redirty_list_add redirties a buffer, it also acquires an extra reference that is released on transaction commit. But this is not required as buffers that are dirty or under writeback are never freed (look for calls to extent_buffer_under_io())). Remove the extra reference and the infrastructure used to drop it again. History behind redirty logic: In the first place, it used releasing_list to hold all the to-be-released extent buffers, and decided which buffers to re-dirty at the commit time. Then, in a later version, the behaviour got changed to re-dirty a necessary buffer and add re-dirtied one to the list in btrfs_free_tree_block(). In short, the list was there mostly for the patch series' historical reason. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [ add Naohiro's comment regarding history ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: fix dirty_metadata_bytes for redirtied buffersChristoph Hellwig3-10/+6
dirty_metadata_bytes is decremented in both places that clear the dirty bit in a buffer, but only incremented in btrfs_mark_buffer_dirty, which means that a buffer that is redirtied using btrfs_redirty_list_add won't be added to dirty_metadata_bytes, but it will be subtracted when written out, leading an inconsistency in the counter. Move the dirty_metadata_bytes from btrfs_mark_buffer_dirty into set_extent_buffer_dirty to also account for the redirty case, and remove the now unused set_extent_buffer_dirty return value. Fixes: d3575156f662 ("btrfs: zoned: redirty released extent buffers") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: unexport btrfs_run_discard_work and make it staticJohannes Thumshirn2-18/+17
Mark btrfs_run_discard_work static and move it above its callers. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: rename del_ptr to btrfs_del_ptr and export itJosef Bacik2-8/+10
This exists internal to ctree.c, however btrfs check needs to use it for some of its operations. I'd rather not duplicate that code inside of btrfs check as this is low level and I want to keep this code in one place, so rename the function to btrfs_del_ptr and export it so that it can be used inside of btrfs-progs safely. Add a comment to make sure this doesn't get removed by a future cleanup. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: add a btrfs_csum_type_size helperJosef Bacik2-1/+8
This is needed in btrfs-progs for the tools that convert the checksum types for file systems and a few other things. We don't have it in the kernel as we just want to get the size for the super blocks type. However I don't want to have to manually add this every time we sync ctree.c into btrfs-progs, so add the helper in the kernel with a note so it doesn't get removed by a later cleanup. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: add __KERNEL__ check for btrfs_no_printkJosef Bacik1-0/+7
We want to override this in btrfs-progs, so wrap this in the __KERNEL__ check so we can easily sync this to btrfs-progs and have our local version of btrfs_no_printk do the work. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: move split_flags/combine_flags helpers to inode-item.hJosef Bacik3-17/+17
These are more related to the inode item flags on disk than the in-memory btrfs_inode, move the helpers to inode-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: move btrfs_verify_level_key into tree-checker.cJosef Bacik4-60/+60
This is more a buffer validation helper, move it into the tree-checker files where it makes more sense. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: add __btrfs_check_node helperJosef Bacik2-12/+18
This helper returns a btrfs_tree_block_status for the various errors, and then btrfs_check_node() will return -EUCLEAN if it gets anything other than BTRFS_TREE_BLOCK_CLEAN which will be used by the kernel. In the future btrfs-progs will use this helper instead. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: extend btrfs_leaf_check to return btrfs_tree_block_statusJosef Bacik2-13/+29
Instead of blanket returning -EUCLEAN for all the failures in btrfs_check_leaf, use btrfs_tree_block_status and return the appropriate status for each failure. Rename the helper to __btrfs_check_leaf and then make a wrapper of btrfs_check_leaf that will return -EUCLEAN to non-clean error codes. This will allow us to have the __btrfs_check_leaf variant in btrfs-progs while keeping the behavior in the kernel consistent. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use btrfs_tree_block_status for leaf item errorsJosef Bacik1-7/+12
We have a variety of item specific errors that can occur. For now simply put these under the umbrella of BTRFS_TREE_BLOCK_INVALID_ITEM, this can be fleshed out as we need in the future. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: add btrfs_tree_block_status definitions to tree-checker.hJosef Bacik1-0/+13
We use this in btrfs-progs to determine if we can fix different types of corruptions. We don't care about this in the kernel, however it would be good to share this code between the kernel and btrfs-progs, so add the status definitions so we can start converting the tree-checker code over to using these status flags instead of blanket returning -EUCLEAN. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: simplify btrfs_check_leaf_* helpers into a single helperJosef Bacik3-32/+13
We have two helpers for checking leaves, because we have an extra check for debugging in btrfs_mark_buffer_dirty(), and at that stage we may have item data that isn't consistent yet. However we can handle this case internally in the helper, if BTRFS_HEADER_FLAG_WRITTEN is set we know the buffer should be internally consistent, otherwise we need to skip checking the item data. Simplify this helper down a single helper and handle the item data checking logic internally to the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: remove level argument from btrfs_set_block_flagsJosef Bacik3-9/+5
We just pass in btrfs_header_level(eb) for the level, and we're passing in the eb already, so simply get the level from the eb inside of btrfs_set_block_flags. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.cJosef Bacik4-21/+21
This is completely related to block rsv's, move it out of the free space cache code and into block-rsv.c. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: scrub: use recovered data stripes as cache to avoid unnecessary readQu Wenruo3-0/+55
For P/Q stripe scrub, we have quite some duplicated read IO: - Data stripes read for verification This is triggered by the scrub_submit_initial_read() inside scrub_raid56_parity_stripe(). - Data stripes read (again) for P/Q stripe verification This is triggered by scrub_assemble_read_bios() from scrub_rbio(). Although we can have hit rbio cache and avoid unnecessary read, the chance is very low, as scrub would easily flush the whole rbio cache. This means, even we're just scrubbing a single P/Q stripe, we would read the data stripes twice for the best case scenario. If we need to recover some data stripes, it would cause more reads on the same data stripes, again and again. However before we call raid56_parity_submit_scrub_rbio() we already have all data stripes repaired and their contents ready to use. But RAID56 cache is unaware about the scrub cache, thus RAID56 layer itself still needs to re-read the data stripes. To avoid such cache miss, this patch would: - Introduce a new helper, raid56_parity_cache_data_pages() This function would grab the pages from an array, and copy the content to the rbio, marking all the involved sectors uptodate. The page copy is unavoidable because of the cache pages of rbio are all self managed, thus can not utilize outside pages without screwing up the lifespan. - Use the repaired data stripes as cache inside scrub_raid56_parity_stripe() By this, we ensure all the data sectors of the scrub rbio are already uptodate, and no need to read them again from disk. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: assert tree lock is held when removing free space entriesFilipe Manana1-10/+24
Removing a free space entry from an in memory space cache requires having the corresponding btrfs_free_space_ctl's 'tree_lock' held. We have several code paths that remove an entry, so add assertions where appropriate to verify we are holding the lock, as the lock is acquired by some other function up in the call chain, which makes it easy to miss in the future. Note: for this to work we need to lock the local btrfs_free_space_ctl at load_free_space_cache(), which was not being done because it's local, declared on the stack, so no other task has access to it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: assert tree lock is held when linking free spaceFilipe Manana1-0/+2
When linking a free space entry, at link_free_space(), the caller should be holding the spinlock 'tree_lock' of the given btrfs_free_space_ctl argument, which is necessary for manipulating the red black tree of free space entries (done by tree_insert_offset(), which already asserts the lock is held) and for manipulating the 'free_space', 'free_extents', 'discardable_extents' and 'discardable_bytes' counters of the given struct btrfs_free_space_ctl. So assert that the spinlock 'tree_lock' of the given btrfs_free_space_ctl is held by the current task. We have multiple code paths that end up calling link_free_space(), and all currently take the lock before calling it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: assert tree lock is held when searching for free space entriesFilipe Manana1-0/+2
When searching for a free space entry by offset, at tree_search_offset(), we are supposed to have the btrfs_free_space_ctl's 'tree_lock' held, so assert that. We have multiple callers of tree_search_offset(), and all currently hold the necessary lock before calling it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: assert proper locks are held at tree_insert_offset()Filipe Manana1-6/+19
There are multiple code paths leading to tree_insert_offset(), and each path takes the necessary locks before tree_insert_offset() is called, since they do other things that require those locks to be held. This makes it easy to miss the locking somewhere, so make tree_insert_offset() assert that the required locks are being held by the calling task. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: simplify arguments to tree_insert_offset()Filipe Manana1-21/+16
For the in-memory component of space caching (free space cache and free space tree), three of the arguments passed to tree_insert_offset() can always be taken from the new free space entry that we are about to add. So simplify tree_insert_offset() to take the new entry instead of the 'offset', 'node' and 'bitmap' arguments. This will also allow to make further changes simpler. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use precomputed end offsets at do_trimming()Filipe Manana1-1/+1
The are two computations of end offsets at do_trimming() that are not necessary, as they were previously computed and stored in local const variables. So just use the variables instead, to make the source code shorter and easier to read. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: avoid searching twice for previous node when merging free space entriesFilipe Manana1-3/+6
At try_merge_free_space(), avoid calling twice rb_prev() to find the previous node, as that requires looping through the red black tree, so store the result of the rb_prev() call and then use it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: avoid extra memory allocation when copying free space cacheFilipe Manana1-2/+4
At copy_free_space_cache(), we add a new entry to the block group's ctl before we free the entry from the temporary ctl. Adding a new entry requires the allocation of a new struct btrfs_free_space, so we can avoid a temporary extra allocation by freeing the entry from the temporary ctl before we add a new entry to the main ctl, which possibly also reduces the chances for a memory allocation failure in case of very high memory pressure. So just do that. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: simplify transid initialization in btrfs_ioctl_wait_syncTom Rix1-5/+4
A small code simplification, move the default value of transid to its initialization and remove the else-statement. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: output affected files when relocation failsQu Wenruo3-0/+208
[PROBLEM] When relocation fails (mostly due to checksum mismatch), we only got very cryptic error messages like: BTRFS info (device dm-4): relocating block group 13631488 flags data BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 csum 0x373e1ae3 expected csum 0x98757625 mirror 1 BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS info (device dm-4): balance: ended with status: -5 The end user has to decipher the above messages and use various tools to locate the affected files and find a way to fix the problem (mostly deleting the file). This is not an easy work even for experienced developer, not to mention the end users. [SCRUB IS DOING BETTER] By contrast, scrub is providing much better error messages: BTRFS error (device dm-4): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-4): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS info (device dm-4): scrub: finished on devid 1 with status: 0 Which provides the affected files directly to the end user. [IMPROVEMENT] Instead of the generic data checksum error messages, which is not doing a good job for data reloc inodes, this patch introduce a scrub like backref walking based solution. When a sector fails its checksum for data reloc inode, we go the following workflow: - Get the real logical bytenr For data reloc inode, the file offset is the offset inside the block group. Thus the real logical bytenr is @file_off + @block_group->start. - Do an extent type check If it's tree blocks it's much easier to handle, just go through all the tree block backref. - Do a backref walk and inode path resolution for data extents This is mostly the same as scrub. But unfortunately we can not reuse the same function as the output format is different. Now the new output would be more user friendly: BTRFS info (device dm-4): relocating block group 13631488 flags data BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 logical 13631488 csum 0x373e1ae3 expected csum 0x98757625 mirror 1 BTRFS warning (device dm-4): checksum error at logical 13631488 mirror 1 root 5 inode 257 offset 0 length 4096 links 1 (path: file) BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 BTRFS info (device dm-4): balance: ended with status: -5 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: remove hipri_workers workqueueChristoph Hellwig4-11/+2
Now that btrfs_wq_submit_bio is never called for synchronous I/O, the hipri_workers workqueue is not used anymore and can be removed. Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: determine synchronous writers from bio or writeback controlChristoph Hellwig5-19/+3
The writeback_control structure already passes down the information about a writeback being synchronous from the core VM code, and thus information is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags helper. Use that information to decide if checksums calculation is offloaded to a workqueue instead of btrfs_inode::sync_writers field that not only bloats the inode but also has too wide scope, being inode wide instead of limited to the actual writeback request. The sync writes were set in: - btrfs_do_write_iter - regular IO, sync status is set - start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL mode - btrfs_write_marked_extents - write marked extents, writeback with WB_SYNC_ALL mode Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: submit IO synchronously for fast checksum implementationsChristoph Hellwig1-13/+7
Most modern hardware supports very fast accelerated crc32c calculation. If that is supported the CPU overhead of the checksum calculation is very limited, and offloading the calculation to special worker threads has a lot of overhead for no gain. E.g. on an Intel Optane device is actually very much slows down even 1M buffered writes with fio: Unpatched: write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets With synchronous CRCs: write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets With a lot of variation during the unpatched run going down as low as 1100MB/s, while the synchronous CRC version has about the same peak write speed but much lower dips, and fewer kworkers churning around. Both tests had fio saturated at 100% CPU. (thanks to Jens Axboe via Chris Mason for the benchmarking) Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use SECTOR_SHIFT to convert LBA to physical offsetAnand Jain4-6/+6
Using SECTOR_SHIFT to convert LBA to physical address makes it more readable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use SECTOR_SHIFT to convert physical offset to LBAAnand Jain5-6/+8
Use SECTOR_SHIFT while converting a physical address to an LBA, makes it more readable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: improve leaf dump and error handlingQu Wenruo2-68/+59
Improve the leaf dump behavior by: - Always dump the leaf first, then the error message - Output the slot number if possible Especially in __btrfs_free_extent() the leaf dump of extent tree can be pretty large. With an extra slot number it's much easier to locate the problem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: print-tree: pass const extent buffer pointerQu Wenruo5-14/+14
Since print-tree infrastructure only prints the content of a tree block, we can make them to accept const extent buffer pointer. This removes a forced type convert in extent-tree, where we convert a const extent buffer pointer to regular one, just to avoid compiler warning. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: export bitmap_test_range_all_{set,zero}Naohiro Aota3-28/+26
bitmap_test_range_all_{set,zero} defined in subpage.c are useful for other components. Move them to misc.h and use them in zoned.c. Also, as find_next{,_zero}_bit take/return "unsigned long" instead of "unsigned int", convert the type to "unsigned long". While at it, also rewrite the "if (...) return true; else return false;" pattern and add const to the input bitmap. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: tag as unlikely the key comparison when checking sibling keysFilipe Manana1-1/+1
When checking siblings keys, before moving keys from one node/leaf to a sibling node/leaf, it's very unexpected to have the last key of the left sibling greater than or equals to the first key of the right sibling, as that means we have a (serious) corruption that breaks the key ordering properties of a b+tree. Since this is unexpected, surround the comparison with the unlikely macro, which helps the compiler generate better code for the most expected case (no existing b+tree corruption). This is also what we do for other unexpected cases of invalid key ordering (like at btrfs_set_item_key_safe()). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: make btrfs_free_device() staticFilipe Manana2-2/+1
The function btrfs_free_device() is never used outside of volumes.c, so make it static and remove its prototype declaration at volumes.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: don't commit transaction for every subvol createSweet Tea Dorminy1-4/+3
Recently a Meta-internal workload encountered subvolume creation taking up to 2s each, significantly slower than directory creation. As they were hoping to be able to use subvolumes instead of directories, and were looking to create hundreds, this was a significant issue. After Josef investigated, it turned out to be due to the transaction commit currently performed at the end of subvolume creation. This change improves the workload by not doing transaction commit for every subvolume creation, and merely requiring a transaction commit on fsync. In the worst case, of doing a subvolume create and fsync in a loop, this should require an equal amount of time to the current scheme; and in the best case, the internal workload creating hundreds of subvolumes before fsyncing is greatly improved. While it would be nice to be able to use the log tree and use the normal fsync path, log tree replay can't deal with new subvolume inodes presently. It's possible that there's some reason that the transaction commit is necessary for correctness during subvolume creation; however, git logs indicate that the commit dates back to the beginning of subvolume creation, and there are no notes on why it would be necessary. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: unexport btrfs_prev_leaf()Filipe Manana2-81/+81
btrfs_prev_leaf() is not used outside ctree.c, so there's no need to export it at ctree.h - just make it static at ctree.c and move its definition above btrfs_search_slot_for_read(), since that function calls it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19ovl: port to new mount apiChristian Brauner1-205/+193
We recently ported util-linux to the new mount api. Now the mount(8) tool will by default use the new mount api. While trying hard to fall back to the old mount api gracefully there are still cases where we run into issues that are difficult to handle nicely. Now with mount(8) and libmount supporting the new mount api I expect an increase in the number of bug reports and issues we're going to see with filesystems that don't yet support the new mount api. So it's time we rectify this. When ovl_fill_super() fails before setting sb->s_root, we need to cleanup sb->s_fs_info. The logic is a bit convoluted but tl;dr: If sget_fc() has succeeded fc->s_fs_info will have been transferred to sb->s_fs_info. So by the time ->fill_super()/ovl_fill_super() is called fc->s_fs_info is NULL consequently fs_context->free() won't call ovl_free_fs(). If we fail before sb->s_root() is set then ->put_super() won't be called which would call ovl_free_fs(). IOW, if we fail in ->fill_super() before sb->s_root we have to clean it up. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: factor out ovl_parse_options() helperAmir Goldstein2-116/+135
For parsing a single mount option. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: store enum redirect_mode in config instead of a stringAmir Goldstein6-84/+103
Do all the logic to set the mode during mount options parsing and do not keep the option string around. Use a constant_table to translate from enum redirect mode to string in preperation for new mount api option parsing. The mount option "off" is translated to either "follow" or "nofollow", depending on the "redirect_always_follow" build/module config, so in effect, there are only three possible redirect modes. This results in a minor change to the string that is displayed in show_options() - when redirect_dir is enabled by default and the user mounts with the option "redirect_dir=off", instead of displaying the mode "redirect_dir=off" in show_options(), the displayed mode will be either "redirect_dir=follow" or "redirect_dir=nofollow", depending on the value of "redirect_always_follow" build/module config. The displayed mode reflects the effective mode, so mounting overlayfs again with the dispalyed redirect_dir option will result with the same effective and displayed mode. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: pass ovl_fs to xino helpersAmir Goldstein4-31/+43
Internal ovl methods should use ovl_fs and not sb as much as possible. Use a constant_table to translate from enum xino mode to string in preperation for new mount api option parsing. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: clarify ovl_get_root() semanticsAmir Goldstein1-1/+3
Change the semantics to take a reference on upperdentry instead of transferrig the reference. This is needed for upcoming port to new mount api. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: negate the ofs->share_whiteout booleanAmir Goldstein3-7/+4
The default common case is that whiteout sharing is enabled. Change to storing the negated no_shared_whiteout state, so we will not need to initialize it. This is the first step towards removing all config and feature initializations out of ovl_fill_super(). Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: check type and offset of struct vfsmount in ovl_entryChristian Brauner1-0/+9
Porting overlayfs to the new amount api I started experiencing random crashes that couldn't be explained easily. So after much debugging and reasoning it became clear that struct ovl_entry requires the point to struct vfsmount to be the first member and of type struct vfsmount. During the port I added a new member at the beginning of struct ovl_entry which broke all over the place in the form of random crashes and cache corruptions. While there's a comment in ovl_free_fs() to the effect of "Hack! Reuse ofs->layers as a vfsmount array before freeing it" there's no such comment on struct ovl_entry which makes this easy to trip over. Add a comment and two static asserts for both the offset and the type of pointer in struct ovl_entry. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: implement lazy lookup of lowerdata in data-only layersAmir Goldstein7-14/+107
Defer lookup of lowerdata in the data-only layers to first data access or before copy up. We perform lowerdata lookup before copy up even if copy up is metadata only copy up. We can further optimize this lookup later if needed. We do best effort lazy lookup of lowerdata for d_real_inode(), because this interface does not expect errors. The only current in-tree caller of d_real_inode() is trace_uprobe and this caller is likely going to be followed reading from the file, before placing uprobes on offset within the file, so lowerdata should be available when setting the uprobe. Tested-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: prepare for lazy lookup of lowerdata inodeAmir Goldstein4-5/+33
Make the code handle the case of numlower > 1 and missing lowerdata dentry gracefully. Missing lowerdata dentry is an indication for lazy lookup of lowerdata and in that case the lowerdata_redirect path is stored in ovl_inode. Following commits will defer lookup and perform the lazy lookup on access. Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: prepare to store lowerdata redirect for lazy lowerdata lookupAmir Goldstein6-4/+24
Prepare to allow ovl_lookup() to leave the last entry in a non-dir lowerstack empty to signify lazy lowerdata lookup. In this case, ovl_lookup() stores the redirect path from metacopy to lowerdata in ovl_inode, which is going to be used later to perform the lazy lowerdata lookup. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: implement lookup in data-only layersAmir Goldstein1-1/+72
Lookup in data-only layers only for a lower metacopy with an absolute redirect xattr. The metacopy xattr is not checked on files found in the data-only layers and redirect xattr are not followed in the data-only layers. Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>