summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2023-06-19btrfs: use btrfs_tree_block_status for leaf item errorsJosef Bacik1-7/+12
We have a variety of item specific errors that can occur. For now simply put these under the umbrella of BTRFS_TREE_BLOCK_INVALID_ITEM, this can be fleshed out as we need in the future. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: add btrfs_tree_block_status definitions to tree-checker.hJosef Bacik1-0/+13
We use this in btrfs-progs to determine if we can fix different types of corruptions. We don't care about this in the kernel, however it would be good to share this code between the kernel and btrfs-progs, so add the status definitions so we can start converting the tree-checker code over to using these status flags instead of blanket returning -EUCLEAN. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: simplify btrfs_check_leaf_* helpers into a single helperJosef Bacik3-32/+13
We have two helpers for checking leaves, because we have an extra check for debugging in btrfs_mark_buffer_dirty(), and at that stage we may have item data that isn't consistent yet. However we can handle this case internally in the helper, if BTRFS_HEADER_FLAG_WRITTEN is set we know the buffer should be internally consistent, otherwise we need to skip checking the item data. Simplify this helper down a single helper and handle the item data checking logic internally to the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: remove level argument from btrfs_set_block_flagsJosef Bacik3-9/+5
We just pass in btrfs_header_level(eb) for the level, and we're passing in the eb already, so simply get the level from the eb inside of btrfs_set_block_flags. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.cJosef Bacik4-21/+21
This is completely related to block rsv's, move it out of the free space cache code and into block-rsv.c. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: scrub: use recovered data stripes as cache to avoid unnecessary readQu Wenruo3-0/+55
For P/Q stripe scrub, we have quite some duplicated read IO: - Data stripes read for verification This is triggered by the scrub_submit_initial_read() inside scrub_raid56_parity_stripe(). - Data stripes read (again) for P/Q stripe verification This is triggered by scrub_assemble_read_bios() from scrub_rbio(). Although we can have hit rbio cache and avoid unnecessary read, the chance is very low, as scrub would easily flush the whole rbio cache. This means, even we're just scrubbing a single P/Q stripe, we would read the data stripes twice for the best case scenario. If we need to recover some data stripes, it would cause more reads on the same data stripes, again and again. However before we call raid56_parity_submit_scrub_rbio() we already have all data stripes repaired and their contents ready to use. But RAID56 cache is unaware about the scrub cache, thus RAID56 layer itself still needs to re-read the data stripes. To avoid such cache miss, this patch would: - Introduce a new helper, raid56_parity_cache_data_pages() This function would grab the pages from an array, and copy the content to the rbio, marking all the involved sectors uptodate. The page copy is unavoidable because of the cache pages of rbio are all self managed, thus can not utilize outside pages without screwing up the lifespan. - Use the repaired data stripes as cache inside scrub_raid56_parity_stripe() By this, we ensure all the data sectors of the scrub rbio are already uptodate, and no need to read them again from disk. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: assert tree lock is held when removing free space entriesFilipe Manana1-10/+24
Removing a free space entry from an in memory space cache requires having the corresponding btrfs_free_space_ctl's 'tree_lock' held. We have several code paths that remove an entry, so add assertions where appropriate to verify we are holding the lock, as the lock is acquired by some other function up in the call chain, which makes it easy to miss in the future. Note: for this to work we need to lock the local btrfs_free_space_ctl at load_free_space_cache(), which was not being done because it's local, declared on the stack, so no other task has access to it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: assert tree lock is held when linking free spaceFilipe Manana1-0/+2
When linking a free space entry, at link_free_space(), the caller should be holding the spinlock 'tree_lock' of the given btrfs_free_space_ctl argument, which is necessary for manipulating the red black tree of free space entries (done by tree_insert_offset(), which already asserts the lock is held) and for manipulating the 'free_space', 'free_extents', 'discardable_extents' and 'discardable_bytes' counters of the given struct btrfs_free_space_ctl. So assert that the spinlock 'tree_lock' of the given btrfs_free_space_ctl is held by the current task. We have multiple code paths that end up calling link_free_space(), and all currently take the lock before calling it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: assert tree lock is held when searching for free space entriesFilipe Manana1-0/+2
When searching for a free space entry by offset, at tree_search_offset(), we are supposed to have the btrfs_free_space_ctl's 'tree_lock' held, so assert that. We have multiple callers of tree_search_offset(), and all currently hold the necessary lock before calling it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: assert proper locks are held at tree_insert_offset()Filipe Manana1-6/+19
There are multiple code paths leading to tree_insert_offset(), and each path takes the necessary locks before tree_insert_offset() is called, since they do other things that require those locks to be held. This makes it easy to miss the locking somewhere, so make tree_insert_offset() assert that the required locks are being held by the calling task. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: simplify arguments to tree_insert_offset()Filipe Manana1-21/+16
For the in-memory component of space caching (free space cache and free space tree), three of the arguments passed to tree_insert_offset() can always be taken from the new free space entry that we are about to add. So simplify tree_insert_offset() to take the new entry instead of the 'offset', 'node' and 'bitmap' arguments. This will also allow to make further changes simpler. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use precomputed end offsets at do_trimming()Filipe Manana1-1/+1
The are two computations of end offsets at do_trimming() that are not necessary, as they were previously computed and stored in local const variables. So just use the variables instead, to make the source code shorter and easier to read. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: avoid searching twice for previous node when merging free space entriesFilipe Manana1-3/+6
At try_merge_free_space(), avoid calling twice rb_prev() to find the previous node, as that requires looping through the red black tree, so store the result of the rb_prev() call and then use it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: avoid extra memory allocation when copying free space cacheFilipe Manana1-2/+4
At copy_free_space_cache(), we add a new entry to the block group's ctl before we free the entry from the temporary ctl. Adding a new entry requires the allocation of a new struct btrfs_free_space, so we can avoid a temporary extra allocation by freeing the entry from the temporary ctl before we add a new entry to the main ctl, which possibly also reduces the chances for a memory allocation failure in case of very high memory pressure. So just do that. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: simplify transid initialization in btrfs_ioctl_wait_syncTom Rix1-5/+4
A small code simplification, move the default value of transid to its initialization and remove the else-statement. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: output affected files when relocation failsQu Wenruo3-0/+208
[PROBLEM] When relocation fails (mostly due to checksum mismatch), we only got very cryptic error messages like: BTRFS info (device dm-4): relocating block group 13631488 flags data BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 csum 0x373e1ae3 expected csum 0x98757625 mirror 1 BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS info (device dm-4): balance: ended with status: -5 The end user has to decipher the above messages and use various tools to locate the affected files and find a way to fix the problem (mostly deleting the file). This is not an easy work even for experienced developer, not to mention the end users. [SCRUB IS DOING BETTER] By contrast, scrub is providing much better error messages: BTRFS error (device dm-4): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-4): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS info (device dm-4): scrub: finished on devid 1 with status: 0 Which provides the affected files directly to the end user. [IMPROVEMENT] Instead of the generic data checksum error messages, which is not doing a good job for data reloc inodes, this patch introduce a scrub like backref walking based solution. When a sector fails its checksum for data reloc inode, we go the following workflow: - Get the real logical bytenr For data reloc inode, the file offset is the offset inside the block group. Thus the real logical bytenr is @file_off + @block_group->start. - Do an extent type check If it's tree blocks it's much easier to handle, just go through all the tree block backref. - Do a backref walk and inode path resolution for data extents This is mostly the same as scrub. But unfortunately we can not reuse the same function as the output format is different. Now the new output would be more user friendly: BTRFS info (device dm-4): relocating block group 13631488 flags data BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 logical 13631488 csum 0x373e1ae3 expected csum 0x98757625 mirror 1 BTRFS warning (device dm-4): checksum error at logical 13631488 mirror 1 root 5 inode 257 offset 0 length 4096 links 1 (path: file) BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 BTRFS info (device dm-4): balance: ended with status: -5 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: remove hipri_workers workqueueChristoph Hellwig4-11/+2
Now that btrfs_wq_submit_bio is never called for synchronous I/O, the hipri_workers workqueue is not used anymore and can be removed. Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: determine synchronous writers from bio or writeback controlChristoph Hellwig5-19/+3
The writeback_control structure already passes down the information about a writeback being synchronous from the core VM code, and thus information is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags helper. Use that information to decide if checksums calculation is offloaded to a workqueue instead of btrfs_inode::sync_writers field that not only bloats the inode but also has too wide scope, being inode wide instead of limited to the actual writeback request. The sync writes were set in: - btrfs_do_write_iter - regular IO, sync status is set - start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL mode - btrfs_write_marked_extents - write marked extents, writeback with WB_SYNC_ALL mode Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: submit IO synchronously for fast checksum implementationsChristoph Hellwig1-13/+7
Most modern hardware supports very fast accelerated crc32c calculation. If that is supported the CPU overhead of the checksum calculation is very limited, and offloading the calculation to special worker threads has a lot of overhead for no gain. E.g. on an Intel Optane device is actually very much slows down even 1M buffered writes with fio: Unpatched: write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets With synchronous CRCs: write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets With a lot of variation during the unpatched run going down as low as 1100MB/s, while the synchronous CRC version has about the same peak write speed but much lower dips, and fewer kworkers churning around. Both tests had fio saturated at 100% CPU. (thanks to Jens Axboe via Chris Mason for the benchmarking) Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use SECTOR_SHIFT to convert LBA to physical offsetAnand Jain4-6/+6
Using SECTOR_SHIFT to convert LBA to physical address makes it more readable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use SECTOR_SHIFT to convert physical offset to LBAAnand Jain5-6/+8
Use SECTOR_SHIFT while converting a physical address to an LBA, makes it more readable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: improve leaf dump and error handlingQu Wenruo2-68/+59
Improve the leaf dump behavior by: - Always dump the leaf first, then the error message - Output the slot number if possible Especially in __btrfs_free_extent() the leaf dump of extent tree can be pretty large. With an extra slot number it's much easier to locate the problem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: print-tree: pass const extent buffer pointerQu Wenruo5-14/+14
Since print-tree infrastructure only prints the content of a tree block, we can make them to accept const extent buffer pointer. This removes a forced type convert in extent-tree, where we convert a const extent buffer pointer to regular one, just to avoid compiler warning. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: export bitmap_test_range_all_{set,zero}Naohiro Aota3-28/+26
bitmap_test_range_all_{set,zero} defined in subpage.c are useful for other components. Move them to misc.h and use them in zoned.c. Also, as find_next{,_zero}_bit take/return "unsigned long" instead of "unsigned int", convert the type to "unsigned long". While at it, also rewrite the "if (...) return true; else return false;" pattern and add const to the input bitmap. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: tag as unlikely the key comparison when checking sibling keysFilipe Manana1-1/+1
When checking siblings keys, before moving keys from one node/leaf to a sibling node/leaf, it's very unexpected to have the last key of the left sibling greater than or equals to the first key of the right sibling, as that means we have a (serious) corruption that breaks the key ordering properties of a b+tree. Since this is unexpected, surround the comparison with the unlikely macro, which helps the compiler generate better code for the most expected case (no existing b+tree corruption). This is also what we do for other unexpected cases of invalid key ordering (like at btrfs_set_item_key_safe()). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: make btrfs_free_device() staticFilipe Manana2-2/+1
The function btrfs_free_device() is never used outside of volumes.c, so make it static and remove its prototype declaration at volumes.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: don't commit transaction for every subvol createSweet Tea Dorminy1-4/+3
Recently a Meta-internal workload encountered subvolume creation taking up to 2s each, significantly slower than directory creation. As they were hoping to be able to use subvolumes instead of directories, and were looking to create hundreds, this was a significant issue. After Josef investigated, it turned out to be due to the transaction commit currently performed at the end of subvolume creation. This change improves the workload by not doing transaction commit for every subvolume creation, and merely requiring a transaction commit on fsync. In the worst case, of doing a subvolume create and fsync in a loop, this should require an equal amount of time to the current scheme; and in the best case, the internal workload creating hundreds of subvolumes before fsyncing is greatly improved. While it would be nice to be able to use the log tree and use the normal fsync path, log tree replay can't deal with new subvolume inodes presently. It's possible that there's some reason that the transaction commit is necessary for correctness during subvolume creation; however, git logs indicate that the commit dates back to the beginning of subvolume creation, and there are no notes on why it would be necessary. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: unexport btrfs_prev_leaf()Filipe Manana2-81/+81
btrfs_prev_leaf() is not used outside ctree.c, so there's no need to export it at ctree.h - just make it static at ctree.c and move its definition above btrfs_search_slot_for_read(), since that function calls it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-16Merge tag 'for-6.4-rc6-tag' of ↵Linus Torvalds3-10/+22
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two fixes for NOCOW files, a regression fix in scrub and an assertion fix: - NOCOW fixes: - keep length of iomap direct io request in case of a failure - properly pass mode of extent reference checking, this can break some cases for swapfile - fix error value confusion when scrubbing a stripe - convert assertion to a proper error handling when loading global roots, reported by syzbot" * tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: scrub: fix a return value overwrite in scrub_stripe() btrfs: do not ASSERT() on duplicated global roots btrfs: can_nocow_file_extent should pass down args->strict from callers btrfs: fix iomap_begin length for nocow writes
2023-06-14btrfs: scrub: fix a return value overwrite in scrub_stripe()Qu Wenruo1-1/+1
[RETURN VALUE OVERWRITE] Inside scrub_stripe(), we would submit all the remaining stripes after iterating all extents. But since flush_scrub_stripes() can return error, we need to avoid overwriting the existing @ret if there is any error. However the existing check is doing the wrong check: ret2 = flush_scrub_stripes(); if (!ret2) ret = ret2; This would overwrite the existing @ret to 0 as long as the final flush detects no critical errors. [FIX] We should check @ret other than @ret2 in that case. Fixes: 8eb3dd17eadd ("btrfs: dev-replace: error out if we have unrepaired metadata error during") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13btrfs: do not ASSERT() on duplicated global rootsQu Wenruo1-2/+8
[BUG] Syzbot reports a reproducible ASSERT() when using rescue=usebackuproot mount option on a corrupted fs. The full report can be found here: https://syzkaller.appspot.com/bug?extid=c4614eae20a166c25bf0 BTRFS error (device loop0: state C): failed to load root csum assertion failed: !tmp, in fs/btrfs/disk-io.c:1103 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3664! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3608 Comm: syz-executor356 Not tainted 6.0.0-rc7-syzkaller-00029-g3800a713b607 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022 RIP: 0010:assertfail+0x1a/0x1c fs/btrfs/ctree.h:3663 RSP: 0018:ffffc90003aaf250 EFLAGS: 00010246 RAX: 0000000000000032 RBX: 0000000000000000 RCX: f21c13f886638400 RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000 RBP: ffff888021c640a0 R08: ffffffff816bd38d R09: ffffed10173667f1 R10: ffffed10173667f1 R11: 1ffff110173667f0 R12: dffffc0000000000 R13: ffff8880229c21f7 R14: ffff888021c64060 R15: ffff8880226c0000 FS: 0000555556a73300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a2637d7a00 CR3: 00000000709c4000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_global_root_insert+0x1a7/0x1b0 fs/btrfs/disk-io.c:1103 load_global_roots_objectid+0x482/0x8c0 fs/btrfs/disk-io.c:2467 load_global_roots fs/btrfs/disk-io.c:2501 [inline] btrfs_read_roots fs/btrfs/disk-io.c:2528 [inline] init_tree_roots+0xccb/0x203c fs/btrfs/disk-io.c:2939 open_ctree+0x1e53/0x33df fs/btrfs/disk-io.c:3574 btrfs_fill_super+0x1c6/0x2d0 fs/btrfs/super.c:1456 btrfs_mount_root+0x885/0x9a0 fs/btrfs/super.c:1824 legacy_get_tree+0xea/0x180 fs/fs_context.c:610 vfs_get_tree+0x88/0x270 fs/super.c:1530 fc_mount fs/namespace.c:1043 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1073 btrfs_mount+0x3d3/0xbb0 fs/btrfs/super.c:1884 [CAUSE] Since the introduction of global roots, we handle csum/extent/free-space-tree roots as global roots, even if no extent-tree-v2 feature is enabled. So for regular csum/extent/fst roots, we load them into fs_info::global_root_tree rb tree. And we should not expect any conflicts in that rb tree, thus we have an ASSERT() inside btrfs_global_root_insert(). But rescue=usebackuproot can break the assumption, as we will try to load those trees again and again as long as we have bad roots and have backup roots slot remaining. So in that case we can have conflicting roots in the rb tree, and triggering the ASSERT() crash. [FIX] We can safely remove that ASSERT(), as the caller will properly put the offending root. To make further debugging easier, also add two explicit error messages: - Error message for conflicting global roots - Error message when using backup roots slot Reported-by: syzbot+a694851c6ab28cbcfb9c@syzkaller.appspotmail.com Fixes: abed4aaae4f7 ("btrfs: track the csum, extent, and free space trees in a rb tree") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13btrfs: can_nocow_file_extent should pass down args->strict from callersChris Mason1-1/+1
Commit 619104ba453ad0 ("btrfs: move common NOCOW checks against a file extent into a helper") changed our call to btrfs_cross_ref_exist() to always pass false for the 'strict' parameter. We're passing this down through the stack so that we can do a full check for cross references during swapfile activation. With strict always false, this test fails: btrfs subvol create swappy chattr +C swappy fallocate -l1G swappy/swapfile chmod 600 swappy/swapfile mkswap swappy/swapfile btrfs subvol snap swappy swapsnap btrfs subvol del -C swapsnap btrfs fi sync / sync;sync;sync swapon swappy/swapfile The fix is to just use args->strict, and everyone except swapfile activation is passing false. Fixes: 619104ba453ad0 ("btrfs: move common NOCOW checks against a file extent into a helper") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13btrfs: fix iomap_begin length for nocow writesChristoph Hellwig1-6/+12
can_nocow_extent can reduce the len passed in, which needs to be propagated to btrfs_dio_iomap_begin so that iomap does not submit more data then is mapped. This problems exists since the btrfs_get_blocks_direct helper was added in commit c5794e51784a ("btrfs: Factor out write portion of btrfs_get_blocks_direct"), but the ordered_extent splitting added in commit b73a6fd1b1ef ("btrfs: split partial dio bios before submit") added a WARN_ON that made a syzkaller test fail. Reported-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com Fixes: c5794e51784a ("btrfs: Factor out write portion of btrfs_get_blocks_direct") CC: stable@vger.kernel.org # 6.1+ Tested-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-12Merge tag 'for-6.4-rc6-tag' of ↵Linus Torvalds3-11/+30
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A more fixes and regression fixes: - in subpage mode, fix crash when repairing metadata at the end of a stripe - properly enable async discard when remounting from read-only to read-write - scrub regression fixes: - respect read-only scrub when attempting to do a repair - fix reporting of found errors, the stats don't get properly accounted after a stripe repair" * tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: scrub: also report errors hit during the initial read btrfs: scrub: respect the read-only flag during repair btrfs: properly enable async discard when switching from RO->RW btrfs: subpage: fix a crash in metadata repair path
2023-06-12block: replace fmode_t with a block-specific type for block open flagsChristoph Hellwig4-15/+15
The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: add a sb_open_mode helperChristoph Hellwig1-4/+1
Add a helper to return the open flags for blkdev_get_by* for passed in super block flags instead of open coding the logic in many places. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: use the holder as indication for exclusive opensChristoph Hellwig4-27/+25
The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12btrfs: don't pass a holder for non-exclusive blkdev_get_by_pathChristoph Hellwig3-21/+15
Passing a holder to blkdev_get_by_path when FMODE_EXCL isn't set doesn't make sense, so pass NULL instead and remove the holder argument from the call chains the only end up in non-FMODE_EXCL blkdev_get_by_path calls. Exclusive mode for device scanning is not used since commit 50d281fc434c ("btrfs: scan device in non-exclusive mode")". Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20230608110258.189493-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-10backing_dev: remove current->backing_dev_infoChristoph Hellwig1-5/+1
Patch series "cleanup the filemap / direct I/O interaction", v4. This series cleans up some of the generic write helper calling conventions and the page cache writeback / invalidation for direct I/O. This is a spinoff from the no-bufferhead kernel project, for which we'll want to an use iomap based buffered write path in the block layer. This patch (of 12): The last user of current->backing_dev_info disappeared in commit b9b1335e6403 ("remove bdi_congested() and wb_congested() and related functions"). Remove the field and all assignments to it. Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-08btrfs: scrub: also report errors hit during the initial readQu Wenruo1-6/+18
[BUG] After the recent scrub rework introduced in commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"), btrfs scrub no longer reports repaired errors any more: # mkfs.btrfs -f $dev -d DUP # mount $dev $mnt # xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file # umount $dev # xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror # mount $dev $mnt # btrfs scrub start -BR $mnt scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218 Scrub started: Tue Jun 6 14:56:50 2023 Status: finished Duration: 0:00:00 data_extents_scrubbed: 2 tree_extents_scrubbed: 18 data_bytes_scrubbed: 131072 tree_bytes_scrubbed: 294912 read_errors: 0 csum_errors: 0 <<< No errors here verify_errors: 0 [...] uncorrectable_errors: 0 unverified_errors: 0 corrected_errors: 16 <<< Only corrected errors last_physical: 2723151872 This can confuse btrfs-progs, as it relies on the csum_errors to determine if there is anything wrong. While on v6.3.x kernels, the report is different: csum_errors: 16 <<< verify_errors: 0 [...] uncorrectable_errors: 0 unverified_errors: 0 corrected_errors: 16 <<< [CAUSE] In the reworked scrub, we update the scrub progress inside scrub_stripe_report_errors(), using various bitmaps to update the result. For example for csum_errors, we use bitmap_weight() of stripe->csum_error_bitmap. Unfortunately at that stage, all error bitmaps (except init_error_bitmap) are the result of the latest repair attempt, thus if the stripe is fully repaired, those error bitmaps will all be empty, resulting the above output mismatch. To fix this, record the number of errors into stripe->init_nr_*_errors. Since we don't really care about where those errors are, we only need to record the number of errors. Then in scrub_stripe_report_errors(), use those initial numbers to update the progress other than using the latest error bitmaps. Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-08btrfs: scrub: respect the read-only flag during repairQu Wenruo1-1/+1
[BUG] With recent scrub rework, the scrub operation no longer respects the read-only flag passed by "-r" option of "btrfs scrub start" command. # mkfs.btrfs -f -d raid1 $dev1 $dev2 # mount $dev1 $mnt # xfs_io -f -d -c "pwrite -b 128K -S 0xaa 0 128k" $mnt/file # sync # xfs_io -c "pwrite -S 0xff $phy1 64k" $dev1 # xfs_io -c "pwrite -S 0xff $((phy2 + 65536)) 64k" $dev2 # mount $dev1 $mnt -o ro # btrfs scrub start -BrRd $mnt Scrub device $dev1 (id 1) done Scrub started: Tue Jun 6 09:59:14 2023 Status: finished Duration: 0:00:00 [...] corrected_errors: 16 <<< Still has corrupted sectors last_physical: 1372585984 Scrub device $dev2 (id 2) done Scrub started: Tue Jun 6 09:59:14 2023 Status: finished Duration: 0:00:00 [...] corrected_errors: 16 <<< Still has corrupted sectors last_physical: 1351614464 # btrfs scrub start -BrRd $mnt Scrub device $dev1 (id 1) done Scrub started: Tue Jun 6 10:00:17 2023 Status: finished Duration: 0:00:00 [...] corrected_errors: 0 <<< No more errors last_physical: 1372585984 Scrub device $dev2 (id 2) done [...] corrected_errors: 0 <<< No more errors last_physical: 1372585984 [CAUSE] In the newly reworked scrub code, repair is always submitted no matter if we're doing a read-only scrub. [FIX] Fix it by skipping the write submission if the scrub is a read-only one. Unfortunately for the report part, even for a read-only scrub we will still report it as corrected errors, as we know it's repairable, even we won't really submit the write. Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-06btrfs: properly enable async discard when switching from RO->RWChris Mason1-0/+6
The async discard uses the BTRFS_FS_DISCARD_RUNNING bit in the fs_info to force discards off when the filesystem has aborted or we're generally not able to run discards. This gets flipped on when we're mounted rw, and also when we go from ro->rw. Commit 63a7cb13071842 ("btrfs: auto enable discard=async when possible") enabled async discard by default, and this meant "mount -o ro /dev/xxx /yyy" had async discards turned on. Unfortunately, this meant our check in btrfs_remount_cleanup() would see that discards are already on: /* If we toggled discard async */ if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) && btrfs_test_opt(fs_info, DISCARD_ASYNC)) btrfs_discard_resume(fs_info); So, we'd never call btrfs_discard_resume() when remounting the root filesystem from ro->rw. drgn shows this really nicely: import os import sys from drgn.helpers.linux.fs import path_lookup from drgn import NULL, Object, Type, cast def btrfs_sb(sb): return cast("struct btrfs_fs_info *", sb.s_fs_info) if len(sys.argv) == 1: path = "/" else: path = sys.argv[1] fs_info = cast("struct btrfs_fs_info *", path_lookup(prog, path).mnt.mnt_sb.s_fs_info) BTRFS_FS_DISCARD_RUNNING = 1 << prog['BTRFS_FS_DISCARD_RUNNING'] if fs_info.flags & BTRFS_FS_DISCARD_RUNNING: print("discard running flag is on") else: print("discard running flag is off") [root]# mount | grep nvme /dev/nvme0n1p3 on / type btrfs (rw,relatime,compress-force=zstd:3,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/) [root]# ./discard_running.drgn discard running flag is off [root]# mount -o remount,discard=sync / [root]# mount -o remount,discard=async / [root]# ./discard_running.drgn discard running flag is on The fix is to call btrfs_discard_resume() when we're going from ro->rw. It already checks to make sure the async discard flag is on, so it'll do the right thing. Fixes: 63a7cb13071842 ("btrfs: auto enable discard=async when possible") CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-05btrfs: subpage: fix a crash in metadata repair pathQu Wenruo1-4/+5
[BUG] Test case btrfs/027 would crash with subpage (64K page size, 4K sectorsize) with the following dying messages: debug: map_length=16384 length=65536 type=metadata|raid6(0x104) assertion failed: map_length >= length, in fs/btrfs/volumes.c:8093 ------------[ cut here ]------------ kernel BUG at fs/btrfs/messages.c:259! Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: btrfs_assertfail+0x28/0x2c [btrfs] btrfs_map_repair_block+0x150/0x2b8 [btrfs] btrfs_repair_io_failure+0xd4/0x31c [btrfs] btrfs_read_extent_buffer+0x150/0x16c [btrfs] read_tree_block+0x38/0xbc [btrfs] read_tree_root_path+0xfc/0x1bc [btrfs] btrfs_get_root_ref.part.0+0xd4/0x3a8 [btrfs] open_ctree+0xa30/0x172c [btrfs] btrfs_mount_root+0x3c4/0x4a4 [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xec vfs_kern_mount.part.0+0x90/0xd4 vfs_kern_mount+0x14/0x28 btrfs_mount+0x114/0x418 [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xec path_mount+0x3e0/0xb64 __arm64_sys_mount+0x200/0x2d8 invoke_syscall+0x48/0x114 el0_svc_common.constprop.0+0x60/0x11c do_el0_svc+0x38/0x98 el0_svc+0x40/0xa8 el0t_64_sync_handler+0xf4/0x120 el0t_64_sync+0x190/0x194 Code: aa0403e2 b0fff060 91010000 959c2024 (d4210000) [CAUSE] In btrfs/027 we test RAID6 with missing devices, in this particular case, we're repairing a metadata at the end of a data stripe. But at btrfs_repair_io_failure(), we always pass a full PAGE for repair, and for subpage case this can cross stripe boundary and lead to the above BUG_ON(). This metadata repair code is always there, since the introduction of subpage support, but this can trigger BUG_ON() after the bio split ability at btrfs_map_bio(). [FIX] Instead of passing the old PAGE_SIZE, we calculate the correct length based on the eb size and page size for both regular and subpage cases. CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-05block: introduce holder opsChristoph Hellwig2-4/+4
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-03Merge tag 'for-6.4-rc4-tag' of ↵Linus Torvalds2-20/+32
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One regression fix. The rewrite of scrub code in 6.4 broke device replace in zoned mode, some of the writes could happen out of order so this had to be adjusted for all cases" * tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix dev-replace after the scrub rework
2023-06-01btrfs: zoned: fix dev-replace after the scrub reworkQu Wenruo2-20/+32
[BUG] After commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"), scrub no longer works for zoned device at all. Even an empty zoned btrfs cannot be replaced: # mkfs.btrfs -f /dev/nvme0n1 # mount /dev/nvme0n1 /mnt/btrfs # btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs Resetting device zones /dev/nvme1n1 (160 zones) ... ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error And we can hit kernel crash related to that: BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2 BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 BUG: kernel NULL pointer dereference, address: 00000000000000a8 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40 Call Trace: <IRQ> btrfs_lookup_ordered_extent+0x31/0x190 btrfs_record_physical_zoned+0x18/0x40 btrfs_simple_end_io+0xaf/0xc0 blk_update_request+0x153/0x4c0 blk_mq_end_request+0x15/0xd0 nvme_poll_cq+0x1d3/0x360 nvme_irq+0x39/0x80 __handle_irq_event_percpu+0x3b/0x190 handle_irq_event+0x2f/0x70 handle_edge_irq+0x7c/0x210 __common_interrupt+0x34/0xa0 common_interrupt+0x7d/0xa0 </IRQ> <TASK> asm_common_interrupt+0x22/0x40 [CAUSE] Dev-replace reuses scrub code to iterate all extents and write the existing content back to the new device. And for zoned devices, we call fill_writer_pointer_gap() to make sure all the writes into the zoned device is sequential, even if there may be some gaps between the writes. However we have several different bugs all related to zoned dev-replace: - We are using ZONE_APPEND operation for metadata style write back For zoned devices, btrfs has two ways to write data: * ZONE_APPEND for data This allows higher queue depth, but will not be able to know where the write would land. Thus needs to grab the real on-disk physical location in it's endio. * WRITE for metadata This requires single queue depth (new writes can only be submitted after previous one finished), and all writes must be sequential. For scrub, we go single queue depth, but still goes with ZONE_APPEND, which requires btrfs_bio::inode being populated. This is the cause of that crash. - No correct tracing of write_pointer After a write finished, we should forward sctx->write_pointer, or fill_writer_pointer_gap() would not work properly and cause more than necessary zero out, and fill the whole zone prematurely. - Incorrect physical bytenr passed to fill_writer_pointer_gap() In scrub_write_sectors(), one call site passes logical address, which is completely wrong. The other call site passes physical address of current sector, but we should pass the physical address of the btrfs_bio we're submitting. This is the cause of the -EIO errors. [FIX] - Do not use ZONE_APPEND for btrfs_submit_repair_write(). - Manually forward sctx->write_pointer after successful writeback - Use the physical address of the to-be-submitted btrfs_bio for fill_writer_pointer_gap() Now zoned device replace would work as expected. Reported-by: Christoph Hellwig <hch@lst.de> Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-31Merge tag 'for-6.4-rc4-tag' of ↵Linus Torvalds3-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "One bug fix and two build warning fixes: - call proper end bio callback for metadata RAID0 in a rare case of an unaligned block - fix uninitialized variable (reported by gcc 10.2) - fix warning about potential access beyond array bounds on mips64 with 64k pages (runtime check would not allow that)" * tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix csum_tree_block page iteration to avoid tripping on -Werror=array-bounds btrfs: fix an uninitialized variable warning in btrfs_log_inode btrfs: call btrfs_orig_bbio_end_io in btrfs_end_bio_work
2023-05-27btrfs: fix csum_tree_block page iteration to avoid tripping on ↵pengfuyuan1-1/+1
-Werror=array-bounds When compiling on a MIPS 64-bit machine we get these warnings: In file included from ./arch/mips/include/asm/cacheflush.h:13, from ./include/linux/cacheflush.h:5, from ./include/linux/highmem.h:8, from ./include/linux/bvec.h:10, from ./include/linux/blk_types.h:10, from ./include/linux/blkdev.h:9, from fs/btrfs/disk-io.c:7: fs/btrfs/disk-io.c: In function ‘csum_tree_block’: fs/btrfs/disk-io.c:100:34: error: array subscript 1 is above array bounds of ‘struct page *[1]’ [-Werror=array-bounds] 100 | kaddr = page_address(buf->pages[i]); | ~~~~~~~~~~^~~ ./include/linux/mm.h:2135:48: note: in definition of macro ‘page_address’ 2135 | #define page_address(page) lowmem_page_address(page) | ^~~~ cc1: all warnings being treated as errors We can check if i overflows to solve the problem. However, this doesn't make much sense, since i == 1 and num_pages == 1 doesn't execute the body of the loop. In addition, i < num_pages can also ensure that buf->pages[i] will not cross the boundary. Unfortunately, this doesn't help with the problem observed here: gcc still complains. To fix this add a compile-time condition for the extent buffer page array size limit, which would eventually lead to eliminating the whole for loop. CC: stable@vger.kernel.org # 5.10+ Signed-off-by: pengfuyuan <pengfuyuan@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-27btrfs: fix an uninitialized variable warning in btrfs_log_inodeShida Zhang1-1/+1
This fixes the following warning reported by gcc 10.2.1 under x86_64: ../fs/btrfs/tree-log.c: In function ‘btrfs_log_inode’: ../fs/btrfs/tree-log.c:6211:9: error: ‘last_range_start’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 6211 | ret = insert_dir_log_key(trans, log, path, key.objectid, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6212 | first_dir_index, last_dir_index); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../fs/btrfs/tree-log.c:6161:6: note: ‘last_range_start’ was declared here 6161 | u64 last_range_start; | ^~~~~~~~~~~~~~~~ This might be a false positive fixed in later compiler versions but we want to have it fixed. Reported-by: k2ci <kernel-bot@kylinos.cn> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-27btrfs: call btrfs_orig_bbio_end_io in btrfs_end_bio_workChristoph Hellwig1-1/+1
When I implemented the storage layer bio splitting, I was under the assumption that we'll never split metadata bios. But Qu reminded me that this can actually happen with very old file systems with unaligned metadata chunks and RAID0. I still haven't seen such a case in practice, but we better handled this case, especially as it is fairly easily to do not calling the ->end_іo method directly in btrfs_end_io_work, and using the proper btrfs_orig_bbio_end_io helper instead. In addition to the old file system with unaligned metadata chunks case documented in the commit log, the combination of the new scrub code with Johannes pending raid-stripe-tree also triggers this case. We spent some time debugging it and found that this patch solves the problem. Fixes: 103c19723c80 ("btrfs: split the bio submission path into a separate file") CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>