summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2024-03-01btrfs: defrag: avoid unnecessary defrag caused by incorrect extent sizeQu Wenruo1-1/+1
commit e42b9d8b9ea2672811285e6a7654887ff64d23f3 upstream. [BUG] With the following file extent layout, defrag would do unnecessary IO and result more on-disk space usage. # mkfs.btrfs -f $dev # mount $dev $mnt # xfs_io -f -c "pwrite 0 40m" $mnt/foobar # sync # xfs_io -f -c "pwrite 40m 16k" $mnt/foobar # sync Above command would lead to the following file extent layout: item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 generation 7 type 1 (regular) extent data disk byte 298844160 nr 41943040 extent data offset 0 nr 41943040 ram 41943040 extent compression 0 (none) item 7 key (257 EXTENT_DATA 41943040) itemoff 15763 itemsize 53 generation 8 type 1 (regular) extent data disk byte 13631488 nr 16384 extent data offset 0 nr 16384 ram 16384 extent compression 0 (none) Which is mostly fine. We can allow the final 16K to be merged with the previous 40M, but it's upon the end users' preference. But if we defrag the file using the default parameters, it would result worse file layout: # btrfs filesystem defrag $mnt/foobar # sync item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 generation 7 type 1 (regular) extent data disk byte 298844160 nr 41943040 extent data offset 0 nr 8650752 ram 41943040 extent compression 0 (none) item 7 key (257 EXTENT_DATA 8650752) itemoff 15763 itemsize 53 generation 9 type 1 (regular) extent data disk byte 340787200 nr 33292288 extent data offset 0 nr 33292288 ram 33292288 extent compression 0 (none) item 8 key (257 EXTENT_DATA 41943040) itemoff 15710 itemsize 53 generation 8 type 1 (regular) extent data disk byte 13631488 nr 16384 extent data offset 0 nr 16384 ram 16384 extent compression 0 (none) Note the original 40M extent is still there, but a new 32M extent is created for no benefit at all. [CAUSE] There is an existing check to make sure we won't defrag a large enough extent (the threshold is by default 32M). But the check is using the length to the end of the extent: range_len = em->len - (cur - em->start); /* Skip too large extent */ if (range_len >= extent_thresh) goto next; This means, for the first 8MiB of the extent, the range_len is always smaller than the default threshold, and would not be defragged. But after the first 8MiB, the remaining part would fit the requirement, and be defragged. Such different behavior inside the same extent caused the above problem, and we should avoid different defrag decision inside the same extent. [FIX] Instead of using @range_len, just use @em->len, so that we have a consistent decision among the same file extent. Now with this fix, we won't touch the extent, thus not making it any worse. Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 0cb5950f3f3b ("btrfs: fix deadlock when reserving space during defrag") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23btrfs: don't drop extent_map for free space inode on write errorJosef Bacik1-2/+17
commit 5571e41ec6e56e35f34ae9f5b3a335ef510e0ade upstream. While running the CI for an unrelated change I hit the following panic with generic/648 on btrfs_holes_spacecache. assertion failed: block_start != EXTENT_MAP_HOLE, in fs/btrfs/extent_io.c:1385 ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:1385! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 2695096 Comm: fsstress Kdump: loaded Tainted: G W 6.8.0-rc2+ #1 RIP: 0010:__extent_writepage_io.constprop.0+0x4c1/0x5c0 Call Trace: <TASK> extent_write_cache_pages+0x2ac/0x8f0 extent_writepages+0x87/0x110 do_writepages+0xd5/0x1f0 filemap_fdatawrite_wbc+0x63/0x90 __filemap_fdatawrite_range+0x5c/0x80 btrfs_fdatawrite_range+0x1f/0x50 btrfs_write_out_cache+0x507/0x560 btrfs_write_dirty_block_groups+0x32a/0x420 commit_cowonly_roots+0x21b/0x290 btrfs_commit_transaction+0x813/0x1360 btrfs_sync_file+0x51a/0x640 __x64_sys_fdatasync+0x52/0x90 do_syscall_64+0x9c/0x190 entry_SYSCALL_64_after_hwframe+0x6e/0x76 This happens because we fail to write out the free space cache in one instance, come back around and attempt to write it again. However on the second pass through we go to call btrfs_get_extent() on the inode to get the extent mapping. Because this is a new block group, and with the free space inode we always search the commit root to avoid deadlocking with the tree, we find nothing and return a EXTENT_MAP_HOLE for the requested range. This happens because the first time we try to write the space cache out we hit an error, and on an error we drop the extent mapping. This is normal for normal files, but the free space cache inode is special. We always expect the extent map to be correct. Thus the second time through we end up with a bogus extent map. Since we're deprecating this feature, the most straightforward way to fix this is to simply skip dropping the extent map range for this failed range. I shortened the test by using error injection to stress the area to make it easier to reproduce. With this patch in place we no longer panic with my error injection test. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23btrfs: reject encoded write if inode has nodatasum flag setFilipe Manana1-0/+7
commit 1bd96c92c6a0a4d43815eb685c15aa4b78879dc9 upstream. Currently we allow an encoded write against inodes that have the NODATASUM flag set, either because they are NOCOW files or they were created while the filesystem was mounted with "-o nodatasum". This results in having compressed extents without corresponding checksums, which is a filesystem inconsistency reported by 'btrfs check'. For example, running btrfs/281 with MOUNT_OPTIONS="-o nodatacow" triggers this and 'btrfs check' errors out with: [1/7] checking root items [2/7] checking extents [3/7] checking free space tree [4/7] checking fs roots root 256 inode 257 errors 1040, bad file extent, some csum missing root 256 inode 258 errors 1040, bad file extent, some csum missing ERROR: errors found in fs roots (...) So reject encoded writes if the target inode has NODATASUM set. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23btrfs: don't reserve space for checksums when writing to nocow filesFilipe Manana1-10/+19
commit feefe1f49d26bad9d8997096e3a200280fa7b1c5 upstream. Currently when doing a write to a file we always reserve metadata space for inserting data checksums. However we don't need to do it if we have a nodatacow file (-o nodatacow mount option or chattr +C) or if checksums are disabled (-o nodatasum mount option), as in that case we are only adding unnecessary pressure to metadata reservations. For example on x86_64, with the default node size of 16K, a 4K buffered write into a nodatacow file is reserving 655360 bytes of metadata space, as it's accounting for checksums. After this change, which stops reserving space for checksums if we have a nodatacow file or checksums are disabled, we only need to reserve 393216 bytes of metadata. CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23btrfs: send: return EOPNOTSUPP on unknown flagsDavid Sterba1-1/+1
commit f884a9f9e59206a2d41f265e7e403f080d10b493 upstream. When some ioctl flags are checked we return EOPNOTSUPP, like for BTRFS_SCRUB_SUPPORTED_FLAGS, BTRFS_SUBVOL_CREATE_ARGS_MASK or fallocate modes. The EINVAL is supposed to be for a supported but invalid values or combination of options. Fix that when checking send flags so it's consistent with the rest. CC: stable@vger.kernel.org # 4.14+ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5rryOLzp3EKq8RTbjMHMHeaJubfpsVLF6H4qJnKCUR1w@mail.gmail.com/ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23btrfs: forbid deleting live subvol qgroupBoris Burkov1-0/+14
commit a8df35619948bd8363d330c20a90c9a7fbff28c0 upstream. If a subvolume still exists, forbid deleting its qgroup 0/subvolid. This behavior generally leads to incorrect behavior in squotas and doesn't have a legitimate purpose. Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23btrfs: do not ASSERT() if the newly created subvolume already got readQu Wenruo1-2/+11
commit e03ee2fe873eb68c1f9ba5112fee70303ebf9dfb upstream. [BUG] There is a syzbot crash, triggered by the ASSERT() during subvolume creation: assertion failed: !anon_dev, in fs/btrfs/disk-io.c:1319 ------------[ cut here ]------------ kernel BUG at fs/btrfs/disk-io.c:1319! invalid opcode: 0000 [#1] PREEMPT SMP KASAN RIP: 0010:btrfs_get_root_ref.part.0+0x9aa/0xa60 <TASK> btrfs_get_new_fs_root+0xd3/0xf0 create_subvol+0xd02/0x1650 btrfs_mksubvol+0xe95/0x12b0 __btrfs_ioctl_snap_create+0x2f9/0x4f0 btrfs_ioctl_snap_create+0x16b/0x200 btrfs_ioctl+0x35f0/0x5cf0 __x64_sys_ioctl+0x19d/0x210 do_syscall_64+0x3f/0xe0 entry_SYSCALL_64_after_hwframe+0x63/0x6b ---[ end trace 0000000000000000 ]--- [CAUSE] During create_subvol(), after inserting root item for the newly created subvolume, we would trigger btrfs_get_new_fs_root() to get the btrfs_root of that subvolume. The idea here is, we have preallocated an anonymous device number for the subvolume, thus we can assign it to the new subvolume. But there is really nothing preventing things like backref walk to read the new subvolume. If that happens before we call btrfs_get_new_fs_root(), the subvolume would be read out, with a new anonymous device number assigned already. In that case, we would trigger ASSERT(), as we really expect no one to read out that subvolume (which is not yet accessible from the fs). But things like backref walk is still possible to trigger the read on the subvolume. Thus our assumption on the ASSERT() is not correct in the first place. [FIX] Fix it by removing the ASSERT(), and just free the @anon_dev, reset it to 0, and continue. If the subvolume tree is read out by something else, it should have already get a new anon_dev assigned thus we only need to free the preallocated one. Reported-by: Chenyuan Yang <chenyuan0y@gmail.com> Fixes: 2dfb1e43f57d ("btrfs: preallocate anon block device at first phase of snapshot creation") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23btrfs: forbid creating subvol qgroupsBoris Burkov1-0/+5
commit 0c309d66dacddf8ce939b891d9ead4a8e21ad6f0 upstream. Creating a qgroup 0/subvolid leads to various races and it isn't helpful, because you can't specify a subvol id when creating a subvol, so you can't be sure it will be the right one. Any requirements on the automatic subvol can be gratified by using a higher level qgroup and the inheritance parameters of subvol creation. Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23btrfs: do not delete unused block group if it may be used soonFilipe Manana1-0/+46
commit f4a9f219411f318ae60d6ff7f129082a75686c6c upstream. Before deleting a block group that is in the list of unused block groups (fs_info->unused_bgs), we check if the block group became used before deleting it, as extents from it may have been allocated after it was added to the list. However even if the block group was not yet used, there may be tasks that have only reserved space and have not yet allocated extents, and they might be relying on the availability of the unused block group in order to allocate extents. The reservation works first by increasing the "bytes_may_use" field of the corresponding space_info object (which may first require flushing delayed items, allocating a new block group, etc), and only later a task does the actual allocation of extents. For metadata we usually don't end up using all reserved space, as we are pessimistic and typically account for the worst cases (need to COW every single node in a path of a tree at maximum possible height, etc). For data we usually reserve the exact amount of space we're going to allocate later, except when using compression where we always reserve space based on the uncompressed size, as compression is only triggered when writeback starts so we don't know in advance how much space we'll actually need, or if the data is compressible. So don't delete an unused block group if the total size of its space_info object minus the block group's size is less then the sum of used space and space that may be used (space_info->bytes_may_use), as that means we have tasks that reserved space and may need to allocate extents from the block group. In this case, besides skipping the deletion, re-add the block group to the list of unused block groups so that it may be reconsidered later, in case the tasks that reserved space end up not needing to allocate extents from it. Allowing the deletion of the block group while we have reserved space, can result in tasks failing to allocate metadata extents (-ENOSPC) while under a transaction handle, resulting in a transaction abort, or failure during writeback for the case of data extents. CC: stable@vger.kernel.org # 6.0+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23btrfs: add and use helper to check if block group is usedFilipe Manana2-2/+8
commit 1693d5442c458ae8d5b0d58463b873cd879569ed upstream. Add a helper function to determine if a block group is being used and make use of it at btrfs_delete_unused_bgs(). This helper will also be used in future code changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-01btrfs: zoned: optimize hint byte for zoned allocatorNaohiro Aota1-0/+18
[ Upstream commit 02444f2ac26eae6385a65fcd66915084d15dffba ] Writing sequentially to a huge file on btrfs on a SMR HDD revealed a decline of the performance (220 MiB/s to 30 MiB/s after 500 minutes). The performance goes down because of increased latency of the extent allocation, which is induced by a traversing of a lot of full block groups. So, this patch optimizes the ffe_ctl->hint_byte by choosing a block group with sufficient size from the active block group list, which does not contain full block groups. After applying the patch, the performance is maintained well. Fixes: 2eda57089ea3 ("btrfs: zoned: implement sequential extent allocation") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-01btrfs: zoned: factor out prepare_allocation_zoned()Naohiro Aota1-13/+19
[ Upstream commit b271fee9a41ca1474d30639fd6cc912c9901d0f8 ] Factor out prepare_allocation_zoned() for further extension. While at it, optimize the if-branch a bit. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 02444f2ac26e ("btrfs: zoned: optimize hint byte for zoned allocator") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-01btrfs: don't abort filesystem when attempting to snapshot deleted subvolumeOmar Sandoval1-0/+3
commit 7081929ab2572920e94d70be3d332e5c9f97095a upstream. If the source file descriptor to the snapshot ioctl refers to a deleted subvolume, we get the following abort: BTRFS: Transaction aborted (error -2) WARNING: CPU: 0 PID: 833 at fs/btrfs/transaction.c:1875 create_pending_snapshot+0x1040/0x1190 [btrfs] Modules linked in: pata_acpi btrfs ata_piix libata scsi_mod virtio_net blake2b_generic xor net_failover virtio_rng failover scsi_common rng_core raid6_pq libcrc32c CPU: 0 PID: 833 Comm: t_snapshot_dele Not tainted 6.7.0-rc6 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 RIP: 0010:create_pending_snapshot+0x1040/0x1190 [btrfs] RSP: 0018:ffffa09c01337af8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff9982053e7c78 RCX: 0000000000000027 RDX: ffff99827dc20848 RSI: 0000000000000001 RDI: ffff99827dc20840 RBP: ffffa09c01337c00 R08: 0000000000000000 R09: ffffa09c01337998 R10: 0000000000000003 R11: ffffffffb96da248 R12: fffffffffffffffe R13: ffff99820535bb28 R14: ffff99820b7bd000 R15: ffff99820381ea80 FS: 00007fe20aadabc0(0000) GS:ffff99827dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000559a120b502f CR3: 00000000055b6000 CR4: 00000000000006f0 Call Trace: <TASK> ? create_pending_snapshot+0x1040/0x1190 [btrfs] ? __warn+0x81/0x130 ? create_pending_snapshot+0x1040/0x1190 [btrfs] ? report_bug+0x171/0x1a0 ? handle_bug+0x3a/0x70 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? create_pending_snapshot+0x1040/0x1190 [btrfs] ? create_pending_snapshot+0x1040/0x1190 [btrfs] create_pending_snapshots+0x92/0xc0 [btrfs] btrfs_commit_transaction+0x66b/0xf40 [btrfs] btrfs_mksubvol+0x301/0x4d0 [btrfs] btrfs_mksnapshot+0x80/0xb0 [btrfs] __btrfs_ioctl_snap_create+0x1c2/0x1d0 [btrfs] btrfs_ioctl_snap_create_v2+0xc4/0x150 [btrfs] btrfs_ioctl+0x8a6/0x2650 [btrfs] ? kmem_cache_free+0x22/0x340 ? do_sys_openat2+0x97/0xe0 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x46/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7fe20abe83af RSP: 002b:00007ffe6eff1360 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe20abe83af RDX: 00007ffe6eff23c0 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000000 R09: 00007fe20ad16cd0 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffe6eff13c0 R14: 00007fe20ad45000 R15: 0000559a120b6d58 </TASK> ---[ end trace 0000000000000000 ]--- BTRFS: error (device vdc: state A) in create_pending_snapshot:1875: errno=-2 No such entry BTRFS info (device vdc: state EA): forced readonly BTRFS warning (device vdc: state EA): Skipping commit of aborted transaction. BTRFS: error (device vdc: state EA) in cleanup_transaction:2055: errno=-2 No such entry This happens because create_pending_snapshot() initializes the new root item as a copy of the source root item. This includes the refs field, which is 0 for a deleted subvolume. The call to btrfs_insert_root() therefore inserts a root with refs == 0. btrfs_get_new_fs_root() then finds the root and returns -ENOENT if refs == 0, which causes create_pending_snapshot() to abort. Fix it by checking the source root's refs before attempting the snapshot, but after locking subvol_sem to avoid racing with deletion. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-01btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_argsQu Wenruo1-0/+4
commit 173431b274a9a54fc10b273b46e67f46bcf62d2e upstream. Add extra sanity check for btrfs_ioctl_defrag_range_args::flags. This is not really to enhance fuzzing tests, but as a preparation for future expansion on btrfs_ioctl_defrag_range_args. In the future we're going to add new members, allowing more fine tuning for btrfs defrag. Without the -ENONOTSUPP error, there would be no way to detect if the kernel supports those new defrag features. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-01btrfs: don't warn if discard range is not aligned to sectorDavid Sterba1-1/+2
commit a208b3f132b48e1f94f620024e66fea635925877 upstream. There's a warning in btrfs_issue_discard() when the range is not aligned to 512 bytes, originally added in 4d89d377bbb0 ("btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries"). We can't do sub-sector writes anyway so the adjustment is the only thing that we can do and the warning is unnecessary. CC: stable@vger.kernel.org # 4.19+ Reported-by: syzbot+4a4f1eba14eb5c3417d1@syzkaller.appspotmail.com Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-01btrfs: tree-checker: fix inline ref size in error messagesChung-Chiang Cheng1-1/+1
commit f398e70dd69e6ceea71463a5380e6118f219197e upstream. The error message should accurately reflect the size rather than the type. Fixes: f82d1c7ca8ae ("btrfs: tree-checker: Add EXTENT_ITEM and METADATA_ITEM check") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-01btrfs: ref-verify: free ref cache before clearing mount optFedor Pchelkin1-2/+4
commit f03e274a8b29d1d1c1bbd7f764766cb5ca537ab7 upstream. As clearing REF_VERIFY mount option indicates there were some errors in a ref-verify process, a ref cache is not relevant anymore and should be freed. btrfs_free_ref_cache() requires REF_VERIFY option being set so call it just before clearing the mount option. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Reported-by: syzbot+be14ed7728594dc8bd42@syzkaller.appspotmail.com Fixes: fd708b81d972 ("Btrfs: add a extent ref verify tool") CC: stable@vger.kernel.org # 5.4+ Closes: https://lore.kernel.org/lkml/000000000000e5a65c05ee832054@google.com/ Reported-by: syzbot+c563a3c79927971f950f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/0000000000007fe09705fdc6086c@google.com/ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-01btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume ↵Omar Sandoval1-9/+13
being deleted commit 3324d0547861b16cf436d54abba7052e0c8aa9de upstream. Sweet Tea spotted a race between subvolume deletion and snapshotting that can result in the root item for the snapshot having the BTRFS_ROOT_SUBVOL_DEAD flag set. The race is: Thread 1 | Thread 2 ----------------------------------------------|---------- btrfs_delete_subvolume | btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)| |btrfs_mksubvol | down_read(subvol_sem) | create_snapshot | ... | create_pending_snapshot | copy root item from source down_write(subvol_sem) | This flag is only checked in send and swap activate, which this would cause to fail mysteriously. create_snapshot() now checks the root refs to reject a deleted subvolume, so we can fix this by locking subvol_sem earlier so that the BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically. CC: stable@vger.kernel.org # 4.14+ Reported-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-01btrfs: zoned: fix lock ordering in btrfs_zone_activate()Naohiro Aota1-6/+2
commit b18f3b60b35a8c01c9a2a0f0d6424c6d73971dc3 upstream. The btrfs CI reported a lockdep warning as follows by running generic generic/129. WARNING: possible circular locking dependency detected 6.7.0-rc5+ #1 Not tainted ------------------------------------------------------ kworker/u5:5/793427 is trying to acquire lock: ffff88813256d028 (&cache->lock){+.+.}-{2:2}, at: btrfs_zone_finish_one_bg+0x5e/0x130 but task is already holding lock: ffff88810a23a318 (&fs_info->zone_active_bgs_lock){+.+.}-{2:2}, at: btrfs_zone_finish_one_bg+0x34/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&fs_info->zone_active_bgs_lock){+.+.}-{2:2}: ... -> #0 (&cache->lock){+.+.}-{2:2}: ... This is because we take fs_info->zone_active_bgs_lock after a block_group's lock in btrfs_zone_activate() while doing the opposite in other places. Fix the issue by expanding the fs_info->zone_active_bgs_lock's critical section and taking it before a block_group's lock. Fixes: a7e1ac7bdc5a ("btrfs: zoned: reserve zones for an active metadata/system block group") CC: stable@vger.kernel.org # 6.6 Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-01btrfs: scrub: avoid use-after-free when chunk length is not 64K alignedQu Wenruo1-7/+22
commit f546c4282673497a06ecb6190b50ae7f6c85b02f upstream. [BUG] There is a bug report that, on a ext4-converted btrfs, scrub leads to various problems, including: - "unable to find chunk map" errors BTRFS info (device vdb): scrub: started on devid 1 BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096 BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056 This would lead to unrepariable errors. - Use-after-free KASAN reports: ================================================================== BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0 Read of size 8 at addr ffff8881013c9040 by task btrfs/909 CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023 Call Trace: <TASK> dump_stack_lvl+0x43/0x60 print_report+0xcf/0x640 kasan_report+0xa6/0xd0 __blk_rq_map_sg+0x18f/0x7c0 virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff] virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff] blk_mq_flush_plug_list.part.0+0x780/0x860 __blk_flush_plug+0x1ba/0x220 blk_finish_plug+0x3b/0x60 submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965] __x64_sys_ioctl+0xbd/0x100 do_syscall_64+0x5d/0xe0 entry_SYSCALL_64_after_hwframe+0x63/0x6b RIP: 0033:0x7f47e5e0952b - Crash, mostly due to above use-after-free [CAUSE] The converted fs has the following data chunk layout: item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80 length 86016 owner 2 stripe_len 65536 type DATA|single For above logical bytenr 2214744064, it's at the chunk end (2214658048 + 86016 = 2214744064). This means btrfs_submit_bio() would split the bio, and trigger endio function for both of the two halves. However scrub_submit_initial_read() would only expect the endio function to be called once, not any more. This means the first endio function would already free the bbio::bio, leaving the bvec freed, thus the 2nd endio call would lead to use-after-free. [FIX] - Make sure scrub_read_endio() only updates bits in its range Since we may read less than 64K at the end of the chunk, we should not touch the bits beyond chunk boundary. - Make sure scrub_submit_initial_read() only to read the chunk range This is done by calculating the real number of sectors we need to read, and add sector-by-sector to the bio. Thankfully the scrub read repair path won't need extra fixes: - scrub_stripe_submit_repair_read() With above fixes, we won't update error bit for range beyond chunk, thus scrub_stripe_submit_repair_read() should never submit any read beyond the chunk. Reported-by: Rongrong <i@rong.moe> Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Tested-by: Rongrong <i@rong.moe> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [ Use min_t() to fix a compiling error due to difference types ] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-01btrfs: sysfs: validate scrub_speed_max valueDavid Disseldorp1-0/+4
commit 2b0122aaa800b021e36027d7f29e206f87c761d6 upstream. The value set as scrub_speed_max accepts size with suffixes (k/m/g/t/p/e) but we should still validate it for trailing characters, similar to what we do with chunk_size_store. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: David Disseldorp <ddiss@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-01btrfs: free qgroup pertrans reserve on transaction abortBoris Burkov4-4/+34
[ Upstream commit b321a52cce062ec7ed385333a33905d22159ce36 ] If we abort a transaction, we never run the code that frees the pertrans qgroup reservation. This results in warnings on unmount as that reservation has been leaked. The leak isn't a huge issue since the fs is read-only, but it's better to clean it up when we know we can/should. Do it during the cleanup_transaction step of aborting. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-01btrfs: qgroup: use qgroup_iterator in qgroup_convert_meta()Qu Wenruo1-22/+10
[ Upstream commit 0913445082496c2b29668ee26521401b273838b8 ] With the new qgroup_iterator_add() and qgroup_iterator_clean(), we can get rid of the ulist and its GFP_ATOMIC memory allocation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: b321a52cce06 ("btrfs: free qgroup pertrans reserve on transaction abort") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-01btrfs: qgroup: iterate qgroups without memory allocation for qgroup_reserve()Qu Wenruo2-32/+38
[ Upstream commit 686c4a5a42635e0d2889e3eb461c554fd0b616b4 ] Qgroup heavily relies on ulist to go through all the involved qgroups, but since we're using ulist inside fs_info->qgroup_lock spinlock, this means we're doing a lot of GFP_ATOMIC allocations. This patch reduces the GFP_ATOMIC usage for qgroup_reserve() by eliminating the memory allocation completely. This is done by moving the needed memory to btrfs_qgroup::iterator list_head, so that we can put all the involved qgroup into a on-stack list, thus eliminating the need to allocate memory while holding spinlock. The only cost is the slightly higher memory usage, but considering the reduce GFP_ATOMIC during a hot path, it should still be acceptable. Function qgroup_reserve() is the perfect start point for this conversion. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: b321a52cce06 ("btrfs: free qgroup pertrans reserve on transaction abort") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-20btrfs: don't clear qgroup reserved bit in release_folioBoris Burkov1-1/+2
commit a86805504b88f636a6458520d85afdf0634e3c6b upstream. The EXTENT_QGROUP_RESERVED bit is used to "lock" regions of the file for duplicate reservations. That is two writes to that range in one transaction shouldn't create two reservations, as the reservation will only be freed once when the write finally goes down. Therefore, it is never OK to clear that bit without freeing the associated qgroup reserve. At this point, we don't want to be freeing the reserve, so mask off the bit. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-20btrfs: fix qgroup_free_reserved_data int overflowBoris Burkov6-25/+31
commit 9e65bfca24cf1d77e4a5c7a170db5867377b3fe7 upstream. The reserved data counter and input parameter is a u64, but we inadvertently accumulate it in an int. Overflowing that int results in freeing the wrong amount of data and breaking reserve accounting. Unfortunately, this overflow rot spreads from there, as the qgroup release/free functions rely on returning an int to take advantage of negative values for error codes. Therefore, the full fix is to return the "released" or "freed" amount by a u64 argument and to return 0 or negative error code via the return value. Most of the call sites simply ignore the return value, though some of them handle the error and count the returned bytes. Change all of them accordingly. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-20btrfs: free qgroup reserve when ORDERED_IOERR is setBoris Burkov1-1/+3
commit f63e1164b90b385cd832ff0fdfcfa76c3cc15436 upstream. An ordered extent completing is a critical moment in qgroup reserve handling, as the ownership of the reservation is handed off from the ordered extent to the delayed ref. In the happy path we release (unlock) but do not free (decrement counter) the reservation, and the delayed ref drives the free. However, on an error, we don't create a delayed ref, since there is no ref to add. Therefore, free on the error path. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-20btrfs: do not allow non subvolume root targets for snapshotJosef Bacik1-0/+9
commit a8892fd71933126ebae3d60aec5918d4dceaae76 upstream. Our btrfs subvolume snapshot <source> <destination> utility enforces that <source> is the root of the subvolume, however this isn't enforced in the kernel. Update the kernel to also enforce this limitation to avoid problems with other users of this ioctl that don't have the appropriate checks in place. Reported-by: Martin Michaelis <code@mgjm.de> CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08btrfs: fix 64bit compat send ioctl arguments not initializing version memberDavid Sterba1-0/+1
commit 5de0434bc064606d6b7467ec3e5ad22963a18c04 upstream. When the send protocol versioning was added in 5.16 e77fbf990316 ("btrfs: send: prepare for v2 protocol"), the 32/64bit compat code was not updated (added by 2351f431f727 ("btrfs: fix send ioctl on 32bit with 64bit kernel")), missing the version struct member. The compat code is probably rarely used, nobody reported any bugs. Found by tool https://github.com/jirislaby/clang-struct . Fixes: e77fbf990316 ("btrfs: send: prepare for v2 protocol") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08btrfs: free the allocated memory if btrfs_alloc_page_array() failsQu Wenruo1-3/+8
commit 94dbf7c0871f7ae6349ba4b0341ce8f5f98a071d upstream. [BUG] If btrfs_alloc_page_array() fail to allocate all pages but part of the slots, then the partially allocated pages would be leaked in function btrfs_submit_compressed_read(). [CAUSE] As explicitly stated, if btrfs_alloc_page_array() returned -ENOMEM, caller is responsible to free the partially allocated pages. For the existing call sites, most of them are fine: - btrfs_raid_bio::stripe_pages Handled by free_raid_bio(). - extent_buffer::pages[] Handled btrfs_release_extent_buffer_pages(). - scrub_stripe::pages[] Handled by release_scrub_stripe(). But there is one exception in btrfs_submit_compressed_read(), if btrfs_alloc_page_array() failed, we didn't cleanup the array and freed the array pointer directly. Initially there is still the error handling in commit dd137dd1f2d7 ("btrfs: factor out allocating an array of pages"), but later in commit 544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio"), the error handling is removed, leading to the possible memory leak. [FIX] This patch would add back the error handling first, then to prevent such situation from happening again, also Make btrfs_alloc_page_array() to free the allocated pages as a extra safety net, then we don't need to add the error handling to btrfs_submit_compressed_read(). Fixes: 544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio") CC: stable@vger.kernel.org # 6.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08btrfs: make error messages more clear when getting a chunk mapFilipe Manana1-3/+4
commit 7d410d5efe04e42a6cd959bfe6d59d559fdf8b25 upstream. When getting a chunk map, at btrfs_get_chunk_map(), we do some sanity checks to verify we found a chunk map and that map found covers the logical address the caller passed in. However the messages aren't very clear in the sense that don't mention the issue is with a chunk map and one of them prints the 'length' argument as if it were the end offset of the requested range (while the in the string format we use %llu-%llu which suggests a range, and the second %llu-%llu is actually a range for the chunk map). So improve these two details in the error messages. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08btrfs: send: ensure send_fd is writableJann Horn1-1/+1
commit 0ac1d13a55eb37d398b63e6ff6db4a09a2c9128c upstream. kernel_write() requires the caller to ensure that the file is writable. Let's do that directly after looking up the ->send_fd. We don't need a separate bailout path because the "out" path already does fput() if ->send_filp is non-NULL. This has no security impact for two reasons: - the ioctl requires CAP_SYS_ADMIN - __kernel_write() bails out on read-only files - but only since 5.8, see commit a01ac27be472 ("fs: check FMODE_WRITE in __kernel_write") Reported-and-tested-by: syzbot+12e098239d20385264d3@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=12e098239d20385264d3 Fixes: 31db9f7c23fb ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08btrfs: fix off-by-one when checking chunk map includes logical addressFilipe Manana1-1/+1
commit 5fba5a571858ce2d787fdaf55814e42725bfa895 upstream. At btrfs_get_chunk_map() we get the extent map for the chunk that contains the given logical address stored in the 'logical' argument. Then we do sanity checks to verify the extent map contains the logical address. One of these checks verifies if the extent map covers a range with an end offset behind the target logical address - however this check has an off-by-one error since it will consider an extent map whose start offset plus its length matches the target logical address as inclusive, while the fact is that the last byte it covers is behind the target logical address (by 1). So fix this condition by using '<=' rather than '<' when comparing the extent map's "start + length" against the target logical address. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod()Bragatheswaran Manickavel1-0/+2
commit f91192cd68591c6b037da345bc9fcd5e50540358 upstream. In btrfs_ref_tree_mod(), when !parent 're' was allocated through kmalloc(). In the following code, if an error occurs, the execution will be redirected to 'out' or 'out_unlock' and the function will be exited. However, on some of the paths, 're' are not deallocated and may lead to memory leaks. For example: lookup_block_entry() for 'be' returns NULL, the out label will be invoked. During that flow ref and 'ra' are freed but not 're', which can potentially lead to a memory leak. CC: stable@vger.kernel.org # 5.10+ Reported-and-tested-by: syzbot+d66de4cbf532749df35f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d66de4cbf532749df35f Signed-off-by: Bragatheswaran Manickavel <bragathemanick0908@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08btrfs: add dmesg output for first mount and last unmount of a filesystemQu Wenruo2-1/+5
commit 2db313205f8b96eea467691917138d646bb50aef upstream. There is a feature request to add dmesg output when unmounting a btrfs. There are several alternative methods to do the same thing, but with their own problems: - Use eBPF to watch btrfs_put_super()/open_ctree() Not end user friendly, they have to dip their head into the source code. - Watch for directory /sys/fs/<uuid>/ This is way more simple, but still requires some simple device -> uuid lookups. And a script needs to use inotify to watch /sys/fs/. Compared to all these, directly outputting the information into dmesg would be the most simple one, with both device and UUID included. And since we're here, also add the output when mounting a filesystem for the first time for parity. A more fine grained monitoring of subvolume mounts should be done by another layer, like audit. Now mounting a btrfs with all default mkfs options would look like this: [81.906566] BTRFS info (device dm-8): first mount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2 [81.907494] BTRFS info (device dm-8): using crc32c (crc32c-intel) checksum algorithm [81.908258] BTRFS info (device dm-8): using free space tree [81.912644] BTRFS info (device dm-8): auto enabling async discard [81.913277] BTRFS info (device dm-8): checking UUID tree [91.668256] BTRFS info (device dm-8): last unmount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2 CC: stable@vger.kernel.org # 5.4+ Link: https://github.com/kdave/btrfs-progs/issues/689 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28btrfs: zoned: wait for data BG to be finished on direct IO allocationNaohiro Aota1-0/+7
commit 776a838f1fa95670c1c1cf7109a898090b473fa3 upstream. Running the fio command below on a ZNS device results in "Resource temporarily unavailable" error. $ sudo fio --name=w --directory=/mnt --filesize=1GB --bs=16MB --numjobs=16 \ --rw=write --ioengine=libaio --iodepth=128 --direct=1 fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=117440512, buflen=16777216 fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=134217728, buflen=16777216 ... This happens because -EAGAIN error returned from btrfs_reserve_extent() called from btrfs_new_extent_direct() is spilling over to the userland. btrfs_reserve_extent() returns -EAGAIN when there is no active zone available. Then, the caller should wait for some other on-going IO to finish a zone and retry the allocation. This logic is already implemented for buffered write in cow_file_range(), but it is missing for the direct IO counterpart. Implement the same logic for it. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 2ce543f47843 ("btrfs: zoned: wait until zone is finished when allocation didn't progress") CC: stable@vger.kernel.org # 6.1+ Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28btrfs: don't arbitrarily slow down delalloc if we're committingJosef Bacik1-3/+0
commit 11aeb97b45ad2e0040cbb2a589bc403152526345 upstream. We have a random schedule_timeout() if the current transaction is committing, which seems to be a holdover from the original delalloc reservation code. Remove this, we have the proper flushing stuff, we shouldn't be hoping for random timing things to make everything work. This just induces latency for no reason. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28btrfs: abort transaction on generation mismatch when marking eb as dirtyFilipe Manana25-169/+205
[ Upstream commit 50564b651d01c19ce732819c5b3c3fd60707188e ] When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20btrfs: make found_logical_ret parameter mandatory for function ↵Qu Wenruo1-3/+7
queue_scrub_stripe() [ Upstream commit 47e2b06b7b5cb356a987ba3429550c3a89ea89d6 ] [BUG] There is a compilation warning reported on commit ae76d8e3e135 ("btrfs: scrub: fix grouping of read IO"), where gcc (14.0.0 20231022 experimental) is reporting the following uninitialized variable: fs/btrfs/scrub.c: In function ‘scrub_simple_mirror.isra’: fs/btrfs/scrub.c:2075:29: error: ‘found_logical’ may be used uninitialized [-Werror=maybe-uninitialized[https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wmaybe-uninitialized]] 2075 | cur_logical = found_logical + BTRFS_STRIPE_LEN; fs/btrfs/scrub.c:2040:21: note: ‘found_logical’ was declared here 2040 | u64 found_logical; | ^~~~~~~~~~~~~ [CAUSE] This is a false alert, as @found_logical is passed as parameter @found_logical_ret of function queue_scrub_stripe(). As long as queue_scrub_stripe() returned 0, we would update @found_logical_ret. And if queue_scrub_stripe() returned >0 or <0, the caller would not utilized @found_logical, thus there should be nothing wrong. Although the triggering gcc is still experimental, it looks like the extra check on "if (found_logical_ret)" can sometimes confuse the compiler. Meanwhile the only caller of queue_scrub_stripe() is always passing a valid pointer, there is no need for such check at all. [FIX] Although the report itself is a false alert, we can still make it more explicit by: - Replace the check for @found_logical_ret with ASSERT() - Initialize @found_logical to U64_MAX - Add one extra ASSERT() to make sure @found_logical got updated Link: https://lore.kernel.org/linux-btrfs/87fs1x1p93.fsf@gentoo.org/ Fixes: ae76d8e3e135 ("btrfs: scrub: fix grouping of read IO") Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20btrfs: use u64 for buffer sizes in the tree search ioctlsFilipe Manana1-5/+5
[ Upstream commit dec96fc2dcb59723e041416b8dc53e011b4bfc2e ] In the tree search v2 ioctl we use the type size_t, which is an unsigned long, to track the buffer size in the local variable 'buf_size'. An unsigned long is 32 bits wide on a 32 bits architecture. The buffer size defined in struct btrfs_ioctl_search_args_v2 is a u64, so when we later try to copy the local variable 'buf_size' to the argument struct, when the search returns -EOVERFLOW, we copy only 32 bits which will be a problem on big endian systems. Fix this by using a u64 type for the buffer sizes, not only at btrfs_ioctl_tree_search_v2(), but also everywhere down the call chain so that we can use the u64 at btrfs_ioctl_tree_search_v2(). Fixes: cc68a8a5a433 ("btrfs: new ioctl TREE_SEARCH_V2") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/linux-btrfs/ce6f4bd6-9453-4ffe-ba00-cee35495e10f@moroto.mountain/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-23Merge tag 'for-6.6-rc7-tag' of ↵Linus Torvalds5-15/+33
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more fix for a problem with snapshot of a newly created subvolume that can lead to inconsistent data under some circumstances. Kernel 6.5 added a performance optimization to skip transaction commit for subvolume creation but this could end up with newer data on disk but not linked to other structures. The fix itself is an added condition, the rest of the patch is a parameter added to several functions" * tag 'for-6.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix unwritten extent buffer after snapshotting a new subvolume
2023-10-23btrfs: fix unwritten extent buffer after snapshotting a new subvolumeFilipe Manana5-15/+33
When creating a snapshot of a subvolume that was created in the current transaction, we can end up not persisting a dirty extent buffer that is referenced by the snapshot, resulting in IO errors due to checksum failures when trying to read the extent buffer later from disk. A sequence of steps that leads to this is the following: 1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical address 36007936, for the leaf/root of a new subvolume that has an ID of 291. We mark the extent buffer as dirty, and at this point the subvolume tree has a single node/leaf which is also its root (level 0); 2) We no longer commit the transaction used to create the subvolume at create_subvol(). We used to, but that was recently removed in commit 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create"); 3) The transaction used to create the subvolume has an ID of 33, so the extent buffer 36007936 has a generation of 33; 4) Several updates happen to subvolume 291 during transaction 33, several files created and its tree height changes from 0 to 1, so we end up with a new root at level 1 and the extent buffer 36007936 is now a leaf of that new root node, which is extent buffer 36048896. The commit root remains as 36007936, since we are still at transaction 33; 5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and we end up at transaction.c:create_pending_snapshot(), in the critical section of a transaction commit. There we COW the root of subvolume 291, which is extent buffer 36048896. The COW operation returns extent buffer 36048896, since there's no need to COW because the extent buffer was created in this transaction and it was not written yet. The we call btrfs_copy_root() against the root node 36048896. During this operation we allocate a new extent buffer to turn into the root node of the snapshot, copy the contents of the root node 36048896 into this snapshot root extent buffer, set the owner to 292 (the ID of the snapshot), etc, and then we call btrfs_inc_ref(). This will create a delayed reference for each leaf pointed by the root node with a reference root of 292 - this includes a reference for the leaf 36007936. After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state. Then we call btrfs_insert_dir_item(), to create the directory entry in in the tree of subvolume 291 that points to the snapshot. This ends up needing to modify leaf 36007936 to insert the respective directory items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state, we need to COW the leaf. We end up at btrfs_force_cow_block() and then at update_ref_for_cow(). At update_ref_for_cow() we call btrfs_block_can_be_shared() which returns false, despite the fact the leaf 36007936 is shared - the subvolume's root and the snapshot's root point to that leaf. The reason that it incorrectly returns false is because the commit root of the subvolume is extent buffer 36007936 - it was the initial root of the subvolume when we created it. So btrfs_block_can_be_shared() which has the following logic: int btrfs_block_can_be_shared(struct btrfs_root *root, struct extent_buffer *buf) { if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) && buf != root->node && buf != root->commit_root && (btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item) || btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))) return 1; return 0; } Returns false (0) since 'buf' (extent buffer 36007936) matches the root's commit root. As a result, at update_ref_for_cow(), we don't check for the number of references for extent buffer 36007936, we just assume it's not shared and therefore that it has only 1 reference, so we set the local variable 'refs' to 1. Later on, in the final if-else statement at update_ref_for_cow(): static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, struct extent_buffer *cow, int *last_ref) { (...) if (refs > 1) { (...) } else { (...) btrfs_clear_buffer_dirty(trans, buf); *last_ref = 1; } } So we mark the extent buffer 36007936 as not dirty, and as a result we don't write it to disk later in the transaction commit, despite the fact that the snapshot's root points to it. Attempting to access the leaf or dumping the tree for example shows that the extent buffer was not written: $ btrfs inspect-internal dump-tree -t 292 /dev/sdb btrfs-progs v6.2.2 file tree key (292 ROOT_ITEM 33) node 36110336 level 1 items 2 free space 119 generation 33 owner 292 node 36110336 flags 0x1(WRITTEN) backref revision 1 checksum stored a8103e3e checksum calced a8103e3e fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07 key (256 INODE_ITEM 0) block 36007936 gen 33 key (257 EXTENT_DATA 0) block 36052992 gen 33 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 total bytes 107374182400 bytes used 38572032 uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 The respective on disk region is full of zeroes as the device was trimmed at mkfs time. Obviously 'btrfs check' also detects and complains about this: $ btrfs check /dev/sdb Opening filesystem to check... Checking filesystem on /dev/sdb UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 generation: 33 (33) [1/7] checking root items [2/7] checking extents checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 bad tree block 36007936, bytenr mismatch, want=36007936, have=0 owner ref check failed [36007936 4096] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree [4/7] checking fs roots checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 bad tree block 36007936, bytenr mismatch, want=36007936, have=0 The following tree block(s) is corrupted in tree 292: tree block bytenr: 36110336, level: 1, node key: (256, 1, 0) root 292 root dir 256 not found ERROR: errors found in fs roots found 38572032 bytes used, error(s) found total csum bytes: 16048 total tree bytes: 1265664 total fs tree bytes: 1118208 total extent tree bytes: 65536 btree space waste bytes: 562598 file data blocks allocated: 65978368 referenced 36569088 Fix this by updating btrfs_block_can_be_shared() to consider that an extent buffer may be shared if it matches the commit root and if its generation matches the current transaction's generation. This can be reproduced with the following script: $ cat test.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi # Use a filesystem with a 64K node size so that we have the same node # size on every machine regardless of its page size (on x86_64 default # node size is 16K due to the 4K page size, while on PPC it's 64K by # default). This way we can make sure we are able to create a btree for # the subvolume with a height of 2. mkfs.btrfs -f -n 64K $DEV mount $DEV $MNT btrfs subvolume create $MNT/subvol # Create a few empty files on the subvolume, this bumps its btree # height to 2 (root node at level 1 and 2 leaves). for ((i = 1; i <= 300; i++)); do echo -n > $MNT/subvol/file_$i done btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap umount $DEV btrfs check $DEV Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment): $ ./test.sh Create subvolume '/mnt/sdi/subvol' Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap' Opening filesystem to check... Checking filesystem on /dev/sdi UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1 [1/7] checking root items [2/7] checking extents parent transid verify failed on 30539776 wanted 7 found 5 parent transid verify failed on 30539776 wanted 7 found 5 parent transid verify failed on 30539776 wanted 7 found 5 Ignoring transid failure owner ref check failed [30539776 65536] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree [4/7] checking fs roots parent transid verify failed on 30539776 wanted 7 found 5 Ignoring transid failure Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0) Wrong generation of child node/leaf, wanted: 5, have: 7 root 257 root dir 256 not found ERROR: errors found in fs roots found 917504 bytes used, error(s) found total csum bytes: 0 total tree bytes: 851968 total fs tree bytes: 393216 total extent tree bytes: 65536 btree space waste bytes: 736550 file data blocks allocated: 0 referenced 0 A test case for fstests will follow soon. Fixes: 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create") CC: stable@vger.kernel.org # 6.5+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-19Merge tag 'for-6.6-rc6-tag' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "Fix a bug in chunk size decision that could lead to suboptimal placement and filling patterns" * tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix stripe length calculation for non-zoned data chunk allocation
2023-10-15btrfs: fix stripe length calculation for non-zoned data chunk allocationZygo Blaxell1-1/+1
Commit f6fca3917b4d "btrfs: store chunk size in space-info struct" broke data chunk allocations on non-zoned multi-device filesystems when using default chunk_size. Commit 5da431b71d4b "btrfs: fix the max chunk size and stripe length calculation" partially fixed that, and this patch completes the fix for that case. After commit f6fca3917b4d and 5da431b71d4b, the sequence of events for a data chunk allocation on a non-zoned filesystem is: 1. btrfs_create_chunk calls init_alloc_chunk_ctl, which copies space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len unmodified. Before f6fca3917b4d, ctl->max_stripe_len value was 1 GiB for non-zoned data chunks and not configurable. 2. btrfs_create_chunk calls gather_device_info which consumes and produces more fields of chunk_ctl. 3. gather_device_info multiplies ctl->max_stripe_len by ctl->dev_stripes (which is 1 in all cases except dup) and calls find_free_dev_extent with that number as num_bytes. 4. find_free_dev_extent locates the first dev_extent hole on a device which is at least as large as num_bytes. With default max_chunk_size from f6fca3917b4d, it finds the first hole which is longer than 10 GiB, or the largest hole if that hole is shorter than 10 GiB. This is different from the pre-f6fca3917b4d behavior, where num_bytes is 1 GiB, and find_free_dev_extent may choose a different hole. 5. gather_device_info repeats step 4 with all devices to find the first or largest dev_extent hole that can be allocated on each device. 6. gather_device_info sorts the device list by the hole size on each device, using total unallocated space on each device to break ties, then returns to btrfs_create_chunk with the list. 7. btrfs_create_chunk calls decide_stripe_size_regular. 8. decide_stripe_size_regular finds the largest stripe_len that fits across the first nr_devs device dev_extent holes that were found by gather_device_info (and satisfies other constraints on stripe_len that are not relevant here). 9. decide_stripe_size_regular caps the length of the stripe it computed at 1 GiB. This cap appeared in 5da431b71d4b to correct one of the other regressions introduced in f6fca3917b4d. 10. btrfs_create_chunk creates a new chunk with the above computed size and number of devices. At step 4, gather_device_info() has found a location where stripe up to 10 GiB in length could be allocated on several devices, and selected which devices should have a dev_extent allocated on them, but at step 9, only 1 GiB of the space that was found on each device can be used. This mismatch causes new suboptimal chunk allocation cases that did not occur in pre-f6fca3917b4d kernels. Consider a filesystem using raid1 profile with 3 devices. After some balances, device 1 has 10x 1 GiB unallocated space, while devices 2 and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of space, but distributed across different numbers of dev_extent holes. For visualization, let's ignore all the chunks that were allocated before this point, and focus on the remaining holes: Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated) Device 2: [__________] (10 GiB contig unallocated) Device 3: [__________] (10 GiB contig unallocated) Before f6fca3917b4d, the allocator would fill these optimally by allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3 ([13]), or 2 and 3 ([23]): [after 0 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [__________] (10 GiB) Device 3: [__________] (10 GiB) [after 1 chunk allocation] Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] Device 2: [12] [_________] (9 GiB) Device 3: [__________] (10 GiB) [after 2 chunk allocations] Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB) Device 2: [12] [_________] (9 GiB) Device 3: [13] [_________] (9 GiB) [after 3 chunk allocations] Device 1: [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB) Device 2: [12] [12] [________] (8 GiB) Device 3: [13] [_________] (9 GiB) [...] [after 12 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB) [after 13 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB) [after 14 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB) [after 15 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full) This allocates all of the space with no waste. The sorting function used by gather_device_info considers free space holes above 1 GiB in length to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently long hole on each device, all the holes appear equal in the sort, and the comparison falls back to sorting devices by total free space. This keeps usable space on each device equal so they can all be filled completely. After f6fca3917b4d, the allocator prefers the devices with larger holes over the devices with more free space, so it makes bad allocation choices: [after 1 chunk allocation] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [_________] (9 GiB) Device 3: [23] [_________] (9 GiB) [after 2 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [________] (8 GiB) Device 3: [23] [23] [________] (8 GiB) [after 3 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [23] [_______] (7 GiB) Device 3: [23] [23] [23] [_______] (7 GiB) [...] [after 9 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) [after 10 chunk allocations] Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) [after 11 chunk allocations] Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [13] (full) No further allocations are possible, with 8 GiB wasted (4 GiB of data space). The sort in gather_device_info now considers free space in holes longer than 1 GiB to be distinct, so it will prefer devices 2 and 3 over device 1 until all but 1 GiB is allocated on devices 2 and 3. At that point, with only 1 GiB unallocated on every device, the largest hole length on each device is equal at 1 GiB, so the sort finally moves to ordering the devices with the most free space, but by this time it is too late to make use of the free space on device 1. Note that it's possible to contrive a case where the pre-f6fca3917b4d allocator fails the same way, but these cases generally have extensive dev_extent fragmentation as a precondition (e.g. many holes of 768M in length on one device, and few holes 1 GiB in length on the others). With the regression in f6fca3917b4d, bad chunk allocation can occur even under optimal conditions, when all dev_extent holes are exact multiples of stripe_len in length, as in the example above. Also note that post-f6fca3917b4d kernels do treat dev_extent holes larger than 10 GiB as equal, so the bad behavior won't show up on a freshly formatted filesystem; however, as the filesystem ages and fills up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem can show up as a failure to balance after adding or removing devices, or an unexpected shortfall in available space due to unequal allocation. To fix the regression and make data chunk allocation work again, set ctl->max_stripe_len back to the original SZ_1G, or space_info->chunk_size if that's smaller (the latter can happen if the user set space_info->chunk_size to less than 1 GiB via sysfs, or it's a 32 MiB system chunk with a hardcoded chunk_size and stripe_len). While researching the background of the earlier commits, I found that an identical fix was already proposed at: https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/ The previous review missed one detail: ctl->max_stripe_len is used before decide_stripe_size_regular() is called, when it is too late for the changes in that function to have any effect. ctl->max_stripe_len is not used directly by decide_stripe_size_regular(), but the parameter does heavily influence the per-device free space data presented to the function. Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct") CC: stable@vger.kernel.org # 6.1+ Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-11Merge tag 'for-6.6-rc5-tag' of ↵Linus Torvalds3-6/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A revert of recent mount option parsing fix, this breaks mounts with security options. The second patch is a flexible array annotation" * tag 'for-6.6-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size() Revert "btrfs: reject unknown mount options early"
2023-10-11btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size()Gustavo A. R. Silva2-2/+2
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). While there, use struct_size() helper, instead of the open-coded version, to calculate the size for the allocation of the whole flexible structure, including of course, the flexible-array member. This code was found with the help of Coccinelle, and audited and fixed manually. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-10Revert "btrfs: reject unknown mount options early"David Sterba1-4/+0
This reverts commit 5f521494cc73520ffac18ede0758883b9aedd018. The patch breaks mounts with security mount options like $ mount -o context=system_u:object_r:root_t:s0 /dev/sdX /mn mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sdX, missing codepage or helper program, ... We cannot reject all unknown options in btrfs_parse_subvol_options() as intended, the security options can be present at this point and it's not possible to enumerate them in a future proof way. This means unknown mount options are silently accepted like before when the filesystem is mounted with either -o subvol=/path or as followup mounts of the same device. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-06Merge tag 'for-6.6-rc4-tag' of ↵Linus Torvalds4-17/+47
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - reject unknown mount options - adjust transaction abort error message level - fix one more build warning with -Wmaybe-uninitialized - proper error handling in several COW-related cases * tag 'for-6.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: error out when reallocating block for defrag using a stale transaction btrfs: error when COWing block from a root that is being deleted btrfs: error out when COWing block using a stale transaction btrfs: always print transaction aborted messages with an error level btrfs: reject unknown mount options early btrfs: fix some -Wmaybe-uninitialized warnings in ioctl.c
2023-10-04btrfs: error out when reallocating block for defrag using a stale transactionFilipe Manana1-2/+16
At btrfs_realloc_node() we have these checks to verify we are not using a stale transaction (a past transaction with an unblocked state or higher), and the only thing we do is to trigger two WARN_ON(). This however is a critical problem, highly unexpected and if it happens it's most likely due to a bug, so we should error out and turn the fs into error state so that such issue is much more easily noticed if it's triggered. The problem is critical because in btrfs_realloc_node() we COW tree blocks, and using such stale transaction will lead to not persisting the extent buffers used for the COW operations, as allocating tree block adds the range of the respective extent buffers to the ->dirty_pages iotree of the transaction, and a stale transaction, in the unlocked state or higher, will not flush dirty extent buffers anymore, therefore resulting in not persisting the tree block and resource leaks (not cleaning the dirty_pages iotree for example). So do the following changes: 1) Return -EUCLEAN if we find a stale transaction; 2) Turn the fs into error state, with error -EUCLEAN, so that no transaction can be committed, and generate a stack trace; 3) Combine both conditions into a single if statement, as both are related and have the same error message; 4) Mark the check as unlikely, since this is not expected to ever happen. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-04btrfs: error when COWing block from a root that is being deletedFilipe Manana1-3/+7
At btrfs_cow_block() we check if the block being COWed belongs to a root that is being deleted and if so we log an error message. However this is an unexpected case and it indicates a bug somewhere, so we should return an error and abort the transaction. So change this in the following ways: 1) Abort the transaction with -EUCLEAN, so that if the issue ever happens it can easily be noticed; 2) Change the logged message level from error to critical, and change the message itself to print the block's logical address and the ID of the root; 3) Return -EUCLEAN to the caller; 4) As this is an unexpected scenario, that should never happen, mark the check as unlikely, allowing the compiler to potentially generate better code. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>