| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit b17b79ff896305fd74980a5f72afec370ee88ca4 ]
[BUG]
When recovering relocation at mount time, merge_reloc_root() and
btrfs_drop_snapshot() both use BUG_ON(level == 0) to guard against
an impossible state: a non-zero drop_progress combined with a zero
drop_level in a root_item, which can be triggered:
------------[ cut here ]------------
kernel BUG at fs/btrfs/relocation.c:1545!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 283 ... Tainted: 6.18.0+ #16 PREEMPT(voluntary)
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Ubuntu 24.04 PC v2, BIOS 1.16.3-debian-1.16.3-2
RIP: 0010:merge_reloc_root+0x1266/0x1650 fs/btrfs/relocation.c:1545
Code: ffff0000 00004589 d7e9acfa ffffe8a1 79bafebe 02000000
Call Trace:
merge_reloc_roots+0x295/0x890 fs/btrfs/relocation.c:1861
btrfs_recover_relocation+0xd6e/0x11d0 fs/btrfs/relocation.c:4195
btrfs_start_pre_rw_mount+0xa4d/0x1810 fs/btrfs/disk-io.c:3130
open_ctree+0x5824/0x5fe0 fs/btrfs/disk-io.c:3640
btrfs_fill_super fs/btrfs/super.c:987 [inline]
btrfs_get_tree_super fs/btrfs/super.c:1951 [inline]
btrfs_get_tree_subvol fs/btrfs/super.c:2094 [inline]
btrfs_get_tree+0x111c/0x2190 fs/btrfs/super.c:2128
vfs_get_tree+0x9a/0x370 fs/super.c:1758
fc_mount fs/namespace.c:1199 [inline]
do_new_mount_fc fs/namespace.c:3642 [inline]
do_new_mount fs/namespace.c:3718 [inline]
path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
do_mount fs/namespace.c:4041 [inline]
__do_sys_mount fs/namespace.c:4229 [inline]
__se_sys_mount fs/namespace.c:4206 [inline]
__x64_sys_mount+0x282/0x320 fs/namespace.c:4206
...
RIP: 0033:0x7f969c9a8fde
Code: 0f1f4000 48c7c2b0 fffffff7 d8648902 b8ffffff ffc3660f
---[ end trace 0000000000000000 ]---
The bug is reproducible on 7.0.0-rc2-next-20260310 with our dynamic
metadata fuzzing tool that corrupts btrfs metadata at runtime.
[CAUSE]
A non-zero drop_progress.objectid means an interrupted
btrfs_drop_snapshot() left a resume point on disk, and in that case
drop_level must be greater than 0 because the checkpoint is only
saved at internal node levels.
Although this invariant is enforced when the kernel writes the root
item, it is not validated when the root item is read back from disk.
That allows on-disk corruption to provide an invalid state with
drop_progress.objectid != 0 and drop_level == 0.
When relocation recovery later processes such a root item,
merge_reloc_root() reads drop_level and hits BUG_ON(level == 0). The
same invalid metadata can also trigger the corresponding BUG_ON() in
btrfs_drop_snapshot().
[FIX]
Fix this by validating the root_item invariant in tree-checker when
reading root items from disk: if drop_progress.objectid is non-zero,
drop_level must also be non-zero. Reject such malformed metadata with
-EUCLEAN before it reaches merge_reloc_root() or btrfs_drop_snapshot()
and triggers the BUG_ON.
After the fix, the same corruption is correctly rejected by tree-checker
and the BUG_ON is no longer triggered.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f9a4e3015db1aeafbef407650eb8555445ca943e ]
Currently our qgroup ioctls don't reserve any space, they just do a
transaction join, which does not reserve any space, neither for the quota
tree updates nor for the delayed refs generated when updating the quota
tree. The quota root uses the global block reserve, which is fine most of
the time since we don't expect a lot of updates to the quota root, or to
be too close to -ENOSPC such that other critical metadata updates need to
resort to the global reserve.
However this is not optimal, as not reserving proper space may result in a
transaction abort due to not reserving space for delayed refs and then
abusing the use of the global block reserve.
For example, the following reproducer (which is unlikely to model any
real world use case, but just to illustrate the problem), triggers such a
transaction abort due to -ENOSPC when running delayed refs:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
umount $DEV &> /dev/null
# Limit device to 1G so that it's much faster to reproduce the issue.
mkfs.btrfs -f -b 1G $DEV
mount -o commit=600 $DEV $MNT
fallocate -l 800M $MNT/filler
btrfs quota enable $MNT
for ((i = 1; i <= 400000; i++)); do
btrfs qgroup create 1/$i $MNT
done
umount $MNT
When running this, we can see in dmesg/syslog that a transaction abort
happened:
[436.490] BTRFS error (device nullb0): failed to run delayed ref for logical 30408704 num_bytes 16384 type 176 action 1 ref_mod 1: -28
[436.493] ------------[ cut here ]------------
[436.494] BTRFS: Transaction aborted (error -28)
[436.495] WARNING: fs/btrfs/extent-tree.c:2247 at btrfs_run_delayed_refs+0xd9/0x110 [btrfs], CPU#4: umount/2495372
[436.497] Modules linked in: btrfs loop (...)
[436.508] CPU: 4 UID: 0 PID: 2495372 Comm: umount Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full)
[436.510] Tainted: [W]=WARN
[436.511] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[436.513] RIP: 0010:btrfs_run_delayed_refs+0xdf/0x110 [btrfs]
[436.514] Code: 0f 82 ea (...)
[436.518] RSP: 0018:ffffd511850b7d78 EFLAGS: 00010292
[436.519] RAX: 00000000ffffffe4 RBX: ffff8f120dad37e0 RCX: 0000000002040001
[436.520] RDX: 0000000000000002 RSI: 00000000ffffffe4 RDI: ffffffffc090fd80
[436.522] RBP: 0000000000000000 R08: 0000000000000001 R09: ffffffffc04d1867
[436.523] R10: ffff8f18dc1fffa8 R11: 0000000000000003 R12: ffff8f173aa89400
[436.524] R13: 0000000000000000 R14: ffff8f173aa89400 R15: 0000000000000000
[436.526] FS: 00007fe59045d840(0000) GS:ffff8f192e22e000(0000) knlGS:0000000000000000
[436.527] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436.528] CR2: 00007fe5905ff2b0 CR3: 000000060710a002 CR4: 0000000000370ef0
[436.530] Call Trace:
[436.530] <TASK>
[436.530] btrfs_commit_transaction+0x73/0xc00 [btrfs]
[436.531] ? btrfs_attach_transaction_barrier+0x1e/0x70 [btrfs]
[436.532] sync_filesystem+0x7a/0x90
[436.533] generic_shutdown_super+0x28/0x180
[436.533] kill_anon_super+0x12/0x40
[436.534] btrfs_kill_super+0x12/0x20 [btrfs]
[436.534] deactivate_locked_super+0x2f/0xb0
[436.534] cleanup_mnt+0xea/0x180
[436.535] task_work_run+0x58/0xa0
[436.535] exit_to_user_mode_loop+0xed/0x480
[436.536] ? __x64_sys_umount+0x68/0x80
[436.536] do_syscall_64+0x2a5/0xf20
[436.537] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[436.537] RIP: 0033:0x7fe5906b6217
[436.538] Code: 0d 00 f7 (...)
[436.540] RSP: 002b:00007ffcd87a61f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[436.541] RAX: 0000000000000000 RBX: 00005618b9ecadc8 RCX: 00007fe5906b6217
[436.541] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005618b9ecb100
[436.542] RBP: 0000000000000000 R08: 00007ffcd87a4fe0 R09: 00000000ffffffff
[436.544] R10: 0000000000000103 R11: 0000000000000246 R12: 00007fe59081626c
[436.544] R13: 00005618b9ecb100 R14: 0000000000000000 R15: 00005618b9ecacc0
[436.545] </TASK>
[436.545] ---[ end trace 0000000000000000 ]---
Fix this by changing the qgroup ioctls to use start transaction instead of
joining so that proper space is reserved for the delayed refs generated
for the updates to the quota root. This way we don't get any transaction
abort.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 77603ab10429fe713a03345553ca8dbbfb1d91c6 ]
Shin'ichiro reported sporadic hangs when running generic/013 in our CI
system. When enabling lockdep, there is a lockdep splat when calling
btrfs_get_dev_zone_info_all_devices() in the mount path that can be
triggered by i.e. generic/013:
======================================================
WARNING: possible circular locking dependency detected
7.0.0-rc1+ #355 Not tainted
------------------------------------------------------
mount/1043 is trying to acquire lock:
ffff8881020b5470 (&vblk->vdev_mutex){+.+.}-{4:4}, at: virtblk_report_zones+0xda/0x430
but task is already holding lock:
ffff888102a738e0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: btrfs_get_dev_zone_info_all_devices+0x45/0x90
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&fs_devs->device_list_mutex){+.+.}-{4:4}:
__mutex_lock+0xa3/0x1360
btrfs_create_pending_block_groups+0x1f4/0x9d0
__btrfs_end_transaction+0x3e/0x2e0
btrfs_zoned_reserve_data_reloc_bg+0x2f8/0x390
open_ctree+0x1934/0x23db
btrfs_get_tree.cold+0x105/0x26c
vfs_get_tree+0x28/0xb0
__do_sys_fsconfig+0x324/0x680
do_syscall_64+0x92/0x4f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #3 (btrfs_trans_num_extwriters){++++}-{0:0}:
join_transaction+0xc2/0x5c0
start_transaction+0x17c/0xbc0
btrfs_zoned_reserve_data_reloc_bg+0x2b4/0x390
open_ctree+0x1934/0x23db
btrfs_get_tree.cold+0x105/0x26c
vfs_get_tree+0x28/0xb0
__do_sys_fsconfig+0x324/0x680
do_syscall_64+0x92/0x4f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #2 (btrfs_trans_num_writers){++++}-{0:0}:
lock_release+0x163/0x4b0
__btrfs_end_transaction+0x1c7/0x2e0
btrfs_dirty_inode+0x6f/0xd0
touch_atime+0xe5/0x2c0
btrfs_file_mmap_prepare+0x65/0x90
__mmap_region+0x4b9/0xf00
mmap_region+0xf7/0x120
do_mmap+0x43d/0x610
vm_mmap_pgoff+0xd6/0x190
ksys_mmap_pgoff+0x7e/0xc0
do_syscall_64+0x92/0x4f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #1 (&mm->mmap_lock){++++}-{4:4}:
__might_fault+0x68/0xa0
_copy_to_user+0x22/0x70
blkdev_copy_zone_to_user+0x22/0x40
virtblk_report_zones+0x282/0x430
blkdev_report_zones_ioctl+0xfd/0x130
blkdev_ioctl+0x20f/0x2c0
__x64_sys_ioctl+0x86/0xd0
do_syscall_64+0x92/0x4f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #0 (&vblk->vdev_mutex){+.+.}-{4:4}:
__lock_acquire+0x1522/0x2680
lock_acquire+0xd5/0x2f0
__mutex_lock+0xa3/0x1360
virtblk_report_zones+0xda/0x430
blkdev_report_zones_cached+0x162/0x190
btrfs_get_dev_zones+0xdc/0x2e0
btrfs_get_dev_zone_info+0x219/0xe80
btrfs_get_dev_zone_info_all_devices+0x62/0x90
open_ctree+0x1200/0x23db
btrfs_get_tree.cold+0x105/0x26c
vfs_get_tree+0x28/0xb0
__do_sys_fsconfig+0x324/0x680
do_syscall_64+0x92/0x4f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of:
&vblk->vdev_mutex --> btrfs_trans_num_extwriters --> &fs_devs->device_list_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_devs->device_list_mutex);
lock(btrfs_trans_num_extwriters);
lock(&fs_devs->device_list_mutex);
lock(&vblk->vdev_mutex);
*** DEADLOCK ***
3 locks held by mount/1043:
#0: ffff88811063e878 (&fc->uapi_mutex){+.+.}-{4:4}, at: __do_sys_fsconfig+0x2ae/0x680
#1: ffff88810cb9f0e8 (&type->s_umount_key#31/1){+.+.}-{4:4}, at: alloc_super+0xc0/0x3e0
#2: ffff888102a738e0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: btrfs_get_dev_zone_info_all_devices+0x45/0x90
stack backtrace:
CPU: 2 UID: 0 PID: 1043 Comm: mount Not tainted 7.0.0-rc1+ #355 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-9.fc43 06/10/2025
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x80
print_circular_bug.cold+0x18d/0x1d8
check_noncircular+0x10d/0x130
__lock_acquire+0x1522/0x2680
? vmap_small_pages_range_noflush+0x3ef/0x820
lock_acquire+0xd5/0x2f0
? virtblk_report_zones+0xda/0x430
? lock_is_held_type+0xcd/0x130
__mutex_lock+0xa3/0x1360
? virtblk_report_zones+0xda/0x430
? virtblk_report_zones+0xda/0x430
? __pfx_copy_zone_info_cb+0x10/0x10
? virtblk_report_zones+0xda/0x430
virtblk_report_zones+0xda/0x430
? __pfx_copy_zone_info_cb+0x10/0x10
blkdev_report_zones_cached+0x162/0x190
? __pfx_copy_zone_info_cb+0x10/0x10
btrfs_get_dev_zones+0xdc/0x2e0
btrfs_get_dev_zone_info+0x219/0xe80
btrfs_get_dev_zone_info_all_devices+0x62/0x90
open_ctree+0x1200/0x23db
btrfs_get_tree.cold+0x105/0x26c
? rcu_is_watching+0x18/0x50
vfs_get_tree+0x28/0xb0
__do_sys_fsconfig+0x324/0x680
do_syscall_64+0x92/0x4f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f615e27a40e
RSP: 002b:00007fff11b18fb8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
RAX: ffffffffffffffda RBX: 000055572e92ab10 RCX: 00007f615e27a40e
RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
RBP: 00007fff11b19100 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000055572e92bc40 R14: 00007f615e3faa60 R15: 000055572e92bd08
</TASK>
Don't hold the device_list_mutex while calling into
btrfs_get_dev_zone_info() in btrfs_get_dev_zone_info_all_devices() to
mitigate the issue. This is safe, as no other thread can touch the device
list at the moment of execution.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1c37d896b12dfd0d4c96e310b0033c6676933917 ]
Whenever we get an error updating the device stats item for a device in
btrfs_run_dev_stats() we allow the loop to go to the next device, and if
updating the stats item for the next device succeeds, we end up losing
the error we had from the previous device.
Fix this by breaking out of the loop once we get an error and make sure
it's returned to the caller. Since we are in the transaction commit path
(and in the critical section actually), returning the error will result
in a transaction abort.
Fixes: 733f4fbbc108 ("Btrfs: read device stats on mount, write modified ones during commit")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a4376d9a5d4c9610e69def3fc0b32c86a7ab7a41 ]
When create_space_info_sub_group() allocates elements of
space_info->sub_group[], kobject_init_and_add() is called for each
element via btrfs_sysfs_add_space_info_type(). However, when
check_removing_space_info() frees these elements, it does not call
btrfs_sysfs_remove_space_info() on them. As a result, kobject_put() is
not called and the associated kobj->name objects are leaked.
This memory leak is reproduced by running the blktests test case
zbd/009 on kernels built with CONFIG_DEBUG_KMEMLEAK. The kmemleak
feature reports the following error:
unreferenced object 0xffff888112877d40 (size 16):
comm "mount", pid 1244, jiffies 4294996972
hex dump (first 16 bytes):
64 61 74 61 2d 72 65 6c 6f 63 00 c4 c6 a7 cb 7f data-reloc......
backtrace (crc 53ffde4d):
__kmalloc_node_track_caller_noprof+0x619/0x870
kstrdup+0x42/0xc0
kobject_set_name_vargs+0x44/0x110
kobject_init_and_add+0xcf/0x150
btrfs_sysfs_add_space_info_type+0xfc/0x210 [btrfs]
create_space_info_sub_group.constprop.0+0xfb/0x1b0 [btrfs]
create_space_info+0x211/0x320 [btrfs]
btrfs_init_space_info+0x15a/0x1b0 [btrfs]
open_ctree+0x33c7/0x4a50 [btrfs]
btrfs_get_tree.cold+0x9f/0x1ee [btrfs]
vfs_get_tree+0x87/0x2f0
vfs_cmd_create+0xbd/0x280
__do_sys_fsconfig+0x3df/0x990
do_syscall_64+0x136/0x1540
entry_SYSCALL_64_after_hwframe+0x76/0x7e
To avoid the leak, call btrfs_sysfs_remove_space_info() instead of
kfree() for the elements.
Fixes: f92ee31e031c ("btrfs: introduce btrfs_space_info sub-group")
Link: https://lore.kernel.org/linux-block/b9488881-f18d-4f47-91a5-3c9bf63955a5@wdc.com/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b52fe51f724385b3ed81e37e510a4a33107e8161 ]
Fix the superblock offset mismatch error message in
btrfs_validate_super(): we changed it so that it considers all the
superblocks, but the message still assumes we're only looking at the
first one.
The change from %u to %llu is because we're changing from a constant to
a u64.
Fixes: 069ec957c35e ("btrfs: Refactor btrfs_check_super_valid")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5131fa077f9bb386a1b901bf5b247041f0ec8f80 ]
We have recently observed a number of subvolumes with broken dentries.
ls-ing the parent dir looks like:
drwxrwxrwt 1 root root 16 Jan 23 16:49 .
drwxr-xr-x 1 root root 24 Jan 23 16:48 ..
d????????? ? ? ? ? ? broken_subvol
and similarly stat-ing the file fails.
In this state, deleting the subvol fails with ENOENT, but attempting to
create a new file or subvol over it errors out with EEXIST and even
aborts the fs. Which leaves us a bit stuck.
dmesg contains a single notable error message reading:
"could not do orphan cleanup -2"
2 is ENOENT and the error comes from the failure handling path of
btrfs_orphan_cleanup(), with the stack leading back up to
btrfs_lookup().
btrfs_lookup
btrfs_lookup_dentry
btrfs_orphan_cleanup // prints that message and returns -ENOENT
After some detailed inspection of the internal state, it became clear
that:
- there are no orphan items for the subvol
- the subvol is otherwise healthy looking, it is not half-deleted or
anything, there is no drop progress, etc.
- the subvol was created a while ago and does the meaningful first
btrfs_orphan_cleanup() call that sets BTRFS_ROOT_ORPHAN_CLEANUP much
later.
- after btrfs_orphan_cleanup() fails, btrfs_lookup_dentry() returns -ENOENT,
which results in a negative dentry for the subvolume via
d_splice_alias(NULL, dentry), leading to the observed behavior. The
bug can be mitigated by dropping the dentry cache, at which point we
can successfully delete the subvolume if we want.
i.e.,
btrfs_lookup()
btrfs_lookup_dentry()
if (!sb_rdonly(inode->vfs_inode)->vfs_inode)
btrfs_orphan_cleanup(sub_root)
test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP)
btrfs_search_slot() // finds orphan item for inode N
...
prints "could not do orphan cleanup -2"
if (inode == ERR_PTR(-ENOENT))
inode = NULL;
return d_splice_alias(NULL, dentry) // NEGATIVE DENTRY for valid subvolume
btrfs_orphan_cleanup() does test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP)
on the root when it runs, so it cannot run more than once on a given
root, so something else must run concurrently. However, the obvious
routes to deleting an orphan when nlinks goes to 0 should not be able to
run without first doing a lookup into the subvolume, which should run
btrfs_orphan_cleanup() and set the bit.
The final important observation is that create_subvol() calls
d_instantiate_new() but does not set BTRFS_ROOT_ORPHAN_CLEANUP, so if
the dentry cache gets dropped, the next lookup into the subvolume will
make a real call into btrfs_orphan_cleanup() for the first time. This
opens up the possibility of concurrently deleting the inode/orphan items
but most typical evict() paths will be holding a reference on the parent
dentry (child dentry holds parent->d_lockref.count via dget in
d_alloc(), released in __dentry_kill()) and prevent the parent from
being removed from the dentry cache.
The one exception is delayed iputs. Ordered extent creation calls
igrab() on the inode. If the file is unlinked and closed while those
refs are held, iput() in __dentry_kill() decrements i_count but does
not trigger eviction (i_count > 0). The child dentry is freed and the
subvol dentry's d_lockref.count drops to 0, making it evictable while
the inode is still alive.
Since there are two races (the race between writeback and unlink and
the race between lookup and delayed iputs), and there are too many moving
parts, the following three diagrams show the complete picture.
(Only the second and third are races)
Phase 1:
Create Subvol in dentry cache without BTRFS_ROOT_ORPHAN_CLEANUP set
btrfs_mksubvol()
lookup_one_len()
__lookup_slow()
d_alloc_parallel()
__d_alloc() // d_lockref.count = 1
create_subvol(dentry)
// doesn't touch the bit..
d_instantiate_new(dentry, inode) // dentry in cache with d_lockref.count == 1
Phase 2:
Create a delayed iput for a file in the subvol but leave the subvol in
state where its dentry can be evicted (d_lockref.count == 0)
T1 (task) T2 (writeback) T3 (OE workqueue)
write() // dirty pages
btrfs_writepages()
btrfs_run_delalloc_range()
cow_file_range()
btrfs_alloc_ordered_extent()
igrab() // i_count: 1 -> 2
btrfs_unlink_inode()
btrfs_orphan_add()
close()
__fput()
dput()
finish_dput()
__dentry_kill()
dentry_unlink_inode()
iput() // 2 -> 1
--parent->d_lockref.count // 1 -> 0; evictable
finish_ordered_fn()
btrfs_finish_ordered_io()
btrfs_put_ordered_extent()
btrfs_add_delayed_iput()
Phase 3:
Once the delayed iput is pending and the subvol dentry is evictable,
the shrinker can free it, causing the next lookup to go through
btrfs_lookup() and call btrfs_orphan_cleanup() for the first time.
If the cleaner kthread processes the delayed iput concurrently, the
two race:
T1 (shrinker) T2 (cleaner kthread) T3 (lookup)
super_cache_scan()
prune_dcache_sb()
__dentry_kill()
// subvol dentry freed
btrfs_run_delayed_iputs()
iput() // i_count -> 0
evict() // sets I_FREEING
btrfs_evict_inode()
// truncation loop
btrfs_lookup()
btrfs_lookup_dentry()
btrfs_orphan_cleanup()
// first call (bit never set)
btrfs_iget()
// blocks on I_FREEING
btrfs_orphan_del()
// inode freed
// returns -ENOENT
btrfs_del_orphan_item()
// -ENOENT
// "could not do orphan cleanup -2"
d_splice_alias(NULL, dentry)
// negative dentry for valid subvol
The most straightforward fix is to ensure the invariant that a dentry
for a subvolume can exist if and only if that subvolume has
BTRFS_ROOT_ORPHAN_CLEANUP set on its root (and is known to have no
orphans or ran btrfs_orphan_cleanup()).
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fc1cd1f18c34f91e78362f9629ab9fd43b9dcab9 ]
Fix tree-checker error message to report "invalid root drop_level"
instead of the misleading "invalid root level".
Fixes: 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9573a365ff9ff45da9222d3fe63695ce562beb24 ]
If we log the parent directory of a conflicting inode, we are not logging
the new dentries of the directory, so when we finish we have the parent
directory's inode marked as logged but we did not log its new dentries.
As a consequence if the parent directory is explicitly fsynced later and
it does not have any new changes since we logged it, the fsync is a no-op
and after a power failure the new dentries are missing.
Example scenario:
$ mkdir foo
$ sync
$rmdir foo
$ mkdir dir1
$ mkdir dir2
# A file with the same name and parent as the directory we just deleted
# and was persisted in a past transaction. So the deleted directory's
# inode is a conflicting inode of this new file's inode.
$ touch foo
$ ln foo dir2/link
# The fsync on dir2 will log the parent directory (".") because the
# conflicting inode (deleted directory) does not exists anymore, but it
# it does not log its new dentries (dir1).
$ xfs_io -c "fsync" dir2
# This fsync on the parent directory is no-op, since the previous fsync
# logged it (but without logging its new dentries).
$ xfs_io -c "fsync" .
<power failure>
# After log replay dir1 is missing.
Fix this by ensuring we log new dir dentries whenever we log the parent
directory of a no longer existing conflicting inode.
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/182055fa-e9ce-4089-9f5f-4b8a23e8dd91@gmail.com/
Fixes: a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 0f475ee0ebce5c9492b260027cd95270191675fa upstream.
If we failed to update the root we don't abort the transaction, which is
wrong since we already used the transaction to remove an item from the
uuid tree.
Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree")
CC: stable@vger.kernel.org # 3.12+
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
try_release_subpage_extent_buffer()
commit b2840e33127ce0eea880504b7f133e780f567a9b upstream.
Call rcu_read_lock() before exiting the loop in
try_release_subpage_extent_buffer() because there is a rcu_read_unlock()
call past the loop.
This has been detected by the Clang thread-safety analyzer.
Fixes: ad580dfa388f ("btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()")
CC: stable@vger.kernel.org # 6.18+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 87f2c46003fce4d739138aab4af1942b1afdadac upstream.
If the set received ioctl fails due to an item overflow when attempting to
add the BTRFS_UUID_KEY_RECEIVED_SUBVOL we have to abort the transaction
since we did some metadata updates before.
This means that if a user calls this ioctl with the same received UUID
field for a lot of subvolumes, we will hit the overflow, trigger the
transaction abort and turn the filesystem into RO mode. A malicious user
could exploit this, and this ioctl does not even requires that a user
has admin privileges (CAP_SYS_ADMIN), only that he/she owns the subvolume.
Fix this by doing an early check for item overflow before starting a
transaction. This is also race safe because we are holding the subvol_sem
semaphore in exclusive (write) mode.
A test case for fstests will follow soon.
Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree")
CC: stable@vger.kernel.org # 3.12+
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2d1ababdedd4ba38867c2500eb7f95af5ddeeef7 upstream.
If we attempt to create several files with names that result in the same
hash, we have to pack them in same dir item and that has a limit inherent
to the leaf size. However if we reach that limit, we trigger a transaction
abort and turns the filesystem into RO mode. This allows for a malicious
user to disrupt a system, without the need to have administration
privileges/capabilities.
Reproducer:
$ cat exploit-hash-collisions.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
# Use smallest node size to make the test faster and require fewer file
# names that result in hash collision.
mkfs.btrfs -f --nodesize 4K $DEV
mount $DEV $MNT
# List of names that result in the same crc32c hash for btrfs.
declare -a names=(
'foobar'
'%a8tYkxfGMLWRGr55QSeQc4PBNH9PCLIvR6jZnkDtUUru1t@RouaUe_L:@xGkbO3nCwvLNYeK9vhE628gss:T$yZjZ5l-Nbd6CbC$M=hqE-ujhJICXyIxBvYrIU9-TDC'
'AQci3EUB%shMsg-N%frgU:02ByLs=IPJU0OpgiWit5nexSyxZDncY6WB:=zKZuk5Zy0DD$Ua78%MelgBuMqaHGyKsJUFf9s=UW80PcJmKctb46KveLSiUtNmqrMiL9-Y0I_l5Fnam04CGIg=8@U:Z'
'CvVqJpJzueKcuA$wqwePfyu7VxuWNN3ho$p0zi2H8QFYK$7YlEqOhhb%:hHgjhIjW5vnqWHKNP4'
'ET:vk@rFU4tsvMB0$C_p=xQHaYZjvoF%-BTc%wkFW8yaDAPcCYoR%x$FH5O:'
'HwTon%v7SGSP4FE08jBwwiu5aot2CFKXHTeEAa@38fUcNGOWvE@Mz6WBeDH_VooaZ6AgsXPkVGwy9l@@ZbNXabUU9csiWrrOp0MWUdfi$EZ3w9GkIqtz7I_eOsByOkBOO'
'Ij%2VlFGXSuPvxJGf5UWy6O@1svxGha%b@=%wjkq:CIgE6u7eJOjmQY5qTtxE2Rjbis9@us'
'KBkjG5%9R8K9sOG8UTnAYjxLNAvBmvV5vz3IiZaPmKuLYO03-6asI9lJ_j4@6Xo$KZicaLWJ3Pv8XEwVeUPMwbHYWwbx0pYvNlGMO9F:ZhHAwyctnGy%_eujl%WPd4U2BI7qooOSr85J-C2V$LfY'
'NcRfDfuUQ2=zP8K3CCF5dFcpfiOm6mwenShsAb_F%n6GAGC7fT2JFFn:c35X-3aYwoq7jNX5$ZJ6hI3wnZs$7KgGi7wjulffhHNUxAT0fRRLF39vJ@NvaEMxsMO'
'Oj42AQAEzRoTxa5OuSKIr=A_lwGMy132v4g3Pdq1GvUG9874YseIFQ6QU'
'Ono7avN5GjC:_6dBJ_'
'WHmN2gnmaN-9dVDy4aWo:yNGFzz8qsJyJhWEWcud7$QzN2D9R0efIWWEdu5kwWr73NZm4=@CoCDxrrZnRITr-kGtU_cfW2:%2_am'
'WiFnuTEhAG9FEC6zopQmj-A-$LDQ0T3WULz%ox3UZAPybSV6v1Z$b4L_XBi4M4BMBtJZpz93r9xafpB77r:lbwvitWRyo$odnAUYlYMmU4RvgnNd--e=I5hiEjGLETTtaScWlQp8mYsBovZwM2k'
'XKyH=OsOAF3p%uziGF_ZVr$ivrvhVgD@1u%5RtrV-gl_vqAwHkK@x7YwlxX3qT6WKKQ%PR56NrUBU2dOAOAdzr2=5nJuKPM-T-$ZpQfCL7phxQbUcb:BZOTPaFExc-qK-gDRCDW2'
'd3uUR6OFEwZr%ns1XH_@tbxA@cCPmbBRLdyh7p6V45H$P2$F%w0RqrD3M0g8aGvWpoTFMiBdOTJXjD:JF7=h9a_43xBywYAP%r$SPZi%zDg%ql-KvkdUCtF9OLaQlxmd'
'ePTpbnit%hyNm@WELlpKzNZYOzOTf8EQ$sEfkMy1VOfIUu3coyvIr13-Y7Sv5v-Ivax2Go_GQRFMU1b3362nktT9WOJf3SpT%z8sZmM3gvYQBDgmKI%%RM-G7hyrhgYflOw%z::ZRcv5O:lDCFm'
'evqk743Y@dvZAiG5J05L_ROFV@$2%rVWJ2%3nxV72-W7$e$-SK3tuSHA2mBt$qloC5jwNx33GmQUjD%akhBPu=VJ5g$xhlZiaFtTrjeeM5x7dt4cHpX0cZkmfImndYzGmvwQG:$euFYmXn$_2rA9mKZ'
'gkgUtnihWXsZQTEkrMAWIxir09k3t7jk_IK25t1:cy1XWN0GGqC%FrySdcmU7M8MuPO_ppkLw3=Dfr0UuBAL4%GFk2$Ma10V1jDRGJje%Xx9EV2ERaWKtjpwiZwh0gCSJsj5UL7CR8RtW5opCVFKGGy8Cky'
'hNgsG_8lNRik3PvphqPm0yEH3P%%fYG:kQLY=6O-61Wa6nrV_WVGR6TLB09vHOv%g4VQRP8Gzx7VXUY1qvZyS'
'isA7JVzN12xCxVPJZ_qoLm-pTBuhjjHMvV7o=F:EaClfYNyFGlsfw-Kf%uxdqW-kwk1sPl2vhbjyHU1A6$hz'
'kiJ_fgcdZFDiOptjgH5PN9-PSyLO4fbk_:u5_2tz35lV_iXiJ6cx7pwjTtKy-XGaQ5IefmpJ4N_ZqGsqCsKuqOOBgf9LkUdffHet@Wu'
'lvwtxyhE9:%Q3UxeHiViUyNzJsy:fm38pg_b6s25JvdhOAT=1s0$pG25x=LZ2rlHTszj=gN6M4zHZYr_qrB49i=pA--@WqWLIuX7o1S_SfS@2FSiUZN'
'rC24cw3UBDZ=5qJBUMs9e$=S4Y94ni%Z8639vnrGp=0Hv4z3dNFL0fBLmQ40=EYIY:Z=SLc@QLMSt2zsss2ZXrP7j4='
'uwGl2s-fFrf@GqS=DQqq2I0LJSsOmM%xzTjS:lzXguE3wChdMoHYtLRKPvfaPOZF2fER@j53evbKa7R%A7r4%YEkD=kicJe@SFiGtXHbKe4gCgPAYbnVn'
'UG37U6KKua2bgc:IHzRs7BnB6FD:2Mt5Cc5NdlsW%$1tyvnfz7S27FvNkroXwAW:mBZLA1@qa9WnDbHCDmQmfPMC9z-Eq6QT0jhhPpqyymaD:R02ghwYo%yx7SAaaq-:x33LYpei$5g8DMl3C'
'y2vjek0FE1PDJC0qpfnN:x8k2wCFZ9xiUF2ege=JnP98R%wxjKkdfEiLWvQzmnW'
'8-HCSgH5B%K7P8_jaVtQhBXpBk:pE-$P7ts58U0J@iR9YZntMPl7j$s62yAJO@_9eanFPS54b=UTw$94C-t=HLxT8n6o9P=QnIxq-f1=Ne2dvhe6WbjEQtc'
'YPPh:IFt2mtR6XWSmjHptXL_hbSYu8bMw-JP8@PNyaFkdNFsk$M=xfL6LDKCDM-mSyGA_2MBwZ8Dr4=R1D%7-mCaaKGxb990jzaagRktDTyp'
'9hD2ApKa_t_7x-a@GCG28kY:7$M@5udI1myQ$x5udtggvagmCQcq9QXWRC5hoB0o-_zHQUqZI5rMcz_kbMgvN5jr63LeYA4Cj-c6F5Ugmx6DgVf@2Jqm%MafecpgooqreJ53P-QTS'
)
# Now create files with all those names in the same parent directory.
# It should not fail since a 4K leaf has enough space for them.
for name in "${names[@]}"; do
touch $MNT/$name
done
# Now add one more file name that causes a crc32c hash collision.
# This should fail, but it should not turn the filesystem into RO mode
# (which could be exploited by malicious users) due to a transaction
# abort.
touch $MNT/'W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt'
# Check that we are able to create another file, with a name that does not cause
# a crc32c hash collision.
echo -n "hello world" > $MNT/baz
# Unmount and mount again, verify file baz exists and with the right content.
umount $MNT
mount $DEV $MNT
echo "File baz content: $(cat $MNT/baz)"
umount $MNT
When running the reproducer:
$ ./exploit-hash-collisions.sh
(...)
touch: cannot touch '/mnt/sdi/W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt': Value too large for defined data type
./exploit-hash-collisions.sh: line 57: /mnt/sdi/baz: Read-only file system
cat: /mnt/sdi/baz: No such file or directory
File baz content:
And the transaction abort stack trace in dmesg/syslog:
$ dmesg
(...)
[758240.509761] ------------[ cut here ]------------
[758240.510668] BTRFS: Transaction aborted (error -75)
[758240.511577] WARNING: fs/btrfs/inode.c:6854 at btrfs_create_new_inode+0x805/0xb50 [btrfs], CPU#6: touch/888644
[758240.513513] Modules linked in: btrfs dm_zero (...)
[758240.523221] CPU: 6 UID: 0 PID: 888644 Comm: touch Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full)
[758240.524621] Tainted: [W]=WARN
[758240.525037] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[758240.526331] RIP: 0010:btrfs_create_new_inode+0x80b/0xb50 [btrfs]
[758240.527093] Code: 0f 82 cf (...)
[758240.529211] RSP: 0018:ffffce64418fbb48 EFLAGS: 00010292
[758240.529935] RAX: 00000000ffffffd3 RBX: 0000000000000000 RCX: 00000000ffffffb5
[758240.531040] RDX: 0000000d04f33e06 RSI: 00000000ffffffb5 RDI: ffffffffc0919dd0
[758240.531920] RBP: ffffce64418fbc10 R08: 0000000000000000 R09: 00000000ffffffb5
[758240.532928] R10: 0000000000000000 R11: ffff8e52c0000000 R12: ffff8e53eee7d0f0
[758240.533818] R13: ffff8e57f70932a0 R14: ffff8e5417629568 R15: 0000000000000000
[758240.534664] FS: 00007f1959a2a740(0000) GS:ffff8e5b27cae000(0000) knlGS:0000000000000000
[758240.535821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[758240.536644] CR2: 00007f1959b10ce0 CR3: 000000012a2cc005 CR4: 0000000000370ef0
[758240.537517] Call Trace:
[758240.537828] <TASK>
[758240.538099] btrfs_create_common+0xbf/0x140 [btrfs]
[758240.538760] path_openat+0x111a/0x15b0
[758240.539252] do_filp_open+0xc2/0x170
[758240.539699] ? preempt_count_add+0x47/0xa0
[758240.540200] ? __virt_addr_valid+0xe4/0x1a0
[758240.540800] ? __check_object_size+0x1b3/0x230
[758240.541661] ? alloc_fd+0x118/0x180
[758240.542315] do_sys_openat2+0x70/0xd0
[758240.543012] __x64_sys_openat+0x50/0xa0
[758240.543723] do_syscall_64+0x50/0xf20
[758240.544462] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[758240.545397] RIP: 0033:0x7f1959abc687
[758240.546019] Code: 48 89 fa (...)
[758240.548522] RSP: 002b:00007ffe16ff8690 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[758240.566278] RAX: ffffffffffffffda RBX: 00007f1959a2a740 RCX: 00007f1959abc687
[758240.567068] RDX: 0000000000000941 RSI: 00007ffe16ffa333 RDI: ffffffffffffff9c
[758240.567860] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[758240.568707] R10: 00000000000001b6 R11: 0000000000000202 R12: 0000561eec7c4b90
[758240.569712] R13: 0000561eec7c311f R14: 00007ffe16ffa333 R15: 0000000000000000
[758240.570758] </TASK>
[758240.571040] ---[ end trace 0000000000000000 ]---
[758240.571681] BTRFS: error (device sdi state A) in btrfs_create_new_inode:6854: errno=-75 unknown
[758240.572899] BTRFS info (device sdi state EA): forced readonly
Fix this by checking for hash collision, and if the adding a new name is
possible, early in btrfs_create_new_inode() before we do any tree updates,
so that we don't need to abort the transaction if we cannot add the new
name due to the leaf size limit.
A test case for fstests will be sent soon.
Fixes: caae78e03234 ("btrfs: move common inode creation code into btrfs_create_new_inode()")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e1b18b959025e6b5dbad668f391f65d34b39595a upstream.
Currently a user can trigger a transaction abort by snapshotting a
previously received snapshot a bunch of times until we reach a
BTRFS_UUID_KEY_RECEIVED_SUBVOL item overflow (the maximum item size we
can store in a leaf). This is very likely not common in practice, but
if it happens, it turns the filesystem into RO mode. The snapshot, send
and set_received_subvol and subvol_setflags (used by receive) don't
require CAP_SYS_ADMIN, just inode_owner_or_capable(). A malicious user
could use this to turn a filesystem into RO mode and disrupt a system.
Reproducer script:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
# Use smallest node size to make the test faster.
mkfs.btrfs -f --nodesize 4K $DEV
mount $DEV $MNT
# Create a subvolume and set it to RO so that it can be used for send.
btrfs subvolume create $MNT/sv
touch $MNT/sv/foo
btrfs property set $MNT/sv ro true
# Send and receive the subvolume into snaps/sv.
mkdir $MNT/snaps
btrfs send $MNT/sv | btrfs receive $MNT/snaps
# Now snapshot the received subvolume, which has a received_uuid, a
# lot of times to trigger the leaf overflow.
total=500
for ((i = 1; i <= $total; i++)); do
echo -ne "\rCreating snapshot $i/$total"
btrfs subvolume snapshot -r $MNT/snaps/sv $MNT/snaps/sv_$i > /dev/null
done
echo
umount $MNT
When running the test:
$ ./test.sh
(...)
Create subvolume '/mnt/sdi/sv'
At subvol /mnt/sdi/sv
At subvol sv
Creating snapshot 496/500ERROR: Could not create subvolume: Value too large for defined data type
Creating snapshot 497/500ERROR: Could not create subvolume: Read-only file system
Creating snapshot 498/500ERROR: Could not create subvolume: Read-only file system
Creating snapshot 499/500ERROR: Could not create subvolume: Read-only file system
Creating snapshot 500/500ERROR: Could not create subvolume: Read-only file system
And in dmesg/syslog:
$ dmesg
(...)
[251067.627338] BTRFS warning (device sdi): insert uuid item failed -75 (0x4628b21c4ac8d898, 0x2598bee2b1515c91) type 252!
[251067.629212] ------------[ cut here ]------------
[251067.630033] BTRFS: Transaction aborted (error -75)
[251067.630871] WARNING: fs/btrfs/transaction.c:1907 at create_pending_snapshot.cold+0x52/0x465 [btrfs], CPU#10: btrfs/615235
[251067.632851] Modules linked in: btrfs dm_zero (...)
[251067.644071] CPU: 10 UID: 0 PID: 615235 Comm: btrfs Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full)
[251067.646165] Tainted: [W]=WARN
[251067.646733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[251067.648735] RIP: 0010:create_pending_snapshot.cold+0x55/0x465 [btrfs]
[251067.649984] Code: f0 48 0f (...)
[251067.653313] RSP: 0018:ffffce644908fae8 EFLAGS: 00010292
[251067.653987] RAX: 00000000ffffff01 RBX: ffff8e5639e63a80 RCX: 00000000ffffffd3
[251067.655042] RDX: ffff8e53faa76b00 RSI: 00000000ffffffb5 RDI: ffffffffc0919750
[251067.656077] RBP: ffffce644908fbd8 R08: 0000000000000000 R09: ffffce644908f820
[251067.657068] R10: ffff8e5adc1fffa8 R11: 0000000000000003 R12: ffff8e53c0431bd0
[251067.658050] R13: ffff8e5414593600 R14: ffff8e55efafd000 R15: 00000000ffffffb5
[251067.659019] FS: 00007f2a4944b3c0(0000) GS:ffff8e5b27dae000(0000) knlGS:0000000000000000
[251067.660115] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[251067.660943] CR2: 00007ffc5aa57898 CR3: 00000005813a2003 CR4: 0000000000370ef0
[251067.661972] Call Trace:
[251067.662292] <TASK>
[251067.662653] create_pending_snapshots+0x97/0xc0 [btrfs]
[251067.663413] btrfs_commit_transaction+0x26e/0xc00 [btrfs]
[251067.664257] ? btrfs_qgroup_convert_reserved_meta+0x35/0x390 [btrfs]
[251067.665238] ? _raw_spin_unlock+0x15/0x30
[251067.665837] ? record_root_in_trans+0xa2/0xd0 [btrfs]
[251067.666531] btrfs_mksubvol+0x330/0x580 [btrfs]
[251067.667145] btrfs_mksnapshot+0x74/0xa0 [btrfs]
[251067.667827] __btrfs_ioctl_snap_create+0x194/0x1d0 [btrfs]
[251067.668595] btrfs_ioctl_snap_create_v2+0x107/0x130 [btrfs]
[251067.669479] btrfs_ioctl+0x1580/0x2690 [btrfs]
[251067.670093] ? count_memcg_events+0x6d/0x180
[251067.670849] ? handle_mm_fault+0x1a0/0x2a0
[251067.671652] __x64_sys_ioctl+0x92/0xe0
[251067.672406] do_syscall_64+0x50/0xf20
[251067.673129] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[251067.674096] RIP: 0033:0x7f2a495648db
[251067.674812] Code: 00 48 89 (...)
[251067.678227] RSP: 002b:00007ffc5aa57840 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[251067.679691] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2a495648db
[251067.681145] RDX: 00007ffc5aa588b0 RSI: 0000000050009417 RDI: 0000000000000004
[251067.682511] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
[251067.683842] R10: 000000000000000a R11: 0000000000000246 R12: 00007ffc5aa59910
[251067.685176] R13: 00007ffc5aa588b0 R14: 0000000000000004 R15: 0000000000000006
[251067.686524] </TASK>
[251067.686972] ---[ end trace 0000000000000000 ]---
[251067.687890] BTRFS: error (device sdi state A) in create_pending_snapshot:1907: errno=-75 unknown
[251067.689049] BTRFS info (device sdi state EA): forced readonly
[251067.689054] BTRFS warning (device sdi state EA): Skipping commit of aborted transaction.
[251067.690119] BTRFS: error (device sdi state EA) in cleanup_transaction:2043: errno=-75 unknown
[251067.702028] BTRFS info (device sdi state EA): last unmount of filesystem 46dc3975-30a2-4a69-a18f-418b859cccda
Fix this by ignoring -EOVERFLOW errors from btrfs_uuid_tree_add() in the
snapshot creation code when attempting to add the
BTRFS_UUID_KEY_RECEIVED_SUBVOL item. This is OK because it's not critical
and we are still able to delete the snapshot, as snapshot/subvolume
deletion ignores if a BTRFS_UUID_KEY_RECEIVED_SUBVOL is missing (see
inode.c:btrfs_delete_subvolume()). As for send/receive, we can still do
send/receive operations since it always peeks the first root ID in the
existing BTRFS_UUID_KEY_RECEIVED_SUBVOL (it could peek any since all
snapshots have the same content), and even if the key is missing, it
falls back to searching by BTRFS_UUID_KEY_SUBVOL key.
A test case for fstests will be sent soon.
Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree")
CC: stable@vger.kernel.org # 3.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
btrfs_chunk_map_num_copies()
commit f15fb3d41543244d1179f423da4a4832a55bc050 upstream.
Fix a chunk map leak in btrfs_map_block(): if we return early with -EINVAL,
we're not freeing the chunk map that we've just looked up.
Fixes: 0ae653fbec2b ("btrfs: reduce chunk_map lookups in btrfs_map_block()")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b8883b61f2fc50dcf22938cbed40fec05020552f ]
btrfs_set_periodic_reclaim_ready() requires space_info->lock to be held,
as enforced by lockdep_assert_held(). However, btrfs_reclaim_sweep() was
calling it after do_reclaim_sweep() returns, at which point
space_info->lock is no longer held.
Fix this by explicitly acquiring space_info->lock before clearing the
periodic reclaim ready flag in btrfs_reclaim_sweep().
Reported-by: Chris Mason <clm@meta.com>
Link: https://lore.kernel.org/linux-btrfs/20260208182556.891815-1-clm@meta.com/
Fixes: 19eff93dc738 ("btrfs: fix periodic reclaim condition")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 587bb33b10bda645a1028c1737ad3992b3d7cf61 ]
Commit d7f67ac9a928 ("btrfs: relax block-group-tree feature dependency
checks") introduced a regression when it comes to handling unsupported
incompat or compat_ro flags. Beforehand we only printed the flags that
we didn't recognize, afterwards we printed them all, which is less
useful. Fix the error handling so it behaves like it used to.
Fixes: d7f67ac9a928 ("btrfs: relax block-group-tree feature dependency checks")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1c7e9111f4e6d6d42bc47759c9af1ef91f03ac2c ]
Fix the error message in btrfs_delete_subvolume() if we can't delete a
subvolume because it has an active swapfile: we were printing the number
of the parent rather than the target.
Fixes: 60021bd754c6 ("btrfs: prevent subvol with swapfile from being deleted")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 44e2fda66427a0442d8d2c0e6443256fb458ab6b ]
Commit b471965fdb2d ("btrfs: fix replace/scrub failure with
metadata_uuid") fixed the comparison in scrub_verify_one_metadata() to
use metadata_uuid rather than fsid, but left the warning as it was. Fix
it so it matches what we're doing.
Fixes: b471965fdb2d ("btrfs: fix replace/scrub failure with metadata_uuid")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a10172780526c2002e062102ad4f2aabac495889 ]
Fix a copy-paste error in check_extent_data_ref(): we're printing root
as in the message above, we should be printing objectid.
Fixes: f333a3c7e832 ("btrfs: tree-checker: validate dref root and objectid")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 511dc8912ae3e929c1a182f5e6b2326516fd42a0 ]
Fix the error message in check_dev_extent_item(), when an overlapping
stripe is encountered. For dev extents, objectid is the disk number and
offset the physical address, so prev_key->objectid should actually be
prev_key->offset.
(I can't take any credit for this one - this was discovered by Chris and
his friend Claude.)
Reported-by: Chris Mason <clm@fb.com>
Fixes: 008e2512dc56 ("btrfs: tree-checker: add dev extent item checks")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3cf0f35779d364cf2003c617bb7f3f3e41023372 ]
Fix the error message in btrfs_delete_delayed_dir_index() if
__btrfs_add_delayed_item() fails: the message says root, inode, index,
error, but we're actually passing index, root, inode, error.
Fixes: adc1ef55dc04 ("btrfs: add details to error messages at btrfs_delete_delayed_dir_index()")
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3f501412f2079ca14bf68a18d80a2b7a823f1f64 ]
In this function the 'pages' object is never freed in the hopes that it is
picked up by btrfs_uring_read_finished() whenever that executes in the
future. But that's just the happy path. Along the way previous
allocations might have gone wrong, or we might not get -EIOCBQUEUED from
btrfs_encoded_read_regular_fill_pages(). In all these cases, we go to a
cleanup section that frees all memory allocated by this function without
assuming any deferred execution, and this also needs to happen for the
'pages' allocation.
Fixes: 34310c442e17 ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)")
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 912d1c6680bdb40b72b1b9204706f32b6eb842c3 ]
Commit 93bba24d4b5a ("btrfs: Enhance btrfs_trim_fs function to handle
error better") intended to make device trimming continue even if one
device fails, tracking failures and reporting them at the end. However,
it used 'break' instead of 'continue', causing the loop to exit on the
first device failure.
Fix this by replacing 'break' with 'continue'.
Fixes: 93bba24d4b5a ("btrfs: Enhance btrfs_trim_fs function to handle error better")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 52ee9965d09b2c56a027613db30d1fb20d623861 ]
When a block group is composed of a sequential write zone and a
conventional zone, we recover the (pseudo) write pointer of the
conventional zone using the end of the last allocated position.
However, if the last extent in a block group is removed, the last extent
position will be smaller than the other real write pointer position.
Then, that will cause an error due to mismatch of the write pointers.
We can fixup this case by moving the alloc_offset to the corresponding
write pointer position.
Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e2d848649e64de39fc1b9c64002629b4daa1105d ]
When a block group is composed of a sequential write zone and a
conventional zone, we recover the (pseudo) write pointer of the
conventional zone using the end of the last allocated position.
However, if the last extent in a block group is removed, the last extent
position will be smaller than the other real write pointer position.
Then, that will cause an error due to mismatch of the write pointers.
We can fixup this case by moving the alloc_offset to the corresponding
write pointer position.
Fixes: c0d90a79e8e6 ("btrfs: zoned: fix alloc_offset calculation for partly conventional block groups")
CC: stable@vger.kernel.org # 6.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dda3ec9ee6b3e120603bff1b798f25b51e54ac5d ]
When a block group is composed of a sequential write zone and a
conventional zone, we recover the (pseudo) write pointer of the
conventional zone using the end of the last allocated position.
However, if the last extent in a block group is removed, the last extent
position will be smaller than the other real write pointer position.
Then, that will cause an error due to mismatch of the write pointers.
We can fixup this case by moving the alloc_offset to the corresponding
write pointer position.
Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 19eff93dc738e8afaa59cb374b44bb5a162e6c2d ]
Problems with current implementation:
1. reclaimable_bytes is signed while chunk_sz is unsigned, causing
negative reclaimable_bytes to trigger reclaim unexpectedly
2. The "space must be freed between scans" assumption breaks the
two-scan requirement: first scan marks block groups, second scan
reclaims them. Without the second scan, no reclamation occurs.
Instead, track actual reclaim progress: pause reclaim when block groups
will be reclaimed, and resume only when progress is made. This ensures
reclaim continues until no further progress can be made. And resume
periodic reclaim when there's enough free space.
And we take care if reclaim is making any progress now, so it's
unnecessary to set periodic_reclaim_ready to false when failed to reclaim
a block group.
Fixes: 813d4c6422516 ("btrfs: prevent pathological periodic reclaim loops")
CC: stable@vger.kernel.org # 6.12+
Suggested-by: Boris Burkov <boris@bur.io>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8ceaad6cd6e7fa5f73b0b2796a2e85d75d37e9f3 ]
[BUG]
There is a bug report that when btrfs hits ENOSPC error in a critical
path, btrfs flips RO (this part is expected, although the ENOSPC bug
still needs to be addressed).
The problem is after the RO flip, if there is a read repair pending, we
can hit the ASSERT() inside btrfs_repair_io_failure() like the following:
BTRFS info (device vdc): relocating block group 30408704 flags metadata|raid1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -28)
WARNING: fs/btrfs/extent-tree.c:3235 at __btrfs_free_extent.isra.0+0x453/0xfd0, CPU#1: btrfs/383844
Modules linked in: kvm_intel kvm irqbypass
[...]
---[ end trace 0000000000000000 ]---
BTRFS info (device vdc state EA): 2 enospc errors during balance
BTRFS info (device vdc state EA): balance: ended with status: -30
BTRFS error (device vdc state EA): parent transid verify failed on logical 30556160 mirror 2 wanted 8 found 6
BTRFS error (device vdc state EA): bdev /dev/nvme0n1 errs: wr 0, rd 0, flush 0, corrupt 10, gen 0
[...]
assertion failed: !(fs_info->sb->s_flags & SB_RDONLY) :: 0, in fs/btrfs/bio.c:938
------------[ cut here ]------------
assertion failed: !(fs_info->sb->s_flags & SB_RDONLY) :: 0, in fs/btrfs/bio.c:938
kernel BUG at fs/btrfs/bio.c:938!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 868 Comm: kworker/u8:13 Tainted: G W N 6.19.0-rc6+ #4788 PREEMPT(full)
Tainted: [W]=WARN, [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
Workqueue: btrfs-endio simple_end_io_work
RIP: 0010:btrfs_repair_io_failure.cold+0xb2/0x120
RSP: 0000:ffffc90001d2bcf0 EFLAGS: 00010246
RAX: 0000000000000051 RBX: 0000000000001000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8305cf42 RDI: 00000000ffffffff
RBP: 0000000000000002 R08: 00000000fffeffff R09: ffffffff837fa988
R10: ffffffff8327a9e0 R11: 6f69747265737361 R12: ffff88813018d310
R13: ffff888168b8a000 R14: ffffc90001d2bd90 R15: ffff88810a169000
FS: 0000000000000000(0000) GS:ffff8885e752c000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
------------[ cut here ]------------
[CAUSE]
The cause of -ENOSPC error during the test case btrfs/124 is still
unknown, although it's known that we still have cases where metadata can
be over-committed but can not be fulfilled correctly, thus if we hit
such ENOSPC error inside a critical path, we have no choice but abort
the current transaction.
This will mark the fs read-only.
The problem is inside the btrfs_repair_io_failure() path that we require
the fs not to be mount read-only. This is normally fine, but if we are
doing a read-repair meanwhile the fs flips RO due to a critical error,
we can enter btrfs_repair_io_failure() with super block set to
read-only, thus triggering the above crash.
[FIX]
Just replace the ASSERT() with a proper return if the fs is already
read-only.
Reported-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-btrfs/20260126045555.GB31641@lst.de/
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit be6324a809dbda76d5fdb23720ad9b20e5c1905c ]
We search with offset (u64)-1 which should never match exactly.
Previously this was handled with BUG(). Now logs an error
and return -EUCLEAN.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Adarsh Das <adarshdas950@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit bfb670b9183b0e4ba660aff2e396ec1cc01d0761 ]
When a fatal signal is pending or the process is freezing,
btrfs_trim_block_group() and btrfs_trim_free_extents() return -ERESTARTSYS.
Currently this is treated as a regular error: the loops continue to the
next iteration and count it as a block group or device failure.
Instead, break out of the loops immediately and return -ERESTARTSYS to
userspace without counting it as a failure. Also skip the device loop
entirely if the block group loop was interrupted.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7c2830f00c3e086292c1ee9f27b61efaf8e76c9a ]
[BACKGROUND]
Inspired by a recent kernel bug report, which is related to direct IO
buffer modification during writeback, that leads to contents mismatch of
different RAID1 mirrors.
[CAUSE AND PROBLEMS]
The root cause is exactly the same explained in commit 968f19c5b1b7
("btrfs: always fallback to buffered write if the inode requires
checksum"), that we can not trust direct IO buffer which can be modified
halfway during writeback.
Unlike data checksum verification, if this happened on inodes without
data checksum but has the data has extra mirrors, it will lead to
stealth data mismatch on different mirrors.
This will be way harder to detect without data checksum.
Furthermore for RAID56, we can even have data without checksum and data
with checksum mixed inside the same full stripe.
In that case if the direct IO buffer got changed halfway for the
nodatasum part, the data with checksum immediately lost its ability to
recover, e.g.:
" " = Good old data or parity calculated using good old data
"X" = Data modified during writeback
0 32K 64K
Data 1 | | Has csum
Data 2 |XXXXXXXXXXXXXXXX | No csum
Parity | |
In above case, the parity is calculated using data 1 (has csum, from
page cache, won't change during writeback), and old data 2 (has no csum,
direct IO write).
After parity is calculated, but before submission to the storage, direct
IO buffer of data 2 is modified, causing the range [0, 32K) of data 2
has a different content.
Now all data is submitted to the storage, and the fs got fully synced.
Then the device of data 1 is lost, has to be rebuilt from data 2 and
parity. But since the data 2 has some modified data, and the parity is
calculated using old data, the recovered data is no the same for data 1,
causing data checksum mismatch.
[FIX]
Fix the problem by checking the data allocation profile.
If our data allocation profile is either RAID0 or SINGLE, we can allow
true zero-copy direct IO and the end user is fully responsible for any
race.
However this is not going to fix all situations, as it's still possible
to race with balance where the fs got a new data profile after the data
allocation profile check.
But this fix should still greatly reduce the window of the original bug.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c7d1d4ff56744074e005771aff193b927392d51f ]
There is no need to BUG(), we can just return an error and log an error
message.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ecb7c2484cfc83a93658907580035a8adf1e0a92 ]
If btrfs_search_slot_for_read() returns 1, it means we did not find any
key greater than or equals to the key we asked for, meaning we have
reached the end of the tree and therefore the path is not valid. If
this happens we need to break out of the loop and stop, instead of
continuing and accessing an invalid path.
Fixes: 5223cc60b40a ("btrfs: drop the path before adding qgroup items when enabling qgroups")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2155d0c0a761a56ce7ede83a26eb23ea0f935260 ]
When initializing the delayed refs block reserve for a transaction handle
we are passing a type of BTRFS_BLOCK_RSV_DELOPS, which is meant for
delayed items and not for delayed refs. The correct type for delayed refs
is BTRFS_BLOCK_RSV_DELREFS.
On release of any excess space reserved in a local delayed refs reserve,
we also should transfer that excess space to the global block reserve
(it it's full, we return to the space info for general availability).
By initializing a transaction's local delayed refs block reserve with a
type of BTRFS_BLOCK_RSV_DELOPS, we were also causing any excess space
released from the delayed block reserve (fs_info->delayed_block_rsv, used
for delayed inodes and items) to be transferred to the global block
reserve instead of the global delayed refs block reserve. This was an
unintentional change in commit 28270e25c69a ("btrfs: always reserve space
for delayed refs when starting transaction"), but it's not particularly
serious as things tend to cancel out each other most of the time and it's
relatively rare to be anywhere near exhaustion of the global reserve.
Fix this by initializing a transaction's local delayed refs reserve with
a type of BTRFS_BLOCK_RSV_DELREFS and making btrfs_block_rsv_release()
attempt to transfer unused space from such a reserve into the global block
reserve, just as we did before that commit for when the block reserve is
a delayed refs rsv.
Reported-by: Alex Lyakas <alex.lyakas@zadara.com>
Link: https://lore.kernel.org/linux-btrfs/CAOcd+r0FHG5LWzTSu=LknwSoqxfw+C00gFAW7fuX71+Z5AfEew@mail.gmail.com/
Fixes: 28270e25c69a ("btrfs: always reserve space for delayed refs when starting transaction")
Reviewed-by: Alex Lyakas <alex.lyakas@zadara.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5870ec7c8fe57a8b2c65005e5da5efc054faa3e6 ]
Block group size classes are managed consistently everywhere.
Currently, btrfs_use_block_group_size_class() sets a block group's size
class to specialize it for a specific allocation size. However, this
size class remains "stale" even if the block group becomes completely
empty (both used and reserved bytes reach zero).
This happens in two scenarios:
1. When space reservations are freed (e.g., due to errors or transaction
aborts) via btrfs_free_reserved_bytes().
2. When the last extent in a block group is freed via
btrfs_update_block_group().
While size classes are advisory, a stale size class can cause
find_free_extent to unnecessarily skip candidate block groups during
initial search loops. This undermines the purpose of size classes to
reduce fragmentation by keeping block groups restricted to a specific
size class when they could be reused for any size.
Fix this by resetting the size class to BTRFS_BG_SZ_NONE whenever a
block group's used and reserved counts both reach zero. This ensures
that empty block groups are fully available for any allocation size in
the next cycle.
Fixes: 52bb7a2166af ("btrfs: introduce size class to block group allocator")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b14c5e04bd0f722ed631845599d52d03fcae1bc1 ]
I have been observing a number of systems aborting at
insert_dev_extents() in btrfs_create_pending_block_groups(). The
following is a sample stack trace of such an abort coming from forced
chunk allocation (typically behind CONFIG_BTRFS_EXPERIMENTAL) but this
can theoretically happen to any DUP chunk allocation.
[81.801] ------------[ cut here ]------------
[81.801] BTRFS: Transaction aborted (error -17)
[81.801] WARNING: fs/btrfs/block-group.c:2876 at btrfs_create_pending_block_groups+0x721/0x770 [btrfs], CPU#1: bash/319
[81.802] Modules linked in: virtio_net btrfs xor zstd_compress raid6_pq null_blk
[81.803] CPU: 1 UID: 0 PID: 319 Comm: bash Kdump: loaded Not tainted 6.19.0-rc6+ #319 NONE
[81.803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014
[81.804] RIP: 0010:btrfs_create_pending_block_groups+0x723/0x770 [btrfs]
[81.806] RSP: 0018:ffffa36241a6bce8 EFLAGS: 00010282
[81.806] RAX: 000000000000000d RBX: ffff8e699921e400 RCX: 0000000000000000
[81.807] RDX: 0000000002040001 RSI: 00000000ffffffef RDI: ffffffffc0608bf0
[81.807] RBP: 00000000ffffffef R08: ffff8e69830f6000 R09: 0000000000000007
[81.808] R10: ffff8e699921e5e8 R11: 0000000000000000 R12: ffff8e6999228000
[81.808] R13: ffff8e6984d82000 R14: ffff8e69966a69c0 R15: ffff8e69aa47b000
[81.809] FS: 00007fec6bdd9740(0000) GS:ffff8e6b1b379000(0000) knlGS:0000000000000000
[81.809] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[81.810] CR2: 00005604833670f0 CR3: 0000000116679000 CR4: 00000000000006f0
[81.810] Call Trace:
[81.810] <TASK>
[81.810] __btrfs_end_transaction+0x3e/0x2b0 [btrfs]
[81.811] btrfs_force_chunk_alloc_store+0xcd/0x140 [btrfs]
[81.811] kernfs_fop_write_iter+0x15f/0x240
[81.812] vfs_write+0x264/0x500
[81.812] ksys_write+0x6c/0xe0
[81.812] do_syscall_64+0x66/0x770
[81.812] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[81.813] RIP: 0033:0x7fec6be66197
[81.814] RSP: 002b:00007fffb159dd30 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[81.815] RAX: ffffffffffffffda RBX: 00007fec6bdd9740 RCX: 00007fec6be66197
[81.815] RDX: 0000000000000002 RSI: 0000560483374f80 RDI: 0000000000000001
[81.816] RBP: 0000560483374f80 R08: 0000000000000000 R09: 0000000000000000
[81.816] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[81.817] R13: 00007fec6bfb85c0 R14: 00007fec6bfb5ee0 R15: 00005604833729c0
[81.817] </TASK>
[81.817] irq event stamp: 20039
[81.818] hardirqs last enabled at (20047): [<ffffffff99a68302>] __up_console_sem+0x52/0x60
[81.818] hardirqs last disabled at (20056): [<ffffffff99a682e7>] __up_console_sem+0x37/0x60
[81.819] softirqs last enabled at (19470): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0
[81.819] softirqs last disabled at (19463): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0
[81.820] ---[ end trace 0000000000000000 ]---
[81.820] BTRFS: error (device dm-7 state A) in btrfs_create_pending_block_groups:2876: errno=-17 Object already exists
Inspecting these aborts with drgn, I observed a pattern of overlapping
chunk_maps. Note how stripe 1 of the first chunk overlaps in physical
address with stripe 0 of the second chunk.
Physical Start Physical End Length Logical Type Stripe
----------------------------------------------------------------------------------------------------
0x0000000102500000 0x0000000142500000 1.0G 0x0000000641d00000 META|DUP 0/2
0x0000000142500000 0x0000000182500000 1.0G 0x0000000641d00000 META|DUP 1/2
0x0000000142500000 0x0000000182500000 1.0G 0x0000000601d00000 META|DUP 0/2
0x0000000182500000 0x00000001c2500000 1.0G 0x0000000601d00000 META|DUP 1/2
Now how could this possibly happen? All chunk allocation is protected by
the chunk_mutex so racing allocations should see a consistent view of
the CHUNK_ALLOCATED bit in the chunk allocation extent-io-tree
(device->alloc_state as set by chunk_map_device_set_bits()) The tree
itself is protected by a spin lock, and clearing/setting the bits is
always protected by fs_info->mapping_tree_lock, so no race is apparent.
It turns out that there is a subtle bug in the logic regarding chunk
allocations that have happened in the current transaction, known as
"pending extents". The chunk allocation as defined in
find_free_dev_extent() is a loop which searches the commit root of the
dev_root and looks for gaps between DEV_EXTENT items. For those gaps, it
then checks alloc_state bitmap for any pending extents and adjusts the
hole that it finds accordingly. However, the logic in that adjustment
assumes that the first pending extent is the only one in that range.
e.g., given a layout with two non-consecutive pending extents in a hole
passed to dev_extent_hole_check() via *hole_start and *hole_size:
|----pending A----| real hole |----pending B----|
| candidate hole |
*hole_start *hole_start + *hole_size
the code incorrectly returns a "hole" from the end of pending extent A
until the passed in hole end, failing to account for pending B.
However, it is not entirely obvious that it is actually possible to
produce such a layout. I was able to reproduce it, but with some
contortions: I continued to use the force chunk allocation sysfs file
and I introduced a long delay (10 seconds) into the start of the cleaner
thread. I also prevented the unused bgs cleaning logic from ever
deleting metadata bgs. These help make it easier to deterministically
produce the condition but shouldn't really matter if you imagine the
conditions happening by race/luck. Allocations/frees can happen
concurrently with the cleaner thread preparing to process an unused
extent and both create some used chunks with an unused chunk
interleaved, all during one transaction. Then btrfs_delete_unused_bgs()
sees the unused one and clears it, leaving a range with several pending
chunk allocations and a gap in the middle.
The basic idea is that the unused_bgs cleanup work happens on a worker
so if we allocate 3 block groups in one transaction, then the cleaner
work kicked off by the previous transaction comes through and deletes
the middle one of the 3, then the commit root shows no dev extents and
we have the bad pattern in the extent-io-tree. One final consideration
is that the code happens to loop to the next hole if there are no more
extents at all, so we need one more dev extent way past the area we are
working in. Something like the following demonstrates the technique:
# push the BG frontier out to 20G
fallocate -l 20G $mnt/foo
# allocate one more that will prevent the "no more dev extents" luck
fallocate -l 1G $mnt/sticky
# sync
sync
# clear out the allocation area
rm $mnt/foo
sync
_cleaner
# let everything quiesce
sleep 20
sync
# dev tree should have one bg 20G out and the rest at the beginning..
# sort of like an empty FS but with a random sticky chunk.
# kick off the cleaner in the background, remember it will sleep 10s
# before doing interesting work
_cleaner &
sleep 3
# create 3 trivial block groups, all empty, all immediately marked as unused.
echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"
echo 1 > "$(_btrfs_sysfs_space_info $dev data)/force_chunk_alloc"
echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"
# let the cleaner thread definitely finish, it will remove the data bg
sleep 10
# this allocation sees the non-consecutive pending metadata chunks with
# data chunk gap of 1G and allocates a 2G extent in that hole. ENOSPC!
echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"
As for the fix, it is not that obvious. I could not see a trivial way to
do it even by adding backup loops into find_free_dev_extent(), so I
opted to change the semantics of dev_extent_hole_check() to not stop
looping until it finds a sufficiently big hole. For clarity, this also
required changing the helper function contains_pending_extent() into two
new helpers which find the first pending extent and the first suitable
hole in a range.
I attempted to clean up the documentation and range calculations to be
as consistent and clear as possible for the future.
I also looked at the zoned case and concluded that the loop there is
different and not to be unified with this one. As far as I can tell, the
zoned check will only further constrain the hole so looping back to find
more holes is acceptable. Though given that zoned really only appends, I
find it highly unlikely that it is susceptible to this bug.
Fixes: 1b9845081633 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole")
Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Closes: https://lore.kernel.org/linux-btrfs/q7760374-q1p4-029o-5149-26p28421s468@tzk.arg/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3a1f4264daed4b419c325a7fe35e756cada3cf82 ]
When the incompat flag EXTENT_TREE_V2 is set, we unconditionally add the
block group tree to the switch_commits list before calling
switch_commit_roots, as we do for the tree root and the chunk root.
However, the block group tree uses normal root dirty tracking and in any
transaction that does an allocation and dirties a block group, the block
group root will already be linked to a list by the dirty_list field and
this use of list_add_tail() is invalid and corrupts the prev/next
members of block_group_root->dirty_list.
This is apparent on a subsequent list_del on the prev if we enable
CONFIG_DEBUG_LIST:
[32.1571] ------------[ cut here ]------------
[32.1572] list_del corruption. next->prev should beffff958890202538, but was ffff9588992bd538. (next=ffff958890201538)
[32.1575] WARNING: lib/list_debug.c:65 at 0x0, CPU#3: sync/607
[32.1583] CPU: 3 UID: 0 PID: 607 Comm: sync Not tainted 6.18.0 #24PREEMPT(none)
[32.1585] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS1.17.0-4.fc41 04/01/2014
[32.1587] RIP: 0010:__list_del_entry_valid_or_report+0x108/0x120
[32.1593] RSP: 0018:ffffaa288287fdd0 EFLAGS: 00010202
[32.1594] RAX: 0000000000000001 RBX: ffff95889326e800 RCX:ffff958890201538
[32.1596] RDX: ffff9588992bd538 RSI: ffff958890202538 RDI:ffffffff82a41e00
[32.1597] RBP: ffff958890202538 R08: ffffffff828fc1e8 R09:00000000ffffefff
[32.1599] R10: ffffffff8288c200 R11: ffffffff828e4200 R12:ffff958890201538
[32.1601] R13: ffff95889326e958 R14: ffff958895c24000 R15:ffff958890202538
[32.1603] FS: 00007f0c28eb5740(0000) GS:ffff958af2bd2000(0000)knlGS:0000000000000000
[32.1605] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32.1607] CR2: 00007f0c28e8a3cc CR3: 0000000109942005 CR4:0000000000370ef0
[32.1609] Call Trace:
[32.1610] <TASK>
[32.1611] switch_commit_roots+0x82/0x1d0 [btrfs]
[32.1615] btrfs_commit_transaction+0x968/0x1550 [btrfs]
[32.1618] ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs]
[32.1621] __iterate_supers+0xe8/0x190
[32.1622] ? __pfx_sync_fs_one_sb+0x10/0x10
[32.1623] ksys_sync+0x63/0xb0
[32.1624] __do_sys_sync+0xe/0x20
[32.1625] do_syscall_64+0x73/0x450
[32.1626] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[32.1627] RIP: 0033:0x7f0c28d05d2b
[32.1632] RSP: 002b:00007ffc9d988048 EFLAGS: 00000246 ORIG_RAX:00000000000000a2
[32.1634] RAX: ffffffffffffffda RBX: 00007ffc9d988228 RCX:00007f0c28d05d2b
[32.1636] RDX: 00007f0c28e02301 RSI: 00007ffc9d989b21 RDI:00007f0c28dba90d
[32.1637] RBP: 0000000000000001 R08: 0000000000000001 R09:0000000000000000
[32.1639] R10: 0000000000000000 R11: 0000000000000246 R12:000055b96572cb80
[32.1641] R13: 000055b96572b19f R14: 00007f0c28dfa434 R15:000055b96572b034
[32.1643] </TASK>
[32.1644] irq event stamp: 0
[32.1644] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[32.1646] hardirqs last disabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260
[32.1648] softirqs last enabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260
[32.1650] softirqs last disabled at (0): [<0000000000000000>] 0x0
[32.1652] ---[ end trace 0000000000000000 ]---
Furthermore, this list corruption eventually (when we happen to add a
new block group) results in getting the switch_commits and
dirty_cowonly_roots lists mixed up and attempting to call update_root
on the tree root which can't be found in the tree root, resulting in a
transaction abort:
[87.8269] BTRFS critical (device nvme1n1): unable to find root key (1 0 0) in tree 1
[87.8272] ------------[ cut here ]------------
[87.8274] BTRFS: Transaction aborted (error -117)
[87.8275] WARNING: fs/btrfs/root-tree.c:153 at 0x0, CPU#4: sync/703
[87.8285] CPU: 4 UID: 0 PID: 703 Comm: sync Not tainted 6.18.0 #25 PREEMPT(none)
[87.8287] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-4.fc41 04/01/2014
[87.8289] RIP: 0010:btrfs_update_root+0x296/0x790 [btrfs]
[87.8295] RSP: 0018:ffffa58d035dfd60 EFLAGS: 00010282
[87.8297] RAX: ffff9a59126ddb68 RBX: ffff9a59126dc000 RCX: 0000000000000000
[87.8299] RDX: 0000000000000000 RSI: 00000000ffffff8b RDI: ffffffffc0b28270
[87.8301] RBP: ffff9a5904aec000 R08: 0000000000000000 R09: 00000000ffffefff
[87.8303] R10: ffffffff9ac8c200 R11: ffffffff9ace4200 R12: 0000000000000001
[87.8305] R13: ffff9a59041740e8 R14: ffff9a5904aec1f7 R15: ffff9a590fdefaf0
[87.8307] FS: 00007f54cde6b740(0000) GS:ffff9a5b5a81c000(0000) knlGS:0000000000000000
[87.8309] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[87.8310] CR2: 00007f54cde403cc CR3: 0000000112902004 CR4: 0000000000370ef0
[87.8312] Call Trace:
[87.8313] <TASK>
[87.8314] ? _raw_spin_unlock+0x23/0x40
[87.8315] commit_cowonly_roots+0x1ad/0x250 [btrfs]
[87.8317] ? btrfs_commit_transaction+0x79b/0x1560 [btrfs]
[87.8320] btrfs_commit_transaction+0x8aa/0x1560 [btrfs]
[87.8322] ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs]
[87.8325] __iterate_supers+0xf1/0x170
[87.8326] ? __pfx_sync_fs_one_sb+0x10/0x10
[87.8327] ksys_sync+0x63/0xb0
[87.8328] __do_sys_sync+0xe/0x20
[87.8329] do_syscall_64+0x73/0x450
[87.8330] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[87.8331] RIP: 0033:0x7f54cdd05d2b
[87.8336] RSP: 002b:00007fff1b58ff78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[87.8338] RAX: ffffffffffffffda RBX: 00007fff1b590158 RCX: 00007f54cdd05d2b
[87.8340] RDX: 00007f54cde02301 RSI: 00007fff1b592b66 RDI: 00007f54cddba90d
[87.8342] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
[87.8344] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e07ca96b80
[87.8346] R13: 000055e07ca9519f R14: 00007f54cddfa434 R15: 000055e07ca95034
[87.8348] </TASK>
[87.8348] irq event stamp: 0
[87.8349] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[87.8351] hardirqs last disabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0
[87.8353] softirqs last enabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0
[87.8355] softirqs last disabled at (0): [<0000000000000000>] 0x0
[87.8357] ---[ end trace 0000000000000000 ]---
[87.8358] BTRFS: error (device nvme1n1 state A) in btrfs_update_root:153: errno=-117 Filesystem corrupted
[87.8360] BTRFS info (device nvme1n1 state EA): forced readonly
[87.8362] BTRFS warning (device nvme1n1 state EA): Skipping commit of aborted transaction.
[87.8364] BTRFS: error (device nvme1n1 state EA) in cleanup_transaction:2037: errno=-117 Filesystem corrupted
Since the block group tree was pulled out of the extent tree and uses
normal root dirty tracking, remove the offending extra list_add. This
fixes the list corruption and the resulting fs corruption.
Fixes: 14033b08a029 ("btrfs: don't save block group root into super block")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 51b1fcf71c88c3c89e7dcf07869c5de837b1f428 ]
If we fail to delete the second qgroup relation item, we end up returning
success or -ENOENT in case the first item does not exist, instead of
returning the error from the second item deletion.
Fixes: 73798c465b66 ("btrfs: qgroup: Try our best to delete qgroup relations")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b39b26e017c7889181cb84032e22bef72e81cf29 ]
In case of a zoned RAID, it can happen that a data write is targeting a
sequential write required zone and a conventional zone. In this case the
bio will be marked as REQ_OP_ZONE_APPEND but for the conventional zone,
this needs to be REQ_OP_WRITE.
The setting of REQ_OP_ZONE_APPEND is deferred to the last possible time in
btrfs_submit_dev_bio(), but the decision if we can use zone append is
cached in btrfs_bio.
CC: Naohiro Aota <naohiro.aota@wdc.com>
Fixes: e9b9b911e03c ("btrfs: add raid stripe tree to features enabled with debug config")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A regression fix for a memory leak when raid56 is used"
* tag 'for-6.19-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmap
|
|
We allocate the bitmap but we never free it in free_raid_bio_pointers().
Fix this by adding a bitmap_free() call against the stripe_uptodate_bitmap
of a raid bio.
Fixes: 1810350b04ef ("btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmap")
Reported-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-btrfs/20260126045315.GA31641@lst.de/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix leaked folio refcount on s390x when using hw zlib compression
acceleration
- remove own threshold from ->writepages() which could collide with
cgroup limits and lead to a deadlock when metadadata are not written
because the amount is under the internal limit
* tag 'for-6.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zlib: fix the folio leak on S390 hardware acceleration
btrfs: do not strictly require dirty metadata threshold for metadata writepages
|
|
[BUG]
After commit aa60fe12b4f4 ("btrfs: zlib: refactor S390x HW acceleration
buffer preparation"), we no longer release the folio of the page cache
of folio returned by btrfs_compress_filemap_get_folio() for S390
hardware acceleration path.
[CAUSE]
Before that commit, we call kumap_local() and folio_put() after handling
each folio.
Although the timing is not ideal (it release previous folio at the
beginning of the loop, and rely on some extra cleanup out of the loop),
it at least handles the folio release correctly.
Meanwhile the refactored code is easier to read, it lacks the call to
release the filemap folio.
[FIX]
Add the missing folio_put() for copy_data_into_buffer().
CC: linux-s390@vger.kernel.org # 6.18+
Fixes: aa60fe12b4f4 ("btrfs: zlib: refactor S390x HW acceleration buffer preparation")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is an internal report that over 1000 processes are
waiting at the io_schedule_timeout() of balance_dirty_pages(), causing
a system hang and trigger a kernel coredump.
The kernel is v6.4 kernel based, but the root problem still applies to
any upstream kernel before v6.18.
[CAUSE]
From Jan Kara for his wisdom on the dirty page balance behavior first.
This cgroup dirty limit was what was actually playing the role here
because the cgroup had only a small amount of memory and so the dirty
limit for it was something like 16MB.
Dirty throttling is responsible for enforcing that nobody can dirty
(significantly) more dirty memory than there's dirty limit. Thus when
a task is dirtying pages it periodically enters into balance_dirty_pages()
and we let it sleep there to slow down the dirtying.
When the system is over dirty limit already (either globally or within
a cgroup of the running task), we will not let the task exit from
balance_dirty_pages() until the number of dirty pages drops below the
limit.
So in this particular case, as I already mentioned, there was a cgroup
with relatively small amount of memory and as a result with dirty limit
set at 16MB. A task from that cgroup has dirtied about 28MB worth of
pages in btrfs btree inode and these were practically the only dirty
pages in that cgroup.
So that means the only way to reduce the dirty pages of that cgroup is
to writeback the dirty pages of btrfs btree inode, and only after that
those processes can exit balance_dirty_pages().
Now back to the btrfs part, btree_writepages() is responsible for
writing back dirty btree inode pages.
The problem here is, there is a btrfs internal threshold that if the
btree inode's dirty bytes are below the 32M threshold, it will not
do any writeback.
This behavior is to batch as much metadata as possible so we won't write
back those tree blocks and then later re-COW them again for another
modification.
This internal 32MiB is higher than the existing dirty page size (28MiB),
meaning no writeback will happen, causing a deadlock between btrfs and
cgroup:
- Btrfs doesn't want to write back btree inode until more dirty pages
- Cgroup/MM doesn't want more dirty pages for btrfs btree inode
Thus any process touching that btree inode is put into sleep until
the number of dirty pages is reduced.
Thanks Jan Kara a lot for the analysis of the root cause.
[ENHANCEMENT]
Since kernel commit b55102826d7d ("btrfs: set AS_KERNEL_FILE on the
btree_inode"), btrfs btree inode pages will only be charged to the root
cgroup which should have a much larger limit than btrfs' 32MiB
threshold.
So it should not affect newer kernels.
But for all current LTS kernels, they are all affected by this problem,
and backporting the whole AS_KERNEL_FILE may not be a good idea.
Even for newer kernels I still think it's a good idea to get
rid of the internal threshold at btree_writepages(), since for most cases
cgroup/MM has a better view of full system memory usage than btrfs' fixed
threshold.
For internal callers using btrfs_btree_balance_dirty() since that
function is already doing internal threshold check, we don't need to
bother them.
But for external callers of btree_writepages(), just respect their
requests and write back whatever they want, ignoring the internal
btrfs threshold to avoid such deadlock on btree inode dirty page
balancing.
CC: stable@vger.kernel.org
CC: Jan Kara <jack@suse.cz>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- protect reading super block vs setting block size externally (found
by syzbot)
- make sure no transaction is started in read-only mode even with some
rescue mount option combinations
- fix checksum calculation of backup super blocks when block-group-tree
is enabled
- more extensive mount-time checks of device items that could be left
after device replace and attempting degraded mount
- fix build warning with -Wmaybe-uninitialized on loongarch64-gcc 12
* tag 'for-6.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add extra device item checks at mount
btrfs: fix missing fields in superblock backup with BLOCK_GROUP_TREE
btrfs: reject new transactions if the fs is fully read-only
btrfs: sync read disk super and set block size
btrfs: fix Wmaybe-uninitialized warning in replay_one_buffer()
|
|
[BUG]
There is a bug report where after a dev-replace, the replace source
device with devid 4 is properly erased (dump tree shows it's the old
devid 4), but the target device is still using devid 0.
When the user tries to mount the fs degraded, the mount failed with the
following errors:
BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 5 transid 1394395 /dev/sda (8:0) scanned by btrfs (261)
BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 6 transid 1394395 /dev/sde (8:64) scanned by btrfs (261)
BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 0 transid 1394395 /dev/sdd (8:48) scanned by btrfs (261)
BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 3 transid 1394395 /dev/sdf (8:80) scanned by btrfs (261)
BTRFS info (device sdd): first mount of filesystem 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9
BTRFS info (device sdd): using crc32c (crc32c-intel) checksum algorithm
BTRFS warning (device sdd): devid 4 uuid 01e2081c-9c2a-4071-b9f4-e1b27e571ff5 is missing
BTRFS info (device sdd): bdev <missing disk> errs: wr 84994544, rd 15567, flush 65872, corrupt 0, gen 0
BTRFS info (device sdd): bdev /dev/sdd errs: wr 71489901, rd 0, flush 30001, corrupt 0, gen 0
BTRFS error (device sdd): replace without active item, run 'device scan --forget' on the target device
BTRFS error (device sdd): failed to init dev_replace: -117
BTRFS error (device sdd): open_ctree failed: -117
[CAUSE]
The devid 0 didn't get its devid updated is its own problem, here I'm
only focusing on the mount failure itself.
The mount is not caused by the missing device, as the fs has RAID1C3 for
metadata and RAID10 for data, thus is completely able to tolerate one
missing device.
The device tree shows the dev-replace has properly finished:
item 7 key (0 DEV_REPLACE 0) itemoff 15931 itemsize 72
src devid -1 cursor left 11091821199360 cursor right 11091821199360 mode ALWAYS
state FINISHED write errors 0 uncorrectable read errors 0
^^^^^^^^
And the chunk tree shows there is no devid 0:
leaf 37980736602112 items 23 free space 12548 generation 1394388 owner CHUNK_TREE
leaf 37980736602112 flags 0x1(WRITTEN) backref revision 1
fs uuid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9
chunk uuid d074c661-6311-4570-b59f-a5c83fd37f8e
item 0 key (DEV_ITEMS DEV_ITEM 3) itemoff 16185 itemsize 98
devid 3 total_bytes 20000588955648 bytes_used 8282877984768
io_align 4096 io_width 4096 sector_size 4096 type 0
generation 0 start_offset 0 dev_group 0
seek_speed 0 bandwidth 0
uuid 0d596b69-fb0d-4031-b4af-a301d0868b8b
fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9
...
Which shows the first device is devid 3.
But there is indeed /dev/sdd with devid 0:
superblock: bytenr=65536, device=/dev/sdd
---------------------------------------------------------
csum_type 0 (crc32c)
csum_size 4
csum 0xd4bed87e [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9
...
uuid_tree_generation 1394388
dev_item.uuid ee6532ad-5442-45f7-87fb-7703e29ed934
dev_item.fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 [match]
dev_item.type 0
dev_item.total_bytes 20000588955648
dev_item.bytes_used 8292541661184
dev_item.io_align 0
dev_item.io_width 0
dev_item.sector_size 0
dev_item.devid 0 <<<
So this means device scan will register sdd as devid 0 into the fs, then
during btrfs_init_dev_replace(), we located the replace progress item,
found the previous replace is finished, but we still need to check if
the dev-replace target device (devid 0) exists.
If that device exists, we error out showing that error message.
But to be honest the end user may not really remember which device is
the replace target device, thus not sure what to do in the next step.
[ENHANCEMENT]
To make the error more obvious, and tell the end user which devices
should be unregistered:
- Introduce BTRFS_DEV_STATE_ITEM_FOUND flag
During device item read from the chunk tree, set the flag for each
found device item.
- Verify there is no device without the above flag during mount
Even missing device should have that flag set.
If we found a device without that flag set, it means it's an
unexpected one and should be rejected.
- More detailed error message on what to do next
This will show all unexpected devices and tell the end user to use
'btrfs dev scan --forget' to forget them or remove them before mount.
There is an example dmesg where a device of a valid filesystem is modified to
have devid 0, then try degraded mount:
BTRFS info (device dm-6): first mount of filesystem 7c873869-844c-4b39-bd75-a96148bf4656
BTRFS info (device dm-6): using crc32c checksum algorithm
BTRFS warning (device dm-6): devid 3 uuid b4a9f35b-db42-4ac4-b55a-cbf81d3b9683 is missing
BTRFS error (device dm-6): devid 0 path /dev/mapper/test-scratch3 is registered but not found in chunk tree
BTRFS error (device dm-6): please remove above devices or use 'btrfs device scan --forget <dev>' to unregister them before mount
BTRFS error (device dm-6): open_ctree failed: -117
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When the BLOCK_GROUP_TREE compat_ro flag is set, the extent root and
csum root fields are getting missed.
This is because EXTENT_TREE_V2 treated these differently, and when
they were split off this special-casing was mistakenly assigned to
BGT rather than the rump EXTENT_TREE_V2. There's no reason why the
existence of the block group tree should mean that we don't record the
details of the last commit's extent root and csum root.
Fix the code in backup_super_roots() so that the correct check gets
made.
Fixes: 1c56ab991903 ("btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is a bug report where a heavily fuzzed fs is mounted with all
rescue mount options, which leads to the following warnings during
unmount:
BTRFS: Transaction aborted (error -22)
Modules linked in:
CPU: 0 UID: 0 PID: 9758 Comm: repro.out Not tainted
6.19.0-rc5-00002-gb71e635feefc #7 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:find_free_extent_update_loop fs/btrfs/extent-tree.c:4208 [inline]
RIP: 0010:find_free_extent+0x52f0/0x5d20 fs/btrfs/extent-tree.c:4611
Call Trace:
<TASK>
btrfs_reserve_extent+0x2cd/0x790 fs/btrfs/extent-tree.c:4705
btrfs_alloc_tree_block+0x1e1/0x10e0 fs/btrfs/extent-tree.c:5157
btrfs_force_cow_block+0x578/0x2410 fs/btrfs/ctree.c:517
btrfs_cow_block+0x3c4/0xa80 fs/btrfs/ctree.c:708
btrfs_search_slot+0xcad/0x2b50 fs/btrfs/ctree.c:2130
btrfs_truncate_inode_items+0x45d/0x2350 fs/btrfs/inode-item.c:499
btrfs_evict_inode+0x923/0xe70 fs/btrfs/inode.c:5628
evict+0x5f4/0xae0 fs/inode.c:837
__dentry_kill+0x209/0x660 fs/dcache.c:670
finish_dput+0xc9/0x480 fs/dcache.c:879
shrink_dcache_for_umount+0xa0/0x170 fs/dcache.c:1661
generic_shutdown_super+0x67/0x2c0 fs/super.c:621
kill_anon_super+0x3b/0x70 fs/super.c:1289
btrfs_kill_super+0x41/0x50 fs/btrfs/super.c:2127
deactivate_locked_super+0xbc/0x130 fs/super.c:474
cleanup_mnt+0x425/0x4c0 fs/namespace.c:1318
task_work_run+0x1d4/0x260 kernel/task_work.c:233
exit_task_work include/linux/task_work.h:40 [inline]
do_exit+0x694/0x22f0 kernel/exit.c:971
do_group_exit+0x21c/0x2d0 kernel/exit.c:1112
__do_sys_exit_group kernel/exit.c:1123 [inline]
__se_sys_exit_group kernel/exit.c:1121 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1121
x64_sys_call+0x2210/0x2210 arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xe8/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x44f639
Code: Unable to access opcode bytes at 0x44f60f.
RSP: 002b:00007ffc15c4e088 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00000000004c32f0 RCX: 000000000044f639
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001
RBP: 0000000000000001 R08: ffffffffffffffc0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004c32f0
R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
</TASK>
Since rescue mount options will mark the full fs read-only, there should
be no new transaction triggered.
But during unmount we will evict all inodes, which can trigger a new
transaction, and triggers warnings on a heavily corrupted fs.
[CAUSE]
Btrfs allows new transaction even on a read-only fs, this is to allow
log replay happen even on read-only mounts, just like what ext4/xfs do.
However with rescue mount options, the fs is fully read-only and cannot
be remounted read-write, thus in that case we should also reject any new
transactions.
[FIX]
If we find the fs has rescue mount options, we should treat the fs as
error, so that no new transaction can be started.
Reported-by: Jiaming Zhang <r772577952@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CANypQFYw8Nt8stgbhoycFojOoUmt+BoZ-z8WJOZVxcogDdwm=Q@mail.gmail.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When the user performs a btrfs mount, the block device is not set
correctly. The user sets the block size of the block device to 0x4000
by executing the BLKBSZSET command.
Since the block size change also changes the mapping->flags value, this
further affects the result of the mapping_min_folio_order() calculation.
Let's analyze the following two scenarios:
Scenario 1: Without executing the BLKBSZSET command, the block size is
0x1000, and mapping_min_folio_order() returns 0;
Scenario 2: After executing the BLKBSZSET command, the block size is
0x4000, and mapping_min_folio_order() returns 2.
do_read_cache_folio() allocates a folio before the BLKBSZSET command
is executed. This results in the allocated folio having an order value
of 0. Later, after BLKBSZSET is executed, the block size increases to
0x4000, and the mapping_min_folio_order() calculation result becomes 2.
This leads to two undesirable consequences:
1. filemap_add_folio() triggers a VM_BUG_ON_FOLIO(folio_order(folio) <
mapping_min_folio_order(mapping)) assertion.
2. The syzbot report [1] shows a null pointer dereference in
create_empty_buffers() due to a buffer head allocation failure.
Synchronization should be established based on the inode between the
BLKBSZSET command and read cache page to prevent inconsistencies in
block size or mapping flags before and after folio allocation.
[1]
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:create_empty_buffers+0x4d/0x480 fs/buffer.c:1694
Call Trace:
folio_create_buffers+0x109/0x150 fs/buffer.c:1802
block_read_full_folio+0x14c/0x850 fs/buffer.c:2403
filemap_read_folio+0xc8/0x2a0 mm/filemap.c:2496
do_read_cache_folio+0x266/0x5c0 mm/filemap.c:4096
do_read_cache_page mm/filemap.c:4162 [inline]
read_cache_page_gfp+0x29/0x120 mm/filemap.c:4195
btrfs_read_disk_super+0x192/0x500 fs/btrfs/volumes.c:1367
Reported-by: syzbot+b4a2af3000eaa84d95d5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b4a2af3000eaa84d95d5
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|