Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 8a1ae2781dee9fc21ca82db682d37bea4bd074ad ]
Now that lockdep is staying enabled through our entire CI runs I started
seeing the following stack in generic/475
------------[ cut here ]------------
WARNING: CPU: 1 PID: 2171864 at fs/btrfs/discard.c:604 btrfs_discard_update_discardable+0x98/0xb0
CPU: 1 PID: 2171864 Comm: kworker/u4:0 Not tainted 5.19.0-rc8+ #789
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Workqueue: btrfs-cache btrfs_work_helper
RIP: 0010:btrfs_discard_update_discardable+0x98/0xb0
RSP: 0018:ffffb857c2f7bad0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8c85c605c200 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffffff86807c5b RDI: ffffffff868a831e
RBP: ffff8c85c4c54000 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8c85c66932f0 R11: 0000000000000001 R12: ffff8c85c3899010
R13: ffff8c85d5be4f40 R14: ffff8c85c4c54000 R15: ffff8c86114bfa80
FS: 0000000000000000(0000) GS:ffff8c863bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2e7f168160 CR3: 000000010289a004 CR4: 0000000000370ee0
Call Trace:
__btrfs_remove_free_space_cache+0x27/0x30
load_free_space_cache+0xad2/0xaf0
caching_thread+0x40b/0x650
? lock_release+0x137/0x2d0
btrfs_work_helper+0xf2/0x3e0
? lock_is_held_type+0xe2/0x140
process_one_work+0x271/0x590
? process_one_work+0x590/0x590
worker_thread+0x52/0x3b0
? process_one_work+0x590/0x590
kthread+0xf0/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
This is the code
ctl = block_group->free_space_ctl;
discard_ctl = &block_group->fs_info->discard_ctl;
lockdep_assert_held(&ctl->tree_lock);
We have a temporary free space ctl for loading the free space cache in
order to avoid having allocations happening while we're loading the
cache. When we hit an error we free it all up, however this also calls
btrfs_discard_update_discardable, which requires
block_group->free_space_ctl->tree_lock to be held. However this is our
temporary ctl so this lock isn't held. Fix this by calling
__btrfs_remove_free_space_cache_locked instead so that we only clean up
the entries and do not mess with the discardable stats.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dbecac26630014d336a8e5ea67096ff18210fb9c ]
btrfs currently prints information about space cache or free space tree
being in use on every remount, regardless whether such remount actually
enabled or disabled one of these features.
This is actually unnecessary since providing remount options changing the
state of these features will explicitly print the appropriate notice.
Let's instead print such unconditional information just on an initial mount
to avoid filling the kernel log when, for example, laptop-mode-tools
remount the fs on some events.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f9eab5f0bba76742af654f33d517bf62a0db8f12 ]
[BUG]
The following script shows that, although scrub can detect super block
errors, it never tries to fix it:
mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2
xfs_io -c "pwrite 67108864 4k" $dev2
mount $dev1 $mnt
btrfs scrub start -B $dev2
btrfs scrub start -Br $dev2
umount $mnt
The first scrub reports the super error correctly:
scrub done for f3289218-abd3-41ac-a630-202f766c0859
Scrub started: Tue Aug 2 14:44:11 2022
Status: finished
Duration: 0:00:00
Total to scrub: 1.26GiB
Rate: 0.00B/s
Error summary: super=1
Corrected: 0
Uncorrectable: 0
Unverified: 0
But the second read-only scrub still reports the same super error:
Scrub started: Tue Aug 2 14:44:11 2022
Status: finished
Duration: 0:00:00
Total to scrub: 1.26GiB
Rate: 0.00B/s
Error summary: super=1
Corrected: 0
Uncorrectable: 0
Unverified: 0
[CAUSE]
The comments already shows that super block can be easily fixed by
committing a transaction:
/*
* If we find an error in a super block, we just report it.
* They will get written with the next transaction commit
* anyway
*/
But the truth is, such assumption is not always true, and since scrub
should try to repair every error it found (except for read-only scrub),
we should really actively commit a transaction to fix this.
[FIX]
Just commit a transaction if we found any super block errors, after
everything else is done.
We cannot do this just after scrub_supers(), as
btrfs_commit_transaction() will try to pause and wait for the running
scrub, thus we can not call it with scrub_lock hold.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e69bf81c9a339f1b2c041b112a6fbb9f60fc9340 ]
[PROBLEM]
Unlike data/metadata corruption, if scrub detected some error in the
super block, the only error message is from the updated device status:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
This is not helpful at all.
[CAUSE]
Unlike data/metadata error reporting, there is no visible report in
kernel dmesg to report supper block errors.
In fact, return value of scrub_checksum_super() is intentionally
skipped, thus scrub_handle_errored_block() will never be called for
super blocks.
[FIX]
Make super block errors to output an error message, now the full
dmesg would looks like this:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
BTRFS info (device dm-1): scrub: started on devid 2
This fix involves:
- Move the super_errors reporting to scrub_handle_errored_block()
This allows the device status message to show after the super block
error message.
But now we no longer distinguish super block corruption and generation
mismatch, now all counted as corruption.
- Properly check the return value from scrub_checksum_super()
- Add extra super block error reporting for scrub_print_warning().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 62cd9d4474282a1eb84f945955c56cbfc42e1ffe ]
There is an internal report on hitting the following ASSERT() in
recalculate_thresholds():
ASSERT(ctl->total_bitmaps <= max_bitmaps);
Above @max_bitmaps is calculated using the following variables:
- bytes_per_bg
8 * 4096 * 4096 (128M) for x86_64/x86.
- block_group->length
The length of the block group.
@max_bitmaps is the rounded up value of block_group->length / 128M.
Normally one free space cache should not have more bitmaps than above
value, but when it happens the ASSERT() can be triggered if
CONFIG_BTRFS_ASSERT is also enabled.
But the ASSERT() itself won't provide enough info to know which is going
wrong.
Is the bg too small thus it only allows one bitmap?
Or is there something else wrong?
So although I haven't found extra reports or crash dump to do further
investigation, add the extra info to make it more helpful to debug.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
btrfs_init_new_buffer
commit cbddcc4fa3443fe8cfb2ff8e210deb1f6a0eea38 upstream.
syzbot is reporting uninit-value in btrfs_clean_tree_block() [1], for
commit bc877d285ca3dba2 ("btrfs: Deduplicate extent_buffer init code")
missed that btrfs_set_header_generation() in btrfs_init_new_buffer() must
not be moved to after clean_tree_block() because clean_tree_block() is
calling btrfs_header_generation() since commit 55c69072d6bd5be1 ("Btrfs:
Fix extent_buffer usage when nodesize != leafsize").
Since memzero_extent_buffer() will reset "struct btrfs_header" part, we
can't move btrfs_set_header_generation() to before memzero_extent_buffer().
Just re-add btrfs_set_header_generation() before btrfs_clean_tree_block().
Link: https://syzkaller.appspot.com/bug?extid=fba8e2116a12609b6c59 [1]
Reported-by: syzbot <syzbot+fba8e2116a12609b6c59@syzkaller.appspotmail.com>
Fixes: bc877d285ca3dba2 ("btrfs: Deduplicate extent_buffer init code")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cef7820d6abf8d61f8e1db411eae3c712f6d72a2 upstream.
When dropping extent maps for a range, through btrfs_drop_extent_cache(),
if we find an extent map that starts before our target range and/or ends
before the target range, and we are not able to allocate extent maps for
splitting that extent map, then we don't fail and simply remove the entire
extent map from the inode's extent map tree.
This is generally fine, because in case anyone needs to access the extent
map, it can just load it again later from the respective file extent
item(s) in the subvolume btree. However, if that extent map is new and is
in the list of modified extents, then a fast fsync will miss the parts of
the extent that were outside our range (that needed to be split),
therefore not logging them. Fix that by marking the inode for a full
fsync. This issue was introduced after removing BUG_ON()s triggered when
the split extent map allocations failed, done by commit 7014cdb49305ed
("Btrfs: btrfs_drop_extent_cache should never fail"), back in 2012, and
the fast fsync path already existed but was very recent.
Also, in the case where we could allocate extent maps for the split
operations but then fail to add a split extent map to the tree, mark the
inode for a full fsync as well. This is not supposed to ever fail, and we
assert that, but in case assertions are disabled (CONFIG_BTRFS_ASSERT is
not set), it's the correct thing to do to make sure a fast fsync will not
miss a new extent.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 331cd9461412e103d07595a10289de90004ac890 upstream.
When enabling quotas, at btrfs_quota_enable(), after committing the
transaction, we change fs_info->quota_root to point to the quota root we
created and set BTRFS_FS_QUOTA_ENABLED at fs_info->flags. Then we try
to start the qgroup rescan worker, first by initializing it with a call
to qgroup_rescan_init() - however if that fails we end up freeing the
quota root but we leave fs_info->quota_root still pointing to it, this
can later result in a use-after-free somewhere else.
We have previously set the flags BTRFS_FS_QUOTA_ENABLED and
BTRFS_QGROUP_STATUS_FLAG_ON, so we can only fail with -EINPROGRESS at
btrfs_quota_enable(), which is possible if someone already called the
quota rescan ioctl, and therefore started the rescan worker.
So fix this by ignoring an -EINPROGRESS and asserting we can't get any
other error.
Reported-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/linux-btrfs/20220823015931.421355-1-yebin10@huawei.com/
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 81d5d61454c365718655cfc87d8200c84e25d596 upstream.
Currently there are two corner cases not handling compat RO flags
correctly:
- Remount
We can still mount the fs RO with compat RO flags, then remount it RW.
We should not allow any write into a fs with unsupported RO flags.
- Still try to search block group items
In fact, behavior/on-disk format change to extent tree should not
need a full incompat flag.
And since we can ensure fs with unsupported RO flags never got any
writes (with above case fixed), then we can even skip block group
items search at mount time.
This patch will enhance the unsupported RO compat flags by:
- Reject read-write remount if there are unsupported RO compat flags
- Go dummy block group items directly for unsupported RO compat flags
In fact, only changes to chunk/subvolume/root/csum trees should go
incompat flags.
The latter part should allow future change to extent tree to be compat
RO flags.
Thus this patch also needs to be backported to all stable trees.
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b0c582233a8563f3c4228df838cdc67a8807ec78 upstream.
With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs for
read-only mmapped files, such as shared libraries. However, the kernel
makes no attempt to actually align those mappings on 2MB boundaries,
which makes it impossible to use those THPs most of the time. This issue
applies to general file mapping THP as well as existing setups using
CONFIG_READ_ONLY_THP_FOR_FS. This is easily fixed by using
thp_get_unmapped_area for the unmapped_area function in btrfs, which
is what ext2, ext4, fuse, and xfs all use.
Initially btrfs had been left out in commit 8c07fc452ac0 ("btrfs: fix
alignment of VMA for memory mapped files on THP") as btrfs does not support
DAX. However, commit 1854bc6e2420 ("mm/readahead: Align file mappings
for non-DAX") removed the DAX requirement. We should now be able to call
thp_get_unmapped_area() for btrfs.
The problem can be seen in /proc/PID/smaps where THPeligible is set to 0
on mappings to eligible shared object files as shown below.
Before this patch:
7fc6a7e18000-7fc6a80cc000 r-xp 00000000 00:1e 199856
/usr/lib64/libcrypto.so.1.1.1k
Size: 2768 kB
THPeligible: 0
VmFlags: rd ex mr mw me
With this patch the library is mapped at a 2MB aligned address:
fbdfe200000-7fbdfe4b4000 r-xp 00000000 00:1e 199856
/usr/lib64/libcrypto.so.1.1.1k
Size: 2768 kB
THPeligible: 1
VmFlags: rd ex mr mw me
This fixes the alignment of VMAs for any mmap of a file that has the
rd and ex permissions and size >= 2MB. The VMA alignment and
THPeligible field for anonymous memory is handled separately and
is thus not effected by this change.
CC: stable@vger.kernel.org # 5.18+
Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2dd7e7bc02829eded71be2342a93dc035f5223f9 upstream.
Before sending REQ_OP_ZONE_FINISH to a zone, we need to ensure that
ongoing IOs already finished. Or, we will see a "Zone Is Full" error for
the IOs, as the ZONE_FINISH command makes the zone full.
We ensure that with btrfs_wait_block_group_reservations() and
btrfs_wait_ordered_roots() for a data block group. And, for a metadata
block group, the comparison of alloc_offset vs meta_write_pointer mostly
ensures IOs for the allocated region already sent. However, there still
can be a little time frame where the IOs are sent but not yet completed.
Introduce wait_eb_writebacks() to ensure such IOs are completed for a
metadata block group. It walks the buffer_radix to find extent buffers in
the block group and calls wait_on_extent_buffer_writeback() on them.
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
CC: stable@vger.kernel.org # 5.19+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a362bb864b8db4861977d00bd2c3222503ccc34b upstream.
Often when running generic/562 from fstests we can hang during unmount,
resulting in a trace like this:
Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00
Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds.
Sep 07 11:55:32 debian9 kernel: Not tainted 6.0.0-rc2-btrfs-next-122 #1
Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 07 11:55:32 debian9 kernel: task:umount state:D stack: 0 pid:49438 ppid: 25683 flags:0x00004000
Sep 07 11:55:32 debian9 kernel: Call Trace:
Sep 07 11:55:32 debian9 kernel: <TASK>
Sep 07 11:55:32 debian9 kernel: __schedule+0x3c8/0xec0
Sep 07 11:55:32 debian9 kernel: ? rcu_read_lock_sched_held+0x12/0x70
Sep 07 11:55:32 debian9 kernel: schedule+0x5d/0xf0
Sep 07 11:55:32 debian9 kernel: schedule_timeout+0xf1/0x130
Sep 07 11:55:32 debian9 kernel: ? lock_release+0x224/0x4a0
Sep 07 11:55:32 debian9 kernel: ? lock_acquired+0x1a0/0x420
Sep 07 11:55:32 debian9 kernel: ? trace_hardirqs_on+0x2c/0xd0
Sep 07 11:55:32 debian9 kernel: __wait_for_common+0xac/0x200
Sep 07 11:55:32 debian9 kernel: ? usleep_range_state+0xb0/0xb0
Sep 07 11:55:32 debian9 kernel: __flush_work+0x26d/0x530
Sep 07 11:55:32 debian9 kernel: ? flush_workqueue_prep_pwqs+0x140/0x140
Sep 07 11:55:32 debian9 kernel: ? trace_clock_local+0xc/0x30
Sep 07 11:55:32 debian9 kernel: __cancel_work_timer+0x11f/0x1b0
Sep 07 11:55:32 debian9 kernel: ? close_ctree+0x12b/0x5b3 [btrfs]
Sep 07 11:55:32 debian9 kernel: ? __trace_bputs+0x10b/0x170
Sep 07 11:55:32 debian9 kernel: close_ctree+0x152/0x5b3 [btrfs]
Sep 07 11:55:32 debian9 kernel: ? evict_inodes+0x166/0x1c0
Sep 07 11:55:32 debian9 kernel: generic_shutdown_super+0x71/0x120
Sep 07 11:55:32 debian9 kernel: kill_anon_super+0x14/0x30
Sep 07 11:55:32 debian9 kernel: btrfs_kill_super+0x12/0x20 [btrfs]
Sep 07 11:55:32 debian9 kernel: deactivate_locked_super+0x2e/0xa0
Sep 07 11:55:32 debian9 kernel: cleanup_mnt+0x100/0x160
Sep 07 11:55:32 debian9 kernel: task_work_run+0x59/0xa0
Sep 07 11:55:32 debian9 kernel: exit_to_user_mode_prepare+0x1a6/0x1b0
Sep 07 11:55:32 debian9 kernel: syscall_exit_to_user_mode+0x16/0x40
Sep 07 11:55:32 debian9 kernel: do_syscall_64+0x48/0x90
Sep 07 11:55:32 debian9 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7
Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7
Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0
Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570
Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000
Sep 07 11:55:32 debian9 kernel: </TASK>
What happens is the following:
1) The cleaner kthread tries to start a transaction to delete an unused
block group, but the metadata reservation can not be satisfied right
away, so a reservation ticket is created and it starts the async
metadata reclaim task (fs_info->async_reclaim_work);
2) Writeback for all the filler inodes with an i_size of 2K starts
(generic/562 creates a lot of 2K files with the goal of filling
metadata space). We try to create an inline extent for them, but we
fail when trying to insert the inline extent with -ENOSPC (at
cow_file_range_inline()) - since this is not critical, we fallback
to non-inline mode (back to cow_file_range()), reserve extents, create
extent maps and create the ordered extents;
3) An unmount starts, enters close_ctree();
4) The async reclaim task is flushing stuff, entering the flush states one
by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current
delayed iputs.
After running the delayed iputs and before calling
btrfs_wait_on_delayed_iputs(), one or more ordered extents complete,
and btrfs_add_delayed_iput() is called for each one through
btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results
in bumping fs_info->nr_delayed_iputs from 0 to some positive value.
So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting
for fs_info->nr_delayed_iputs to become 0;
5) The current transaction is committed by the transaction kthread, we then
start unpinning extents and end up calling btrfs_try_granting_tickets()
through unpin_extent_range(), since we released some space.
This results in satisfying the ticket created by the cleaner kthread at
step 1, waking up the cleaner kthread;
6) At close_ctree() we ask the cleaner kthread to park;
7) The cleaner kthread starts the transaction, deletes the unused block
group, and then calls kthread_should_park(), which returns true, so it
parks. And at this point we have the delayed iputs added by the
completion of the ordered extents still pending;
8) Then later at close_ctree(), when we call:
cancel_work_sync(&fs_info->async_reclaim_work);
We hang forever, since the cleaner was parked and no one else can run
delayed iputs after that, while the reclaim task is waiting for the
remaining delayed iputs to be completed.
Fix this by waiting for all ordered extents to complete and running the
delayed iputs before attempting to stop the async reclaim tasks. Note that
we can not wait for ordered extents with btrfs_wait_ordered_roots() (or
other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE
flag to be set on an ordered extent, but the delayed iput is added after
that, when doing the final btrfs_put_ordered_extent(). So instead wait for
the work queues used for executing ordered extent completion to be empty,
which works because we do the final put on an ordered extent at
btrfs_finish_ordered_io() (while we are in the unmount context).
Fixes: d6fd0ae25c6495 ("Btrfs: fix missing delayed iputs on unmount")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8a1f1e3d1eecf9d2359a2709e276743a67e145db upstream.
During early unmount, at close_ctree(), we try to stop the block group
reclaim task with cancel_work_sync(), but that may hang if the block group
reclaim task is currently at btrfs_relocate_block_group() waiting for the
flag BTRFS_FS_UNFINISHED_DROPS to be cleared from fs_info->flags. During
unmount we only clear that flag later, after trying to stop the block
group reclaim task.
Fix that by clearing BTRFS_FS_UNFINISHED_DROPS before trying to stop the
block group reclaim task and after setting BTRFS_FS_CLOSING_START, so that
if the reclaim task is waiting on that bit, it will stop immediately after
being woken, because it sees the filesystem is closing (with a call to
btrfs_fs_closing()), and then returns immediately with -EINTR.
Fixes: 31e70e527806c5 ("btrfs: fix hang during unmount when block group reclaim task is running")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 5da431b71d4b9be3c8cf6786eff9e2d41a5f9f65 ]
[BEHAVIOR CHANGE]
Since commit f6fca3917b4d ("btrfs: store chunk size in space-info
struct"), btrfs no longer can create larger data chunks than 1G:
mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4
mount $dev1 $mnt
btrfs balance start --full $mnt
btrfs balance start --full $mnt
umount $mnt
btrfs ins dump-tree -t chunk $dev1 | grep "DATA|RAID0" -C 2
Before that offending commit, what we got is a 4G data chunk:
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
io_align 65536 io_width 65536 sector_size 4096
num_stripes 4 sub_stripes 1
Now what we got is only 1G data chunk:
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176
length 1073741824 owner 2 stripe_len 65536 type DATA|RAID0
io_align 65536 io_width 65536 sector_size 4096
num_stripes 4 sub_stripes 1
This will increase the number of data chunks by the number of devices,
not only increase system chunk usage, but also greatly increase mount
time.
Without a proper reason, we should not change the max chunk size.
[CAUSE]
Previously, we set max data chunk size to 10G, while max data stripe
length to 1G.
Commit f6fca3917b4d ("btrfs: store chunk size in space-info struct")
completely ignored the 10G limit, but use 1G max stripe limit instead,
causing above shrink in max data chunk size.
[FIX]
Fix the max data chunk size to 10G, and in decide_stripe_size_regular()
we limit stripe_size to 1G manually.
This should only affect data chunks, as for metadata chunks we always
set the max stripe size the same as max chunk size (256M or 1G
depending on fs size).
Now the same script result the same old result:
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
io_align 65536 io_width 65536 sector_size 4096
num_stripes 4 sub_stripes 1
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6ca64ac2763149fb66c0b4bf12f5e0977a88e51d ]
Since commit 6a921de58992 ("btrfs: zoned: introduce
space_info->active_total_bytes"), we're only counting the bytes of a
block group on an active zone as usable for metadata writes. But on a
SMR drive, we don't have active zones and short circuit some of the
logic.
This leads to an error on mount, because we cannot reserve space for
metadata writes.
Fix this by also setting the BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE bit in the
block-group's runtime flag if the zone is a conventional zone.
Fixes: 6a921de58992 ("btrfs: zoned: introduce space_info->active_total_bytes")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit d5b81ced74afded85619ffbbe9c32ba9d82c9b1e upstream.
The commit 2ce543f47843 ("btrfs: zoned: wait until zone is finished when
allocation didn't progress") implemented a zone finish waiting mechanism
to the write path of zoned mode. However, using
wait_var_event()/wake_up_all() on fs_info->zone_finish_wait is wrong and
wait_var_event() just hangs because no one ever wakes it up once it goes
into sleep.
Instead, we can simply use wait_on_bit_io() and clear_and_wake_up_bit()
on fs_info->flags with a proper barrier installed.
Fixes: 2ce543f47843 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cac5c44c48c9fb9cc31bea15ebd9ef0c6462314f upstream.
The commit 7d7672bc5d10 ("btrfs: convert count_max_extents() to use
fs_info->max_extent_size") introduced a division by
fs_info->max_extent_size. This max_extent_size is initialized with max
zone append limit size of the device btrfs runs on. However, in zone
emulation mode, the device is not zoned then its zone append limit is
zero. This resulted in zero value of fs_info->max_extent_size and caused
zero division error.
Fix the error by setting non-zero pseudo value to max append zone limit
in zone emulation mode. Set the pseudo value based on max_segments as
suggested in the commit c2ae7b772ef4 ("btrfs: zoned: revive
max_zone_append_bytes").
Fixes: 7d7672bc5d10 ("btrfs: convert count_max_extents() to use fs_info->max_extent_size")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 899b7f69f244e539ea5df1b4d756046337de44a5 ]
We're seeing a weird problem in production where we have overlapping
extent items in the extent tree. It's unclear where these are coming
from, and in debugging we realized there's no check in the tree checker
for this sort of problem. Add a check to the tree-checker to make sure
that the extents do not overlap each other.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b40130b23ca4a08c5785d5a3559805916bddba3c ]
We have been hitting the following lockdep splat with btrfs/187 recently
WARNING: possible circular locking dependency detected
5.19.0-rc8+ #775 Not tainted
------------------------------------------------------
btrfs/752500 is trying to acquire lock:
ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
but task is already holding lock:
ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_init_new_buffer+0x7d/0x2c0
btrfs_alloc_tree_block+0x120/0x3b0
__btrfs_cow_block+0x136/0x600
btrfs_cow_block+0x10b/0x230
btrfs_search_slot+0x53b/0xb70
btrfs_lookup_inode+0x2a/0xa0
__btrfs_update_delayed_inode+0x5f/0x280
btrfs_async_run_delayed_root+0x24c/0x290
btrfs_work_helper+0xf2/0x3e0
process_one_work+0x271/0x590
worker_thread+0x52/0x3b0
kthread+0xf0/0x120
ret_from_fork+0x1f/0x30
-> #1 (btrfs-tree-01){++++}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_search_slot+0x3c3/0xb70
do_relocation+0x10c/0x6b0
relocate_tree_blocks+0x317/0x6d0
relocate_block_group+0x1f1/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Chain exists of:
btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-tree-01/1);
lock(btrfs-tree-01);
lock(btrfs-tree-01/1);
lock(btrfs-treloc-02#2);
*** DEADLOCK ***
7 locks held by btrfs/752500:
#0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
#1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
#2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
#3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
#4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
stack backtrace:
CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x56/0x73
check_noncircular+0xd6/0x100
? lock_is_held_type+0xe2/0x140
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
? __btrfs_tree_lock+0x24/0x110
down_write_nested+0x41/0x80
? __btrfs_tree_lock+0x24/0x110
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
? lock_release+0x137/0x2d0
? _raw_spin_unlock+0x29/0x50
? release_extent_buffer+0x128/0x180
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
? lock_is_held_type+0xe2/0x140
? lock_is_held_type+0xe2/0x140
? __x64_sys_ioctl+0x88/0xc0
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
This isn't necessarily new, it's just tricky to hit in practice. There
are two competing things going on here. With relocation we create a
snapshot of every fs tree with a reloc tree. Any extent buffers that
get initialized here are initialized with the reloc root lockdep key.
However since it is a snapshot, any blocks that are currently in cache
that originally belonged to the fs tree will have the normal tree
lockdep key set. This creates the lock dependency of
reloc tree -> normal tree
for the extent buffer locking during the first phase of the relocation
as we walk down the reloc root to relocate blocks.
However this is problematic because the final phase of the relocation is
merging the reloc root into the original fs root. This involves
searching down to any keys that exist in the original fs root and then
swapping the relocated block and the original fs root block. We have to
search down to the fs root first, and then go search the reloc root for
the block we need to replace. This creates the dependency of
normal tree -> reloc tree
which is why lockdep complains.
Additionally even if we were to fix this particular mismatch with a
different nesting for the merge case, we're still slotting in a block
that has a owner of the reloc root objectid into a normal tree, so that
block will have its lockdep key set to the tree reloc root, and create a
lockdep splat later on when we wander into that block from the fs root.
Unfortunately the only solution here is to make sure we do not set the
lockdep key to the reloc tree lockdep key normally, and then reset any
blocks we wander into from the reloc root when we're doing the merged.
This solves the problem of having mixed tree reloc keys intermixed with
normal tree keys, and then allows us to make sure in the merge case we
maintain the lock order of
normal tree -> reloc tree
We handle this by setting a bit on the reloc root when we do the search
for the block we want to relocate, and any block we search into or COW
at that point gets set to the reloc tree key. This works correctly
because we only ever COW down to the parent node, so we aren't resetting
the key for the block we're linking into the fs root.
With this patch we no longer have the lockdep splat in btrfs/187.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0a27a0474d146eb79e09ec88bf0d4229f4cfc1b8 ]
These definitions exist in disk-io.c, which is not related to the
locking. Move this over to locking.h/c where it makes more sense.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit e6e3dec6c3c288d556b991a85d5d8e3ee71e9046 upstream.
When punching a hole into a file range that is adjacent with a hole and we
are not using the no-holes feature, we expand the range of the adjacent
file extent item that represents a hole, to save metadata space.
However we don't update the generation of hole file extent item, which
means a full fsync will not log that file extent item if the fsync happens
in a later transaction (since commit 7f30c07288bb9e ("btrfs: stop copying
old file extents when doing a full fsync")).
For example, if we do this:
$ mkfs.btrfs -f -O ^no-holes /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xab 2M 2M" /mnt/foobar
$ sync
We end up with 2 file extent items in our file:
1) One that represents the hole for the file range [0, 2M), with a
generation of 7;
2) Another one that represents an extent covering the range [2M, 4M).
After that if we do the following:
$ xfs_io -c "fpunch 2M 2M" /mnt/foobar
We end up with a single file extent item in the file, which represents a
hole for the range [0, 4M) and with a generation of 7 - because we end
dropping the data extent for range [2M, 4M) and then update the file
extent item that represented the hole at [0, 2M), by increasing
length from 2M to 4M.
Then doing a full fsync and power failing:
$ xfs_io -c "fsync" /mnt/foobar
<power failure>
will result in the full fsync not logging the file extent item that
represents the hole for the range [0, 4M), because its generation is 7,
which is lower than the generation of the current transaction (8).
As a consequence, after mounting again the filesystem (after log replay),
the region [2M, 4M) does not have a hole, it still points to the
previous data extent.
So fix this by always updating the generation of existing file extent
items representing holes when we merge/expand them. This solves the
problem and it's the same approach as when we merge prealloc extents that
got written (at btrfs_mark_extent_written()). Setting the generation to
the current transaction's generation is also what we do when merging
the new hole extent map with the previous one or the next one.
A test case for fstests, covering both cases of hole file extent item
merging (to the left and to the right), will be sent soon.
Fixes: 7f30c07288bb9e ("btrfs: stop copying old file extents when doing a full fsync")
CC: stable@vger.kernel.org # 5.18+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9ea0106a7a3d8116860712e3f17cd52ce99f6707 upstream.
In btrfs_get_dev_args_from_path(), btrfs_get_bdev_and_sb() can fail if
the path is invalid. In this case, btrfs_get_dev_args_from_path()
returns directly without freeing args->uuid and args->fsid allocated
before, which causes memory leak.
To fix these possible leaks, when btrfs_get_bdev_and_sb() fails,
btrfs_put_dev_args_from_path() is called to clean up the memory.
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Fixes: faa775c41d655 ("btrfs: add a btrfs_get_dev_args_from_path helper")
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Zixuan Fu <r33s3n6@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b51111271b0352aa596c5ae8faf06939e91b3b68 upstream.
For a filesystem which has btrfs read-only property set to true, all
write operations including xattr should be denied. However, security
xattr can still be changed even if btrfs ro property is true.
This happens because xattr_permission() does not have any restrictions
on security.*, system.* and in some cases trusted.* from VFS and
the decision is left to the underlying filesystem. See comments in
xattr_permission() for more details.
This patch checks if the root is read-only before performing the set
xattr operation.
Testcase:
DEV=/dev/vdb
MNT=/mnt
mkfs.btrfs -f $DEV
mount $DEV $MNT
echo "file one" > $MNT/f1
setfattr -n "security.one" -v 2 $MNT/f1
btrfs property set /mnt ro true
setfattr -n "security.one" -v 1 $MNT/f1
umount $MNT
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ced8ecf026fd8084cf175530ff85c76d6085d715 upstream.
When testing space_cache v2 on a large set of machines, we encountered a
few symptoms:
1. "unable to add free space :-17" (EEXIST) errors.
2. Missing free space info items, sometimes caught with a "missing free
space info for X" error.
3. Double-accounted space: ranges that were allocated in the extent tree
and also marked as free in the free space tree, ranges that were
marked as allocated twice in the extent tree, or ranges that were
marked as free twice in the free space tree. If the latter made it
onto disk, the next reboot would hit the BUG_ON() in
add_new_free_space().
4. On some hosts with no on-disk corruption or error messages, the
in-memory space cache (dumped with drgn) disagreed with the free
space tree.
All of these symptoms have the same underlying cause: a race between
caching the free space for a block group and returning free space to the
in-memory space cache for pinned extents causes us to double-add a free
range to the space cache. This race exists when free space is cached
from the free space tree (space_cache=v2) or the extent tree
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
struct btrfs_block_group::last_byte_to_unpin and struct
btrfs_block_group::progress are supposed to protect against this race,
but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") subtly broke this by allowing
multiple transactions to be unpinning extents at the same time.
Specifically, the race is as follows:
1. An extent is deleted from an uncached block group in transaction A.
2. btrfs_commit_transaction() is called for transaction A.
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
ref for the deleted extent.
4. __btrfs_free_extent() -> do_free_extent_accounting() ->
add_to_free_space_tree() adds the deleted extent back to the free
space tree.
5. do_free_extent_accounting() -> btrfs_update_block_group() ->
btrfs_cache_block_group() queues up the block group to get cached.
block_group->progress is set to block_group->start.
6. btrfs_commit_transaction() for transaction A calls
switch_commit_roots(). It sets block_group->last_byte_to_unpin to
block_group->progress, which is block_group->start because the block
group hasn't been cached yet.
7. The caching thread gets to our block group. Since the commit roots
were already switched, load_free_space_tree() sees the deleted extent
as free and adds it to the space cache. It finishes caching and sets
block_group->progress to U64_MAX.
8. btrfs_commit_transaction() advances transaction A to
TRANS_STATE_SUPER_COMMITTED.
9. fsync calls btrfs_commit_transaction() for transaction B. Since
transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
commit is for fsync, it advances.
10. btrfs_commit_transaction() for transaction B calls
switch_commit_roots(). This time, the block group has already been
cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
11. btrfs_commit_transaction() for transaction A calls
btrfs_finish_extent_commit(), which calls unpin_extent_range() for
the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
transaction B!), so it adds the deleted extent to the space cache
again!
This explains all of our symptoms above:
* If the sequence of events is exactly as described above, when the free
space is re-added in step 11, it will fail with EEXIST.
* If another thread reallocates the deleted extent in between steps 7
and 11, then step 11 will silently re-add that space to the space
cache as free even though it is actually allocated. Then, if that
space is allocated *again*, the free space tree will be corrupted
(namely, the wrong item will be deleted).
* If we don't catch this free space tree corruption, it will continue
to get worse as extents are deleted and reallocated.
The v1 space_cache is synchronously loaded when an extent is deleted
(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
with load_cache_only=1), so it is not normally affected by this bug.
However, as noted above, if we fail to load the space cache, we will
fall back to caching from the extent tree and may hit this bug.
The easiest fix for this race is to also make caching from the free
space tree or extent tree synchronous. Josef tested this and found no
performance regressions.
A few extra changes fall out of this change. Namely, this fix does the
following, with step 2 being the crucial fix:
1. Factor btrfs_caching_ctl_wait_done() out of
btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
that we already hold a reference to.
2. Change the call in btrfs_cache_block_group() of
btrfs_wait_space_cache_v1_finished() to
btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
space_cache option.
3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
space_cache_v1_done().
4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
`bool wait` to more accurately describe its new meaning.
5. Change a few callers which had a separate call to
btrfs_wait_block_group_cache_done() to use wait = true instead.
6. Make btrfs_wait_block_group_cache_done() static now that it's not
used outside of block-group.c anymore.
Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f2c3bec215694fb8bc0ef5010f2a758d1906fc2d upstream.
If the replace target device reappears after the suspended replace is
cancelled, it blocks the mount operation as it can't find the matching
replace-item in the metadata. As shown below,
BTRFS error (device sda5): replace devid present without an active replace item
To overcome this situation, the user can run the command
btrfs device scan --forget <replace target device>
and try the mount command again. And also, to avoid repeating the issue,
superblock on the devid=0 must be wiped.
wipefs -a device-path-to-devid=0.
This patch adds some info when this situation occurs.
Reported-by: Samuel Greiner <samuel@balkonien.org>
Link: https://lore.kernel.org/linux-btrfs/b4f62b10-b295-26ea-71f9-9a5c9299d42c@balkonien.org/T/
CC: stable@vger.kernel.org # 5.0+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 59a3991984dbc1fc47e5651a265c5200bd85464e upstream.
If the filesystem mounts with the replace-operation in a suspended state
and try to cancel the suspended replace-operation, we hit the assert. The
assert came from the commit fe97e2e173af ("btrfs: dev-replace: replace's
scrub must not be running in suspended state") that was actually not
required. So just remove it.
$ mount /dev/sda5 /btrfs
BTRFS info (device sda5): cannot continue dev_replace, tgtdev is missing
BTRFS info (device sda5): you may cancel the operation after 'mount -o degraded'
$ mount -o degraded /dev/sda5 /btrfs <-- success.
$ btrfs replace cancel /btrfs
kernel: assertion failed: ret != -ENOTCONN, in fs/btrfs/dev-replace.c:1131
kernel: ------------[ cut here ]------------
kernel: kernel BUG at fs/btrfs/ctree.h:3750!
After the patch:
$ btrfs replace cancel /btrfs
BTRFS info (device sda5): suspended dev_replace from /dev/sda5 (devid 1) to <missing disk> canceled
Fixes: fe97e2e173af ("btrfs: dev-replace: replace's scrub must not be running in suspended state")
CC: stable@vger.kernel.org # 5.0+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 47bf225a8d2cccb15f7e8d4a1ed9b757dd86afd7 upstream.
At btrfs_del_root_ref(), if btrfs_search_slot() returns an error, we end
up returning from the function with a value of 0 (success). This happens
because the function returns the value stored in the variable 'err',
which is 0, while the error value we got from btrfs_search_slot() is
stored in the 'ret' variable.
So fix it by setting 'err' with the error value.
Fixes: 8289ed9f93bef2 ("btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 769030e11847c5412270c0726ff21d3a1f0a3131 upstream.
During log replay, at add_link(), we may increment the link count of
another inode that has a reference that conflicts with a new reference
for the inode currently being processed.
During log replay, at add_link(), we may drop (unlink) a reference from
some inode in the subvolume tree if that reference conflicts with a new
reference found in the log for the inode we are currently processing.
After the unlink, If the link count has decreased from 1 to 0, then we
increment the link count to prevent the inode from being deleted if it's
evicted by an iput() call, because we may have references to add to that
inode later on (and we will fixup its link count later during log replay).
However incrementing the link count from 0 to 1 triggers a warning:
$ cat fs/inode.c
(...)
void inc_nlink(struct inode *inode)
{
if (unlikely(inode->i_nlink == 0)) {
WARN_ON(!(inode->i_state & I_LINKABLE));
atomic_long_dec(&inode->i_sb->s_remove_count);
}
(...)
The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's
never set during log replay.
Most of the time, the warning isn't triggered even if we dropped the last
reference of the conflicting inode, and this is because:
1) The conflicting inode was previously marked for fixup, through a call
to link_to_fixup_dir(), which increments the inode's link count;
2) And the last iput() on the inode has not triggered eviction of the
inode, nor was eviction triggered after the iput(). So at add_link(),
even if we unlink the last reference of the inode, its link count ends
up being 1 and not 0.
So this means that if eviction is triggered after link_to_fixup_dir() is
called, at add_link() we will read the inode back from the subvolume tree
and have it with a correct link count, matching the number of references
it has on the subvolume tree. So if when we are at add_link() the inode
has exactly one reference only, its link count is 1, and after the unlink
its link count becomes 0.
So fix this by using set_nlink() instead of inc_nlink(), as the former
accepts a transition from 0 to 1 and it's what we use in other similar
contexts (like at link_to_fixup_dir().
Also make add_inode_ref() use set_nlink() instead of inc_nlink() to
bump the link count from 0 to 1.
The warning is actually harmless, but it may scare users. Josef also ran
into it recently.
CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7a6b75b79902e47f46328b57733f2604774fa2d9 upstream.
During log replay, when processing inode references, if we get an error
when looking up for an extended reference at __add_inode_ref(), we ignore
it and proceed, returning success (0) if no other error happens after the
lookup. This is obviously wrong because in case an extended reference
exists and it encodes some name not in the log, we need to unlink it,
otherwise the filesystem state will not match the state it had after the
last fsync.
So just make __add_inode_ref() return an error it gets from the extended
reference lookup.
Fixes: f186373fef005c ("btrfs: extended inode refs")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 74944c873602a3ed8d16ff7af3f64af80c0f9dac upstream.
With the automatic block group reclaim code we will preemptively try to
mark the block group RO before we start the relocation. We do this to
make sure we should actually try to relocate the block group.
However if we hit an error during the actual relocation we won't clean
up our RO counter and the block group will remain RO. This was observed
internally with file systems reporting less space available from df when
we had failed background relocations.
Fix this by doing the dec_ro in the error case.
Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 85f02d6c856b9f3a0acf5219de6e32f58b9778eb upstream.
In btrfs_relocate_block_group(), the rc is allocated. Then
btrfs_relocate_block_group() calls
relocate_block_group()
prepare_to_relocate()
set_reloc_control()
that assigns rc to the variable fs_info->reloc_ctl. When
prepare_to_relocate() returns, it calls
btrfs_commit_transaction()
btrfs_start_dirty_block_groups()
btrfs_alloc_path()
kmem_cache_zalloc()
which may fail for example (or other errors could happen). When the
failure occurs, btrfs_relocate_block_group() detects the error and frees
rc and doesn't set fs_info->reloc_ctl to NULL. After that, in
btrfs_init_reloc_root(), rc is retrieved from fs_info->reloc_ctl and
then used, which may cause a use-after-free bug.
This possible bug can be triggered by calling btrfs_ioctl_balance()
before calling btrfs_ioctl_defrag().
To fix this possible bug, in prepare_to_relocate(), check if
btrfs_commit_transaction() fails. If the failure occurs,
unset_reloc_control() is called to set fs_info->reloc_ctl to NULL.
The error log in our fault-injection testing is shown as follows:
[ 58.751070] BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
...
[ 58.753577] Call Trace:
...
[ 58.755800] kasan_report+0x45/0x60
[ 58.756066] btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
[ 58.757304] record_root_in_trans+0x792/0xa10 [btrfs]
[ 58.757748] btrfs_record_root_in_trans+0x463/0x4f0 [btrfs]
[ 58.758231] start_transaction+0x896/0x2950 [btrfs]
[ 58.758661] btrfs_defrag_root+0x250/0xc00 [btrfs]
[ 58.759083] btrfs_ioctl_defrag+0x467/0xa00 [btrfs]
[ 58.759513] btrfs_ioctl+0x3c95/0x114e0 [btrfs]
...
[ 58.768510] Allocated by task 23683:
[ 58.768777] ____kasan_kmalloc+0xb5/0xf0
[ 58.769069] __kmalloc+0x227/0x3d0
[ 58.769325] alloc_reloc_control+0x10a/0x3d0 [btrfs]
[ 58.769755] btrfs_relocate_block_group+0x7aa/0x1e20 [btrfs]
[ 58.770228] btrfs_relocate_chunk+0xf1/0x760 [btrfs]
[ 58.770655] __btrfs_balance+0x1326/0x1f10 [btrfs]
[ 58.771071] btrfs_balance+0x3150/0x3d30 [btrfs]
[ 58.771472] btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
[ 58.771902] btrfs_ioctl+0x4caa/0x114e0 [btrfs]
...
[ 58.773337] Freed by task 23683:
...
[ 58.774815] kfree+0xda/0x2b0
[ 58.775038] free_reloc_control+0x1d6/0x220 [btrfs]
[ 58.775465] btrfs_relocate_block_group+0x115c/0x1e20 [btrfs]
[ 58.775944] btrfs_relocate_chunk+0xf1/0x760 [btrfs]
[ 58.776369] __btrfs_balance+0x1326/0x1f10 [btrfs]
[ 58.776784] btrfs_balance+0x3150/0x3d30 [btrfs]
[ 58.777185] btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
[ 58.777621] btrfs_ioctl+0x4caa/0x114e0 [btrfs]
...
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Zixuan Fu <r33s3n6@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f6065f8edeb25f4a9dfe0b446030ad995a84a088 upstream.
[BUG]
There is a small workload which will always fail with recent kernel:
(A simplified version from btrfs/125 test case)
mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
mount $dev1 $mnt
xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
sync
umount $mnt
btrfs dev scan -u $dev3
mount -o degraded $dev1 $mnt
xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
umount $mnt
btrfs dev scan
mount $dev1 $mnt
btrfs balance start --full-balance $mnt
umount $mnt
The failure is always failed to read some tree blocks:
BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
...
[CAUSE]
With the recently added debug output, we can see all RAID56 operations
related to full stripe 38928384:
56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
Before we enter raid56_parity_recover(), we have triggered some metadata
write for the full stripe 38928384, this leads to us to read all the
sectors from disk.
Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
avoid unnecessary read.
This means, for that full stripe, after any partial write, we will have
stale data, along with P/Q calculated using that stale data.
Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
which has data stripes" we haven't submitted all the corrupted P/Q to disk.
When we really need to recover certain range, aka in
raid56_parity_recover(), we will use the cached rbio, along with its
cached sectors (the full stripe is all cached).
This explains why we have no event raid56_scrub_read_recover()
triggered.
Since we have the cached P/Q which is calculated using the stale data,
the recovered one will just be stale.
In our particular test case, it will always return the same incorrect
metadata, thus causing the same error message "parent transid verify
failed on 39010304 wanted 9 found 7" again and again.
[BTRFS DESTRUCTIVE RMW PROBLEM]
Test case btrfs/125 (and above workload) always has its trouble with
the destructive read-modify-write (RMW) cycle:
0 32K 64K
Data1: | Good | Good |
Data2: | Bad | Bad |
Parity: | Good | Good |
In above case, if we trigger any write into Data1, we will use the bad
data in Data2 to re-generate parity, killing the only chance to recovery
Data2, thus Data2 is lost forever.
This destructive RMW cycle is not specific to btrfs RAID56, but there
are some btrfs specific behaviors making the case even worse:
- Btrfs will cache sectors for unrelated vertical stripes.
In above example, if we're only writing into 0~32K range, btrfs will
still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
This behavior is to cache sectors for later update.
Incidentally commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio()
subpage compatible") has a bug which makes RAID56 to never trust the
cached sectors, thus slightly improve the situation for recovery.
Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
steal_rbio" will revert the behavior back to the old one.
- Btrfs raid56 partial write will update all P/Q sectors and cache them
This means, even if data at (64K ~ 96K) of Data2 is free space, and
only (96K ~ 128K) of Data2 is really stale data.
And we write into that (96K ~ 128K), we will update all the parity
sectors for the full stripe.
This unnecessary behavior will completely kill the chance of recovery.
Thankfully, an unrelated optimization "btrfs: only write the sectors
in the vertical stripe which has data stripes" will prevent
submitting the write bio for untouched vertical sectors.
That optimization will keep the on-disk P/Q untouched for a chance for
later recovery.
[FIX]
Although we have no good way to completely fix the destructive RMW
(unless we go full scrub for each partial write), we can still limit the
damage.
With patch "btrfs: only write the sectors in the vertical stripe which
has data stripes" now we won't really submit the P/Q of unrelated
vertical stripes, so the on-disk P/Q should still be fine.
Now we really need to do is just drop all the cached sectors when doing
recovery.
By this, we have a chance to read the original P/Q from disk, and have a
chance to recover the stale data, while still keep the cache to speed up
regular write path.
In fact, just dropping all the cache for recovery path is good enough to
allow the test case btrfs/125 along with the small script to pass
reliably.
The lack of metadata write after the degraded mount, and forced metadata
COW is saving us this time.
So this patch will fix the behavior by not trust any cache in
__raid56_parity_recover(), to solve the problem while still keep the
cache useful.
But please note that this test pass DOES NOT mean we have solved the
destructive RMW problem, we just do better damage control a little
better.
Related patches:
- btrfs: only write the sectors in the vertical stripe
- d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage compatible")
- btrfs: update stripe_sectors::uptodate in steal_rbio
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bd8f7e627703ca5707833d623efcd43f104c7b3f upstream.
If we have only 8K partial write at the beginning of a full RAID56
stripe, we will write the following contents:
0 8K 32K 64K
Disk 1 (data): |XX| | |
Disk 2 (data): | | |
Disk 3 (parity): |XXXXXXXXXXXXXXX|XXXXXXXXXXXXXXX|
|X| means the sector will be written back to disk.
Note that, although we won't write any sectors from disk 2, but we will
write the full 64KiB of parity to disk.
This behavior is fine for now, but not for the future (especially for
RAID56J, as we waste quite some space to journal the unused parity
stripes).
So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we
queue a higher level bio into an rbio, we will update rbio::dbitmap to
indicate which vertical stripes we need to writeback.
And at finish_rmw(), we also check dbitmap to see if we need to write
any sector in the vertical stripe.
So after the patch, above example will only lead to the following
writeback pattern:
0 8K 32K 64K
Disk 1 (data): |XX| | |
Disk 2 (data): | | |
Disk 3 (parity): |XX| | |
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 723df2bcc9e166ac7fb82b3932a53e09415dfcde ]
When logging a new name, in case of a rename, we pin the log before
changing it. We then either delete a directory entry from the log or
insert a key range item to mark the old name for deletion on log replay.
However when doing one of those log changes we may have another task that
started writing out the log (at btrfs_sync_log()) and it started before
we pinned the log root. So we may end up changing a log tree while its
writeback is being started by another task syncing the log. This can lead
to inconsistencies in a log tree and other unexpected results during log
replay, because we can get some committed node pointing to a node/leaf
that ends up not getting written to disk before the next log commit.
The problem, conceptually, started to happen in commit 88d2beec7e53fc
("btrfs: avoid logging all directory changes during renames"), because
there we started to update the log without joining its current transaction
first.
However the problem only became visible with commit 259c4b96d78dda
("btrfs: stop doing unnecessary log updates during a rename"), and that is
because we used to pin the log at btrfs_rename() and then before entering
btrfs_log_new_name(), when unlinking the old dentry, we ended up at
btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(). Both
of them join the current log transaction, effectively waiting for any log
transaction writeout (due to acquiring the root's log_mutex). This made it
safe even after leaving the current log transaction, because we remained
with the log pinned when we called btrfs_log_new_name().
Then in commit 259c4b96d78dda ("btrfs: stop doing unnecessary log updates
during a rename"), we removed the log pinning from btrfs_rename() and
stopped calling btrfs_del_inode_ref_in_log() and
btrfs_del_dir_entries_in_log() during the rename, and started to do all
the needed work at btrfs_log_new_name(), but without joining the current
log transaction, only pinning the log, which is racy because another task
may have started writeout of the log tree right before we pinned the log.
Both commits landed in kernel 5.18, so it doesn't make any practical
difference which should be blamed, but I'm blaming the second commit only
because with the first one, by chance, the problem did not happen due to
the fact we joined the log transaction after pinning the log and unpinned
it only after calling btrfs_log_new_name().
So make btrfs_log_new_name() join the current log transaction instead of
pinning it, so that we never do log updates if it's writeout is starting.
Fixes: 259c4b96d78dda ("btrfs: stop doing unnecessary log updates during a rename")
CC: stable@vger.kernel.org # 5.18+
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2ce543f478433a0eec0f72090d7e814f1d53d456 ]
When the allocated position doesn't progress, we cannot submit IOs to
finish a block group, but there should be ongoing IOs that will finish a
block group. So, in that case, we wait for a zone to be finished and retry
the allocation after that.
Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
indicate we need a zone finish to have proceeded. The flag is set when the
allocator detected it cannot activate a new block group. And, it is cleared
once a zone is finished.
CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 898793d992c23dac6126a6a94ad893eae1a2c9df ]
cow_file_range() works in an all-or-nothing way: if it fails to allocate an
extent for a part of the given region, it gives up all the region including
the successfully allocated parts. On cow_file_range(), run_delalloc_zoned()
writes data for the region only when it successfully allocate all the
region.
This all-or-nothing allocation and write-out are problematic when available
space in all the block groups are get tight with the active zone
restriction. btrfs_reserve_extent() try hard to utilize the left space in
the active block groups and gives up finally and fails with
-ENOSPC. However, if we send IOs for the successfully allocated region, we
can finish a zone and can continue on the rest of the allocation on a newly
allocated block group.
This patch implements the partial write-out for run_delalloc_zoned(). With
this patch applied, cow_file_range() returns -EAGAIN to tell the caller to
do something to progress the further allocation, and tells the successfully
allocated region with done_offset. Furthermore, the zoned extent allocator
returns -EAGAIN to tell cow_file_range() going back to the caller side.
Actually, we still need to wait for an IO to complete to continue the
allocation. The next patch implements that part.
CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b6a98021e4019c562a23ad151a7e40adfa9f91e5 ]
There are two places where allocating a chunk is not enough. These two
places are trying to ensure the space by allocating a chunk. To meet the
condition for active_total_bytes, we also need to activate a block group
there.
CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b0931513913633044ed6e3800334c28433c007b0 ]
For metadata space on zoned filesystem, reaching ALLOC_CHUNK{,_FORCE}
means we don't have enough space left in the active_total_bytes. Before
allocating a new chunk, we can try to activate an existing block group
in this case.
Also, allocating a chunk is not enough to grant a ticket for metadata
space on zoned filesystem we need to activate the block group to
increase the active_total_bytes.
btrfs_zoned_activate_one_bg() implements the activation feature. It will
activate a block group by (maybe) finishing a block group. It will give up
activating a block group if it cannot finish any block group.
CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6a921de589926a350634e6e279f43fa5b9dbf5ba ]
The active_total_bytes, like the total_bytes, accounts for the total bytes
of active block groups in the space_info.
With an introduction of active_total_bytes, we can check if the reserved
bytes can be written to the block groups without activating a new block
group. The check is necessary for metadata allocation on zoned
filesystem. We cannot finish a block group, which may require waiting
for the current transaction, from the metadata allocation context.
Instead, we need to ensure the ongoing allocation (reserved bytes) fits
in active block groups.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f6fca3917b4d99d8c13901738afec35f570a3c2f ]
The chunk size is stored in the btrfs_space_info structure. It is
initialized at the start and is then used.
A new API is added to update the current chunk size. This API is used
to be able to expose the chunk_size as a sysfs setting.
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename and merge helpers, switch atomic type to u64, style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 79417d040f4f77b19c701bccc23013b9cdac358d ]
The metadata overcommit makes the space reservation flexible but it is also
harmful to active zone tracking. Since we cannot finish a block group from
the metadata allocation context, we might not activate a new block group
and might not be able to actually write out the overcommit reservations.
So, disable metadata overcommit for zoned filesystems. We will ensure
the reservations are under active_total_bytes in the following patches.
CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 393f646e34c18b85d0f41272bfcbd475ae3a0d34 ]
When we run out of active zones and no sufficient space is left in any
block groups, we need to finish one block group to make room to activate a
new block group.
However, we cannot do this for metadata block groups because we can cause a
deadlock by waiting for a running transaction commit. So, do that only for
a data block group.
Furthermore, the block group to be finished has two requirements. First,
the block group must not have reserved bytes left. Having reserved bytes
means we have an allocated region but did not yet send bios for it. If that
region is allocated by the thread calling btrfs_zone_finish(), it results
in a deadlock.
Second, the block group to be finished must not be a SYSTEM block
group. Finishing a SYSTEM block group easily breaks further chunk
allocation by nullifying the SYSTEM free space.
In a certain case, we cannot find any zone finish candidate or
btrfs_zone_finish() may fail. In that case, we fall back to split the
allocation bytes and fill the last spaces left in the block groups.
CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit bb9950d3df7169a673c594d38fb74e241ed4fb2a ]
For the later patch, convert the return type from bool to int and return
errors. No functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7d7672bc5d1038c745716c397d892d21e29de71c ]
If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
of extents needed, btrfs release the metadata reservation too much on its
way to write out the data.
Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
convert count_max_extents() to use it instead, and fix the calculation of
the metadata reservation.
CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f7b12a62f008a3041f42f2426983e59a6a0a3c59 ]
On zoned filesystem, data write out is limited by max_zone_append_size,
and a large ordered extent is split according the size of a bio. OTOH,
the number of extents to be written is calculated using
BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
metadata bytes to update and/or create the metadata items.
The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.
The increase of the number of extents easily occurs on zoned filesystem
if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
following warning on a small RAM environment with disabling metadata
over-commit (in the following patch).
[75721.498492] ------------[ cut here ]------------
[75721.505624] BTRFS: block rsv 1 returned -28
[75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G W 5.18.0-rc2-BTRFS-ZNS+ #109
[75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
[75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
[75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
[75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
[75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
[75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
[75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
[75721.701878] FS: 0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
[75721.712601] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
[75721.730499] Call Trace:
[75721.735166] <TASK>
[75721.739886] btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
[75721.747545] ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
[75721.756145] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.762852] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.769520] ? push_leaf_left+0x420/0x620 [btrfs]
[75721.776431] ? memcpy+0x4e/0x60
[75721.781931] split_leaf+0x433/0x12d0 [btrfs]
[75721.788392] ? btrfs_get_token_32+0x580/0x580 [btrfs]
[75721.795636] ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
[75721.803759] ? leaf_space_used+0x15d/0x1a0 [btrfs]
[75721.811156] btrfs_search_slot+0x1bc3/0x2790 [btrfs]
[75721.818300] ? lock_downgrade+0x7c0/0x7c0
[75721.824411] ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
[75721.832456] ? split_leaf+0x12d0/0x12d0 [btrfs]
[75721.839149] ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
[75721.846945] ? free_extent_buffer+0x13/0x20 [btrfs]
[75721.853960] ? btrfs_release_path+0x4b/0x190 [btrfs]
[75721.861429] btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
[75721.869313] ? rcu_read_lock_sched_held+0x16/0x80
[75721.876085] ? lock_release+0x552/0xf80
[75721.881957] ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
[75721.888886] ? __kasan_check_write+0x14/0x20
[75721.895152] ? do_raw_read_unlock+0x44/0x80
[75721.901323] ? _raw_write_lock_irq+0x60/0x80
[75721.907983] ? btrfs_global_root+0xb9/0xe0 [btrfs]
[75721.915166] ? btrfs_csum_root+0x12b/0x180 [btrfs]
[75721.921918] ? btrfs_get_global_root+0x820/0x820 [btrfs]
[75721.929166] ? _raw_write_unlock+0x23/0x40
[75721.935116] ? unpin_extent_cache+0x1e3/0x390 [btrfs]
[75721.942041] btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
[75721.949906] ? try_to_wake_up+0x30/0x14a0
[75721.955700] ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
[75721.962661] ? rcu_read_lock_sched_held+0x16/0x80
[75721.969111] ? lock_acquire+0x41b/0x4c0
[75721.974982] finish_ordered_fn+0x15/0x20 [btrfs]
[75721.981639] btrfs_work_helper+0x1af/0xa80 [btrfs]
[75721.988184] ? _raw_spin_unlock_irq+0x28/0x50
[75721.994643] process_one_work+0x815/0x1460
[75722.000444] ? pwq_dec_nr_in_flight+0x250/0x250
[75722.006643] ? do_raw_spin_trylock+0xbb/0x190
[75722.013086] worker_thread+0x59a/0xeb0
[75722.018511] kthread+0x2ac/0x360
[75722.023428] ? process_one_work+0x1460/0x1460
[75722.029431] ? kthread_complete_and_exit+0x30/0x30
[75722.036044] ret_from_fork+0x22/0x30
[75722.041255] </TASK>
[75722.045047] irq event stamp: 0
[75722.049703] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
[75722.067533] softirqs last enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
[75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
[75722.085335] ---[ end trace 0000000000000000 ]---
To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular vs zoned filesystem.
Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
filesystem, it is set to fs_info->max_zone_append_size.
CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c2ae7b772ef4e86c5ddf3fd47bf59045ae96a414 ]
This patch is basically a revert of commit 5a80d1c6a270 ("btrfs: zoned:
remove max_zone_append_size logic"), but without unnecessary ASSERT and
check. The max_zone_append_size will be used as a hint to estimate the
number of extents to cover delalloc/writeback region in the later commits.
The size of a ZONE APPEND bio is also limited by queue_max_segments(), so
this commit considers it to calculate max_zone_append_size. Technically, a
bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are
contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE"
as an upper limit of an extent size to calculate the number of extents
needed to write data.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e26b04c4c91925dba57324db177a24e18e2d0013 ]
Commit 6f93e834fa7c seemingly inadvertently moved the code responsible
for flagging the filesystem as having BIG_METADATA to a place where
setting the flag was essentially lost. This means that
filesystems created with kernels containing this bug (starting with 5.15)
can potentially be mounted by older (pre-3.4) kernels. In reality
chances for this happening are low because there are other incompat
flags introduced in the mean time. Still the correct behavior is to set
INCOMPAT_BIG_METADATA flag and persist this in the superblock.
Fixes: 6f93e834fa7c ("btrfs: fix upper limit for max_inline for page size 64K")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1314ca78b2c35d3e7d0f097268a2ee6dc0d369ef ]
If you try to force a chunk allocation, but you race with another chunk
allocation, you will end up waiting on the chunk allocation that just
occurred and then allocate another chunk. If you have many threads all
doing this at once you can way over-allocate chunks.
Fix this by resetting force to NO_FORCE, that way if we think we need to
allocate we can, otherwise we don't force another chunk allocation if
one is already happening.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 71aa147b4d9d81fa65afa6016f50d7818b64a54f ]
When cow_file_range() fails in the middle of the allocation loop, it
unlocks the pages but leaves the ordered extents intact. Thus, we need
to call btrfs_cleanup_ordered_extents() to finish the created ordered
extents.
Also, we need to call end_extent_writepage() if locked_page is available
because btrfs_cleanup_ordered_extents() never processes the region on
the locked_page.
Furthermore, we need to set the mapping as error if locked_page is
unavailable before unlocking the pages, so that the errno is properly
propagated to the user space.
CC: stable@vger.kernel.org # 5.18+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9ce7466f372d83054c7494f6b3e4b9abaf3f0355 ]
There is a hung_task report on zoned btrfs like below.
https://github.com/naota/linux/issues/59
[726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
[726.329839] Not tainted 5.16.0-rc1+ #1
[726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[726.331603] task:rocksdb:high0 state:D stack: 0 pid:11085 ppid: 11082 flags:0x00000000
[726.331608] Call Trace:
[726.331611] <TASK>
[726.331614] __schedule+0x2e5/0x9d0
[726.331622] schedule+0x58/0xd0
[726.331626] io_schedule+0x3f/0x70
[726.331629] __folio_lock+0x125/0x200
[726.331634] ? find_get_entries+0x1bc/0x240
[726.331638] ? filemap_invalidate_unlock_two+0x40/0x40
[726.331642] truncate_inode_pages_range+0x5b2/0x770
[726.331649] truncate_inode_pages_final+0x44/0x50
[726.331653] btrfs_evict_inode+0x67/0x480
[726.331658] evict+0xd0/0x180
[726.331661] iput+0x13f/0x200
[726.331664] do_unlinkat+0x1c0/0x2b0
[726.331668] __x64_sys_unlink+0x23/0x30
[726.331670] do_syscall_64+0x3b/0xc0
[726.331674] entry_SYSCALL_64_after_hwframe+0x44/0xae
[726.331677] RIP: 0033:0x7fb9490a171b
[726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
[726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
[726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
[726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
[726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
[726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
[726.331693] </TASK>
While we debug the issue, we found running fstests generic/551 on 5GB
non-zoned null_blk device in the emulated zoned mode also had a
similar hung issue.
Also, we can reproduce the same symptom with an error injected
cow_file_range() setup.
The hang occurs when cow_file_range() fails in the middle of
allocation. cow_file_range() called from do_allocation_zoned() can
split the give region ([start, end]) for allocation depending on
current block group usages. When btrfs can allocate bytes for one part
of the split regions but fails for the other region (e.g. because of
-ENOSPC), we return the error leaving the pages in the succeeded regions
locked. Technically, this occurs only when @unlock == 0. Otherwise, we
unlock the pages in an allocated region after creating an ordered
extent.
Considering the callers of cow_file_range(unlock=0) won't write out
the pages, we can unlock the pages on error exit from
cow_file_range(). So, we can ensure all the pages except @locked_page
are unlocked on error case.
In summary, cow_file_range now behaves like this:
- page_started == 1 (return value)
- All the pages are unlocked. IO is started.
- unlock == 1
- All the pages except @locked_page are unlocked in any case
- unlock == 0
- On success, all the pages are locked for writing out them
- On failure, all the pages except @locked_page are unlocked
Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|