Age | Commit message (Collapse) | Author | Files | Lines |
|
commit 35fec1089ebb5617f85884d3fa6a699ce6337a75 upstream.
If do_zone_finish() is called with a filesystem that has missing devices
(e.g. a RAID file system mounted in degraded mode) it is accessing the
btrfs_device::zone_info pointer, which will not be set if the device
in question is missing.
Check if the device is present (by checking if it has a valid block device
pointer associated) and if not, skip zone finishing for it.
Fixes: 4dcbb8ab31c1 ("btrfs: zoned: make zone finishing multi stripe capable")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2bbc4a45e5eb6b868357c1045bf6f38f6ba576e0 upstream.
If btrfs_zone_activate() is called with a filesystem that has missing
devices (e.g. a RAID file system mounted in degraded mode) it is accessing
the btrfs_device::zone_info pointer, which will not be set if the device in
question is missing.
Check if the device is present (by checking if it has a valid block
device pointer associated) and if not, skip zone activation for it.
Fixes: f9a912a3c45f ("btrfs: zoned: make zone activation multi stripe capable")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 009ca358486ded9b4822eddb924009b6848d7271 upstream.
If we fail to add the chunk map to the fs mapping tree we exit
test_rmap_block() without freeing the chunk map. Fix this by adding a
call to btrfs_free_chunk_map() before exiting the test function if the
call to btrfs_add_chunk_map() failed.
Fixes: 7dc66abb5a47 ("btrfs: use a dedicated data structure for chunk maps")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
workers
commit 4c782247b89376a83fa132f7d45d6977edae0629 upstream.
At close_ctree() after we have ran delayed iputs either through explicitly
calling btrfs_run_delayed_iputs() or later during the call to
btrfs_commit_super() or btrfs_error_commit_super(), we assert that the
delayed iputs list is empty.
When we have compressed writes this assertion may fail because delayed
iputs may have been added to the list after we last ran delayed iputs.
This happens like this:
1) We have a compressed write bio executing;
2) We enter close_ctree() and flush the fs_info->endio_write_workers
queue which is the queue used for running ordered extent completion;
3) The compressed write bio finishes and enters
btrfs_finish_compressed_write_work(), where it calls
btrfs_finish_ordered_extent() which in turn calls
btrfs_queue_ordered_fn(), which queues a work item in the
fs_info->endio_write_workers queue that we have flushed before;
4) At close_ctree() we proceed, run all existing delayed iputs and
call btrfs_commit_super() (which also runs delayed iputs), but before
we run the following assertion below:
ASSERT(list_empty(&fs_info->delayed_iputs))
A delayed iput is added by the step below...
5) The ordered extent completion job queued in step 3 runs and results in
creating a delayed iput when dropping the last reference of the ordered
extent (a call to btrfs_put_ordered_extent() made from
btrfs_finish_one_ordered());
6) At this point the delayed iputs list is not empty, so the assertion at
close_ctree() fails.
Fix this by flushing the fs_info->compressed_write_workers queue at
close_ctree() before flushing the fs_info->endio_write_workers queue,
respecting the queue dependency as the later is responsible for the
execution of ordered extent completion.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7511e29cf1355b2c47d0effb39e463119913e2f6 ]
As far as I can tell, these calls of list_del_init() on bg_list cannot
run concurrently with btrfs_mark_bg_unused() or btrfs_mark_bg_to_reclaim(),
as they are in transaction error paths and situations where the block
group is readonly.
However, if there is any chance at all of racing with mark_bg_unused(),
or a different future user of bg_list, better to be safe than sorry.
Otherwise we risk the following interleaving (bg_list refcount in parens)
T1 (some random op) T2 (btrfs_mark_bg_unused)
!list_empty(&bg->bg_list); (1)
list_del_init(&bg->bg_list); (1)
list_move_tail (1)
btrfs_put_block_group (0)
btrfs_delete_unused_bgs
bg = list_first_entry
list_del_init(&bg->bg_list);
btrfs_put_block_group(bg); (-1)
Ultimately, this results in a broken ref count that hits zero one deref
early and the real final deref underflows the refcount, resulting in a WARNING.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 01af106a076352182b2916b143fc50272600bd81 ]
It is meaningless to shift a byte count by folio_shift(). The folio index
is in units of PAGE_SIZE, not folio_size(). We can use folio_contains()
to make this work for arbitrary-order folios, so remove the assertion
that the folios are of order 0.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fdef89ce6fada462aef9cb90a140c93c8c209f0f ]
At btrfs_qgroup_cleanup_dropped_subvolume() all we want to commit the
current transaction in order to have all the qgroup rfer/excl numbers up
to date. However we are using btrfs_start_transaction(), which joins the
current transaction if there is one that is not yet committing, but also
starts a new one if there is none or if the current one is already
committing (its state is >= TRANS_STATE_COMMIT_START). This later case
results in unnecessary IO, wasting time and a pointless rotation of the
backup roots in the super block.
So instead of using btrfs_start_transaction() followed by a
btrfs_commit_transaction(), use btrfs_commit_current_transaction() which
achieves our purpose and avoids starting and committing new transactions.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 35d99c68af40a8ca175babc5a89ef7e2226fb3ca upstream.
Add btrfs_free_chunk_map() to free the memory allocated
by btrfs_alloc_chunk_map() if btrfs_add_chunk_map() fails.
Fixes: 7dc66abb5a47 ("btrfs: use a dedicated data structure for chunk maps")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5a4041f2c47247575a6c2e53ce14f7b0ac946c33 upstream.
Running generic/751 on the for-next branch often results in a hang like
below. They are both stack by locking an extent. This suggests someone
forget to unlock an extent.
INFO: task kworker/u128:1:12 blocked for more than 323 seconds.
Not tainted 6.13.0-BTRFS-ZNS+ #503
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u128:1 state:D stack:0 pid:12 tgid:12 ppid:2 flags:0x00004000
Workqueue: btrfs-fixup btrfs_work_helper [btrfs]
Call Trace:
<TASK>
__schedule+0x534/0xdd0
schedule+0x39/0x140
__lock_extent+0x31b/0x380 [btrfs]
? __pfx_autoremove_wake_function+0x10/0x10
btrfs_writepage_fixup_worker+0xf1/0x3a0 [btrfs]
btrfs_work_helper+0xff/0x480 [btrfs]
? lock_release+0x178/0x2c0
process_one_work+0x1ee/0x570
? srso_return_thunk+0x5/0x5f
worker_thread+0x1d1/0x3b0
? __pfx_worker_thread+0x10/0x10
kthread+0x10b/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/u134:0:184 blocked for more than 323 seconds.
Not tainted 6.13.0-BTRFS-ZNS+ #503
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u134:0 state:D stack:0 pid:184 tgid:184 ppid:2 flags:0x00004000
Workqueue: writeback wb_workfn (flush-btrfs-4)
Call Trace:
<TASK>
__schedule+0x534/0xdd0
schedule+0x39/0x140
__lock_extent+0x31b/0x380 [btrfs]
? __pfx_autoremove_wake_function+0x10/0x10
find_lock_delalloc_range+0xdb/0x260 [btrfs]
writepage_delalloc+0x12f/0x500 [btrfs]
? srso_return_thunk+0x5/0x5f
extent_write_cache_pages+0x232/0x840 [btrfs]
btrfs_writepages+0x72/0x130 [btrfs]
do_writepages+0xe7/0x260
? srso_return_thunk+0x5/0x5f
? lock_acquire+0xd2/0x300
? srso_return_thunk+0x5/0x5f
? find_held_lock+0x2b/0x80
? wbc_attach_and_unlock_inode.part.0+0x102/0x250
? wbc_attach_and_unlock_inode.part.0+0x102/0x250
__writeback_single_inode+0x5c/0x4b0
writeback_sb_inodes+0x22d/0x550
__writeback_inodes_wb+0x4c/0xe0
wb_writeback+0x2f6/0x3f0
wb_workfn+0x32a/0x510
process_one_work+0x1ee/0x570
? srso_return_thunk+0x5/0x5f
worker_thread+0x1d1/0x3b0
? __pfx_worker_thread+0x10/0x10
kthread+0x10b/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
This happens because we have another success path for the zoned mode. When
there is no active zone available, btrfs_reserve_extent() returns
-EAGAIN. In this case, we have two reactions.
(1) If the given range is never allocated, we can only wait for someone
to finish a zone, so wait on BTRFS_FS_NEED_ZONE_FINISH bit and retry
afterward.
(2) Or, if some allocations are already done, we must bail out and let
the caller to send IOs for the allocation. This is because these IOs
may be necessary to finish a zone.
The commit 06f364284794 ("btrfs: do proper folio cleanup when
cow_file_range() failed") moved the unlock code from the inside of the
loop to the outside. So, previously, the allocated extents are unlocked
just after the allocation and so before returning from the function.
However, they are no longer unlocked on the case (2) above. That caused
the hang issue.
Fix the issue by modifying the 'end' to the end of the allocated
range. Then, we can exit the loop and the same unlock code can properly
handle the case.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Fixes: 06f364284794 ("btrfs: do proper folio cleanup when cow_file_range() failed")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
size
commit efa11fd269c139e29b71ec21bc9c9c0063fde40d upstream.
[BUG]
When running generic/418 with a btrfs whose block size < page size
(subpage cases), it always fails.
And the following minimal reproducer is more than enough to trigger it
reliably:
workload()
{
mkfs.btrfs -s 4k -f $dev > /dev/null
dmesg -C
mount $dev $mnt
$fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest
ret=$?
umount $mnt
stop_trace
if [ $ret -ne 0 ]; then
fail
fi
}
for (( i = 0; i < 1024; i++)); do
echo "=== $i/$runtime ==="
workload
done
[CAUSE]
With extra trace printk added to the following functions:
- btrfs_buffered_write()
* Which folio is touched
* The file offset (start) where the buffered write is at
* How many bytes are copied
* The content of the write (the first 2 bytes)
- submit_one_sector()
* Which folio is touched
* The position inside the folio
* The content of the page cache (the first 2 bytes)
- pagecache_isize_extended()
* The parameters of the function itself
* The parameters of the folio_zero_range()
Which are enough to show the problem:
22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101
22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101
22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101
22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101
22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192
22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000
22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101
The tool dio-invalidate-cache will start 3 threads, each doing a buffered
write with 0x01 at offset 0, 4096 and 8192, do a fsync, then do a direct read,
and compare the read buffer with the write buffer.
Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into
the page cache.
But at submit_one_sector(), at file offset 4096, the content is zeroed
out, by pagecache_isize_extended().
The race happens like this:
Thread A is writing into range [4K, 8K).
Thread B is writing into range [8K, 12k).
Thread A | Thread B
-------------------------------------+------------------------------------
btrfs_buffered_write() | btrfs_buffered_write()
|- old_isize = 4K; | |- old_isize = 4096;
|- btrfs_inode_lock() | |
|- write into folio range [4K, 8K) | |
|- pagecache_isize_extended() | |
| extend isize from 4096 to 8192 | |
| no folio_zero_range() called | |
|- btrfs_inode_lock() | |
| |- btrfs_inode_lock()
| |- write into folio range [8K, 12K)
| |- pagecache_isize_extended()
| | calling folio_zero_range(4K, 8K)
| | This is caused by the old_isize is
| | grabbed too early, without any
| | inode lock.
| |- btrfs_inode_unlock()
The @old_isize is grabbed without inode lock, causing race between two
buffered write threads and making pagecache_isize_extended() to zero
range which is still containing cached data.
And this is only affecting subpage btrfs, because for regular blocksize
== page size case, the function pagecache_isize_extended() will do
nothing if the block size >= page size.
[FIX]
Grab the old i_size while holding the inode lock.
This means each buffered write thread will have a stable view of the
old inode size, thus avoid the above race.
CC: stable@vger.kernel.org # 5.15+
Fixes: 5e8b9ef30392 ("btrfs: move pos increment and pagecache extension to btrfs_buffered_write")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 59f37036bb7ab3d554c24abc856aabca01126414 upstream.
At btrfs_scan_root() we are accessing the inode's root (and fs_info) in a
call to btrfs_fs_closing() after we have scheduled the inode for a delayed
iput, and that can result in a use-after-free on the inode in case the
cleaner kthread does the iput before we dereference the inode in the call
to btrfs_fs_closing().
Fix this by using the fs_info stored already in a local variable instead
of doing inode->root->fs_info.
Fixes: 102044384056 ("btrfs: make the extent map shrinker run asynchronously as a work queue job")
CC: stable@vger.kernel.org # 6.13+
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 15b3b3254d1453a8db038b7d44b311a2d6c71f98 upstream.
The extent map shrinker now runs in the system unbound workqueue and no
longer in kswapd context so it can directly do an iput() on inodes even
if that blocks or needs to acquire any lock (we aren't holding any locks
when requesting the delayed iput from the shrinker). So we don't need to
add a delayed iput, wake up the cleaner and delegate the iput() to the
cleaner, which also adds extra contention on the spinlock that protects
the delayed iputs list.
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c6c9c4d56483d941f567eb921434c25fc6086dfa upstream.
If there are inodes that don't have any loaded extent maps, we end up
grabbing a reference on them and later adding a delayed iput, which wakes
up the cleaner and makes it do unnecessary work. This is common when for
example the inodes were open only to run stat(2) or all their extent maps
were already released through the folio release callback
(btrfs_release_folio()) or released by a previous run of the shrinker, or
directories which never have extent maps.
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
CC: stable@vger.kernel.org # 6.13+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8bf334beb3496da3c3fbf3daf3856f7eec70dacc ]
[BUG]
If submit_one_sector() failed inside extent_writepage_io() for sector
size < page size cases (e.g. 4K sector size and 64K page size), then
we can hit double ordered extent accounting error.
This should be very rare, as submit_one_sector() only fails when we
failed to grab the extent map, and such extent map should exist inside
the memory and has been pinned.
[CAUSE]
For example we have the following folio layout:
0 4K 32K 48K 60K 64K
|//| |//////| |///|
Where |///| is the dirty range we need to writeback. The 3 different
dirty ranges are submitted for regular COW.
Now we hit the following sequence:
- submit_one_sector() returned 0 for [0, 4K)
- submit_one_sector() returned 0 for [32K, 48K)
- submit_one_sector() returned error for [60K, 64K)
- btrfs_mark_ordered_io_finished() called for the whole folio
This will mark the following ranges as finished:
* [0, 4K)
* [32K, 48K)
Both ranges have their IO already submitted, this cleanup will
lead to double accounting.
* [60K, 64K)
That's the correct cleanup.
The only good news is, this error is only theoretical, as the target
extent map is always pinned, thus we should directly grab it from
memory, other than reading it from the disk.
[FIX]
Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
range, which can touch ranges we should not touch, instead
move the error handling inside extent_writepage_io().
So that we can cleanup exact sectors that ought to be submitted but failed.
This provides much more accurate cleanup, avoiding the double accounting.
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 72dad8e377afa50435940adfb697e070d3556670 ]
[BUG]
When running btrfs with block size (4K) smaller than page size (64K,
aarch64), there is a very high chance to crash the kernel at
generic/750, with the following messages:
(before the call traces, there are 3 extra debug messages added)
BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental
BTRFS info (device dm-3): checking UUID tree
hrtimer: interrupt took 5451385 ns
BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28
BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28
BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28
------------[ cut here ]------------
WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs]
CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G OE 6.13.0-rc1-custom+ #89
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs]
lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs]
Call trace:
can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P)
can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L)
btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
extent_writepage+0x10c/0x3b8 [btrfs]
extent_write_cache_pages+0x21c/0x4e8 [btrfs]
btrfs_writepages+0x94/0x160 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x74/0xa0
start_delalloc_inodes+0x17c/0x3b0 [btrfs]
btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
shrink_delalloc+0x11c/0x280 [btrfs]
flush_space+0x288/0x328 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x228/0x680
worker_thread+0x1bc/0x360
kthread+0x100/0x118
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0
BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0
CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G W OE 6.13.0-rc1-custom+ #89
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: btrfs_work_helper [btrfs] (btrfs-endio-write)
pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : process_one_work+0x110/0x680
lr : worker_thread+0x1bc/0x360
Call trace:
process_one_work+0x110/0x680 (P)
worker_thread+0x1bc/0x360 (L)
worker_thread+0x1bc/0x360
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 2-3
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: 0x275bb9540000 from 0xffff800080000000
PHYS_OFFSET: 0xffff8fbba0000000
CPU features: 0x100,00000070,00801250,8201720b
[CAUSE]
The above warning is triggered immediately after the delalloc range
failure, this happens in the following sequence:
- Range [1568K, 1636K) is dirty
1536K 1568K 1600K 1636K 1664K
| |/////////|////////| |
Where 1536K, 1600K and 1664K are page boundaries (64K page size)
- Enter extent_writepage() for page 1536K
- Enter run_delalloc_nocow() with locked page 1536K and range
[1568K, 1636K)
This is due to the inode having preallocated extents.
- Enter cow_file_range() with locked page 1536K and range
[1568K, 1636K)
- btrfs_reserve_extent() only reserved two extents
The main loop of cow_file_range() only reserved two data extents,
Now we have:
1536K 1568K 1600K 1636K 1664K
| |<-->|<--->|/|///////| |
1584K 1596K
Range [1568K, 1596K) has an ordered extent reserved.
- btrfs_reserve_extent() failed inside cow_file_range() for file offset
1596K
This is already a bug in our space reservation code, but for now let's
focus on the error handling path.
Now cow_file_range() returned -ENOSPC.
- btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE
Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range
[1568K, 1636K)
Function btrfs_cleanup_ordered_extents() normally needs to skip the
ranges inside the folio, as it will normally be cleaned up by
extent_writepage().
Such split error handling is already problematic in the first place.
What's worse is the folio range skipping itself, which is not taking
subpage cases into consideration at all, it will only skip the range
if the page start >= the range start.
In our case, the page start < the range start, since for subpage cases
we can have delalloc ranges inside the folio but not covering the
folio.
So it doesn't skip the page range at all.
This means all the ordered extents, both [1568K, 1584K) and
[1584K, 1596K) will be marked as IOERR.
And these two ordered extents have no more pending ios, they are marked
finished, and *QUEUED* to be deleted from the io tree.
- extent_writepage() do error cleanup
Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K).
Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the
deletion from io tree is async, it may or may not happen at this
time.
If the ranges have not yet been removed, we will do double cleaning on
those ranges, triggering the above ordered extent warnings.
In theory there are other bugs, like the cleanup in extent_writepage()
can cause double accounting on ranges that are submitted asynchronously
(compression for example).
But that's much harder to trigger because normally we do not mix regular
and compression delalloc ranges.
[FIX]
The folio range split is already buggy and not subpage compatible, it
was introduced a long time ago where subpage support was not even considered.
So instead of splitting the ordered extents cleanup into the folio range
and out of folio range, do all the cleanup inside writepage_delalloc().
- Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in
btrfs_run_delalloc_range()
- Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc()
failed
So all ordered extents are only cleaned up by
btrfs_run_delalloc_range().
- Handle the ranges that already have ordered extents allocated
If part of the folio already has ordered extent allocated, and
btrfs_run_delalloc_range() failed, we also need to cleanup that range.
Now we have a concentrated error handling for ordered extents during
btrfs_run_delalloc_range().
Fixes: d1051d6ebf8e ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 8bf334beb349 ("btrfs: fix double accounting race when extent_writepage_io() failed")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 011a9a1f244656cc3cbde47edba2b250f794d440 ]
As extent_writepage() is internal helper we should use our inode type,
so change it from struct inode.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 8bf334beb349 ("btrfs: fix double accounting race when extent_writepage_io() failed")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit acc18e1c1d8c0d59d793cf87790ccfcafb1bf5f0 ]
After commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire
read") we can now trigger a race between a task doing a direct IO write
and readahead. When this race is triggered it results in tasks getting
stale data when they attempt do a buffered read (including the task that
did the direct IO write).
This race can be sporadically triggered with test case generic/418, failing
like this:
$ ./check generic/418
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025
MKFS_OPTIONS -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad)
# --- tests/generic/418.out 2020-06-10 19:29:03.850519863 +0100
# +++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad 2025-02-03 15:42:36.974609476 +0000
@@ -1,2 +1,5 @@
QA output created by 418
+cmpbuf: offset 0: Expected: 0x1, got 0x0
+[6:0] FAIL - comparison failed, offset 24576
+diotest -wp -b 4096 -n 8 -i 4 failed at loop 3
Silence is golden
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad' to see the entire diff)
Ran: generic/418
Failures: generic/418
Failed 1 of 1 tests
The race happens like this:
1) A file has a prealloc extent for the range [16K, 28K);
2) Task A starts a direct IO write against file range [24K, 28K).
At the start of the direct IO write it invalidates the page cache at
__iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file
offset 24K;
3) Task A enters btrfs_dio_iomap_begin() and locks the extent range
[24K, 28K);
4) Task B starts a readahead for file range [16K, 28K), entering
btrfs_readahead().
First it attempts to read the page at offset 16K by entering
btrfs_do_readpage(), where it calls get_extent_map(), locks the range
[16K, 20K) and gets the extent map for the range [16K, 28K), caching
it into the 'em_cached' variable declared in the local stack of
btrfs_readahead(), and then unlocks the range [16K, 20K).
Since the extent map has the prealloc flag, at btrfs_do_readpage() we
zero out the page's content and don't submit any bio to read the page
from the extent.
Then it attempts to read the page at offset 20K entering
btrfs_do_readpage() where we reuse the previously cached extent map
(decided by get_extent_map()) since it spans the page's range and
it's still in the inode's extent map tree.
Just like for the previous page, we zero out the page's content since
the extent map has the prealloc flag set.
Then it attempts to read the page at offset 24K entering
btrfs_do_readpage() where we reuse the previously cached extent map
(decided by get_extent_map()) since it spans the page's range and
it's still in the inode's extent map tree.
Just like for the previous pages, we zero out the page's content since
the extent map has the prealloc flag set. Note that we didn't lock the
extent range [24K, 28K), so we didn't synchronize with the ongoing
direct IO write being performed by task A;
5) Task A enters btrfs_create_dio_extent() and creates an ordered extent
for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and
BTRFS_ORDERED_PREALLOC set;
6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin();
7) The ordered extent enters btrfs_finish_one_ordered() and locks the
range [24K, 28K);
8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries
to invalidate the page at offset 24K by calling
kiocb_invalidate_post_direct_write(), resulting in a call chain that
ends up at btrfs_release_folio().
The btrfs_release_folio() call ends up returning false because the range
for the page at file offset 24K is currently locked by the task doing
the ordered extent completion in the previous step (7), so we have:
btrfs_release_folio() ->
__btrfs_release_folio() ->
try_release_extent_mapping() ->
try_release_extent_state()
This last function checking that the range is locked and returning false
and propagating it up to btrfs_release_folio().
So this results in a failure to invalidate the page and
kiocb_invalidate_post_direct_write() triggers this message logged in
dmesg:
Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
After this we leave the page cache with stale data for the file range
[24K, 28K), filled with zeroes instead of the data written by direct IO
write (all bytes with a 0x01 value), so any task attempting to read with
buffered IO, including the task that did the direct IO write, will get
all bytes in the range with a 0x00 value instead of the written data.
Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(),
at the two callers of btrfs_do_readpage() instead of doing it at
get_extent_map(), just like we did before commit ac325fc2aad5 ("btrfs: do
not hold the extent lock for entire read"), and unlocking the range after
all the calls to btrfs_do_readpage(). This way we never reuse a cached
extent map without flushing any pending ordered extents from a concurrent
direct IO write.
Fixes: ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 06de96faf795b5c276a3be612da6b08c6112e747 ]
The double underscore naming scheme does not apply here, there's only
only get_extent_map(). As the definition is changed also pass the struct
btrfs_inode.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: acc18e1c1d8c ("btrfs: fix stale page cache after race between readahead and direct IO write")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit da2dccd7451de62b175fb8f0808d644959e964c7 upstream.
At btrfs_write_check() if our file's i_size is not sector size aligned and
we have a write that starts at an offset larger than the i_size that falls
within the same page of the i_size, then we end up not zeroing the file
range [i_size, write_offset).
The code is this:
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
/* Expand hole size to cover write data, preventing empty gap */
loff_t end_pos = round_up(pos + count, fs_info->sectorsize);
ret = btrfs_cont_expand(BTRFS_I(inode), oldsize, end_pos);
if (ret)
return ret;
}
So if our file's i_size is 90269 bytes and a write at offset 90365 bytes
comes in, we get 'start_pos' set to 90112 bytes, which is less than the
i_size and therefore we don't zero out the range [90269, 90365) by
calling btrfs_cont_expand().
This is an old bug introduced in commit 9036c10208e1 ("Btrfs: update hole
handling v2"), from 2008, and the buggy code got moved around over the
years.
Fix this by discarding 'start_pos' and comparing against the write offset
('pos') without any alignment.
This bug was recently exposed by test case generic/363 which tests this
scenario by polluting ranges beyond EOF with an mmap write and than verify
that after a file increases we get zeroes for the range which is supposed
to be a hole and not what we wrote with the previous mmaped write.
We're only seeing this exposed now because generic/363 used to run only
on xfs until last Sunday's fstests update.
The test was failing like this:
$ ./check generic/363
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025
MKFS_OPTIONS -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
generic/363 0s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad)
# --- tests/generic/363.out 2025-02-05 15:31:14.013646509 +0000
# +++ /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad 2025-02-05 17:25:33.112630781 +0000
@@ -1 +1,46 @@
QA output created by 363
+READ BAD DATA: offset = 0xdcad, size = 0xd921, fname = /home/fdmanana/btrfs-tests/dev/junk
+OFFSET GOOD BAD RANGE
+0x1609d 0x0000 0x3104 0x0
+operation# (mod 256) for the bad data may be 4
+0x1609e 0x0000 0x0472 0x1
+operation# (mod 256) for the bad data may be 4
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/363.out /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad' to see the entire diff)
Ran: generic/363
Failures: generic/363
Failed 1 of 1 tests
Fixes: 9036c10208e1 ("Btrfs: update hole handling v2")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit dc14ba10781bd2629835696b7cc1febf914768e9 ]
Don't use btrfs_set_item_key_safe() to modify the keys in the RAID
stripe-tree, as this can lead to corruption of the tree, which is caught
by the checks in btrfs_set_item_key_safe():
BTRFS info (device nvme1n1): leaf 49168384 gen 15 total ptrs 194 free space 8329 owner 12
BTRFS info (device nvme1n1): refs 2 lock_owner 1030 current 1030
[ snip ]
item 105 key (354549760 230 20480) itemoff 14587 itemsize 16
stride 0 devid 5 physical 67502080
item 106 key (354631680 230 4096) itemoff 14571 itemsize 16
stride 0 devid 1 physical 88559616
item 107 key (354631680 230 32768) itemoff 14555 itemsize 16
stride 0 devid 1 physical 88555520
item 108 key (354717696 230 28672) itemoff 14539 itemsize 16
stride 0 devid 2 physical 67604480
[ snip ]
BTRFS critical (device nvme1n1): slot 106 key (354631680 230 32768) new key (354635776 230 4096)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:2602!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 1 UID: 0 PID: 1055 Comm: fsstress Not tainted 6.13.0-rc1+ #1464
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0xf7/0x270
Code: <snip>
RSP: 0018:ffffc90001337ab0 EFLAGS: 00010287
RAX: 0000000000000000 RBX: ffff8881115fd000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00000000ffffffff
RBP: ffff888110ed6f50 R08: 00000000ffffefff R09: ffffffff8244c500
R10: 00000000ffffefff R11: 00000000ffffffff R12: ffff888100586000
R13: 00000000000000c9 R14: ffffc90001337b1f R15: ffff888110f23b58
FS: 00007f7d75c72740(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa811652c60 CR3: 0000000111398001 CR4: 0000000000370eb0
Call Trace:
<TASK>
? __die_body.cold+0x14/0x1a
? die+0x2e/0x50
? do_trap+0xca/0x110
? do_error_trap+0x65/0x80
? btrfs_set_item_key_safe+0xf7/0x270
? exc_invalid_op+0x50/0x70
? btrfs_set_item_key_safe+0xf7/0x270
? asm_exc_invalid_op+0x1a/0x20
? btrfs_set_item_key_safe+0xf7/0x270
btrfs_partially_delete_raid_extent+0xc4/0xe0
btrfs_delete_raid_extent+0x227/0x240
__btrfs_free_extent.isra.0+0x57f/0x9c0
? exc_coproc_segment_overrun+0x40/0x40
__btrfs_run_delayed_refs+0x2fa/0xe80
btrfs_run_delayed_refs+0x81/0xe0
btrfs_commit_transaction+0x2dd/0xbe0
? preempt_count_add+0x52/0xb0
btrfs_sync_file+0x375/0x4c0
do_fsync+0x39/0x70
__x64_sys_fsync+0x13/0x20
do_syscall_64+0x54/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f7d7550ef90
Code: <snip>
RSP: 002b:00007ffd70237248 EFLAGS: 00000202 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f7d7550ef90
RDX: 000000000000013a RSI: 000000000040eb28 RDI: 0000000000000004
RBP: 000000000000001b R08: 0000000000000078 R09: 00007ffd7023725c
R10: 00007f7d75400390 R11: 0000000000000202 R12: 028f5c28f5c28f5c
R13: 8f5c28f5c28f5c29 R14: 000000000040b520 R15: 00007f7d75c726c8
</TASK>
While the root cause of the tree order corruption isn't clear, using
btrfs_duplicate_item() to copy the item and then adjusting both the key
and the per-device physical addresses is a safe way to counter this
problem.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6a4730b325aaa48f7a5d5ba97aff0a955e2d9cec ]
This BUG_ON is meant to catch backref cache problems, but these can
arise from either bugs in the backref cache or corruption in the extent
tree. Fix it to be a proper error.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
btrfs_drop_extents()
[ Upstream commit 5324c4e10e9c2ce307a037e904c0d9671d7137d9 ]
A data race occurs when the function `insert_ordered_extent_file_extent()`
and the function `btrfs_inode_safe_disk_i_size_write()` are executed
concurrently. The function `insert_ordered_extent_file_extent()` is not
locked when reading inode->disk_i_size, causing
`btrfs_inode_safe_disk_i_size_write()` to cause data competition when
writing inode->disk_i_size, thus affecting the value of `modify_tree`.
The specific call stack that appears during testing is as follows:
============DATA_RACE============
btrfs_drop_extents+0x89a/0xa060 [btrfs]
insert_reserved_file_extent+0xb54/0x2960 [btrfs]
insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs]
btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs]
btrfs_finish_ordered_io+0x37/0x60 [btrfs]
finish_ordered_fn+0x3e/0x50 [btrfs]
btrfs_work_helper+0x9c9/0x27a0 [btrfs]
process_scheduled_works+0x716/0xf10
worker_thread+0xb6a/0x1190
kthread+0x292/0x330
ret_from_fork+0x4d/0x80
ret_from_fork_asm+0x1a/0x30
============OTHER_INFO============
btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs]
btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs]
btrfs_finish_ordered_io+0x37/0x60 [btrfs]
finish_ordered_fn+0x3e/0x50 [btrfs]
btrfs_work_helper+0x9c9/0x27a0 [btrfs]
process_scheduled_works+0x716/0xf10
worker_thread+0xb6a/0x1190
kthread+0x292/0x330
ret_from_fork+0x4d/0x80
ret_from_fork_asm+0x1a/0x30
=================================
The main purpose of the check of the inode's disk_i_size is to avoid
taking write locks on a btree path when we have a write at or beyond
EOF, since in these cases we don't expect to find extent items in the
root to drop. However if we end up taking write locks due to a data
race on disk_i_size, everything is still correct, we only add extra
lock contention on the tree in case there's concurrency from other tasks.
If the race causes us to not take write locks when we actually need them,
then everything is functionally correct as well, since if we find out we
have extent items to drop and we took read locks (modify_tree set to 0),
we release the path and retry again with write locks.
Since this data race does not affect the correctness of the function,
it is a harmless data race, use data_race() to check inode->disk_i_size.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Hao-ran Zheng <zhenghaoran154@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e2f0943cf37305dbdeaf9846e3c941451bcdef63 ]
When we are trying to join the current transaction and if it's aborted,
we read its 'aborted' field after unlocking fs_info->trans_lock and
without holding any extra reference count on it. This means that a
concurrent task that is aborting the transaction may free the transaction
before we read its 'aborted' field, leading to a use-after-free.
Fix this by reading the 'aborted' field while holding fs_info->trans_lock
since any freeing task must first acquire that lock and set
fs_info->running_transaction to NULL before freeing the transaction.
This was reported by syzbot and Dmitry with the following stack traces
from KASAN:
==================================================================
BUG: KASAN: slab-use-after-free in join_transaction+0xd9b/0xda0 fs/btrfs/transaction.c:278
Read of size 4 at addr ffff888011839024 by task kworker/u4:9/1128
CPU: 0 UID: 0 PID: 1128 Comm: kworker/u4:9 Not tainted 6.13.0-rc7-syzkaller-00019-gc45323b7560e #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Workqueue: events_unbound btrfs_async_reclaim_data_space
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0x169/0x550 mm/kasan/report.c:489
kasan_report+0x143/0x180 mm/kasan/report.c:602
join_transaction+0xd9b/0xda0 fs/btrfs/transaction.c:278
start_transaction+0xaf8/0x1670 fs/btrfs/transaction.c:697
flush_space+0x448/0xcf0 fs/btrfs/space-info.c:803
btrfs_async_reclaim_data_space+0x159/0x510 fs/btrfs/space-info.c:1321
process_one_work kernel/workqueue.c:3236 [inline]
process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3317
worker_thread+0x870/0xd30 kernel/workqueue.c:3398
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
Allocated by task 5315:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
__kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
kasan_kmalloc include/linux/kasan.h:260 [inline]
__kmalloc_cache_noprof+0x243/0x390 mm/slub.c:4329
kmalloc_noprof include/linux/slab.h:901 [inline]
join_transaction+0x144/0xda0 fs/btrfs/transaction.c:308
start_transaction+0xaf8/0x1670 fs/btrfs/transaction.c:697
btrfs_create_common+0x1b2/0x2e0 fs/btrfs/inode.c:6572
lookup_open fs/namei.c:3649 [inline]
open_last_lookups fs/namei.c:3748 [inline]
path_openat+0x1c03/0x3590 fs/namei.c:3984
do_filp_open+0x27f/0x4e0 fs/namei.c:4014
do_sys_openat2+0x13e/0x1d0 fs/open.c:1402
do_sys_open fs/open.c:1417 [inline]
__do_sys_creat fs/open.c:1495 [inline]
__se_sys_creat fs/open.c:1489 [inline]
__x64_sys_creat+0x123/0x170 fs/open.c:1489
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 5336:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582
poison_slab_object mm/kasan/common.c:247 [inline]
__kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
kasan_slab_free include/linux/kasan.h:233 [inline]
slab_free_hook mm/slub.c:2353 [inline]
slab_free mm/slub.c:4613 [inline]
kfree+0x196/0x430 mm/slub.c:4761
cleanup_transaction fs/btrfs/transaction.c:2063 [inline]
btrfs_commit_transaction+0x2c97/0x3720 fs/btrfs/transaction.c:2598
insert_balance_item+0x1284/0x20b0 fs/btrfs/volumes.c:3757
btrfs_balance+0x992/0x10c0 fs/btrfs/volumes.c:4633
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff888011839000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 36 bytes inside of
freed 2048-byte region [ffff888011839000, ffff888011839800)
The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11838
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 00fff00000000040 ffff88801ac42000 ffffea0000493400 dead000000000002
raw: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000
head: 00fff00000000040 ffff88801ac42000 ffffea0000493400 dead000000000002
head: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000
head: 00fff00000000003 ffffea0000460e01 ffffffffffffffff 0000000000000000
head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 57, tgid 57 (kworker/0:2), ts 67248182943, free_ts 67229742023
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1558
prep_new_page mm/page_alloc.c:1566 [inline]
get_page_from_freelist+0x365c/0x37a0 mm/page_alloc.c:3476
__alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4753
alloc_pages_mpol_noprof+0x3e1/0x780 mm/mempolicy.c:2269
alloc_slab_page+0x6a/0x110 mm/slub.c:2423
allocate_slab+0x5a/0x2b0 mm/slub.c:2589
new_slab mm/slub.c:2642 [inline]
___slab_alloc+0xc27/0x14a0 mm/slub.c:3830
__slab_alloc+0x58/0xa0 mm/slub.c:3920
__slab_alloc_node mm/slub.c:3995 [inline]
slab_alloc_node mm/slub.c:4156 [inline]
__do_kmalloc_node mm/slub.c:4297 [inline]
__kmalloc_node_track_caller_noprof+0x2e9/0x4c0 mm/slub.c:4317
kmalloc_reserve+0x111/0x2a0 net/core/skbuff.c:609
__alloc_skb+0x1f3/0x440 net/core/skbuff.c:678
alloc_skb include/linux/skbuff.h:1323 [inline]
alloc_skb_with_frags+0xc3/0x820 net/core/skbuff.c:6612
sock_alloc_send_pskb+0x91a/0xa60 net/core/sock.c:2884
sock_alloc_send_skb include/net/sock.h:1803 [inline]
mld_newpack+0x1c3/0xaf0 net/ipv6/mcast.c:1747
add_grhead net/ipv6/mcast.c:1850 [inline]
add_grec+0x1492/0x19a0 net/ipv6/mcast.c:1988
mld_send_cr net/ipv6/mcast.c:2114 [inline]
mld_ifc_work+0x691/0xd90 net/ipv6/mcast.c:2651
page last free pid 5300 tgid 5300 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
free_pages_prepare mm/page_alloc.c:1127 [inline]
free_unref_page+0xd3f/0x1010 mm/page_alloc.c:2659
__slab_free+0x2c2/0x380 mm/slub.c:4524
qlink_free mm/kasan/quarantine.c:163 [inline]
qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179
kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286
__kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329
kasan_slab_alloc include/linux/kasan.h:250 [inline]
slab_post_alloc_hook mm/slub.c:4119 [inline]
slab_alloc_node mm/slub.c:4168 [inline]
__do_kmalloc_node mm/slub.c:4297 [inline]
__kmalloc_noprof+0x236/0x4c0 mm/slub.c:4310
kmalloc_noprof include/linux/slab.h:905 [inline]
kzalloc_noprof include/linux/slab.h:1037 [inline]
fib_create_info+0xc14/0x25b0 net/ipv4/fib_semantics.c:1435
fib_table_insert+0x1f6/0x1f20 net/ipv4/fib_trie.c:1231
fib_magic+0x3d8/0x620 net/ipv4/fib_frontend.c:1112
fib_add_ifaddr+0x40c/0x5e0 net/ipv4/fib_frontend.c:1156
fib_netdev_event+0x375/0x490 net/ipv4/fib_frontend.c:1494
notifier_call_chain+0x1a5/0x3f0 kernel/notifier.c:85
__dev_notify_flags+0x207/0x400
dev_change_flags+0xf0/0x1a0 net/core/dev.c:9045
do_setlink+0xc90/0x4210 net/core/rtnetlink.c:3109
rtnl_changelink net/core/rtnetlink.c:3723 [inline]
__rtnl_newlink net/core/rtnetlink.c:3875 [inline]
rtnl_newlink+0x1bb6/0x2210 net/core/rtnetlink.c:4012
Memory state around the buggy address:
ffff888011838f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888011838f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888011839000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888011839080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888011839100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Reported-by: syzbot+45212e9d87a98c3f5b42@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/678e7da5.050a0220.303755.007c.GAE@google.com/
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/linux-btrfs/CACT4Y+ZFBdo7pT8L2AzM=vegZwjp-wNkVJZQf0Ta3vZqtExaSw@mail.gmail.com/
Fixes: 871383be592b ("btrfs: add missing unlocks to transaction abort paths")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c9c863793395cf0a66c2778a29d72c48c02fbb66 ]
[BUG]
There is a bug report that btrfs outputs the following error message:
BTRFS info (device nvme0n1p2): qgroup scan completed (inconsistency flag cleared)
BTRFS warning (device nvme0n1p2): failed to cleanup qgroup 0/1179: -2
[CAUSE]
The error itself is pretty harmless, and the end user should ignore it.
When a subvolume is fully dropped, btrfs will call
btrfs_qgroup_cleanup_dropped_subvolume() to delete the qgroup.
However if a qgroup rescan happened before a subvolume fully dropped,
qgroup for that subvolume will not be re-created, as rescan will only
create new qgroup if there is a BTRFS_ROOT_REF_KEY found.
But before we drop a subvolume, the subvolume is unlinked thus there is no
BTRFS_ROOT_REF_KEY.
In that case, btrfs_remove_qgroup() will fail with -ENOENT and trigger
the above error message.
[FIX]
Just ignore -ENOENT error from btrfs_remove_qgroup() inside
btrfs_qgroup_cleanup_dropped_subvolume().
Reported-by: John Shand <jshand2013@gmail.com>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1236056
Fixes: 839d6ea4f86d ("btrfs: automatically remove the subvolume qgroup")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
abort
[ Upstream commit 0d85f5c2dd91df6b5da454406756f463ba923b69 ]
If while we are doing a direct IO write a transaction abort happens, we
mark all existing ordered extents with the BTRFS_ORDERED_IOERR flag (done
at btrfs_destroy_ordered_extents()), and then after that if we enter
btrfs_split_ordered_extent() and the ordered extent has bytes left
(meaning we have a bio that doesn't cover the whole ordered extent, see
details at btrfs_extract_ordered_extent()), we will fail on the following
assertion at btrfs_split_ordered_extent():
ASSERT(!(flags & ~BTRFS_ORDERED_TYPE_FLAGS));
because the BTRFS_ORDERED_IOERR flag is set and the definition of
BTRFS_ORDERED_TYPE_FLAGS is just the union of all flags that identify the
type of write (regular, nocow, prealloc, compressed, direct IO, encoded).
Fix this by returning an error from btrfs_extract_ordered_extent() if we
find the BTRFS_ORDERED_IOERR flag in the ordered extent. The error will
be the error that resulted in the transaction abort or -EIO if no
transaction abort happened.
This was recently reported by syzbot with the following trace:
FAULT_INJECTION: forcing a failure.
name failslab, interval 1, probability 0, space 0, times 1
CPU: 0 UID: 0 PID: 5321 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
fail_dump lib/fault-inject.c:53 [inline]
should_fail_ex+0x3b0/0x4e0 lib/fault-inject.c:154
should_failslab+0xac/0x100 mm/failslab.c:46
slab_pre_alloc_hook mm/slub.c:4072 [inline]
slab_alloc_node mm/slub.c:4148 [inline]
__do_kmalloc_node mm/slub.c:4297 [inline]
__kmalloc_noprof+0xdd/0x4c0 mm/slub.c:4310
kmalloc_noprof include/linux/slab.h:905 [inline]
kzalloc_noprof include/linux/slab.h:1037 [inline]
btrfs_chunk_alloc_add_chunk_item+0x244/0x1100 fs/btrfs/volumes.c:5742
reserve_chunk_space+0x1ca/0x2c0 fs/btrfs/block-group.c:4292
check_system_chunk fs/btrfs/block-group.c:4319 [inline]
do_chunk_alloc fs/btrfs/block-group.c:3891 [inline]
btrfs_chunk_alloc+0x77b/0xf80 fs/btrfs/block-group.c:4187
find_free_extent_update_loop fs/btrfs/extent-tree.c:4166 [inline]
find_free_extent+0x42d1/0x5810 fs/btrfs/extent-tree.c:4579
btrfs_reserve_extent+0x422/0x810 fs/btrfs/extent-tree.c:4672
btrfs_new_extent_direct fs/btrfs/direct-io.c:186 [inline]
btrfs_get_blocks_direct_write+0x706/0xfa0 fs/btrfs/direct-io.c:321
btrfs_dio_iomap_begin+0xbb7/0x1180 fs/btrfs/direct-io.c:525
iomap_iter+0x697/0xf60 fs/iomap/iter.c:90
__iomap_dio_rw+0xeb9/0x25b0 fs/iomap/direct-io.c:702
btrfs_dio_write fs/btrfs/direct-io.c:775 [inline]
btrfs_direct_write+0x610/0xa30 fs/btrfs/direct-io.c:880
btrfs_do_write_iter+0x2a0/0x760 fs/btrfs/file.c:1397
do_iter_readv_writev+0x600/0x880
vfs_writev+0x376/0xba0 fs/read_write.c:1050
do_pwritev fs/read_write.c:1146 [inline]
__do_sys_pwritev2 fs/read_write.c:1204 [inline]
__se_sys_pwritev2+0x196/0x2b0 fs/read_write.c:1195
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f1281f85d29
RSP: 002b:00007f12819fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000148
RAX: ffffffffffffffda RBX: 00007f1282176080 RCX: 00007f1281f85d29
RDX: 0000000000000001 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 00007f12819fe090 R08: 0000000000000000 R09: 0000000000000003
R10: 0000000000007000 R11: 0000000000000246 R12: 0000000000000002
R13: 0000000000000000 R14: 00007f1282176080 R15: 00007ffcb9e23328
</TASK>
BTRFS error (device loop0 state A): Transaction aborted (error -12)
BTRFS: error (device loop0 state A) in btrfs_chunk_alloc_add_chunk_item:5745: errno=-12 Out of memory
BTRFS info (device loop0 state EA): forced readonly
assertion failed: !(flags & ~BTRFS_ORDERED_TYPE_FLAGS), in fs/btrfs/ordered-data.c:1234
------------[ cut here ]------------
kernel BUG at fs/btrfs/ordered-data.c:1234!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 5321 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:btrfs_split_ordered_extent+0xd8d/0xe20 fs/btrfs/ordered-data.c:1234
RSP: 0018:ffffc9000d1df2b8 EFLAGS: 00010246
RAX: 0000000000000057 RBX: 000000000006a000 RCX: 9ce21886c4195300
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: 0000000000000091 R08: ffffffff817f0a3c R09: 1ffff92001a3bdf4
R10: dffffc0000000000 R11: fffff52001a3bdf5 R12: 1ffff1100a45f401
R13: ffff8880522fa018 R14: dffffc0000000000 R15: 000000000006a000
FS: 00007f12819fe6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000557750bd7da8 CR3: 00000000400ea000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
btrfs_extract_ordered_extent fs/btrfs/direct-io.c:702 [inline]
btrfs_dio_submit_io+0x4be/0x6d0 fs/btrfs/direct-io.c:737
iomap_dio_submit_bio fs/iomap/direct-io.c:85 [inline]
iomap_dio_bio_iter+0x1022/0x1740 fs/iomap/direct-io.c:447
__iomap_dio_rw+0x13b7/0x25b0 fs/iomap/direct-io.c:703
btrfs_dio_write fs/btrfs/direct-io.c:775 [inline]
btrfs_direct_write+0x610/0xa30 fs/btrfs/direct-io.c:880
btrfs_do_write_iter+0x2a0/0x760 fs/btrfs/file.c:1397
do_iter_readv_writev+0x600/0x880
vfs_writev+0x376/0xba0 fs/read_write.c:1050
do_pwritev fs/read_write.c:1146 [inline]
__do_sys_pwritev2 fs/read_write.c:1204 [inline]
__se_sys_pwritev2+0x196/0x2b0 fs/read_write.c:1195
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f1281f85d29
RSP: 002b:00007f12819fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000148
RAX: ffffffffffffffda RBX: 00007f1282176080 RCX: 00007f1281f85d29
RDX: 0000000000000001 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 00007f12819fe090 R08: 0000000000000000 R09: 0000000000000003
R10: 0000000000007000 R11: 0000000000000246 R12: 0000000000000002
R13: 0000000000000000 R14: 00007f1282176080 R15: 00007ffcb9e23328
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:btrfs_split_ordered_extent+0xd8d/0xe20 fs/btrfs/ordered-data.c:1234
RSP: 0018:ffffc9000d1df2b8 EFLAGS: 00010246
RAX: 0000000000000057 RBX: 000000000006a000 RCX: 9ce21886c4195300
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: 0000000000000091 R08: ffffffff817f0a3c R09: 1ffff92001a3bdf4
R10: dffffc0000000000 R11: fffff52001a3bdf5 R12: 1ffff1100a45f401
R13: ffff8880522fa018 R14: dffffc0000000000 R15: 000000000006a000
FS: 00007f12819fe6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000557750bd7da8 CR3: 00000000400ea000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
In this case the transaction abort was due to (an injected) memory
allocation failure when attempting to allocate a new chunk.
Reported-by: syzbot+f60d8337a5c8e8d92a77@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6777f2dd.050a0220.178762.0045.GAE@google.com/
Fixes: 52b1fdca23ac ("btrfs: handle completed ordered extents in btrfs_split_ordered_extent")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a216542027b892e6651c1b4e076012140d04afaf ]
When COWing a relocation tree path, at relocation.c:replace_path(), we
can trigger a lockdep splat while we are in the btrfs_search_slot() call
against the relocation root. This happens in that callchain at
ctree.c:read_block_for_search() when we happen to find a child extent
buffer already loaded through the fs tree with a lockdep class set to
the fs tree. So when we attempt to lock that extent buffer through a
relocation tree we have to reset the lockdep class to the class for a
relocation tree, since a relocation tree has extent buffers that used
to belong to a fs tree and may currently be already loaded (we swap
extent buffers between the two trees at the end of replace_path()).
However we are missing calls to btrfs_maybe_reset_lockdep_class() to reset
the lockdep class at ctree.c:read_block_for_search() before we read lock
an extent buffer, just like we did for btrfs_search_slot() in commit
b40130b23ca4 ("btrfs: fix lockdep splat with reloc root extent buffers").
So add the missing btrfs_maybe_reset_lockdep_class() calls before the
attempts to read lock an extent buffer at ctree.c:read_block_for_search().
The lockdep splat was reported by syzbot and it looks like this:
======================================================
WARNING: possible circular locking dependency detected
6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Not tainted
------------------------------------------------------
syz.0.0/5335 is trying to acquire lock:
ffff8880545dbc38 (btrfs-tree-01){++++}-{4:4}, at: btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
but task is already holding lock:
ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-treloc-02/1){+.+.}-{4:4}:
reacquire_held_locks+0x3eb/0x690 kernel/locking/lockdep.c:5374
__lock_release kernel/locking/lockdep.c:5563 [inline]
lock_release+0x396/0xa30 kernel/locking/lockdep.c:5870
up_write+0x79/0x590 kernel/locking/rwsem.c:1629
btrfs_force_cow_block+0x14b3/0x1fd0 fs/btrfs/ctree.c:660
btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755
btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153
replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
__btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (btrfs-tree-01/1){+.+.}-{4:4}:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
down_write_nested+0xa2/0x220 kernel/locking/rwsem.c:1693
btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
btrfs_init_new_buffer fs/btrfs/extent-tree.c:5052 [inline]
btrfs_alloc_tree_block+0x41c/0x1440 fs/btrfs/extent-tree.c:5132
btrfs_force_cow_block+0x526/0x1fd0 fs/btrfs/ctree.c:573
btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755
btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153
btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4351
btrfs_insert_empty_item fs/btrfs/ctree.h:688 [inline]
btrfs_insert_inode_ref+0x2bb/0xf80 fs/btrfs/inode-item.c:330
btrfs_rename_exchange fs/btrfs/inode.c:7990 [inline]
btrfs_rename2+0xcb7/0x2b90 fs/btrfs/inode.c:8374
vfs_rename+0xbdb/0xf00 fs/namei.c:5067
do_renameat2+0xd94/0x13f0 fs/namei.c:5224
__do_sys_renameat2 fs/namei.c:5258 [inline]
__se_sys_renameat2 fs/namei.c:5255 [inline]
__x64_sys_renameat2+0xce/0xe0 fs/namei.c:5255
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (btrfs-tree-01){++++}-{4:4}:
check_prev_add kernel/locking/lockdep.c:3161 [inline]
check_prevs_add kernel/locking/lockdep.c:3280 [inline]
validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
__lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649
btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline]
read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610
btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237
replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
__btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
btrfs-tree-01 --> btrfs-tree-01/1 --> btrfs-treloc-02/1
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-treloc-02/1);
lock(btrfs-tree-01/1);
lock(btrfs-treloc-02/1);
rlock(btrfs-tree-01);
*** DEADLOCK ***
8 locks held by syz.0.0/5335:
#0: ffff88801e3ae420 (sb_writers#13){.+.+}-{0:0}, at: mnt_want_write_file+0x5e/0x200 fs/namespace.c:559
#1: ffff888052c760d0 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: __btrfs_balance+0x4c2/0x26b0 fs/btrfs/volumes.c:4183
#2: ffff888052c74850 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x775/0xd90 fs/btrfs/relocation.c:4086
#3: ffff88801e3ae610 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xf11/0x1ad0 fs/btrfs/relocation.c:1659
#4: ffff888052c76470 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288
#5: ffff888052c76498 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288
#6: ffff8880545db878 (btrfs-tree-01/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
#7: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
stack backtrace:
CPU: 0 UID: 0 PID: 5335 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074
check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206
check_prev_add kernel/locking/lockdep.c:3161 [inline]
check_prevs_add kernel/locking/lockdep.c:3280 [inline]
validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
__lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649
btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline]
read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610
btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237
replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
__btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f1ac6985d29
Code: ff ff c3 (...)
RSP: 002b:00007f1ac63fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f1ac6b76160 RCX: 00007f1ac6985d29
RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000007
RBP: 00007f1ac6a01b08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000001 R14: 00007f1ac6b76160 R15: 00007fffda145a88
</TASK>
Reported-by: syzbot+63913e558c084f7f8fdc@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/677b3014.050a0220.3b53b0.0064.GAE@google.com/
Fixes: 99785998ed1c ("btrfs: reduce lock contention when eb cache miss for btree search")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit c2b47df81c8e20a8e8cd94f0d7df211137ae94ed upstream.
[BUG]
With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash
with the following VM_BUG_ON_FOLIO():
BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28
BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28
page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664
aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774"
flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff)
page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
------------[ cut here ]------------
kernel BUG at mm/page-writeback.c:2992!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G OE 6.12.0-rc7-custom+ #87
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : folio_clear_dirty_for_io+0x128/0x258
lr : folio_clear_dirty_for_io+0x128/0x258
Call trace:
folio_clear_dirty_for_io+0x128/0x258
btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs]
__process_folios_contig+0x154/0x268 [btrfs]
extent_clear_unlock_delalloc+0x5c/0x80 [btrfs]
run_delalloc_nocow+0x5f8/0x760 [btrfs]
btrfs_run_delalloc_range+0xa8/0x220 [btrfs]
writepage_delalloc+0x230/0x4c8 [btrfs]
extent_writepage+0xb8/0x358 [btrfs]
extent_write_cache_pages+0x21c/0x4e8 [btrfs]
btrfs_writepages+0x94/0x150 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x88/0xc8
start_delalloc_inodes+0x178/0x3a8 [btrfs]
btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
shrink_delalloc+0x114/0x280 [btrfs]
flush_space+0x250/0x2f8 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x164/0x408
worker_thread+0x25c/0x388
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
The first two lines of extra debug messages show the problem is caused
by the error handling of run_delalloc_nocow().
E.g. we have the following dirtied range (4K blocksize 4K page size):
0 16K 32K
|//////////////////////////////////////|
| Pre-allocated |
And the range [0, 16K) has a preallocated extent.
- Enter run_delalloc_nocow() for range [0, 16K)
Which found range [0, 16K) is preallocated, can do the proper NOCOW
write.
- Enter fallback_to_fow() for range [16K, 32K)
Since the range [16K, 32K) is not backed by preallocated extent, we
have to go COW.
- cow_file_range() failed for range [16K, 32K)
So cow_file_range() will do the clean up by clearing folio dirty,
unlock the folios.
Now the folios in range [16K, 32K) is unlocked.
- Enter extent_clear_unlock_delalloc() from run_delalloc_nocow()
Which is called with PAGE_START_WRITEBACK to start page writeback.
But folios can only be marked writeback when it's properly locked,
thus this triggered the VM_BUG_ON_FOLIO().
Furthermore there is another hidden but common bug that
run_delalloc_nocow() is not clearing the folio dirty flags in its error
handling path.
This is the common bug shared between run_delalloc_nocow() and
cow_file_range().
[FIX]
- Clear folio dirty for range [@start, @cur_offset)
Introduce a helper, cleanup_dirty_folios(), which
will find and lock the folio in the range, clear the dirty flag and
start/end the writeback, with the extra handling for the
@locked_folio.
- Introduce a helper to clear folio dirty, start and end writeback
- Introduce a helper to record the last failed COW range end
This is to trace which range we should skip, to avoid double
unlocking.
- Skip the failed COW range for the error handling
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 06f364284794f149d2abc167c11d556cf20c954b upstream.
[BUG]
When testing with COW fixup marked as BUG_ON() (this is involved with the
new pin_user_pages*() change, which should not result new out-of-band
dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW
fixup path.
This BUG_ON() happens just after a failed btrfs_run_delalloc_range():
BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:1444!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G OE 6.12.0-rc7-custom+ #86
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : extent_writepage_io+0x2d4/0x308 [btrfs]
lr : extent_writepage_io+0x2d4/0x308 [btrfs]
Call trace:
extent_writepage_io+0x2d4/0x308 [btrfs]
extent_writepage+0x218/0x330 [btrfs]
extent_write_cache_pages+0x1d4/0x4b0 [btrfs]
btrfs_writepages+0x94/0x150 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x88/0xc8
start_delalloc_inodes+0x180/0x3b0 [btrfs]
btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
shrink_delalloc+0x114/0x280 [btrfs]
flush_space+0x250/0x2f8 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x164/0x408
worker_thread+0x25c/0x388
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
That failure is mostly from cow_file_range(), where we can hit -ENOSPC.
Although the -ENOSPC is already a bug related to our space reservation
code, let's just focus on the error handling.
For example, we have the following dirty range [0, 64K) of an inode,
with 4K sector size and 4K page size:
0 16K 32K 48K 64K
|///////////////////////////////////////|
|#######################################|
Where |///| means page are still dirty, and |###| means the extent io
tree has EXTENT_DELALLOC flag.
- Enter extent_writepage() for page 0
- Enter btrfs_run_delalloc_range() for range [0, 64K)
- Enter cow_file_range() for range [0, 64K)
- Function btrfs_reserve_extent() only reserved one 16K extent
So we created extent map and ordered extent for range [0, 16K)
0 16K 32K 48K 64K
|////////|//////////////////////////////|
|<- OE ->|##############################|
And range [0, 16K) has its delalloc flag cleared.
But since we haven't yet submit any bio, involved 4 pages are still
dirty.
- Function btrfs_reserve_extent() returns with -ENOSPC
Now we have to run error cleanup, which will clear all
EXTENT_DELALLOC* flags and clear the dirty flags for the remaining
ranges:
0 16K 32K 48K 64K
|////////| |
| | |
Note that range [0, 16K) still has its pages dirty.
- Some time later, writeback is triggered again for the range [0, 16K)
since the page range still has dirty flags.
- btrfs_run_delalloc_range() will do nothing because there is no
EXTENT_DELALLOC flag.
- extent_writepage_io() finds page 0 has no ordered flag
Which falls into the COW fixup path, triggering the BUG_ON().
Unfortunately this error handling bug dates back to the introduction of
btrfs. Thankfully with the abuse of COW fixup, at least it won't crash
the kernel.
[FIX]
Instead of immediately unlocking the extent and folios, we keep the extent
and folios locked until either erroring out or the whole delalloc range
finished.
When the whole delalloc range finished without error, we just unlock the
whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked
cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared.
And the involved folios will be properly submitted, with their dirty
flags cleared during submission.
For the error path, it will be a little more complex:
- The range with ordered extent allocated (range (1))
We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining
flags are cleaned up by
btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered().
For folios we finish the IO (clear dirty, start writeback and
immediately finish the writeback) and unlock the folios.
- The range with reserved extent but no ordered extent (range(2))
- The range we never touched (range(3))
For both range (2) and range(3) the behavior is not changed.
Now even if cow_file_range() failed halfway with some successfully
reserved extents/ordered extents, we will keep all folios clean, so
there will be no future writeback triggered on them.
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d0f038104fa37380e2a725e669508e43d0c503e9 upstream.
There is a recent ML report that mounting a large fs backed by hardware
RAID56 controller (with one device missing) took too much time, and
systemd seems to kill the mount attempt.
In that case, the only error message is:
BTRFS error (device sdj): open_ctree failed
There is no reason on why the failure happened, making it very hard to
understand the reason.
At least output the error number (in the particular case it should be
-EINTR) to provide some clue.
Link: https://lore.kernel.org/linux-btrfs/9b9c4d2810abcca2f9f76e32220ed9a90febb235.camel@scientia.org/
Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Cc: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 396294d1afee65a203d6cabd843d0782e5d7388e ]
We're dumping the locked bitmap into the @checked_bitmap variable,
printing incorrect values during debug.
Thankfully even during my development I haven't hit a case where I need
to dump the locked bitmap. But for the sake of consistency, fix it by
dupping the locked bitmap into @locked_bitmap variable for output.
Fixes: 75258f20fb70 ("btrfs: subpage: dump extra subpage bitmaps for debug")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c0def46dec9c547679a25fe7552c4bcbec0b0dd2 ]
[WARNING]
There are several warnings about the recently introduced qgroup
auto-removal that it triggers WARN_ON() for the non-zero rfer/excl
numbers, e.g:
------------[ cut here ]------------
WARNING: CPU: 67 PID: 2882 at fs/btrfs/qgroup.c:1854 btrfs_remove_qgroup+0x3df/0x450
CPU: 67 UID: 0 PID: 2882 Comm: btrfs-cleaner Kdump: loaded Not tainted 6.11.6-300.fc41.x86_64 #1
RIP: 0010:btrfs_remove_qgroup+0x3df/0x450
Call Trace:
<TASK>
btrfs_qgroup_cleanup_dropped_subvolume+0x97/0xc0
btrfs_drop_snapshot+0x44e/0xa80
btrfs_clean_one_deleted_snapshot+0xc3/0x110
cleaner_kthread+0xd8/0x130
kthread+0xd2/0x100
ret_from_fork+0x34/0x50
ret_from_fork_asm+0x1a/0x30
</TASK>
---[ end trace 0000000000000000 ]---
BTRFS warning (device sda): to be deleted qgroup 0/319 has non-zero numbers, rfer 258478080 rfer_cmpr 258478080 excl 0 excl_cmpr 0
[CAUSE]
Although the root cause is still unclear, as if qgroup is consistent a
fully dropped subvolume (with extra transaction committed) should lead
to all zero numbers for the qgroup.
My current guess is the subvolume drop triggered the new subtree drop
threshold thus marked qgroup inconsistent, then rescan cleared it but
some corner case is not properly handled during subvolume dropping.
But at least for this particular case, since it's only the rfer/excl not
properly reset to 0, and qgroup is already marked inconsistent, there is
nothing to be worried for the end users.
The user space tool utilizing qgroup would queue a rescan to handle
everything, so the kernel wanring is a little overkilled.
[ENHANCEMENT]
Enhance the warning inside btrfs_remove_qgroup() by:
- Only do WARN() if CONFIG_BTRFS_DEBUG is enabled
As explained the kernel can handle inconsistent qgroups by simply do a
rescan, there is nothing to bother the end users.
- Treat the reserved space leak the same as non-zero numbers
By outputting the values and trigger a WARN() if it's a debug build.
So far I haven't experienced any case related to reserved space so I
hope we will never need to bother them.
Fixes: 839d6ea4f86d ("btrfs: automatically remove the subvolume qgroup")
Link: https://github.com/kdave/btrfs-progs/issues/922
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
- handle d_path() errors when canonicalizing device mapper paths during
device scan
* tag 'for-6.13-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add the missing error handling inside get_canonical_dev_path
|
|
Inside function get_canonical_dev_path(), we call d_path() to get the
final device path.
But d_path() can return error, and in that case the next strscpy() call
will trigger an invalid memory access.
Add back the missing error handling for d_path().
Reported-by: Boris Burkov <boris@bur.io>
Fixes: 7e06de7c83a7 ("btrfs: canonicalize the device path before adding it")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes.
Besides the one-liners in Btrfs there's fix to the io_uring and
encoded read integration (added in this development cycle). The update
to io_uring provides more space for the ongoing command that is then
used in Btrfs to handle some cases.
- io_uring and encoded read:
- provide stable storage for io_uring command data
- make a copy of encoded read ioctl call, reuse that in case the
call would block and will be called again
- properly initialize zlib context for hardware compression on s390
- fix max extent size calculation on filesystems with non-zoned
devices
- fix crash in scrub on crafted image due to invalid extent tree"
* tag 'for-6.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zlib: fix avail_in bytes for s390 zlib HW compression path
btrfs: zoned: calculate max_extent_size properly on non-zoned setup
btrfs: avoid NULL pointer dereference if no valid extent tree
btrfs: don't read from userspace twice in btrfs_uring_encoded_read()
io_uring: add io_uring_cmd_get_async_data helper
io_uring/cmd: add per-op data to struct io_uring_cmd_data
io_uring/cmd: rename struct uring_cache to io_uring_cmd_data
|
|
Since the input data length passed to zlib_compress_folios() can be
arbitrary, always setting strm.avail_in to a multiple of PAGE_SIZE may
cause read-in bytes to exceed the input range. Currently this triggers
an assert in btrfs_compress_folios() on the debug kernel (see below).
Fix strm.avail_in calculation for S390 hardware acceleration path.
assertion failed: *total_in <= orig_len, in fs/btrfs/compression.c:1041
------------[ cut here ]------------
kernel BUG at fs/btrfs/compression.c:1041!
monitor event: 0040 ilc:2 [#1] PREEMPT SMP
CPU: 16 UID: 0 PID: 325 Comm: kworker/u273:3 Not tainted 6.13.0-20241204.rc1.git6.fae3b21430ca.300.fc41.s390x+debug #1
Hardware name: IBM 3931 A01 703 (z/VM 7.4.0)
Workqueue: btrfs-delalloc btrfs_work_helper
Krnl PSW : 0704d00180000000 0000021761df6538 (btrfs_compress_folios+0x198/0x1a0)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
Krnl GPRS: 0000000080000000 0000000000000001 0000000000000047 0000000000000000
0000000000000006 ffffff01757bb000 000001976232fcc0 000000000000130c
000001976232fcd0 000001976232fcc8 00000118ff4a0e30 0000000000000001
00000111821ab400 0000011100000000 0000021761df6534 000001976232fb58
Krnl Code: 0000021761df6528: c020006f5ef4 larl %r2,0000021762be2310
0000021761df652e: c0e5ffbd09d5 brasl %r14,00000217615978d8
#0000021761df6534: af000000 mc 0,0
>0000021761df6538: 0707 bcr 0,%r7
0000021761df653a: 0707 bcr 0,%r7
0000021761df653c: 0707 bcr 0,%r7
0000021761df653e: 0707 bcr 0,%r7
0000021761df6540: c004004bb7ec brcl 0,000002176276d518
Call Trace:
[<0000021761df6538>] btrfs_compress_folios+0x198/0x1a0
([<0000021761df6534>] btrfs_compress_folios+0x194/0x1a0)
[<0000021761d97788>] compress_file_range+0x3b8/0x6d0
[<0000021761dcee7c>] btrfs_work_helper+0x10c/0x160
[<0000021761645760>] process_one_work+0x2b0/0x5d0
[<000002176164637e>] worker_thread+0x20e/0x3e0
[<000002176165221a>] kthread+0x15a/0x170
[<00000217615b859c>] __ret_from_fork+0x3c/0x60
[<00000217626e72d2>] ret_from_fork+0xa/0x38
INFO: lockdep is turned off.
Last Breaking-Event-Address:
[<0000021761597924>] _printk+0x4c/0x58
Kernel panic - not syncing: Fatal exception: panic_on_oops
Fixes: fd1e75d0105d ("btrfs: make compression path to be subpage compatible")
CC: stable@vger.kernel.org # 6.12+
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since commit 559218d43ec9 ("block: pre-calculate max_zone_append_sectors"),
queue_limits's max_zone_append_sectors is default to be 0 and it is only
updated when there is a zoned device. So, we have
lim->max_zone_append_sectors = 0 when there is no zoned device in the
filesystem.
That leads to fs_info->max_zone_append_size and thus
fs_info->max_extent_size to be 0, which is wrong and can for example
lead to a divide by zero in count_max_extents().
Fix this by only capping fs_info->max_extent_size to
fs_info->max_zone_append_size when it is non-zero.
Based on a patch from Naohiro Aota <naohiro.aota@wdc.com>, from which
much of this commit message is stolen as well.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 559218d43ec9 ("block: pre-calculate max_zone_append_sectors")
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
Syzbot reported a crash with the following call trace:
BTRFS info (device loop0): scrub: started on devid 1
BUG: kernel NULL pointer dereference, address: 0000000000000208
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 106e70067 P4D 106e70067 PUD 107143067 PMD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 UID: 0 PID: 689 Comm: repro Kdump: loaded Tainted: G O 6.13.0-rc4-custom+ #206
Tainted: [O]=OOT_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:find_first_extent_item+0x26/0x1f0 [btrfs]
Call Trace:
<TASK>
scrub_find_fill_first_stripe+0x13d/0x3b0 [btrfs]
scrub_simple_mirror+0x175/0x260 [btrfs]
scrub_stripe+0x5d4/0x6c0 [btrfs]
scrub_chunk+0xbb/0x170 [btrfs]
scrub_enumerate_chunks+0x2f4/0x5f0 [btrfs]
btrfs_scrub_dev+0x240/0x600 [btrfs]
btrfs_ioctl+0x1dc8/0x2fa0 [btrfs]
? do_sys_openat2+0xa5/0xf0
__x64_sys_ioctl+0x97/0xc0
do_syscall_64+0x4f/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
[CAUSE]
The reproducer is using a corrupted image where extent tree root is
corrupted, thus forcing to use "rescue=all,ro" mount option to mount the
image.
Then it triggered a scrub, but since scrub relies on extent tree to find
where the data/metadata extents are, scrub_find_fill_first_stripe()
relies on an non-empty extent root.
But unfortunately scrub_find_fill_first_stripe() doesn't really expect
an NULL pointer for extent root, it use extent_root to grab fs_info and
triggered a NULL pointer dereference.
[FIX]
Add an extra check for a valid extent root at the beginning of
scrub_find_fill_first_stripe().
The new error path is introduced by 42437a6386ff ("btrfs: introduce
mount option rescue=ignorebadroots"), but that's pretty old, and later
commit b979547513ff ("btrfs: scrub: introduce helper to find and fill
sector info for a scrub_stripe") changed how we do scrub.
So for kernels older than 6.6, the fix will need manual backport.
Reported-by: syzbot+339e9dbe3a2ca419b85d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/67756935.050a0220.25abdd.0a12.GAE@google.com/
Fixes: 42437a6386ff ("btrfs: introduce mount option rescue=ignorebadroots")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we return -EAGAIN the first time because we need to block,
btrfs_uring_encoded_read() will get called twice. Take a copy of args,
the iovs, and the iter the first time, as by the time we are called the
second time these may have gone out of scope.
Reported-by: Jens Axboe <axboe@kernel.dk>
Fixes: 34310c442e17 ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)")
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes that accumulated over the last two weeks, fixing some
user reported problems:
- swapfile fixes:
- conditional reschedule in the activation loop
- fix race with memory mapped file when activating
- make activation loop interruptible
- rework and fix extent sharing checks
- folio fixes:
- in send, recheck folio mapping after unlock
- in relocation, recheck folio mapping after unlock
- fix waiting for encoded read io_uring requests
- fix transaction atomicity when enabling simple quotas
- move COW block trace point before the block gets freed
- print various sizes in sysfs with correct endianity"
* tag 'for-6.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: sysfs: fix direct super block member reads
btrfs: fix transaction atomicity bug when enabling simple quotas
btrfs: avoid monopolizing a core when activating a swap file
btrfs: allow swap activation to be interruptible
btrfs: fix swap file activation failure due to extents that used to be shared
btrfs: fix race with memory mapped writes when activating swap file
btrfs: check folio mapping after unlock in put_file_data()
btrfs: check folio mapping after unlock in relocate_one_folio()
btrfs: fix use-after-free when COWing tree bock and tracing is enabled
btrfs: fix use-after-free waiting for encoded read endios
|
|
The following sysfs entries are reading super block member directly,
which can have a different endian and cause wrong values:
- sys/fs/btrfs/<uuid>/nodesize
- sys/fs/btrfs/<uuid>/sectorsize
- sys/fs/btrfs/<uuid>/clone_alignment
Thankfully those values (nodesize and sectorsize) are always aligned
inside the btrfs_super_block, so it won't trigger unaligned read errors,
just endian problems.
Fix them by using the native cached members instead.
Fixes: df93589a1737 ("btrfs: export more from FS_INFO to sysfs")
CC: stable@vger.kernel.org
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Set squota incompat bit before committing the transaction that enables
the feature.
With the config CONFIG_BTRFS_ASSERT enabled, an assertion
failure occurs regarding the simple quota feature.
[5.596534] assertion failed: btrfs_fs_incompat(fs_info, SIMPLE_QUOTA), in fs/btrfs/qgroup.c:365
[5.597098] ------------[ cut here ]------------
[5.597371] kernel BUG at fs/btrfs/qgroup.c:365!
[5.597946] CPU: 1 UID: 0 PID: 268 Comm: mount Not tainted 6.13.0-rc2-00031-gf92f4749861b #146
[5.598450] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[5.599008] RIP: 0010:btrfs_read_qgroup_config+0x74d/0x7a0
[5.604303] <TASK>
[5.605230] ? btrfs_read_qgroup_config+0x74d/0x7a0
[5.605538] ? exc_invalid_op+0x56/0x70
[5.605775] ? btrfs_read_qgroup_config+0x74d/0x7a0
[5.606066] ? asm_exc_invalid_op+0x1f/0x30
[5.606441] ? btrfs_read_qgroup_config+0x74d/0x7a0
[5.606741] ? btrfs_read_qgroup_config+0x74d/0x7a0
[5.607038] ? try_to_wake_up+0x317/0x760
[5.607286] open_ctree+0xd9c/0x1710
[5.607509] btrfs_get_tree+0x58a/0x7e0
[5.608002] vfs_get_tree+0x2e/0x100
[5.608224] fc_mount+0x16/0x60
[5.608420] btrfs_get_tree+0x2f8/0x7e0
[5.608897] vfs_get_tree+0x2e/0x100
[5.609121] path_mount+0x4c8/0xbc0
[5.609538] __x64_sys_mount+0x10d/0x150
The issue can be easily reproduced using the following reproducer:
root@q:linux# cat repro.sh
set -e
mkfs.btrfs -q -f /dev/sdb
mount /dev/sdb /mnt/btrfs
btrfs quota enable -s /mnt/btrfs
umount /mnt/btrfs
mount /dev/sdb /mnt/btrfs
The issue is that when enabling quotas, at btrfs_quota_enable(), we set
BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE at fs_info->qgroup_flags and persist
it in the quota root in the item with the key BTRFS_QGROUP_STATUS_KEY, but
we only set the incompat bit BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA after we
commit the transaction used to enable simple quotas.
This means that if after that transaction commit we unmount the filesystem
without starting and committing any other transaction, or we have a power
failure, the next time we mount the filesystem we will find the flag
BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE set in the item with the key
BTRFS_QGROUP_STATUS_KEY but we will not find the incompat bit
BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA set in the superblock, triggering an
assertion failure at:
btrfs_read_qgroup_config() -> qgroup_read_enable_gen()
To fix this issue, set the BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA flag
immediately after setting the BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE.
This ensures that both flags are flushed to disk within the same
transaction.
Fixes: 182940f4f4db ("btrfs: qgroup: add new quota mode for simple quotas")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During swap activation we iterate over the extents of a file and we can
have many thousands of them, so we can end up in a busy loop monopolizing
a core. Avoid this by doing a voluntary reschedule after processing each
extent.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When activating a swap file, to determine if an extent is shared we use
can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
is meant to be quick because it's used in the NOCOW write path, when
flushing delalloc and when doing a direct IO write, however it does return
some false positives, meaning it may indicate that an extent is shared
even if it's no longer the case. For the write path this is fine, we just
do a unnecessary COW operation instead of doing a more rigorous check
which would be too heavy (calling btrfs_is_data_extent_shared()).
However when activating a swap file, the false positives simply result
in a failure, which is confusing for users/applications. One particular
case where this happens is when a data extent only has 1 reference but
that reference is not inlined in the extent item located in the extent
tree - this happens when we create more than 33 references for an extent
and then delete those 33 references plus every other non-inline reference
except one. The function check_committed_ref() assumes that if the size
of an extent item doesn't match the size of struct btrfs_extent_item
plus the size of an inline reference (plus an owner reference in case
simple quotas are enabled), then the extent is shared - that is not the
case however, we can have a single reference but it's not inlined - the
reason we do this is to be fast and avoid inspecting non-inline references
which may be located in another leaf of the extent tree, slowing down
write paths.
The following test script reproduces the bug:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
NUM_CLONES=50
umount $DEV &> /dev/null
run_test()
{
local sync_after_add_reflinks=$1
local sync_after_remove_reflinks=$2
mkfs.btrfs -f $DEV > /dev/null
#mkfs.xfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foo
chmod 0600 $MNT/foo
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo &> /dev/null
xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
mkswap $MNT/foo
for ((i = 1; i <= $NUM_CLONES; i++)); do
touch $MNT/foo_clone_$i
chmod 0600 $MNT/foo_clone_$i
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo_clone_$i &> /dev/null
cp --reflink=always $MNT/foo $MNT/foo_clone_$i
done
if [ $sync_after_add_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Remove the original file and all clones except the last.
rm -f $MNT/foo
for ((i = 1; i < $NUM_CLONES; i++)); do
rm -f $MNT/foo_clone_$i
done
if [ $sync_after_remove_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Now use the last clone as a swap file. It should work since
# its extent are not shared anymore.
swapon $MNT/foo_clone_${NUM_CLONES}
swapoff $MNT/foo_clone_${NUM_CLONES}
umount $MNT
}
echo -e "\nTest without sync after creating and removing clones"
run_test 0 0
echo -e "\nTest with sync after creating clones"
run_test 1 0
echo -e "\nTest with sync after removing clones"
run_test 0 1
echo -e "\nTest with sync after creating and removing clones"
run_test 1 1
Running the test:
$ ./test.sh
Test without sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after creating clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764
Test with sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Fix this by reworking btrfs_swap_activate() to instead of using extent
maps and checking for shared extents with can_nocow_extent(), iterate
over the inode's file extent items and use the accurate
btrfs_is_data_extent_shared().
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When activating the swap file we flush all delalloc and wait for ordered
extent completion, so that we don't miss any delalloc and extents before
we check that the file's extent layout is usable for a swap file and
activate the swap file. We are called with the inode's VFS lock acquired,
so we won't race with buffered and direct IO writes, however we can still
race with memory mapped writes since they don't acquire the inode's VFS
lock. The race window is between flushing all delalloc and locking the
whole file's extent range, since memory mapped writes lock an extent range
with the length of a page.
Fix this by acquiring the inode's mmap lock before we flush delalloc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When we call btrfs_read_folio() we get an unlocked folio, so it is possible
for a different thread to concurrently modify folio->mapping. We must
check that this hasn't happened once we do have the lock.
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When we call btrfs_read_folio() to bring a folio uptodate, we unlock the
folio. The result of that is that a different thread can modify the
mapping (like remove it with invalidate) before we call folio_lock().
This results in an invalid page and we need to try again.
In particular, if we are relocating concurrently with aborting a
transaction, this can result in a crash like the following:
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 0 P4D 0
Oops: 0000 [#1] SMP
CPU: 76 PID: 1411631 Comm: kworker/u322:5
Workqueue: events_unbound btrfs_reclaim_bgs_work
RIP: 0010:set_page_extent_mapped+0x20/0xb0
RSP: 0018:ffffc900516a7be8 EFLAGS: 00010246
RAX: ffffea009e851d08 RBX: ffffea009e0b1880 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffc900516a7b90 RDI: ffffea009e0b1880
RBP: 0000000003573000 R08: 0000000000000001 R09: ffff88c07fd2f3f0
R10: 0000000000000000 R11: 0000194754b575be R12: 0000000003572000
R13: 0000000003572fff R14: 0000000000100cca R15: 0000000005582fff
FS: 0000000000000000(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000407d00f002 CR4: 00000000007706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x78/0xc0
? page_fault_oops+0x2a8/0x3a0
? __switch_to+0x133/0x530
? wq_worker_running+0xa/0x40
? exc_page_fault+0x63/0x130
? asm_exc_page_fault+0x22/0x30
? set_page_extent_mapped+0x20/0xb0
relocate_file_extent_cluster+0x1a7/0x940
relocate_data_extent+0xaf/0x120
relocate_block_group+0x20f/0x480
btrfs_relocate_block_group+0x152/0x320
btrfs_relocate_chunk+0x3d/0x120
btrfs_reclaim_bgs_work+0x2ae/0x4e0
process_scheduled_works+0x184/0x370
worker_thread+0xc6/0x3e0
? blk_add_timer+0xb0/0xb0
kthread+0xae/0xe0
? flush_tlb_kernel_range+0x90/0x90
ret_from_fork+0x2f/0x40
? flush_tlb_kernel_range+0x90/0x90
ret_from_fork_asm+0x11/0x20
</TASK>
This occurs because cleanup_one_transaction() calls
destroy_delalloc_inodes() which calls invalidate_inode_pages2() which
takes the folio_lock before setting mapping to NULL. We fail to check
this, and subsequently call set_extent_mapping(), which assumes that
mapping != NULL (in fact it asserts that in debug mode)
Note that the "fixes" patch here is not the one that introduced the
race (the very first iteration of this code from 2009) but a more recent
change that made this particular crash happen in practice.
Fixes: e7f1326cc24e ("btrfs: set page extent mapped after read_folio in relocate_one_page")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When a COWing a tree block, at btrfs_cow_block(), and we have the
tracepoint trace_btrfs_cow_block() enabled and preemption is also enabled
(CONFIG_PREEMPT=y), we can trigger a use-after-free in the COWed extent
buffer while inside the tracepoint code. This is because in some paths
that call btrfs_cow_block(), such as btrfs_search_slot(), we are holding
the last reference on the extent buffer @buf so btrfs_force_cow_block()
drops the last reference on the @buf extent buffer when it calls
free_extent_buffer_stale(buf), which schedules the release of the extent
buffer with RCU. This means that if we are on a kernel with preemption,
the current task may be preempted before calling trace_btrfs_cow_block()
and the extent buffer already released by the time trace_btrfs_cow_block()
is called, resulting in a use-after-free.
Fix this by moving the trace_btrfs_cow_block() from btrfs_cow_block() to
btrfs_force_cow_block() before the COWed extent buffer is freed.
This also has a side effect of invoking the tracepoint in the tree defrag
code, at defrag.c:btrfs_realloc_node(), since btrfs_force_cow_block() is
called there, but this is fine and it was actually missing there.
Reported-by: syzbot+8517da8635307182c8a5@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6759a9b9.050a0220.1ac542.000d.GAE@google.com/
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Fix a use-after-free in the I/O completion path for encoded reads by
using a completion instead of a wait_queue for synchronizing the
destruction of 'struct btrfs_encoded_read_private'.
Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- tree-checker catches invalid number of inline extent references
- zoned mode fixes:
- enhance zone append IO command so it also detects emulated writes
- handle bio splitting at sectorsize boundary
- when deleting a snapshot, fix a condition for visiting nodes in reloc
trees
* tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: reject inline extent items with 0 ref count
btrfs: split bios to the fs sector size boundary
btrfs: use bio_is_zone_append() in the completion handler
btrfs: fix improper generation check in snapshot delete
|