summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2025-04-20btrfs: zoned: fix zone finishing with missing devicesJohannes Thumshirn1-0/+3
commit 35fec1089ebb5617f85884d3fa6a699ce6337a75 upstream. If do_zone_finish() is called with a filesystem that has missing devices (e.g. a RAID file system mounted in degraded mode) it is accessing the btrfs_device::zone_info pointer, which will not be set if the device in question is missing. Check if the device is present (by checking if it has a valid block device pointer associated) and if not, skip zone finishing for it. Fixes: 4dcbb8ab31c1 ("btrfs: zoned: make zone finishing multi stripe capable") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20btrfs: zoned: fix zone activation with missing devicesJohannes Thumshirn1-0/+3
commit 2bbc4a45e5eb6b868357c1045bf6f38f6ba576e0 upstream. If btrfs_zone_activate() is called with a filesystem that has missing devices (e.g. a RAID file system mounted in degraded mode) it is accessing the btrfs_device::zone_info pointer, which will not be set if the device in question is missing. Check if the device is present (by checking if it has a valid block device pointer associated) and if not, skip zone activation for it. Fixes: f9a912a3c45f ("btrfs: zoned: make zone activation multi stripe capable") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20btrfs: tests: fix chunk map leak after failure to add it to the treeFilipe Manana1-0/+1
commit 009ca358486ded9b4822eddb924009b6848d7271 upstream. If we fail to add the chunk map to the fs mapping tree we exit test_rmap_block() without freeing the chunk map. Fix this by adding a call to btrfs_free_chunk_map() before exiting the test function if the call to btrfs_add_chunk_map() failed. Fixes: 7dc66abb5a47 ("btrfs: use a dedicated data structure for chunk maps") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20btrfs: fix non-empty delayed iputs list on unmount due to compressed write ↵Filipe Manana1-0/+12
workers commit 4c782247b89376a83fa132f7d45d6977edae0629 upstream. At close_ctree() after we have ran delayed iputs either through explicitly calling btrfs_run_delayed_iputs() or later during the call to btrfs_commit_super() or btrfs_error_commit_super(), we assert that the delayed iputs list is empty. When we have compressed writes this assertion may fail because delayed iputs may have been added to the list after we last ran delayed iputs. This happens like this: 1) We have a compressed write bio executing; 2) We enter close_ctree() and flush the fs_info->endio_write_workers queue which is the queue used for running ordered extent completion; 3) The compressed write bio finishes and enters btrfs_finish_compressed_write_work(), where it calls btrfs_finish_ordered_extent() which in turn calls btrfs_queue_ordered_fn(), which queues a work item in the fs_info->endio_write_workers queue that we have flushed before; 4) At close_ctree() we proceed, run all existing delayed iputs and call btrfs_commit_super() (which also runs delayed iputs), but before we run the following assertion below: ASSERT(list_empty(&fs_info->delayed_iputs)) A delayed iput is added by the step below... 5) The ordered extent completion job queued in step 3 runs and results in creating a delayed iput when dropping the last reference of the ordered extent (a call to btrfs_put_ordered_extent() made from btrfs_finish_one_ordered()); 6) At this point the delayed iputs list is not empty, so the assertion at close_ctree() fails. Fix this by flushing the fs_info->compressed_write_workers queue at close_ctree() before flushing the fs_info->endio_write_workers queue, respecting the queue dependency as the later is responsible for the execution of ordered extent completion. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20btrfs: harden block_group::bg_list against list_del() racesBoris Burkov2-0/+20
[ Upstream commit 7511e29cf1355b2c47d0effb39e463119913e2f6 ] As far as I can tell, these calls of list_del_init() on bg_list cannot run concurrently with btrfs_mark_bg_unused() or btrfs_mark_bg_to_reclaim(), as they are in transaction error paths and situations where the block group is readonly. However, if there is any chance at all of racing with mark_bg_unused(), or a different future user of bg_list, better to be safe than sorry. Otherwise we risk the following interleaving (bg_list refcount in parens) T1 (some random op) T2 (btrfs_mark_bg_unused) !list_empty(&bg->bg_list); (1) list_del_init(&bg->bg_list); (1) list_move_tail (1) btrfs_put_block_group (0) btrfs_delete_unused_bgs bg = list_first_entry list_del_init(&bg->bg_list); btrfs_put_block_group(bg); (-1) Ultimately, this results in a broken ref count that hits zero one deref early and the real final deref underflows the refcount, resulting in a WARNING. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22btrfs: fix two misuses of folio_shift()Matthew Wilcox (Oracle)1-6/+5
[ Upstream commit 01af106a076352182b2916b143fc50272600bd81 ] It is meaningless to shift a byte count by folio_shift(). The folio index is in units of PAGE_SIZE, not folio_size(). We can use folio_contains() to make this work for arbitrary-order folios, so remove the assertion that the folios are of order 0. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22btrfs: avoid starting new transaction when cleaning qgroup during subvolume dropFilipe Manana1-5/+1
[ Upstream commit fdef89ce6fada462aef9cb90a140c93c8c209f0f ] At btrfs_qgroup_cleanup_dropped_subvolume() all we want to commit the current transaction in order to have all the qgroup rfer/excl numbers up to date. However we are using btrfs_start_transaction(), which joins the current transaction if there is one that is not yet committing, but also starts a new one if there is none or if the current one is already committing (its state is >= TRANS_STATE_COMMIT_START). This later case results in unnecessary IO, wasting time and a pointless rotation of the backup roots in the super block. So instead of using btrfs_start_transaction() followed by a btrfs_commit_transaction(), use btrfs_commit_current_transaction() which achieves our purpose and avoids starting and committing new transactions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13btrfs: fix a leaked chunk map issue in read_one_chunk()Haoxiang Li1-0/+1
commit 35d99c68af40a8ca175babc5a89ef7e2226fb3ca upstream. Add btrfs_free_chunk_map() to free the memory allocated by btrfs_alloc_chunk_map() if btrfs_add_chunk_map() fails. Fixes: 7dc66abb5a47 ("btrfs: use a dedicated data structure for chunk maps") CC: stable@vger.kernel.org Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13btrfs: zoned: fix extent range end unlock in cow_file_range()Naohiro Aota1-2/+7
commit 5a4041f2c47247575a6c2e53ce14f7b0ac946c33 upstream. Running generic/751 on the for-next branch often results in a hang like below. They are both stack by locking an extent. This suggests someone forget to unlock an extent. INFO: task kworker/u128:1:12 blocked for more than 323 seconds. Not tainted 6.13.0-BTRFS-ZNS+ #503 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u128:1 state:D stack:0 pid:12 tgid:12 ppid:2 flags:0x00004000 Workqueue: btrfs-fixup btrfs_work_helper [btrfs] Call Trace: <TASK> __schedule+0x534/0xdd0 schedule+0x39/0x140 __lock_extent+0x31b/0x380 [btrfs] ? __pfx_autoremove_wake_function+0x10/0x10 btrfs_writepage_fixup_worker+0xf1/0x3a0 [btrfs] btrfs_work_helper+0xff/0x480 [btrfs] ? lock_release+0x178/0x2c0 process_one_work+0x1ee/0x570 ? srso_return_thunk+0x5/0x5f worker_thread+0x1d1/0x3b0 ? __pfx_worker_thread+0x10/0x10 kthread+0x10b/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> INFO: task kworker/u134:0:184 blocked for more than 323 seconds. Not tainted 6.13.0-BTRFS-ZNS+ #503 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u134:0 state:D stack:0 pid:184 tgid:184 ppid:2 flags:0x00004000 Workqueue: writeback wb_workfn (flush-btrfs-4) Call Trace: <TASK> __schedule+0x534/0xdd0 schedule+0x39/0x140 __lock_extent+0x31b/0x380 [btrfs] ? __pfx_autoremove_wake_function+0x10/0x10 find_lock_delalloc_range+0xdb/0x260 [btrfs] writepage_delalloc+0x12f/0x500 [btrfs] ? srso_return_thunk+0x5/0x5f extent_write_cache_pages+0x232/0x840 [btrfs] btrfs_writepages+0x72/0x130 [btrfs] do_writepages+0xe7/0x260 ? srso_return_thunk+0x5/0x5f ? lock_acquire+0xd2/0x300 ? srso_return_thunk+0x5/0x5f ? find_held_lock+0x2b/0x80 ? wbc_attach_and_unlock_inode.part.0+0x102/0x250 ? wbc_attach_and_unlock_inode.part.0+0x102/0x250 __writeback_single_inode+0x5c/0x4b0 writeback_sb_inodes+0x22d/0x550 __writeback_inodes_wb+0x4c/0xe0 wb_writeback+0x2f6/0x3f0 wb_workfn+0x32a/0x510 process_one_work+0x1ee/0x570 ? srso_return_thunk+0x5/0x5f worker_thread+0x1d1/0x3b0 ? __pfx_worker_thread+0x10/0x10 kthread+0x10b/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> This happens because we have another success path for the zoned mode. When there is no active zone available, btrfs_reserve_extent() returns -EAGAIN. In this case, we have two reactions. (1) If the given range is never allocated, we can only wait for someone to finish a zone, so wait on BTRFS_FS_NEED_ZONE_FINISH bit and retry afterward. (2) Or, if some allocations are already done, we must bail out and let the caller to send IOs for the allocation. This is because these IOs may be necessary to finish a zone. The commit 06f364284794 ("btrfs: do proper folio cleanup when cow_file_range() failed") moved the unlock code from the inside of the loop to the outside. So, previously, the allocated extents are unlocked just after the allocation and so before returning from the function. However, they are no longer unlocked on the case (2) above. That caused the hang issue. Fix the issue by modifying the 'end' to the end of the allocated range. Then, we can exit the loop and the same unlock code can properly handle the case. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Fixes: 06f364284794 ("btrfs: do proper folio cleanup when cow_file_range() failed") CC: stable@vger.kernel.org Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07btrfs: fix data overwriting bug during buffered write when block size < page ↵Qu Wenruo1-1/+8
size commit efa11fd269c139e29b71ec21bc9c9c0063fde40d upstream. [BUG] When running generic/418 with a btrfs whose block size < page size (subpage cases), it always fails. And the following minimal reproducer is more than enough to trigger it reliably: workload() { mkfs.btrfs -s 4k -f $dev > /dev/null dmesg -C mount $dev $mnt $fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest ret=$? umount $mnt stop_trace if [ $ret -ne 0 ]; then fail fi } for (( i = 0; i < 1024; i++)); do echo "=== $i/$runtime ===" workload done [CAUSE] With extra trace printk added to the following functions: - btrfs_buffered_write() * Which folio is touched * The file offset (start) where the buffered write is at * How many bytes are copied * The content of the write (the first 2 bytes) - submit_one_sector() * Which folio is touched * The position inside the folio * The content of the page cache (the first 2 bytes) - pagecache_isize_extended() * The parameters of the function itself * The parameters of the folio_zero_range() Which are enough to show the problem: 22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101 22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101 22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101 22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101 22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192 22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000 22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101 The tool dio-invalidate-cache will start 3 threads, each doing a buffered write with 0x01 at offset 0, 4096 and 8192, do a fsync, then do a direct read, and compare the read buffer with the write buffer. Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into the page cache. But at submit_one_sector(), at file offset 4096, the content is zeroed out, by pagecache_isize_extended(). The race happens like this: Thread A is writing into range [4K, 8K). Thread B is writing into range [8K, 12k). Thread A | Thread B -------------------------------------+------------------------------------ btrfs_buffered_write() | btrfs_buffered_write() |- old_isize = 4K; | |- old_isize = 4096; |- btrfs_inode_lock() | | |- write into folio range [4K, 8K) | | |- pagecache_isize_extended() | | | extend isize from 4096 to 8192 | | | no folio_zero_range() called | | |- btrfs_inode_lock() | | | |- btrfs_inode_lock() | |- write into folio range [8K, 12K) | |- pagecache_isize_extended() | | calling folio_zero_range(4K, 8K) | | This is caused by the old_isize is | | grabbed too early, without any | | inode lock. | |- btrfs_inode_unlock() The @old_isize is grabbed without inode lock, causing race between two buffered write threads and making pagecache_isize_extended() to zero range which is still containing cached data. And this is only affecting subpage btrfs, because for regular blocksize == page size case, the function pagecache_isize_extended() will do nothing if the block size >= page size. [FIX] Grab the old i_size while holding the inode lock. This means each buffered write thread will have a stable view of the old inode size, thus avoid the above race. CC: stable@vger.kernel.org # 5.15+ Fixes: 5e8b9ef30392 ("btrfs: move pos increment and pagecache extension to btrfs_buffered_write") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07btrfs: fix use-after-free on inode when scanning root during em shrinkingFilipe Manana1-2/+1
commit 59f37036bb7ab3d554c24abc856aabca01126414 upstream. At btrfs_scan_root() we are accessing the inode's root (and fs_info) in a call to btrfs_fs_closing() after we have scheduled the inode for a delayed iput, and that can result in a use-after-free on the inode in case the cleaner kthread does the iput before we dereference the inode in the call to btrfs_fs_closing(). Fix this by using the fs_info stored already in a local variable instead of doing inode->root->fs_info. Fixes: 102044384056 ("btrfs: make the extent map shrinker run asynchronously as a work queue job") CC: stable@vger.kernel.org # 6.13+ Tested-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07btrfs: do regular iput instead of delayed iput during extent map shrinkingFilipe Manana1-1/+1
commit 15b3b3254d1453a8db038b7d44b311a2d6c71f98 upstream. The extent map shrinker now runs in the system unbound workqueue and no longer in kswapd context so it can directly do an iput() on inodes even if that blocks or needs to acquire any lock (we aren't holding any locks when requesting the delayed iput from the shrinker). So we don't need to add a delayed iput, wake up the cleaner and delegate the iput() to the cleaner, which also adds extra contention on the spinlock that protects the delayed iputs list. Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Tested-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/ CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07btrfs: skip inodes without loaded extent maps when shrinking extent mapsFilipe Manana1-21/+57
commit c6c9c4d56483d941f567eb921434c25fc6086dfa upstream. If there are inodes that don't have any loaded extent maps, we end up grabbing a reference on them and later adding a delayed iput, which wakes up the cleaner and makes it do unnecessary work. This is common when for example the inodes were open only to run stat(2) or all their extent maps were already released through the folio release callback (btrfs_release_folio()) or released by a previous run of the shrinker, or directories which never have extent maps. Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Tested-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/ CC: stable@vger.kernel.org # 6.13+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-27btrfs: fix double accounting race when extent_writepage_io() failedQu Wenruo1-13/+24
[ Upstream commit 8bf334beb3496da3c3fbf3daf3856f7eec70dacc ] [BUG] If submit_one_sector() failed inside extent_writepage_io() for sector size < page size cases (e.g. 4K sector size and 64K page size), then we can hit double ordered extent accounting error. This should be very rare, as submit_one_sector() only fails when we failed to grab the extent map, and such extent map should exist inside the memory and has been pinned. [CAUSE] For example we have the following folio layout: 0 4K 32K 48K 60K 64K |//| |//////| |///| Where |///| is the dirty range we need to writeback. The 3 different dirty ranges are submitted for regular COW. Now we hit the following sequence: - submit_one_sector() returned 0 for [0, 4K) - submit_one_sector() returned 0 for [32K, 48K) - submit_one_sector() returned error for [60K, 64K) - btrfs_mark_ordered_io_finished() called for the whole folio This will mark the following ranges as finished: * [0, 4K) * [32K, 48K) Both ranges have their IO already submitted, this cleanup will lead to double accounting. * [60K, 64K) That's the correct cleanup. The only good news is, this error is only theoretical, as the target extent map is always pinned, thus we should directly grab it from memory, other than reading it from the disk. [FIX] Instead of calling btrfs_mark_ordered_io_finished() for the whole folio range, which can touch ranges we should not touch, instead move the error handling inside extent_writepage_io(). So that we can cleanup exact sectors that ought to be submitted but failed. This provides much more accurate cleanup, avoiding the double accounting. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-27btrfs: fix double accounting race when btrfs_run_delalloc_range() failedQu Wenruo2-13/+49
[ Upstream commit 72dad8e377afa50435940adfb697e070d3556670 ] [BUG] When running btrfs with block size (4K) smaller than page size (64K, aarch64), there is a very high chance to crash the kernel at generic/750, with the following messages: (before the call traces, there are 3 extra debug messages added) BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental BTRFS info (device dm-3): checking UUID tree hrtimer: interrupt took 5451385 ns BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28 BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28 BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs] CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G OE 6.13.0-rc1-custom+ #89 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs] lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] Call trace: can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P) can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L) btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs] extent_writepage+0x10c/0x3b8 [btrfs] extent_write_cache_pages+0x21c/0x4e8 [btrfs] btrfs_writepages+0x94/0x160 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x74/0xa0 start_delalloc_inodes+0x17c/0x3b0 [btrfs] btrfs_start_delalloc_roots+0x17c/0x288 [btrfs] shrink_delalloc+0x11c/0x280 [btrfs] flush_space+0x288/0x328 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x228/0x680 worker_thread+0x1bc/0x360 kthread+0x100/0x118 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0 BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0 CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G W OE 6.13.0-rc1-custom+ #89 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: btrfs_work_helper [btrfs] (btrfs-endio-write) pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : process_one_work+0x110/0x680 lr : worker_thread+0x1bc/0x360 Call trace: process_one_work+0x110/0x680 (P) worker_thread+0x1bc/0x360 (L) worker_thread+0x1bc/0x360 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs SMP: failed to stop secondary CPUs 2-3 Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: 0x275bb9540000 from 0xffff800080000000 PHYS_OFFSET: 0xffff8fbba0000000 CPU features: 0x100,00000070,00801250,8201720b [CAUSE] The above warning is triggered immediately after the delalloc range failure, this happens in the following sequence: - Range [1568K, 1636K) is dirty 1536K 1568K 1600K 1636K 1664K | |/////////|////////| | Where 1536K, 1600K and 1664K are page boundaries (64K page size) - Enter extent_writepage() for page 1536K - Enter run_delalloc_nocow() with locked page 1536K and range [1568K, 1636K) This is due to the inode having preallocated extents. - Enter cow_file_range() with locked page 1536K and range [1568K, 1636K) - btrfs_reserve_extent() only reserved two extents The main loop of cow_file_range() only reserved two data extents, Now we have: 1536K 1568K 1600K 1636K 1664K | |<-->|<--->|/|///////| | 1584K 1596K Range [1568K, 1596K) has an ordered extent reserved. - btrfs_reserve_extent() failed inside cow_file_range() for file offset 1596K This is already a bug in our space reservation code, but for now let's focus on the error handling path. Now cow_file_range() returned -ENOSPC. - btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range [1568K, 1636K) Function btrfs_cleanup_ordered_extents() normally needs to skip the ranges inside the folio, as it will normally be cleaned up by extent_writepage(). Such split error handling is already problematic in the first place. What's worse is the folio range skipping itself, which is not taking subpage cases into consideration at all, it will only skip the range if the page start >= the range start. In our case, the page start < the range start, since for subpage cases we can have delalloc ranges inside the folio but not covering the folio. So it doesn't skip the page range at all. This means all the ordered extents, both [1568K, 1584K) and [1584K, 1596K) will be marked as IOERR. And these two ordered extents have no more pending ios, they are marked finished, and *QUEUED* to be deleted from the io tree. - extent_writepage() do error cleanup Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K). Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the deletion from io tree is async, it may or may not happen at this time. If the ranges have not yet been removed, we will do double cleaning on those ranges, triggering the above ordered extent warnings. In theory there are other bugs, like the cleanup in extent_writepage() can cause double accounting on ranges that are submitted asynchronously (compression for example). But that's much harder to trigger because normally we do not mix regular and compression delalloc ranges. [FIX] The folio range split is already buggy and not subpage compatible, it was introduced a long time ago where subpage support was not even considered. So instead of splitting the ordered extents cleanup into the folio range and out of folio range, do all the cleanup inside writepage_delalloc(). - Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in btrfs_run_delalloc_range() - Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc() failed So all ordered extents are only cleaned up by btrfs_run_delalloc_range(). - Handle the ranges that already have ordered extents allocated If part of the folio already has ordered extent allocated, and btrfs_run_delalloc_range() failed, we also need to cleanup that range. Now we have a concentrated error handling for ordered extents during btrfs_run_delalloc_range(). Fixes: d1051d6ebf8e ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 8bf334beb349 ("btrfs: fix double accounting race when extent_writepage_io() failed") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-27btrfs: use btrfs_inode in extent_writepage()David Sterba1-7/+7
[ Upstream commit 011a9a1f244656cc3cbde47edba2b250f794d440 ] As extent_writepage() is internal helper we should use our inode type, so change it from struct inode. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 8bf334beb349 ("btrfs: fix double accounting race when extent_writepage_io() failed") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21btrfs: fix stale page cache after race between readahead and direct IO writeFilipe Manana1-3/+15
[ Upstream commit acc18e1c1d8c0d59d793cf87790ccfcafb1bf5f0 ] After commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read") we can now trigger a race between a task doing a direct IO write and readahead. When this race is triggered it results in tasks getting stale data when they attempt do a buffered read (including the task that did the direct IO write). This race can be sporadically triggered with test case generic/418, failing like this: $ ./check generic/418 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad) # --- tests/generic/418.out 2020-06-10 19:29:03.850519863 +0100 # +++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad 2025-02-03 15:42:36.974609476 +0000 @@ -1,2 +1,5 @@ QA output created by 418 +cmpbuf: offset 0: Expected: 0x1, got 0x0 +[6:0] FAIL - comparison failed, offset 24576 +diotest -wp -b 4096 -n 8 -i 4 failed at loop 3 Silence is golden ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad' to see the entire diff) Ran: generic/418 Failures: generic/418 Failed 1 of 1 tests The race happens like this: 1) A file has a prealloc extent for the range [16K, 28K); 2) Task A starts a direct IO write against file range [24K, 28K). At the start of the direct IO write it invalidates the page cache at __iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file offset 24K; 3) Task A enters btrfs_dio_iomap_begin() and locks the extent range [24K, 28K); 4) Task B starts a readahead for file range [16K, 28K), entering btrfs_readahead(). First it attempts to read the page at offset 16K by entering btrfs_do_readpage(), where it calls get_extent_map(), locks the range [16K, 20K) and gets the extent map for the range [16K, 28K), caching it into the 'em_cached' variable declared in the local stack of btrfs_readahead(), and then unlocks the range [16K, 20K). Since the extent map has the prealloc flag, at btrfs_do_readpage() we zero out the page's content and don't submit any bio to read the page from the extent. Then it attempts to read the page at offset 20K entering btrfs_do_readpage() where we reuse the previously cached extent map (decided by get_extent_map()) since it spans the page's range and it's still in the inode's extent map tree. Just like for the previous page, we zero out the page's content since the extent map has the prealloc flag set. Then it attempts to read the page at offset 24K entering btrfs_do_readpage() where we reuse the previously cached extent map (decided by get_extent_map()) since it spans the page's range and it's still in the inode's extent map tree. Just like for the previous pages, we zero out the page's content since the extent map has the prealloc flag set. Note that we didn't lock the extent range [24K, 28K), so we didn't synchronize with the ongoing direct IO write being performed by task A; 5) Task A enters btrfs_create_dio_extent() and creates an ordered extent for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and BTRFS_ORDERED_PREALLOC set; 6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin(); 7) The ordered extent enters btrfs_finish_one_ordered() and locks the range [24K, 28K); 8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries to invalidate the page at offset 24K by calling kiocb_invalidate_post_direct_write(), resulting in a call chain that ends up at btrfs_release_folio(). The btrfs_release_folio() call ends up returning false because the range for the page at file offset 24K is currently locked by the task doing the ordered extent completion in the previous step (7), so we have: btrfs_release_folio() -> __btrfs_release_folio() -> try_release_extent_mapping() -> try_release_extent_state() This last function checking that the range is locked and returning false and propagating it up to btrfs_release_folio(). So this results in a failure to invalidate the page and kiocb_invalidate_post_direct_write() triggers this message logged in dmesg: Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! After this we leave the page cache with stale data for the file range [24K, 28K), filled with zeroes instead of the data written by direct IO write (all bytes with a 0x01 value), so any task attempting to read with buffered IO, including the task that did the direct IO write, will get all bytes in the range with a 0x00 value instead of the written data. Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(), at the two callers of btrfs_do_readpage() instead of doing it at get_extent_map(), just like we did before commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read"), and unlocking the range after all the calls to btrfs_do_readpage(). This way we never reuse a cached extent map without flushing any pending ordered extents from a concurrent direct IO write. Fixes: ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21btrfs: rename __get_extent_map() and pass btrfs_inodeDavid Sterba1-8/+7
[ Upstream commit 06de96faf795b5c276a3be612da6b08c6112e747 ] The double underscore naming scheme does not apply here, there's only only get_extent_map(). As the definition is changed also pass the struct btrfs_inode. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: acc18e1c1d8c ("btrfs: fix stale page cache after race between readahead and direct IO write") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21btrfs: fix hole expansion when writing at an offset beyond EOFFilipe Manana1-3/+1
commit da2dccd7451de62b175fb8f0808d644959e964c7 upstream. At btrfs_write_check() if our file's i_size is not sector size aligned and we have a write that starts at an offset larger than the i_size that falls within the same page of the i_size, then we end up not zeroing the file range [i_size, write_offset). The code is this: start_pos = round_down(pos, fs_info->sectorsize); oldsize = i_size_read(inode); if (start_pos > oldsize) { /* Expand hole size to cover write data, preventing empty gap */ loff_t end_pos = round_up(pos + count, fs_info->sectorsize); ret = btrfs_cont_expand(BTRFS_I(inode), oldsize, end_pos); if (ret) return ret; } So if our file's i_size is 90269 bytes and a write at offset 90365 bytes comes in, we get 'start_pos' set to 90112 bytes, which is less than the i_size and therefore we don't zero out the range [90269, 90365) by calling btrfs_cont_expand(). This is an old bug introduced in commit 9036c10208e1 ("Btrfs: update hole handling v2"), from 2008, and the buggy code got moved around over the years. Fix this by discarding 'start_pos' and comparing against the write offset ('pos') without any alignment. This bug was recently exposed by test case generic/363 which tests this scenario by polluting ranges beyond EOF with an mmap write and than verify that after a file increases we get zeroes for the range which is supposed to be a hole and not what we wrote with the previous mmaped write. We're only seeing this exposed now because generic/363 used to run only on xfs until last Sunday's fstests update. The test was failing like this: $ ./check generic/363 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 generic/363 0s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad) # --- tests/generic/363.out 2025-02-05 15:31:14.013646509 +0000 # +++ /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad 2025-02-05 17:25:33.112630781 +0000 @@ -1 +1,46 @@ QA output created by 363 +READ BAD DATA: offset = 0xdcad, size = 0xd921, fname = /home/fdmanana/btrfs-tests/dev/junk +OFFSET GOOD BAD RANGE +0x1609d 0x0000 0x3104 0x0 +operation# (mod 256) for the bad data may be 4 +0x1609e 0x0000 0x0472 0x1 +operation# (mod 256) for the bad data may be 4 ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/363.out /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad' to see the entire diff) Ran: generic/363 Failures: generic/363 Failed 1 of 1 tests Fixes: 9036c10208e1 ("Btrfs: update hole handling v2") CC: stable@vger.kernel.org Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17btrfs: don't use btrfs_set_item_key_safe on RAID stripe-extentsJohannes Thumshirn1-5/+21
[ Upstream commit dc14ba10781bd2629835696b7cc1febf914768e9 ] Don't use btrfs_set_item_key_safe() to modify the keys in the RAID stripe-tree, as this can lead to corruption of the tree, which is caught by the checks in btrfs_set_item_key_safe(): BTRFS info (device nvme1n1): leaf 49168384 gen 15 total ptrs 194 free space 8329 owner 12 BTRFS info (device nvme1n1): refs 2 lock_owner 1030 current 1030 [ snip ] item 105 key (354549760 230 20480) itemoff 14587 itemsize 16 stride 0 devid 5 physical 67502080 item 106 key (354631680 230 4096) itemoff 14571 itemsize 16 stride 0 devid 1 physical 88559616 item 107 key (354631680 230 32768) itemoff 14555 itemsize 16 stride 0 devid 1 physical 88555520 item 108 key (354717696 230 28672) itemoff 14539 itemsize 16 stride 0 devid 2 physical 67604480 [ snip ] BTRFS critical (device nvme1n1): slot 106 key (354631680 230 32768) new key (354635776 230 4096) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:2602! Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 1 UID: 0 PID: 1055 Comm: fsstress Not tainted 6.13.0-rc1+ #1464 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0xf7/0x270 Code: <snip> RSP: 0018:ffffc90001337ab0 EFLAGS: 00010287 RAX: 0000000000000000 RBX: ffff8881115fd000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00000000ffffffff RBP: ffff888110ed6f50 R08: 00000000ffffefff R09: ffffffff8244c500 R10: 00000000ffffefff R11: 00000000ffffffff R12: ffff888100586000 R13: 00000000000000c9 R14: ffffc90001337b1f R15: ffff888110f23b58 FS: 00007f7d75c72740(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa811652c60 CR3: 0000000111398001 CR4: 0000000000370eb0 Call Trace: <TASK> ? __die_body.cold+0x14/0x1a ? die+0x2e/0x50 ? do_trap+0xca/0x110 ? do_error_trap+0x65/0x80 ? btrfs_set_item_key_safe+0xf7/0x270 ? exc_invalid_op+0x50/0x70 ? btrfs_set_item_key_safe+0xf7/0x270 ? asm_exc_invalid_op+0x1a/0x20 ? btrfs_set_item_key_safe+0xf7/0x270 btrfs_partially_delete_raid_extent+0xc4/0xe0 btrfs_delete_raid_extent+0x227/0x240 __btrfs_free_extent.isra.0+0x57f/0x9c0 ? exc_coproc_segment_overrun+0x40/0x40 __btrfs_run_delayed_refs+0x2fa/0xe80 btrfs_run_delayed_refs+0x81/0xe0 btrfs_commit_transaction+0x2dd/0xbe0 ? preempt_count_add+0x52/0xb0 btrfs_sync_file+0x375/0x4c0 do_fsync+0x39/0x70 __x64_sys_fsync+0x13/0x20 do_syscall_64+0x54/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f7d7550ef90 Code: <snip> RSP: 002b:00007ffd70237248 EFLAGS: 00000202 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f7d7550ef90 RDX: 000000000000013a RSI: 000000000040eb28 RDI: 0000000000000004 RBP: 000000000000001b R08: 0000000000000078 R09: 00007ffd7023725c R10: 00007f7d75400390 R11: 0000000000000202 R12: 028f5c28f5c28f5c R13: 8f5c28f5c28f5c29 R14: 000000000040b520 R15: 00007f7d75c726c8 </TASK> While the root cause of the tree order corruption isn't clear, using btrfs_duplicate_item() to copy the item and then adjusting both the key and the per-device physical addresses is a safe way to counter this problem. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-17btrfs: convert BUG_ON in btrfs_reloc_cow_block() to proper error handlingJosef Bacik1-2/+12
[ Upstream commit 6a4730b325aaa48f7a5d5ba97aff0a955e2d9cec ] This BUG_ON is meant to catch backref cache problems, but these can arise from either bugs in the backref cache or corruption in the extent tree. Fix it to be a proper error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-17btrfs: fix data race when accessing the inode's disk_i_size at ↵Hao-ran Zheng1-1/+1
btrfs_drop_extents() [ Upstream commit 5324c4e10e9c2ce307a037e904c0d9671d7137d9 ] A data race occurs when the function `insert_ordered_extent_file_extent()` and the function `btrfs_inode_safe_disk_i_size_write()` are executed concurrently. The function `insert_ordered_extent_file_extent()` is not locked when reading inode->disk_i_size, causing `btrfs_inode_safe_disk_i_size_write()` to cause data competition when writing inode->disk_i_size, thus affecting the value of `modify_tree`. The specific call stack that appears during testing is as follows: ============DATA_RACE============ btrfs_drop_extents+0x89a/0xa060 [btrfs] insert_reserved_file_extent+0xb54/0x2960 [btrfs] insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs] btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs] btrfs_finish_ordered_io+0x37/0x60 [btrfs] finish_ordered_fn+0x3e/0x50 [btrfs] btrfs_work_helper+0x9c9/0x27a0 [btrfs] process_scheduled_works+0x716/0xf10 worker_thread+0xb6a/0x1190 kthread+0x292/0x330 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 ============OTHER_INFO============ btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs] btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs] btrfs_finish_ordered_io+0x37/0x60 [btrfs] finish_ordered_fn+0x3e/0x50 [btrfs] btrfs_work_helper+0x9c9/0x27a0 [btrfs] process_scheduled_works+0x716/0xf10 worker_thread+0xb6a/0x1190 kthread+0x292/0x330 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 ================================= The main purpose of the check of the inode's disk_i_size is to avoid taking write locks on a btree path when we have a write at or beyond EOF, since in these cases we don't expect to find extent items in the root to drop. However if we end up taking write locks due to a data race on disk_i_size, everything is still correct, we only add extra lock contention on the tree in case there's concurrency from other tasks. If the race causes us to not take write locks when we actually need them, then everything is functionally correct as well, since if we find out we have extent items to drop and we took read locks (modify_tree set to 0), we release the path and retry again with write locks. Since this data race does not affect the correctness of the function, it is a harmless data race, use data_race() to check inode->disk_i_size. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Hao-ran Zheng <zhenghaoran154@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-17btrfs: fix use-after-free when attempting to join an aborted transactionFilipe Manana1-1/+3
[ Upstream commit e2f0943cf37305dbdeaf9846e3c941451bcdef63 ] When we are trying to join the current transaction and if it's aborted, we read its 'aborted' field after unlocking fs_info->trans_lock and without holding any extra reference count on it. This means that a concurrent task that is aborting the transaction may free the transaction before we read its 'aborted' field, leading to a use-after-free. Fix this by reading the 'aborted' field while holding fs_info->trans_lock since any freeing task must first acquire that lock and set fs_info->running_transaction to NULL before freeing the transaction. This was reported by syzbot and Dmitry with the following stack traces from KASAN: ================================================================== BUG: KASAN: slab-use-after-free in join_transaction+0xd9b/0xda0 fs/btrfs/transaction.c:278 Read of size 4 at addr ffff888011839024 by task kworker/u4:9/1128 CPU: 0 UID: 0 PID: 1128 Comm: kworker/u4:9 Not tainted 6.13.0-rc7-syzkaller-00019-gc45323b7560e #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Workqueue: events_unbound btrfs_async_reclaim_data_space Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0x169/0x550 mm/kasan/report.c:489 kasan_report+0x143/0x180 mm/kasan/report.c:602 join_transaction+0xd9b/0xda0 fs/btrfs/transaction.c:278 start_transaction+0xaf8/0x1670 fs/btrfs/transaction.c:697 flush_space+0x448/0xcf0 fs/btrfs/space-info.c:803 btrfs_async_reclaim_data_space+0x159/0x510 fs/btrfs/space-info.c:1321 process_one_work kernel/workqueue.c:3236 [inline] process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3317 worker_thread+0x870/0xd30 kernel/workqueue.c:3398 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Allocated by task 5315: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394 kasan_kmalloc include/linux/kasan.h:260 [inline] __kmalloc_cache_noprof+0x243/0x390 mm/slub.c:4329 kmalloc_noprof include/linux/slab.h:901 [inline] join_transaction+0x144/0xda0 fs/btrfs/transaction.c:308 start_transaction+0xaf8/0x1670 fs/btrfs/transaction.c:697 btrfs_create_common+0x1b2/0x2e0 fs/btrfs/inode.c:6572 lookup_open fs/namei.c:3649 [inline] open_last_lookups fs/namei.c:3748 [inline] path_openat+0x1c03/0x3590 fs/namei.c:3984 do_filp_open+0x27f/0x4e0 fs/namei.c:4014 do_sys_openat2+0x13e/0x1d0 fs/open.c:1402 do_sys_open fs/open.c:1417 [inline] __do_sys_creat fs/open.c:1495 [inline] __se_sys_creat fs/open.c:1489 [inline] __x64_sys_creat+0x123/0x170 fs/open.c:1489 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 5336: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2353 [inline] slab_free mm/slub.c:4613 [inline] kfree+0x196/0x430 mm/slub.c:4761 cleanup_transaction fs/btrfs/transaction.c:2063 [inline] btrfs_commit_transaction+0x2c97/0x3720 fs/btrfs/transaction.c:2598 insert_balance_item+0x1284/0x20b0 fs/btrfs/volumes.c:3757 btrfs_balance+0x992/0x10c0 fs/btrfs/volumes.c:4633 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f The buggy address belongs to the object at ffff888011839000 which belongs to the cache kmalloc-2k of size 2048 The buggy address is located 36 bytes inside of freed 2048-byte region [ffff888011839000, ffff888011839800) The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11838 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff) page_type: f5(slab) raw: 00fff00000000040 ffff88801ac42000 ffffea0000493400 dead000000000002 raw: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000 head: 00fff00000000040 ffff88801ac42000 ffffea0000493400 dead000000000002 head: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000 head: 00fff00000000003 ffffea0000460e01 ffffffffffffffff 0000000000000000 head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 57, tgid 57 (kworker/0:2), ts 67248182943, free_ts 67229742023 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1558 prep_new_page mm/page_alloc.c:1566 [inline] get_page_from_freelist+0x365c/0x37a0 mm/page_alloc.c:3476 __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4753 alloc_pages_mpol_noprof+0x3e1/0x780 mm/mempolicy.c:2269 alloc_slab_page+0x6a/0x110 mm/slub.c:2423 allocate_slab+0x5a/0x2b0 mm/slub.c:2589 new_slab mm/slub.c:2642 [inline] ___slab_alloc+0xc27/0x14a0 mm/slub.c:3830 __slab_alloc+0x58/0xa0 mm/slub.c:3920 __slab_alloc_node mm/slub.c:3995 [inline] slab_alloc_node mm/slub.c:4156 [inline] __do_kmalloc_node mm/slub.c:4297 [inline] __kmalloc_node_track_caller_noprof+0x2e9/0x4c0 mm/slub.c:4317 kmalloc_reserve+0x111/0x2a0 net/core/skbuff.c:609 __alloc_skb+0x1f3/0x440 net/core/skbuff.c:678 alloc_skb include/linux/skbuff.h:1323 [inline] alloc_skb_with_frags+0xc3/0x820 net/core/skbuff.c:6612 sock_alloc_send_pskb+0x91a/0xa60 net/core/sock.c:2884 sock_alloc_send_skb include/net/sock.h:1803 [inline] mld_newpack+0x1c3/0xaf0 net/ipv6/mcast.c:1747 add_grhead net/ipv6/mcast.c:1850 [inline] add_grec+0x1492/0x19a0 net/ipv6/mcast.c:1988 mld_send_cr net/ipv6/mcast.c:2114 [inline] mld_ifc_work+0x691/0xd90 net/ipv6/mcast.c:2651 page last free pid 5300 tgid 5300 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1127 [inline] free_unref_page+0xd3f/0x1010 mm/page_alloc.c:2659 __slab_free+0x2c2/0x380 mm/slub.c:4524 qlink_free mm/kasan/quarantine.c:163 [inline] qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179 kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286 __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329 kasan_slab_alloc include/linux/kasan.h:250 [inline] slab_post_alloc_hook mm/slub.c:4119 [inline] slab_alloc_node mm/slub.c:4168 [inline] __do_kmalloc_node mm/slub.c:4297 [inline] __kmalloc_noprof+0x236/0x4c0 mm/slub.c:4310 kmalloc_noprof include/linux/slab.h:905 [inline] kzalloc_noprof include/linux/slab.h:1037 [inline] fib_create_info+0xc14/0x25b0 net/ipv4/fib_semantics.c:1435 fib_table_insert+0x1f6/0x1f20 net/ipv4/fib_trie.c:1231 fib_magic+0x3d8/0x620 net/ipv4/fib_frontend.c:1112 fib_add_ifaddr+0x40c/0x5e0 net/ipv4/fib_frontend.c:1156 fib_netdev_event+0x375/0x490 net/ipv4/fib_frontend.c:1494 notifier_call_chain+0x1a5/0x3f0 kernel/notifier.c:85 __dev_notify_flags+0x207/0x400 dev_change_flags+0xf0/0x1a0 net/core/dev.c:9045 do_setlink+0xc90/0x4210 net/core/rtnetlink.c:3109 rtnl_changelink net/core/rtnetlink.c:3723 [inline] __rtnl_newlink net/core/rtnetlink.c:3875 [inline] rtnl_newlink+0x1bb6/0x2210 net/core/rtnetlink.c:4012 Memory state around the buggy address: ffff888011838f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888011838f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff888011839000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888011839080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888011839100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Reported-by: syzbot+45212e9d87a98c3f5b42@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/678e7da5.050a0220.303755.007c.GAE@google.com/ Reported-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/linux-btrfs/CACT4Y+ZFBdo7pT8L2AzM=vegZwjp-wNkVJZQf0Ta3vZqtExaSw@mail.gmail.com/ Fixes: 871383be592b ("btrfs: add missing unlocks to transaction abort paths") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-17btrfs: do not output error message if a qgroup has been already cleaned upQu Wenruo1-1/+4
[ Upstream commit c9c863793395cf0a66c2778a29d72c48c02fbb66 ] [BUG] There is a bug report that btrfs outputs the following error message: BTRFS info (device nvme0n1p2): qgroup scan completed (inconsistency flag cleared) BTRFS warning (device nvme0n1p2): failed to cleanup qgroup 0/1179: -2 [CAUSE] The error itself is pretty harmless, and the end user should ignore it. When a subvolume is fully dropped, btrfs will call btrfs_qgroup_cleanup_dropped_subvolume() to delete the qgroup. However if a qgroup rescan happened before a subvolume fully dropped, qgroup for that subvolume will not be re-created, as rescan will only create new qgroup if there is a BTRFS_ROOT_REF_KEY found. But before we drop a subvolume, the subvolume is unlinked thus there is no BTRFS_ROOT_REF_KEY. In that case, btrfs_remove_qgroup() will fail with -ENOENT and trigger the above error message. [FIX] Just ignore -ENOENT error from btrfs_remove_qgroup() inside btrfs_qgroup_cleanup_dropped_subvolume(). Reported-by: John Shand <jshand2013@gmail.com> Link: https://bugzilla.suse.com/show_bug.cgi?id=1236056 Fixes: 839d6ea4f86d ("btrfs: automatically remove the subvolume qgroup") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-17btrfs: fix assertion failure when splitting ordered extent after transaction ↵Filipe Manana1-0/+12
abort [ Upstream commit 0d85f5c2dd91df6b5da454406756f463ba923b69 ] If while we are doing a direct IO write a transaction abort happens, we mark all existing ordered extents with the BTRFS_ORDERED_IOERR flag (done at btrfs_destroy_ordered_extents()), and then after that if we enter btrfs_split_ordered_extent() and the ordered extent has bytes left (meaning we have a bio that doesn't cover the whole ordered extent, see details at btrfs_extract_ordered_extent()), we will fail on the following assertion at btrfs_split_ordered_extent(): ASSERT(!(flags & ~BTRFS_ORDERED_TYPE_FLAGS)); because the BTRFS_ORDERED_IOERR flag is set and the definition of BTRFS_ORDERED_TYPE_FLAGS is just the union of all flags that identify the type of write (regular, nocow, prealloc, compressed, direct IO, encoded). Fix this by returning an error from btrfs_extract_ordered_extent() if we find the BTRFS_ORDERED_IOERR flag in the ordered extent. The error will be the error that resulted in the transaction abort or -EIO if no transaction abort happened. This was recently reported by syzbot with the following trace: FAULT_INJECTION: forcing a failure. name failslab, interval 1, probability 0, space 0, times 1 CPU: 0 UID: 0 PID: 5321 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 fail_dump lib/fault-inject.c:53 [inline] should_fail_ex+0x3b0/0x4e0 lib/fault-inject.c:154 should_failslab+0xac/0x100 mm/failslab.c:46 slab_pre_alloc_hook mm/slub.c:4072 [inline] slab_alloc_node mm/slub.c:4148 [inline] __do_kmalloc_node mm/slub.c:4297 [inline] __kmalloc_noprof+0xdd/0x4c0 mm/slub.c:4310 kmalloc_noprof include/linux/slab.h:905 [inline] kzalloc_noprof include/linux/slab.h:1037 [inline] btrfs_chunk_alloc_add_chunk_item+0x244/0x1100 fs/btrfs/volumes.c:5742 reserve_chunk_space+0x1ca/0x2c0 fs/btrfs/block-group.c:4292 check_system_chunk fs/btrfs/block-group.c:4319 [inline] do_chunk_alloc fs/btrfs/block-group.c:3891 [inline] btrfs_chunk_alloc+0x77b/0xf80 fs/btrfs/block-group.c:4187 find_free_extent_update_loop fs/btrfs/extent-tree.c:4166 [inline] find_free_extent+0x42d1/0x5810 fs/btrfs/extent-tree.c:4579 btrfs_reserve_extent+0x422/0x810 fs/btrfs/extent-tree.c:4672 btrfs_new_extent_direct fs/btrfs/direct-io.c:186 [inline] btrfs_get_blocks_direct_write+0x706/0xfa0 fs/btrfs/direct-io.c:321 btrfs_dio_iomap_begin+0xbb7/0x1180 fs/btrfs/direct-io.c:525 iomap_iter+0x697/0xf60 fs/iomap/iter.c:90 __iomap_dio_rw+0xeb9/0x25b0 fs/iomap/direct-io.c:702 btrfs_dio_write fs/btrfs/direct-io.c:775 [inline] btrfs_direct_write+0x610/0xa30 fs/btrfs/direct-io.c:880 btrfs_do_write_iter+0x2a0/0x760 fs/btrfs/file.c:1397 do_iter_readv_writev+0x600/0x880 vfs_writev+0x376/0xba0 fs/read_write.c:1050 do_pwritev fs/read_write.c:1146 [inline] __do_sys_pwritev2 fs/read_write.c:1204 [inline] __se_sys_pwritev2+0x196/0x2b0 fs/read_write.c:1195 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f1281f85d29 RSP: 002b:00007f12819fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000148 RAX: ffffffffffffffda RBX: 00007f1282176080 RCX: 00007f1281f85d29 RDX: 0000000000000001 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00007f12819fe090 R08: 0000000000000000 R09: 0000000000000003 R10: 0000000000007000 R11: 0000000000000246 R12: 0000000000000002 R13: 0000000000000000 R14: 00007f1282176080 R15: 00007ffcb9e23328 </TASK> BTRFS error (device loop0 state A): Transaction aborted (error -12) BTRFS: error (device loop0 state A) in btrfs_chunk_alloc_add_chunk_item:5745: errno=-12 Out of memory BTRFS info (device loop0 state EA): forced readonly assertion failed: !(flags & ~BTRFS_ORDERED_TYPE_FLAGS), in fs/btrfs/ordered-data.c:1234 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ordered-data.c:1234! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 UID: 0 PID: 5321 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:btrfs_split_ordered_extent+0xd8d/0xe20 fs/btrfs/ordered-data.c:1234 RSP: 0018:ffffc9000d1df2b8 EFLAGS: 00010246 RAX: 0000000000000057 RBX: 000000000006a000 RCX: 9ce21886c4195300 RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000 RBP: 0000000000000091 R08: ffffffff817f0a3c R09: 1ffff92001a3bdf4 R10: dffffc0000000000 R11: fffff52001a3bdf5 R12: 1ffff1100a45f401 R13: ffff8880522fa018 R14: dffffc0000000000 R15: 000000000006a000 FS: 00007f12819fe6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000557750bd7da8 CR3: 00000000400ea000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_extract_ordered_extent fs/btrfs/direct-io.c:702 [inline] btrfs_dio_submit_io+0x4be/0x6d0 fs/btrfs/direct-io.c:737 iomap_dio_submit_bio fs/iomap/direct-io.c:85 [inline] iomap_dio_bio_iter+0x1022/0x1740 fs/iomap/direct-io.c:447 __iomap_dio_rw+0x13b7/0x25b0 fs/iomap/direct-io.c:703 btrfs_dio_write fs/btrfs/direct-io.c:775 [inline] btrfs_direct_write+0x610/0xa30 fs/btrfs/direct-io.c:880 btrfs_do_write_iter+0x2a0/0x760 fs/btrfs/file.c:1397 do_iter_readv_writev+0x600/0x880 vfs_writev+0x376/0xba0 fs/read_write.c:1050 do_pwritev fs/read_write.c:1146 [inline] __do_sys_pwritev2 fs/read_write.c:1204 [inline] __se_sys_pwritev2+0x196/0x2b0 fs/read_write.c:1195 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f1281f85d29 RSP: 002b:00007f12819fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000148 RAX: ffffffffffffffda RBX: 00007f1282176080 RCX: 00007f1281f85d29 RDX: 0000000000000001 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00007f12819fe090 R08: 0000000000000000 R09: 0000000000000003 R10: 0000000000007000 R11: 0000000000000246 R12: 0000000000000002 R13: 0000000000000000 R14: 00007f1282176080 R15: 00007ffcb9e23328 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:btrfs_split_ordered_extent+0xd8d/0xe20 fs/btrfs/ordered-data.c:1234 RSP: 0018:ffffc9000d1df2b8 EFLAGS: 00010246 RAX: 0000000000000057 RBX: 000000000006a000 RCX: 9ce21886c4195300 RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000 RBP: 0000000000000091 R08: ffffffff817f0a3c R09: 1ffff92001a3bdf4 R10: dffffc0000000000 R11: fffff52001a3bdf5 R12: 1ffff1100a45f401 R13: ffff8880522fa018 R14: dffffc0000000000 R15: 000000000006a000 FS: 00007f12819fe6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000557750bd7da8 CR3: 00000000400ea000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 In this case the transaction abort was due to (an injected) memory allocation failure when attempting to allocate a new chunk. Reported-by: syzbot+f60d8337a5c8e8d92a77@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6777f2dd.050a0220.178762.0045.GAE@google.com/ Fixes: 52b1fdca23ac ("btrfs: handle completed ordered extents in btrfs_split_ordered_extent") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-17btrfs: fix lockdep splat while merging a relocation rootFilipe Manana1-0/+2
[ Upstream commit a216542027b892e6651c1b4e076012140d04afaf ] When COWing a relocation tree path, at relocation.c:replace_path(), we can trigger a lockdep splat while we are in the btrfs_search_slot() call against the relocation root. This happens in that callchain at ctree.c:read_block_for_search() when we happen to find a child extent buffer already loaded through the fs tree with a lockdep class set to the fs tree. So when we attempt to lock that extent buffer through a relocation tree we have to reset the lockdep class to the class for a relocation tree, since a relocation tree has extent buffers that used to belong to a fs tree and may currently be already loaded (we swap extent buffers between the two trees at the end of replace_path()). However we are missing calls to btrfs_maybe_reset_lockdep_class() to reset the lockdep class at ctree.c:read_block_for_search() before we read lock an extent buffer, just like we did for btrfs_search_slot() in commit b40130b23ca4 ("btrfs: fix lockdep splat with reloc root extent buffers"). So add the missing btrfs_maybe_reset_lockdep_class() calls before the attempts to read lock an extent buffer at ctree.c:read_block_for_search(). The lockdep splat was reported by syzbot and it looks like this: ====================================================== WARNING: possible circular locking dependency detected 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Not tainted ------------------------------------------------------ syz.0.0/5335 is trying to acquire lock: ffff8880545dbc38 (btrfs-tree-01){++++}-{4:4}, at: btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 but task is already holding lock: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-treloc-02/1){+.+.}-{4:4}: reacquire_held_locks+0x3eb/0x690 kernel/locking/lockdep.c:5374 __lock_release kernel/locking/lockdep.c:5563 [inline] lock_release+0x396/0xa30 kernel/locking/lockdep.c:5870 up_write+0x79/0x590 kernel/locking/rwsem.c:1629 btrfs_force_cow_block+0x14b3/0x1fd0 fs/btrfs/ctree.c:660 btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755 btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 (btrfs-tree-01/1){+.+.}-{4:4}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_write_nested+0xa2/0x220 kernel/locking/rwsem.c:1693 btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 btrfs_init_new_buffer fs/btrfs/extent-tree.c:5052 [inline] btrfs_alloc_tree_block+0x41c/0x1440 fs/btrfs/extent-tree.c:5132 btrfs_force_cow_block+0x526/0x1fd0 fs/btrfs/ctree.c:573 btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755 btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153 btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4351 btrfs_insert_empty_item fs/btrfs/ctree.h:688 [inline] btrfs_insert_inode_ref+0x2bb/0xf80 fs/btrfs/inode-item.c:330 btrfs_rename_exchange fs/btrfs/inode.c:7990 [inline] btrfs_rename2+0xcb7/0x2b90 fs/btrfs/inode.c:8374 vfs_rename+0xbdb/0xf00 fs/namei.c:5067 do_renameat2+0xd94/0x13f0 fs/namei.c:5224 __do_sys_renameat2 fs/namei.c:5258 [inline] __se_sys_renameat2 fs/namei.c:5255 [inline] __x64_sys_renameat2+0xce/0xe0 fs/namei.c:5255 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (btrfs-tree-01){++++}-{4:4}: check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649 btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline] read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610 btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: btrfs-tree-01 --> btrfs-tree-01/1 --> btrfs-treloc-02/1 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-treloc-02/1); lock(btrfs-tree-01/1); lock(btrfs-treloc-02/1); rlock(btrfs-tree-01); *** DEADLOCK *** 8 locks held by syz.0.0/5335: #0: ffff88801e3ae420 (sb_writers#13){.+.+}-{0:0}, at: mnt_want_write_file+0x5e/0x200 fs/namespace.c:559 #1: ffff888052c760d0 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: __btrfs_balance+0x4c2/0x26b0 fs/btrfs/volumes.c:4183 #2: ffff888052c74850 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x775/0xd90 fs/btrfs/relocation.c:4086 #3: ffff88801e3ae610 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xf11/0x1ad0 fs/btrfs/relocation.c:1659 #4: ffff888052c76470 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288 #5: ffff888052c76498 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288 #6: ffff8880545db878 (btrfs-tree-01/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 #7: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 stack backtrace: CPU: 0 UID: 0 PID: 5335 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206 check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649 btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline] read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610 btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f1ac6985d29 Code: ff ff c3 (...) RSP: 002b:00007f1ac63fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f1ac6b76160 RCX: 00007f1ac6985d29 RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000007 RBP: 00007f1ac6a01b08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000001 R14: 00007f1ac6b76160 R15: 00007fffda145a88 </TASK> Reported-by: syzbot+63913e558c084f7f8fdc@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/677b3014.050a0220.3b53b0.0064.GAE@google.com/ Fixes: 99785998ed1c ("btrfs: reduce lock contention when eb cache miss for btree search") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08btrfs: do proper folio cleanup when run_delalloc_nocow() failedQu Wenruo2-6/+102
commit c2b47df81c8e20a8e8cd94f0d7df211137ae94ed upstream. [BUG] With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash with the following VM_BUG_ON_FOLIO(): BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28 BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28 page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664 aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774" flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff) page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) ------------[ cut here ]------------ kernel BUG at mm/page-writeback.c:2992! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G OE 6.12.0-rc7-custom+ #87 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : folio_clear_dirty_for_io+0x128/0x258 lr : folio_clear_dirty_for_io+0x128/0x258 Call trace: folio_clear_dirty_for_io+0x128/0x258 btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs] __process_folios_contig+0x154/0x268 [btrfs] extent_clear_unlock_delalloc+0x5c/0x80 [btrfs] run_delalloc_nocow+0x5f8/0x760 [btrfs] btrfs_run_delalloc_range+0xa8/0x220 [btrfs] writepage_delalloc+0x230/0x4c8 [btrfs] extent_writepage+0xb8/0x358 [btrfs] extent_write_cache_pages+0x21c/0x4e8 [btrfs] btrfs_writepages+0x94/0x150 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x88/0xc8 start_delalloc_inodes+0x178/0x3a8 [btrfs] btrfs_start_delalloc_roots+0x174/0x280 [btrfs] shrink_delalloc+0x114/0x280 [btrfs] flush_space+0x250/0x2f8 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x164/0x408 worker_thread+0x25c/0x388 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000) ---[ end trace 0000000000000000 ]--- [CAUSE] The first two lines of extra debug messages show the problem is caused by the error handling of run_delalloc_nocow(). E.g. we have the following dirtied range (4K blocksize 4K page size): 0 16K 32K |//////////////////////////////////////| | Pre-allocated | And the range [0, 16K) has a preallocated extent. - Enter run_delalloc_nocow() for range [0, 16K) Which found range [0, 16K) is preallocated, can do the proper NOCOW write. - Enter fallback_to_fow() for range [16K, 32K) Since the range [16K, 32K) is not backed by preallocated extent, we have to go COW. - cow_file_range() failed for range [16K, 32K) So cow_file_range() will do the clean up by clearing folio dirty, unlock the folios. Now the folios in range [16K, 32K) is unlocked. - Enter extent_clear_unlock_delalloc() from run_delalloc_nocow() Which is called with PAGE_START_WRITEBACK to start page writeback. But folios can only be marked writeback when it's properly locked, thus this triggered the VM_BUG_ON_FOLIO(). Furthermore there is another hidden but common bug that run_delalloc_nocow() is not clearing the folio dirty flags in its error handling path. This is the common bug shared between run_delalloc_nocow() and cow_file_range(). [FIX] - Clear folio dirty for range [@start, @cur_offset) Introduce a helper, cleanup_dirty_folios(), which will find and lock the folio in the range, clear the dirty flag and start/end the writeback, with the extra handling for the @locked_folio. - Introduce a helper to clear folio dirty, start and end writeback - Introduce a helper to record the last failed COW range end This is to trace which range we should skip, to avoid double unlocking. - Skip the failed COW range for the error handling CC: stable@vger.kernel.org Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-08btrfs: do proper folio cleanup when cow_file_range() failedQu Wenruo1-33/+30
commit 06f364284794f149d2abc167c11d556cf20c954b upstream. [BUG] When testing with COW fixup marked as BUG_ON() (this is involved with the new pin_user_pages*() change, which should not result new out-of-band dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW fixup path. This BUG_ON() happens just after a failed btrfs_run_delalloc_range(): BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28 ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:1444! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G OE 6.12.0-rc7-custom+ #86 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : extent_writepage_io+0x2d4/0x308 [btrfs] lr : extent_writepage_io+0x2d4/0x308 [btrfs] Call trace: extent_writepage_io+0x2d4/0x308 [btrfs] extent_writepage+0x218/0x330 [btrfs] extent_write_cache_pages+0x1d4/0x4b0 [btrfs] btrfs_writepages+0x94/0x150 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x88/0xc8 start_delalloc_inodes+0x180/0x3b0 [btrfs] btrfs_start_delalloc_roots+0x174/0x280 [btrfs] shrink_delalloc+0x114/0x280 [btrfs] flush_space+0x250/0x2f8 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x164/0x408 worker_thread+0x25c/0x388 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000) ---[ end trace 0000000000000000 ]--- [CAUSE] That failure is mostly from cow_file_range(), where we can hit -ENOSPC. Although the -ENOSPC is already a bug related to our space reservation code, let's just focus on the error handling. For example, we have the following dirty range [0, 64K) of an inode, with 4K sector size and 4K page size: 0 16K 32K 48K 64K |///////////////////////////////////////| |#######################################| Where |///| means page are still dirty, and |###| means the extent io tree has EXTENT_DELALLOC flag. - Enter extent_writepage() for page 0 - Enter btrfs_run_delalloc_range() for range [0, 64K) - Enter cow_file_range() for range [0, 64K) - Function btrfs_reserve_extent() only reserved one 16K extent So we created extent map and ordered extent for range [0, 16K) 0 16K 32K 48K 64K |////////|//////////////////////////////| |<- OE ->|##############################| And range [0, 16K) has its delalloc flag cleared. But since we haven't yet submit any bio, involved 4 pages are still dirty. - Function btrfs_reserve_extent() returns with -ENOSPC Now we have to run error cleanup, which will clear all EXTENT_DELALLOC* flags and clear the dirty flags for the remaining ranges: 0 16K 32K 48K 64K |////////| | | | | Note that range [0, 16K) still has its pages dirty. - Some time later, writeback is triggered again for the range [0, 16K) since the page range still has dirty flags. - btrfs_run_delalloc_range() will do nothing because there is no EXTENT_DELALLOC flag. - extent_writepage_io() finds page 0 has no ordered flag Which falls into the COW fixup path, triggering the BUG_ON(). Unfortunately this error handling bug dates back to the introduction of btrfs. Thankfully with the abuse of COW fixup, at least it won't crash the kernel. [FIX] Instead of immediately unlocking the extent and folios, we keep the extent and folios locked until either erroring out or the whole delalloc range finished. When the whole delalloc range finished without error, we just unlock the whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared. And the involved folios will be properly submitted, with their dirty flags cleared during submission. For the error path, it will be a little more complex: - The range with ordered extent allocated (range (1)) We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining flags are cleaned up by btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered(). For folios we finish the IO (clear dirty, start writeback and immediately finish the writeback) and unlock the folios. - The range with reserved extent but no ordered extent (range(2)) - The range we never touched (range(3)) For both range (2) and range(3) the behavior is not changed. Now even if cow_file_range() failed halfway with some successfully reserved extents/ordered extents, we will keep all folios clean, so there will be no future writeback triggered on them. CC: stable@vger.kernel.org Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-08btrfs: output the reason for open_ctree() failureQu Wenruo1-1/+1
commit d0f038104fa37380e2a725e669508e43d0c503e9 upstream. There is a recent ML report that mounting a large fs backed by hardware RAID56 controller (with one device missing) took too much time, and systemd seems to kill the mount attempt. In that case, the only error message is: BTRFS error (device sdj): open_ctree failed There is no reason on why the failure happened, making it very hard to understand the reason. At least output the error number (in the particular case it should be -EINTR) to provide some clue. Link: https://lore.kernel.org/linux-btrfs/9b9c4d2810abcca2f9f76e32220ed9a90febb235.camel@scientia.org/ Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> Cc: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-08btrfs: subpage: fix the bitmap dump of the locked flagsQu Wenruo1-2/+4
[ Upstream commit 396294d1afee65a203d6cabd843d0782e5d7388e ] We're dumping the locked bitmap into the @checked_bitmap variable, printing incorrect values during debug. Thankfully even during my development I haven't hit a case where I need to dump the locked bitmap. But for the sake of consistency, fix it by dupping the locked bitmap into @locked_bitmap variable for output. Fixes: 75258f20fb70 ("btrfs: subpage: dump extra subpage bitmaps for debug") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08btrfs: improve the warning and error message for btrfs_remove_qgroup()Qu Wenruo1-5/+16
[ Upstream commit c0def46dec9c547679a25fe7552c4bcbec0b0dd2 ] [WARNING] There are several warnings about the recently introduced qgroup auto-removal that it triggers WARN_ON() for the non-zero rfer/excl numbers, e.g: ------------[ cut here ]------------ WARNING: CPU: 67 PID: 2882 at fs/btrfs/qgroup.c:1854 btrfs_remove_qgroup+0x3df/0x450 CPU: 67 UID: 0 PID: 2882 Comm: btrfs-cleaner Kdump: loaded Not tainted 6.11.6-300.fc41.x86_64 #1 RIP: 0010:btrfs_remove_qgroup+0x3df/0x450 Call Trace: <TASK> btrfs_qgroup_cleanup_dropped_subvolume+0x97/0xc0 btrfs_drop_snapshot+0x44e/0xa80 btrfs_clean_one_deleted_snapshot+0xc3/0x110 cleaner_kthread+0xd8/0x130 kthread+0xd2/0x100 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- BTRFS warning (device sda): to be deleted qgroup 0/319 has non-zero numbers, rfer 258478080 rfer_cmpr 258478080 excl 0 excl_cmpr 0 [CAUSE] Although the root cause is still unclear, as if qgroup is consistent a fully dropped subvolume (with extra transaction committed) should lead to all zero numbers for the qgroup. My current guess is the subvolume drop triggered the new subtree drop threshold thus marked qgroup inconsistent, then rescan cleared it but some corner case is not properly handled during subvolume dropping. But at least for this particular case, since it's only the rfer/excl not properly reset to 0, and qgroup is already marked inconsistent, there is nothing to be worried for the end users. The user space tool utilizing qgroup would queue a rescan to handle everything, so the kernel wanring is a little overkilled. [ENHANCEMENT] Enhance the warning inside btrfs_remove_qgroup() by: - Only do WARN() if CONFIG_BTRFS_DEBUG is enabled As explained the kernel can handle inconsistent qgroups by simply do a rescan, there is nothing to bother the end users. - Treat the reserved space leak the same as non-zero numbers By outputting the values and trigger a WARN() if it's a debug build. So far I haven't experienced any case related to reserved space so I hope we will never need to bother them. Fixes: 839d6ea4f86d ("btrfs: automatically remove the subvolume qgroup") Link: https://github.com/kdave/btrfs-progs/issues/922 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-16Merge tag 'for-6.13-rc7-tag' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: - handle d_path() errors when canonicalizing device mapper paths during device scan * tag 'for-6.13-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add the missing error handling inside get_canonical_dev_path
2025-01-13btrfs: add the missing error handling inside get_canonical_dev_pathQu Wenruo1-0/+4
Inside function get_canonical_dev_path(), we call d_path() to get the final device path. But d_path() can return error, and in that case the next strscpy() call will trigger an invalid memory access. Add back the missing error handling for d_path(). Reported-by: Boris Burkov <boris@bur.io> Fixes: 7e06de7c83a7 ("btrfs: canonicalize the device path before adding it") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-09Merge tag 'for-6.13-rc6-tag' of ↵Linus Torvalds4-61/+74
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes. Besides the one-liners in Btrfs there's fix to the io_uring and encoded read integration (added in this development cycle). The update to io_uring provides more space for the ongoing command that is then used in Btrfs to handle some cases. - io_uring and encoded read: - provide stable storage for io_uring command data - make a copy of encoded read ioctl call, reuse that in case the call would block and will be called again - properly initialize zlib context for hardware compression on s390 - fix max extent size calculation on filesystems with non-zoned devices - fix crash in scrub on crafted image due to invalid extent tree" * tag 'for-6.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zlib: fix avail_in bytes for s390 zlib HW compression path btrfs: zoned: calculate max_extent_size properly on non-zoned setup btrfs: avoid NULL pointer dereference if no valid extent tree btrfs: don't read from userspace twice in btrfs_uring_encoded_read() io_uring: add io_uring_cmd_get_async_data helper io_uring/cmd: add per-op data to struct io_uring_cmd_data io_uring/cmd: rename struct uring_cache to io_uring_cmd_data
2025-01-06btrfs: zlib: fix avail_in bytes for s390 zlib HW compression pathMikhail Zaslonko1-2/+2
Since the input data length passed to zlib_compress_folios() can be arbitrary, always setting strm.avail_in to a multiple of PAGE_SIZE may cause read-in bytes to exceed the input range. Currently this triggers an assert in btrfs_compress_folios() on the debug kernel (see below). Fix strm.avail_in calculation for S390 hardware acceleration path. assertion failed: *total_in <= orig_len, in fs/btrfs/compression.c:1041 ------------[ cut here ]------------ kernel BUG at fs/btrfs/compression.c:1041! monitor event: 0040 ilc:2 [#1] PREEMPT SMP CPU: 16 UID: 0 PID: 325 Comm: kworker/u273:3 Not tainted 6.13.0-20241204.rc1.git6.fae3b21430ca.300.fc41.s390x+debug #1 Hardware name: IBM 3931 A01 703 (z/VM 7.4.0) Workqueue: btrfs-delalloc btrfs_work_helper Krnl PSW : 0704d00180000000 0000021761df6538 (btrfs_compress_folios+0x198/0x1a0) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 Krnl GPRS: 0000000080000000 0000000000000001 0000000000000047 0000000000000000 0000000000000006 ffffff01757bb000 000001976232fcc0 000000000000130c 000001976232fcd0 000001976232fcc8 00000118ff4a0e30 0000000000000001 00000111821ab400 0000011100000000 0000021761df6534 000001976232fb58 Krnl Code: 0000021761df6528: c020006f5ef4 larl %r2,0000021762be2310 0000021761df652e: c0e5ffbd09d5 brasl %r14,00000217615978d8 #0000021761df6534: af000000 mc 0,0 >0000021761df6538: 0707 bcr 0,%r7 0000021761df653a: 0707 bcr 0,%r7 0000021761df653c: 0707 bcr 0,%r7 0000021761df653e: 0707 bcr 0,%r7 0000021761df6540: c004004bb7ec brcl 0,000002176276d518 Call Trace: [<0000021761df6538>] btrfs_compress_folios+0x198/0x1a0 ([<0000021761df6534>] btrfs_compress_folios+0x194/0x1a0) [<0000021761d97788>] compress_file_range+0x3b8/0x6d0 [<0000021761dcee7c>] btrfs_work_helper+0x10c/0x160 [<0000021761645760>] process_one_work+0x2b0/0x5d0 [<000002176164637e>] worker_thread+0x20e/0x3e0 [<000002176165221a>] kthread+0x15a/0x170 [<00000217615b859c>] __ret_from_fork+0x3c/0x60 [<00000217626e72d2>] ret_from_fork+0xa/0x38 INFO: lockdep is turned off. Last Breaking-Event-Address: [<0000021761597924>] _printk+0x4c/0x58 Kernel panic - not syncing: Fatal exception: panic_on_oops Fixes: fd1e75d0105d ("btrfs: make compression path to be subpage compatible") CC: stable@vger.kernel.org # 6.12+ Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-06btrfs: zoned: calculate max_extent_size properly on non-zoned setupChristoph Hellwig1-2/+3
Since commit 559218d43ec9 ("block: pre-calculate max_zone_append_sectors"), queue_limits's max_zone_append_sectors is default to be 0 and it is only updated when there is a zoned device. So, we have lim->max_zone_append_sectors = 0 when there is no zoned device in the filesystem. That leads to fs_info->max_zone_append_size and thus fs_info->max_extent_size to be 0, which is wrong and can for example lead to a divide by zero in count_max_extents(). Fix this by only capping fs_info->max_extent_size to fs_info->max_zone_append_size when it is non-zero. Based on a patch from Naohiro Aota <naohiro.aota@wdc.com>, from which much of this commit message is stolen as well. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 559218d43ec9 ("block: pre-calculate max_zone_append_sectors") Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-06btrfs: avoid NULL pointer dereference if no valid extent treeQu Wenruo1-0/+4
[BUG] Syzbot reported a crash with the following call trace: BTRFS info (device loop0): scrub: started on devid 1 BUG: kernel NULL pointer dereference, address: 0000000000000208 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 106e70067 P4D 106e70067 PUD 107143067 PMD 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 UID: 0 PID: 689 Comm: repro Kdump: loaded Tainted: G O 6.13.0-rc4-custom+ #206 Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:find_first_extent_item+0x26/0x1f0 [btrfs] Call Trace: <TASK> scrub_find_fill_first_stripe+0x13d/0x3b0 [btrfs] scrub_simple_mirror+0x175/0x260 [btrfs] scrub_stripe+0x5d4/0x6c0 [btrfs] scrub_chunk+0xbb/0x170 [btrfs] scrub_enumerate_chunks+0x2f4/0x5f0 [btrfs] btrfs_scrub_dev+0x240/0x600 [btrfs] btrfs_ioctl+0x1dc8/0x2fa0 [btrfs] ? do_sys_openat2+0xa5/0xf0 __x64_sys_ioctl+0x97/0xc0 do_syscall_64+0x4f/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e </TASK> [CAUSE] The reproducer is using a corrupted image where extent tree root is corrupted, thus forcing to use "rescue=all,ro" mount option to mount the image. Then it triggered a scrub, but since scrub relies on extent tree to find where the data/metadata extents are, scrub_find_fill_first_stripe() relies on an non-empty extent root. But unfortunately scrub_find_fill_first_stripe() doesn't really expect an NULL pointer for extent root, it use extent_root to grab fs_info and triggered a NULL pointer dereference. [FIX] Add an extra check for a valid extent root at the beginning of scrub_find_fill_first_stripe(). The new error path is introduced by 42437a6386ff ("btrfs: introduce mount option rescue=ignorebadroots"), but that's pretty old, and later commit b979547513ff ("btrfs: scrub: introduce helper to find and fill sector info for a scrub_stripe") changed how we do scrub. So for kernels older than 6.6, the fix will need manual backport. Reported-by: syzbot+339e9dbe3a2ca419b85d@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/67756935.050a0220.25abdd.0a12.GAE@google.com/ Fixes: 42437a6386ff ("btrfs: introduce mount option rescue=ignorebadroots") Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-06btrfs: don't read from userspace twice in btrfs_uring_encoded_read()Mark Harmstone1-57/+65
If we return -EAGAIN the first time because we need to block, btrfs_uring_encoded_read() will get called twice. Take a copy of args, the iovs, and the iter the first time, as by the time we are called the second time these may have gone out of scope. Reported-by: Jens Axboe <axboe@kernel.dk> Fixes: 34310c442e17 ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)") Signed-off-by: Mark Harmstone <maharmstone@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-29Merge tag 'for-6.13-rc4-tag' of ↵Linus Torvalds6-56/+130
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes that accumulated over the last two weeks, fixing some user reported problems: - swapfile fixes: - conditional reschedule in the activation loop - fix race with memory mapped file when activating - make activation loop interruptible - rework and fix extent sharing checks - folio fixes: - in send, recheck folio mapping after unlock - in relocation, recheck folio mapping after unlock - fix waiting for encoded read io_uring requests - fix transaction atomicity when enabling simple quotas - move COW block trace point before the block gets freed - print various sizes in sysfs with correct endianity" * tag 'for-6.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: sysfs: fix direct super block member reads btrfs: fix transaction atomicity bug when enabling simple quotas btrfs: avoid monopolizing a core when activating a swap file btrfs: allow swap activation to be interruptible btrfs: fix swap file activation failure due to extents that used to be shared btrfs: fix race with memory mapped writes when activating swap file btrfs: check folio mapping after unlock in put_file_data() btrfs: check folio mapping after unlock in relocate_one_folio() btrfs: fix use-after-free when COWing tree bock and tracing is enabled btrfs: fix use-after-free waiting for encoded read endios
2024-12-24btrfs: sysfs: fix direct super block member readsQu Wenruo1-3/+3
The following sysfs entries are reading super block member directly, which can have a different endian and cause wrong values: - sys/fs/btrfs/<uuid>/nodesize - sys/fs/btrfs/<uuid>/sectorsize - sys/fs/btrfs/<uuid>/clone_alignment Thankfully those values (nodesize and sectorsize) are always aligned inside the btrfs_super_block, so it won't trigger unaligned read errors, just endian problems. Fix them by using the native cached members instead. Fixes: df93589a1737 ("btrfs: export more from FS_INFO to sysfs") CC: stable@vger.kernel.org Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-24btrfs: fix transaction atomicity bug when enabling simple quotasJulian Sun1-2/+1
Set squota incompat bit before committing the transaction that enables the feature. With the config CONFIG_BTRFS_ASSERT enabled, an assertion failure occurs regarding the simple quota feature. [5.596534] assertion failed: btrfs_fs_incompat(fs_info, SIMPLE_QUOTA), in fs/btrfs/qgroup.c:365 [5.597098] ------------[ cut here ]------------ [5.597371] kernel BUG at fs/btrfs/qgroup.c:365! [5.597946] CPU: 1 UID: 0 PID: 268 Comm: mount Not tainted 6.13.0-rc2-00031-gf92f4749861b #146 [5.598450] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [5.599008] RIP: 0010:btrfs_read_qgroup_config+0x74d/0x7a0 [5.604303] <TASK> [5.605230] ? btrfs_read_qgroup_config+0x74d/0x7a0 [5.605538] ? exc_invalid_op+0x56/0x70 [5.605775] ? btrfs_read_qgroup_config+0x74d/0x7a0 [5.606066] ? asm_exc_invalid_op+0x1f/0x30 [5.606441] ? btrfs_read_qgroup_config+0x74d/0x7a0 [5.606741] ? btrfs_read_qgroup_config+0x74d/0x7a0 [5.607038] ? try_to_wake_up+0x317/0x760 [5.607286] open_ctree+0xd9c/0x1710 [5.607509] btrfs_get_tree+0x58a/0x7e0 [5.608002] vfs_get_tree+0x2e/0x100 [5.608224] fc_mount+0x16/0x60 [5.608420] btrfs_get_tree+0x2f8/0x7e0 [5.608897] vfs_get_tree+0x2e/0x100 [5.609121] path_mount+0x4c8/0xbc0 [5.609538] __x64_sys_mount+0x10d/0x150 The issue can be easily reproduced using the following reproducer: root@q:linux# cat repro.sh set -e mkfs.btrfs -q -f /dev/sdb mount /dev/sdb /mnt/btrfs btrfs quota enable -s /mnt/btrfs umount /mnt/btrfs mount /dev/sdb /mnt/btrfs The issue is that when enabling quotas, at btrfs_quota_enable(), we set BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE at fs_info->qgroup_flags and persist it in the quota root in the item with the key BTRFS_QGROUP_STATUS_KEY, but we only set the incompat bit BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA after we commit the transaction used to enable simple quotas. This means that if after that transaction commit we unmount the filesystem without starting and committing any other transaction, or we have a power failure, the next time we mount the filesystem we will find the flag BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE set in the item with the key BTRFS_QGROUP_STATUS_KEY but we will not find the incompat bit BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA set in the superblock, triggering an assertion failure at: btrfs_read_qgroup_config() -> qgroup_read_enable_gen() To fix this issue, set the BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA flag immediately after setting the BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE. This ensures that both flags are flushed to disk within the same transaction. Fixes: 182940f4f4db ("btrfs: qgroup: add new quota mode for simple quotas") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-24btrfs: avoid monopolizing a core when activating a swap fileFilipe Manana1-0/+2
During swap activation we iterate over the extents of a file and we can have many thousands of them, so we can end up in a busy loop monopolizing a core. Avoid this by doing a voluntary reschedule after processing each extent. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-24btrfs: allow swap activation to be interruptibleFilipe Manana1-0/+5
During swap activation we iterate over the extents of a file, then do several checks for each extent, some of which may take some significant time such as checking if an extent is shared. Since a file can have many thousands of extents, this can be a very slow operation and it's currently not interruptible. I had a bug during development of a previous patch that resulted in an infinite loop when iterating the extents, so a core was busy looping and I couldn't cancel the operation, which is very annoying and requires a reboot. So make the loop interruptible by checking for fatal signals at the end of each iteration and stopping immediately if there is one. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-24btrfs: fix swap file activation failure due to extents that used to be sharedFilipe Manana1-27/+69
When activating a swap file, to determine if an extent is shared we use can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper is meant to be quick because it's used in the NOCOW write path, when flushing delalloc and when doing a direct IO write, however it does return some false positives, meaning it may indicate that an extent is shared even if it's no longer the case. For the write path this is fine, we just do a unnecessary COW operation instead of doing a more rigorous check which would be too heavy (calling btrfs_is_data_extent_shared()). However when activating a swap file, the false positives simply result in a failure, which is confusing for users/applications. One particular case where this happens is when a data extent only has 1 reference but that reference is not inlined in the extent item located in the extent tree - this happens when we create more than 33 references for an extent and then delete those 33 references plus every other non-inline reference except one. The function check_committed_ref() assumes that if the size of an extent item doesn't match the size of struct btrfs_extent_item plus the size of an inline reference (plus an owner reference in case simple quotas are enabled), then the extent is shared - that is not the case however, we can have a single reference but it's not inlined - the reason we do this is to be fast and avoid inspecting non-inline references which may be located in another leaf of the extent tree, slowing down write paths. The following test script reproduces the bug: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi NUM_CLONES=50 umount $DEV &> /dev/null run_test() { local sync_after_add_reflinks=$1 local sync_after_remove_reflinks=$2 mkfs.btrfs -f $DEV > /dev/null #mkfs.xfs -f $DEV > /dev/null mount $DEV $MNT touch $MNT/foo chmod 0600 $MNT/foo # On btrfs the file must be NOCOW. chattr +C $MNT/foo &> /dev/null xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo mkswap $MNT/foo for ((i = 1; i <= $NUM_CLONES; i++)); do touch $MNT/foo_clone_$i chmod 0600 $MNT/foo_clone_$i # On btrfs the file must be NOCOW. chattr +C $MNT/foo_clone_$i &> /dev/null cp --reflink=always $MNT/foo $MNT/foo_clone_$i done if [ $sync_after_add_reflinks -ne 0 ]; then # Flush delayed refs and commit current transaction. sync -f $MNT fi # Remove the original file and all clones except the last. rm -f $MNT/foo for ((i = 1; i < $NUM_CLONES; i++)); do rm -f $MNT/foo_clone_$i done if [ $sync_after_remove_reflinks -ne 0 ]; then # Flush delayed refs and commit current transaction. sync -f $MNT fi # Now use the last clone as a swap file. It should work since # its extent are not shared anymore. swapon $MNT/foo_clone_${NUM_CLONES} swapoff $MNT/foo_clone_${NUM_CLONES} umount $MNT } echo -e "\nTest without sync after creating and removing clones" run_test 0 0 echo -e "\nTest with sync after creating clones" run_test 1 0 echo -e "\nTest with sync after removing clones" run_test 0 1 echo -e "\nTest with sync after creating and removing clones" run_test 1 1 Running the test: $ ./test.sh Test without sync after creating and removing clones wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec) Setting up swapspace version 1, size = 1020 KiB (1044480 bytes) no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02 swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument Test with sync after creating clones wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec) Setting up swapspace version 1, size = 1020 KiB (1044480 bytes) no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33 swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument Test with sync after removing clones wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec) Setting up swapspace version 1, size = 1020 KiB (1044480 bytes) no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764 Test with sync after creating and removing clones wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec) Setting up swapspace version 1, size = 1020 KiB (1044480 bytes) no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912 swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument Fix this by reworking btrfs_swap_activate() to instead of using extent maps and checking for shared extents with can_nocow_extent(), iterate over the inode's file extent items and use the accurate btrfs_is_data_extent_shared(). CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-24btrfs: fix race with memory mapped writes when activating swap fileFilipe Manana1-7/+24
When activating the swap file we flush all delalloc and wait for ordered extent completion, so that we don't miss any delalloc and extents before we check that the file's extent layout is usable for a swap file and activate the swap file. We are called with the inode's VFS lock acquired, so we won't race with buffered and direct IO writes, however we can still race with memory mapped writes since they don't acquire the inode's VFS lock. The race window is between flushing all delalloc and locking the whole file's extent range, since memory mapped writes lock an extent range with the length of a page. Fix this by acquiring the inode's mmap lock before we flush delalloc. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-24btrfs: check folio mapping after unlock in put_file_data()Boris Burkov1-0/+6
When we call btrfs_read_folio() we get an unlocked folio, so it is possible for a different thread to concurrently modify folio->mapping. We must check that this hasn't happened once we do have the lock. CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-24btrfs: check folio mapping after unlock in relocate_one_folio()Boris Burkov1-0/+6
When we call btrfs_read_folio() to bring a folio uptodate, we unlock the folio. The result of that is that a different thread can modify the mapping (like remove it with invalidate) before we call folio_lock(). This results in an invalid page and we need to try again. In particular, if we are relocating concurrently with aborting a transaction, this can result in a crash like the following: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 76 PID: 1411631 Comm: kworker/u322:5 Workqueue: events_unbound btrfs_reclaim_bgs_work RIP: 0010:set_page_extent_mapped+0x20/0xb0 RSP: 0018:ffffc900516a7be8 EFLAGS: 00010246 RAX: ffffea009e851d08 RBX: ffffea009e0b1880 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffc900516a7b90 RDI: ffffea009e0b1880 RBP: 0000000003573000 R08: 0000000000000001 R09: ffff88c07fd2f3f0 R10: 0000000000000000 R11: 0000194754b575be R12: 0000000003572000 R13: 0000000003572fff R14: 0000000000100cca R15: 0000000005582fff FS: 0000000000000000(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000407d00f002 CR4: 00000000007706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __die+0x78/0xc0 ? page_fault_oops+0x2a8/0x3a0 ? __switch_to+0x133/0x530 ? wq_worker_running+0xa/0x40 ? exc_page_fault+0x63/0x130 ? asm_exc_page_fault+0x22/0x30 ? set_page_extent_mapped+0x20/0xb0 relocate_file_extent_cluster+0x1a7/0x940 relocate_data_extent+0xaf/0x120 relocate_block_group+0x20f/0x480 btrfs_relocate_block_group+0x152/0x320 btrfs_relocate_chunk+0x3d/0x120 btrfs_reclaim_bgs_work+0x2ae/0x4e0 process_scheduled_works+0x184/0x370 worker_thread+0xc6/0x3e0 ? blk_add_timer+0xb0/0xb0 kthread+0xae/0xe0 ? flush_tlb_kernel_range+0x90/0x90 ret_from_fork+0x2f/0x40 ? flush_tlb_kernel_range+0x90/0x90 ret_from_fork_asm+0x11/0x20 </TASK> This occurs because cleanup_one_transaction() calls destroy_delalloc_inodes() which calls invalidate_inode_pages2() which takes the folio_lock before setting mapping to NULL. We fail to check this, and subsequently call set_extent_mapping(), which assumes that mapping != NULL (in fact it asserts that in debug mode) Note that the "fixes" patch here is not the one that introduced the race (the very first iteration of this code from 2009) but a more recent change that made this particular crash happen in practice. Fixes: e7f1326cc24e ("btrfs: set page extent mapped after read_folio in relocate_one_page") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23btrfs: fix use-after-free when COWing tree bock and tracing is enabledFilipe Manana1-7/+4
When a COWing a tree block, at btrfs_cow_block(), and we have the tracepoint trace_btrfs_cow_block() enabled and preemption is also enabled (CONFIG_PREEMPT=y), we can trigger a use-after-free in the COWed extent buffer while inside the tracepoint code. This is because in some paths that call btrfs_cow_block(), such as btrfs_search_slot(), we are holding the last reference on the extent buffer @buf so btrfs_force_cow_block() drops the last reference on the @buf extent buffer when it calls free_extent_buffer_stale(buf), which schedules the release of the extent buffer with RCU. This means that if we are on a kernel with preemption, the current task may be preempted before calling trace_btrfs_cow_block() and the extent buffer already released by the time trace_btrfs_cow_block() is called, resulting in a use-after-free. Fix this by moving the trace_btrfs_cow_block() from btrfs_cow_block() to btrfs_force_cow_block() before the COWed extent buffer is freed. This also has a side effect of invoking the tracepoint in the tree defrag code, at defrag.c:btrfs_realloc_node(), since btrfs_force_cow_block() is called there, but this is fine and it was actually missing there. Reported-by: syzbot+8517da8635307182c8a5@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6759a9b9.050a0220.1ac542.000d.GAE@google.com/ CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23btrfs: fix use-after-free waiting for encoded read endiosJohannes Thumshirn1-11/+11
Fix a use-after-free in the I/O completion path for encoded reads by using a completion instead of a wait_queue for synchronizing the destruction of 'struct btrfs_encoded_read_private'. Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-19Merge tag 'for-6.13-rc3-tag' of ↵Linus Torvalds4-9/+59
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - tree-checker catches invalid number of inline extent references - zoned mode fixes: - enhance zone append IO command so it also detects emulated writes - handle bio splitting at sectorsize boundary - when deleting a snapshot, fix a condition for visiting nodes in reloc trees * tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: tree-checker: reject inline extent items with 0 ref count btrfs: split bios to the fs sector size boundary btrfs: use bio_is_zone_append() in the completion handler btrfs: fix improper generation check in snapshot delete