summaryrefslogtreecommitdiff
path: root/fs/btrfs/qgroup.c
AgeCommit message (Collapse)AuthorFilesLines
2024-02-23btrfs: forbid deleting live subvol qgroupBoris Burkov1-0/+14
commit a8df35619948bd8363d330c20a90c9a7fbff28c0 upstream. If a subvolume still exists, forbid deleting its qgroup 0/subvolid. This behavior generally leads to incorrect behavior in squotas and doesn't have a legitimate purpose. Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-01btrfs: free qgroup pertrans reserve on transaction abortBoris Burkov1-2/+3
[ Upstream commit b321a52cce062ec7ed385333a33905d22159ce36 ] If we abort a transaction, we never run the code that frees the pertrans qgroup reservation. This results in warnings on unmount as that reservation has been leaked. The leak isn't a huge issue since the fs is read-only, but it's better to clean it up when we know we can/should. Do it during the cleanup_transaction step of aborting. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-01btrfs: qgroup: use qgroup_iterator in qgroup_convert_meta()Qu Wenruo1-22/+10
[ Upstream commit 0913445082496c2b29668ee26521401b273838b8 ] With the new qgroup_iterator_add() and qgroup_iterator_clean(), we can get rid of the ulist and its GFP_ATOMIC memory allocation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: b321a52cce06 ("btrfs: free qgroup pertrans reserve on transaction abort") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-01btrfs: qgroup: iterate qgroups without memory allocation for qgroup_reserve()Qu Wenruo1-32/+29
[ Upstream commit 686c4a5a42635e0d2889e3eb461c554fd0b616b4 ] Qgroup heavily relies on ulist to go through all the involved qgroups, but since we're using ulist inside fs_info->qgroup_lock spinlock, this means we're doing a lot of GFP_ATOMIC allocations. This patch reduces the GFP_ATOMIC usage for qgroup_reserve() by eliminating the memory allocation completely. This is done by moving the needed memory to btrfs_qgroup::iterator list_head, so that we can put all the involved qgroup into a on-stack list, thus eliminating the need to allocate memory while holding spinlock. The only cost is the slightly higher memory usage, but considering the reduce GFP_ATOMIC during a hot path, it should still be acceptable. Function qgroup_reserve() is the perfect start point for this conversion. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: b321a52cce06 ("btrfs: free qgroup pertrans reserve on transaction abort") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-20btrfs: fix qgroup_free_reserved_data int overflowBoris Burkov1-10/+15
commit 9e65bfca24cf1d77e4a5c7a170db5867377b3fe7 upstream. The reserved data counter and input parameter is a u64, but we inadvertently accumulate it in an int. Overflowing that int results in freeing the wrong amount of data and breaking reserve accounting. Unfortunately, this overflow rot spreads from there, as the qgroup release/free functions rely on returning an int to take advantage of negative values for error codes. Therefore, the full fix is to return the "released" or "freed" amount by a u64 argument and to return 0 or negative error code via the return value. Most of the call sites simply ignore the return value, though some of them handle the error and count the returned bytes. Change all of them accordingly. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28btrfs: abort transaction on generation mismatch when marking eb as dirtyFilipe Manana1-7/+7
[ Upstream commit 50564b651d01c19ce732819c5b3c3fd60707188e ] When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-21btrfs: avoid start and commit empty transaction when flushing qgroupsFilipe Manana1-1/+3
When flushing qgroups, we try to join a running transaction, with btrfs_join_transaction(), and then commit the transaction. However using btrfs_join_transaction() will result in creating a new transaction in case there isn't any running or if there's an existing one already committing. This is pointless as we only need to attach to an existing one that is not committing and in case there's an existing one committing, wait for its commit to complete. Creating and committing an empty transaction is wasteful, pointless IO and unnecessary rotation of the backup roots. So use btrfs_attach_transaction_barrier() instead, to avoid creating and committing empty transactions. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: avoid start and commit empty transaction when starting qgroup rescanFilipe Manana1-7/+8
When starting a qgroup rescan, we try to join a running transaction, with btrfs_join_transaction(), and then commit the transaction. However using btrfs_join_transaction() will result in creating a new transaction in case there isn't any running or if there's an existing one already committing. This is pointless as we only need to attach to an existing one that is not committing and in case there's an existing one committing, wait for its commit to complete. Creating and committing an empty transaction is wasteful, pointless IO and unnecessary rotation of the backup roots. So use btrfs_attach_transaction_barrier() instead, to avoid creating and committing empty transactions. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: fix warning when putting transaction with qgroups enabled after abortFilipe Manana1-0/+1
If we have a transaction abort with qgroups enabled we get a warning triggered when doing the final put on the transaction, like this: [552.6789] ------------[ cut here ]------------ [552.6815] WARNING: CPU: 4 PID: 81745 at fs/btrfs/transaction.c:144 btrfs_put_transaction+0x123/0x130 [btrfs] [552.6817] Modules linked in: btrfs blake2b_generic xor (...) [552.6819] CPU: 4 PID: 81745 Comm: btrfs-transacti Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [552.6819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [552.6819] RIP: 0010:btrfs_put_transaction+0x123/0x130 [btrfs] [552.6821] Code: bd a0 01 00 (...) [552.6821] RSP: 0018:ffffa168c0527e28 EFLAGS: 00010286 [552.6821] RAX: ffff936042caed00 RBX: ffff93604a3eb448 RCX: 0000000000000000 [552.6821] RDX: ffff93606421b028 RSI: ffffffff92ff0878 RDI: ffff93606421b010 [552.6821] RBP: ffff93606421b000 R08: 0000000000000000 R09: ffffa168c0d07c20 [552.6821] R10: 0000000000000000 R11: ffff93608dc52950 R12: ffffa168c0527e70 [552.6821] R13: ffff93606421b000 R14: ffff93604a3eb420 R15: ffff93606421b028 [552.6821] FS: 0000000000000000(0000) GS:ffff93675fb00000(0000) knlGS:0000000000000000 [552.6821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [552.6821] CR2: 0000558ad262b000 CR3: 000000014feda005 CR4: 0000000000370ee0 [552.6822] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [552.6822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [552.6822] Call Trace: [552.6822] <TASK> [552.6822] ? __warn+0x80/0x130 [552.6822] ? btrfs_put_transaction+0x123/0x130 [btrfs] [552.6824] ? report_bug+0x1f4/0x200 [552.6824] ? handle_bug+0x42/0x70 [552.6824] ? exc_invalid_op+0x14/0x70 [552.6824] ? asm_exc_invalid_op+0x16/0x20 [552.6824] ? btrfs_put_transaction+0x123/0x130 [btrfs] [552.6826] btrfs_cleanup_transaction+0xe7/0x5e0 [btrfs] [552.6828] ? _raw_spin_unlock_irqrestore+0x23/0x40 [552.6828] ? try_to_wake_up+0x94/0x5e0 [552.6828] ? __pfx_process_timeout+0x10/0x10 [552.6828] transaction_kthread+0x103/0x1d0 [btrfs] [552.6830] ? __pfx_transaction_kthread+0x10/0x10 [btrfs] [552.6832] kthread+0xee/0x120 [552.6832] ? __pfx_kthread+0x10/0x10 [552.6832] ret_from_fork+0x29/0x50 [552.6832] </TASK> [552.6832] ---[ end trace 0000000000000000 ]--- This corresponds to this line of code: void btrfs_put_transaction(struct btrfs_transaction *transaction) { (...) WARN_ON(!RB_EMPTY_ROOT( &transaction->delayed_refs.dirty_extent_root)); (...) } The warning happens because btrfs_qgroup_destroy_extent_records(), called in the transaction abort path, we free all entries from the rbtree "dirty_extent_root" with rbtree_postorder_for_each_entry_safe(), but we don't actually empty the rbtree - it's still pointing to nodes that were freed. So set the rbtree's root node to NULL to avoid this warning (assign RB_ROOT). Fixes: 81f7eb00ff5b ("btrfs: destroy qgroup extent records on transaction abort") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: fix race between quota disable and relocationFilipe Manana1-3/+15
If we disable quotas while we have a relocation of a metadata block group that has extents belonging to the quota root, we can cause the relocation to fail with -ENOENT. This is because relocation builds backref nodes for extents of the quota root and later needs to walk the backrefs and access the quota root - however if in between a task disables quotas, it results in deleting the quota root from the root tree (with btrfs_del_root(), called from btrfs_quota_disable(). This can be sporadically triggered by test case btrfs/255 from fstests: $ ./check btrfs/255 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.4.0-rc6-btrfs-next-134+ #1 SMP PREEMPT_DYNAMIC Thu Jun 15 11:59:28 WEST 2023 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 btrfs/255 6s ... _check_dmesg: something found in dmesg (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.dmesg) - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad) --- tests/btrfs/255.out 2023-03-02 21:47:53.876609426 +0000 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad 2023-06-16 10:20:39.267563212 +0100 @@ -1,2 +1,4 @@ QA output created by 255 +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No such file or directory +There may be more info in syslog - try dmesg | tail Silence is golden ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/255.out /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad' to see the entire diff) Ran: btrfs/255 Failures: btrfs/255 Failed 1 of 1 tests To fix this make the quota disable operation take the cleaner mutex, as relocation of a block group also takes this mutex. This is also what we do when deleting a subvolume/snapshot, we take the cleaner mutex in the cleaner kthread (at cleaner_kthread()) and then we call btrfs_del_root() at btrfs_drop_snapshot() while under the protection of the cleaner mutex. Fixes: bed92eae26cc ("Btrfs: qgroup implementation and prototypes") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: fix race when deleting quota root from the dirty cow roots listFilipe Manana1-0/+2
When disabling quotas we are deleting the quota root from the list fs_info->dirty_cowonly_roots without taking the lock that protects it, which is struct btrfs_fs_info::trans_lock. This unsynchronized list manipulation may cause chaos if there's another concurrent manipulation of this list, such as when adding a root to it with ctree.c:add_root_to_dirty_list(). This can result in all sorts of weird failures caused by a race, such as the following crash: [337571.278245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI [337571.278933] CPU: 1 PID: 115447 Comm: btrfs Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [337571.279153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [337571.279572] RIP: 0010:commit_cowonly_roots+0x11f/0x250 [btrfs] [337571.279928] Code: 85 38 06 00 (...) [337571.280363] RSP: 0018:ffff9f63446efba0 EFLAGS: 00010206 [337571.280582] RAX: ffff942d98ec2638 RBX: ffff9430b82b4c30 RCX: 0000000449e1c000 [337571.280798] RDX: dead000000000100 RSI: ffff9430021e4900 RDI: 0000000000036070 [337571.281015] RBP: ffff942d98ec2000 R08: ffff942d98ec2000 R09: 000000000000015b [337571.281254] R10: 0000000000000009 R11: 0000000000000001 R12: ffff942fe8fbf600 [337571.281476] R13: ffff942dabe23040 R14: ffff942dabe20800 R15: ffff942d92cf3b48 [337571.281723] FS: 00007f478adb7340(0000) GS:ffff94349fa40000(0000) knlGS:0000000000000000 [337571.281950] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [337571.282184] CR2: 00007f478ab9a3d5 CR3: 000000001e02c001 CR4: 0000000000370ee0 [337571.282416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [337571.282647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [337571.282874] Call Trace: [337571.283101] <TASK> [337571.283327] ? __die_body+0x1b/0x60 [337571.283570] ? die_addr+0x39/0x60 [337571.283796] ? exc_general_protection+0x22e/0x430 [337571.284022] ? asm_exc_general_protection+0x22/0x30 [337571.284251] ? commit_cowonly_roots+0x11f/0x250 [btrfs] [337571.284531] btrfs_commit_transaction+0x42e/0xf90 [btrfs] [337571.284803] ? _raw_spin_unlock+0x15/0x30 [337571.285031] ? release_extent_buffer+0x103/0x130 [btrfs] [337571.285305] reset_balance_state+0x152/0x1b0 [btrfs] [337571.285578] btrfs_balance+0xa50/0x11e0 [btrfs] [337571.285864] ? __kmem_cache_alloc_node+0x14a/0x410 [337571.286086] btrfs_ioctl+0x249a/0x3320 [btrfs] [337571.286358] ? mod_objcg_state+0xd2/0x360 [337571.286577] ? refill_obj_stock+0xb0/0x160 [337571.286798] ? seq_release+0x25/0x30 [337571.287016] ? __rseq_handle_notify_resume+0x3ba/0x4b0 [337571.287235] ? percpu_counter_add_batch+0x2e/0xa0 [337571.287455] ? __x64_sys_ioctl+0x88/0xc0 [337571.287675] __x64_sys_ioctl+0x88/0xc0 [337571.287901] do_syscall_64+0x38/0x90 [337571.288126] entry_SYSCALL_64_after_hwframe+0x72/0xdc [337571.288352] RIP: 0033:0x7f478aaffe9b So fix this by locking struct btrfs_fs_info::trans_lock before deleting the quota root from that list. Fixes: bed92eae26cc ("Btrfs: qgroup implementation and prototypes") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-28btrfs: fix race between quota disable and quota assign ioctlsFilipe Manana1-1/+10
The quota assign ioctl can currently run in parallel with a quota disable ioctl call. The assign ioctl uses the quota root, while the disable ioctl frees that root, and therefore we can have a use-after-free triggered in the assign ioctl, leading to a trace like the following when KASAN is enabled: [672.723][T736] BUG: KASAN: slab-use-after-free in btrfs_search_slot+0x2962/0x2db0 [672.723][T736] Read of size 8 at addr ffff888022ec0208 by task btrfs_search_sl/27736 [672.724][T736] [672.725][T736] CPU: 1 PID: 27736 Comm: btrfs_search_sl Not tainted 6.3.0-rc3 #37 [672.723][T736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [672.727][T736] Call Trace: [672.728][T736] <TASK> [672.728][T736] dump_stack_lvl+0xd9/0x150 [672.725][T736] print_report+0xc1/0x5e0 [672.720][T736] ? __virt_addr_valid+0x61/0x2e0 [672.727][T736] ? __phys_addr+0xc9/0x150 [672.725][T736] ? btrfs_search_slot+0x2962/0x2db0 [672.722][T736] kasan_report+0xc0/0xf0 [672.729][T736] ? btrfs_search_slot+0x2962/0x2db0 [672.724][T736] btrfs_search_slot+0x2962/0x2db0 [672.723][T736] ? fs_reclaim_acquire+0xba/0x160 [672.722][T736] ? split_leaf+0x13d0/0x13d0 [672.726][T736] ? rcu_is_watching+0x12/0xb0 [672.723][T736] ? kmem_cache_alloc+0x338/0x3c0 [672.722][T736] update_qgroup_status_item+0xf7/0x320 [672.724][T736] ? add_qgroup_rb+0x3d0/0x3d0 [672.739][T736] ? do_raw_spin_lock+0x12d/0x2b0 [672.730][T736] ? spin_bug+0x1d0/0x1d0 [672.737][T736] btrfs_run_qgroups+0x5de/0x840 [672.730][T736] ? btrfs_qgroup_rescan_worker+0xa70/0xa70 [672.738][T736] ? __del_qgroup_relation+0x4ba/0xe00 [672.738][T736] btrfs_ioctl+0x3d58/0x5d80 [672.735][T736] ? tomoyo_path_number_perm+0x16a/0x550 [672.737][T736] ? tomoyo_execute_permission+0x4a0/0x4a0 [672.731][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50 [672.737][T736] ? __sanitizer_cov_trace_switch+0x54/0x90 [672.734][T736] ? do_vfs_ioctl+0x132/0x1660 [672.730][T736] ? vfs_fileattr_set+0xc40/0xc40 [672.730][T736] ? _raw_spin_unlock_irq+0x2e/0x50 [672.732][T736] ? sigprocmask+0xf2/0x340 [672.737][T736] ? __fget_files+0x26a/0x480 [672.732][T736] ? bpf_lsm_file_ioctl+0x9/0x10 [672.738][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50 [672.736][T736] __x64_sys_ioctl+0x198/0x210 [672.736][T736] do_syscall_64+0x39/0xb0 [672.731][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.739][T736] RIP: 0033:0x4556ad [672.742][T736] </TASK> [672.743][T736] [672.748][T736] Allocated by task 27677: [672.743][T736] kasan_save_stack+0x22/0x40 [672.741][T736] kasan_set_track+0x25/0x30 [672.741][T736] __kasan_kmalloc+0xa4/0xb0 [672.749][T736] btrfs_alloc_root+0x48/0x90 [672.746][T736] btrfs_create_tree+0x146/0xa20 [672.744][T736] btrfs_quota_enable+0x461/0x1d20 [672.743][T736] btrfs_ioctl+0x4a1c/0x5d80 [672.747][T736] __x64_sys_ioctl+0x198/0x210 [672.749][T736] do_syscall_64+0x39/0xb0 [672.744][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.756][T736] [672.757][T736] Freed by task 27677: [672.759][T736] kasan_save_stack+0x22/0x40 [672.759][T736] kasan_set_track+0x25/0x30 [672.756][T736] kasan_save_free_info+0x2e/0x50 [672.751][T736] ____kasan_slab_free+0x162/0x1c0 [672.758][T736] slab_free_freelist_hook+0x89/0x1c0 [672.752][T736] __kmem_cache_free+0xaf/0x2e0 [672.752][T736] btrfs_put_root+0x1ff/0x2b0 [672.759][T736] btrfs_quota_disable+0x80a/0xbc0 [672.752][T736] btrfs_ioctl+0x3e5f/0x5d80 [672.756][T736] __x64_sys_ioctl+0x198/0x210 [672.753][T736] do_syscall_64+0x39/0xb0 [672.765][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.769][T736] [672.768][T736] The buggy address belongs to the object at ffff888022ec0000 [672.768][T736] which belongs to the cache kmalloc-4k of size 4096 [672.769][T736] The buggy address is located 520 bytes inside of [672.769][T736] freed 4096-byte region [ffff888022ec0000, ffff888022ec1000) [672.760][T736] [672.764][T736] The buggy address belongs to the physical page: [672.761][T736] page:ffffea00008bb000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22ec0 [672.766][T736] head:ffffea00008bb000 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [672.779][T736] flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff) [672.770][T736] raw: 00fff00000010200 ffff888012842140 ffffea000054ba00 dead000000000002 [672.770][T736] raw: 0000000000000000 0000000000040004 00000001ffffffff 0000000000000000 [672.771][T736] page dumped because: kasan: bad access detected [672.778][T736] page_owner tracks the page as allocated [672.777][T736] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd2040(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 88 [672.779][T736] get_page_from_freelist+0x119c/0x2d50 [672.779][T736] __alloc_pages+0x1cb/0x4a0 [672.776][T736] alloc_pages+0x1aa/0x270 [672.773][T736] allocate_slab+0x260/0x390 [672.771][T736] ___slab_alloc+0xa9a/0x13e0 [672.778][T736] __slab_alloc.constprop.0+0x56/0xb0 [672.771][T736] __kmem_cache_alloc_node+0x136/0x320 [672.789][T736] __kmalloc+0x4e/0x1a0 [672.783][T736] tomoyo_realpath_from_path+0xc3/0x600 [672.781][T736] tomoyo_path_perm+0x22f/0x420 [672.782][T736] tomoyo_path_unlink+0x92/0xd0 [672.780][T736] security_path_unlink+0xdb/0x150 [672.788][T736] do_unlinkat+0x377/0x680 [672.788][T736] __x64_sys_unlink+0xca/0x110 [672.789][T736] do_syscall_64+0x39/0xb0 [672.783][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.784][T736] page last free stack trace: [672.787][T736] free_pcp_prepare+0x4e5/0x920 [672.787][T736] free_unref_page+0x1d/0x4e0 [672.784][T736] __unfreeze_partials+0x17c/0x1a0 [672.797][T736] qlist_free_all+0x6a/0x180 [672.796][T736] kasan_quarantine_reduce+0x189/0x1d0 [672.797][T736] __kasan_slab_alloc+0x64/0x90 [672.793][T736] kmem_cache_alloc+0x17c/0x3c0 [672.799][T736] getname_flags.part.0+0x50/0x4e0 [672.799][T736] getname_flags+0x9e/0xe0 [672.792][T736] vfs_fstatat+0x77/0xb0 [672.791][T736] __do_sys_newlstat+0x84/0x100 [672.798][T736] do_syscall_64+0x39/0xb0 [672.796][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.790][T736] [672.791][T736] Memory state around the buggy address: [672.799][T736] ffff888022ec0100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [672.805][T736] ffff888022ec0180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [672.802][T736] >ffff888022ec0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [672.809][T736] ^ [672.809][T736] ffff888022ec0280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [672.809][T736] ffff888022ec0300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fix this by having the qgroup assign ioctl take the qgroup ioctl mutex before calling btrfs_run_qgroups(), which is what all qgroup ioctls should call. Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAFcO6XN3VD8ogmHwqRk4kbiwtpUSNySu2VAxN8waEPciCHJvMA@mail.gmail.com/ CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: rename btrfs_clean_tree_block to btrfs_clear_buffer_dirtyJosef Bacik1-1/+1
btrfs_clean_tree_block is a misnomer, it's just clear_extent_buffer_dirty with some extra accounting around it. Rename this to btrfs_clear_buffer_dirty to make it more clear it belongs with it's setter, btrfs_mark_buffer_dirty. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: add trans argument to btrfs_clean_tree_blockJosef Bacik1-1/+1
We check the header generation in the extent buffer against the current running transaction id to see if it's safe to clear DIRTY on this buffer. Generally speaking if we're clearing the buffer dirty we're holding the transaction open, but in the case of cleaning up an aborted transaction we don't, so we have extra checks in that path to check the transid. To allow for a future cleanup go ahead and pass in the trans handle so we don't have to rely on ->running_transaction being set. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-16btrfs: fix race between quota rescan and disable leading to NULL pointer derefFilipe Manana1-8/+17
If we have one task trying to start the quota rescan worker while another one is trying to disable quotas, we can end up hitting a race that results in the quota rescan worker doing a NULL pointer dereference. The steps for this are the following: 1) Quotas are enabled; 2) Task A calls the quota rescan ioctl and enters btrfs_qgroup_rescan(). It calls qgroup_rescan_init() which returns 0 (success) and then joins a transaction and commits it; 3) Task B calls the quota disable ioctl and enters btrfs_quota_disable(). It clears the bit BTRFS_FS_QUOTA_ENABLED from fs_info->flags and calls btrfs_qgroup_wait_for_completion(), which returns immediately since the rescan worker is not yet running. Then it starts a transaction and locks fs_info->qgroup_ioctl_lock; 4) Task A queues the rescan worker, by calling btrfs_queue_work(); 5) The rescan worker starts, and calls rescan_should_stop() at the start of its while loop, which results in 0 iterations of the loop, since the flag BTRFS_FS_QUOTA_ENABLED was cleared from fs_info->flags by task B at step 3); 6) Task B sets fs_info->quota_root to NULL; 7) The rescan worker tries to start a transaction and uses fs_info->quota_root as the root argument for btrfs_start_transaction(). This results in a NULL pointer dereference down the call chain of btrfs_start_transaction(). The stack trace is something like the one reported in Link tag below: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f] CPU: 1 PID: 34 Comm: kworker/u4:2 Not tainted 6.1.0-syzkaller-13872-gb6bb9676f216 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 Workqueue: btrfs-qgroup-rescan btrfs_work_helper RIP: 0010:start_transaction+0x48/0x10f0 fs/btrfs/transaction.c:564 Code: 48 89 fb 48 (...) RSP: 0018:ffffc90000ab7ab0 EFLAGS: 00010206 RAX: 0000000000000041 RBX: 0000000000000208 RCX: ffff88801779ba80 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000 RBP: dffffc0000000000 R08: 0000000000000001 R09: fffff52000156f5d R10: fffff52000156f5d R11: 1ffff92000156f5c R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f2bea75b718 CR3: 000000001d0cc000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_qgroup_rescan_worker+0x3bb/0x6a0 fs/btrfs/qgroup.c:3402 btrfs_work_helper+0x312/0x850 fs/btrfs/async-thread.c:280 process_one_work+0x877/0xdb0 kernel/workqueue.c:2289 worker_thread+0xb14/0x1330 kernel/workqueue.c:2436 kthread+0x266/0x300 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 </TASK> Modules linked in: So fix this by having the rescan worker function not attempt to start a transaction if it didn't do any rescan work. Reported-by: syzbot+96977faa68092ad382c4@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000e5454b05f065a803@google.com/ Fixes: e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11btrfs: qgroup: do not warn on record without old_roots populatedQu Wenruo1-2/+12
[BUG] There are some reports from the mailing list that since v6.1 kernel, the WARN_ON() inside btrfs_qgroup_account_extent() gets triggered during rescan: WARNING: CPU: 3 PID: 6424 at fs/btrfs/qgroup.c:2756 btrfs_qgroup_account_extents+0x1ae/0x260 [btrfs] CPU: 3 PID: 6424 Comm: snapperd Tainted: P OE 6.1.2-1-default #1 openSUSE Tumbleweed 05c7a1b1b61d5627475528f71f50444637b5aad7 RIP: 0010:btrfs_qgroup_account_extents+0x1ae/0x260 [btrfs] Call Trace: <TASK> btrfs_commit_transaction+0x30c/0xb40 [btrfs c39c9c546c241c593f03bd6d5f39ea1b676250f6] ? start_transaction+0xc3/0x5b0 [btrfs c39c9c546c241c593f03bd6d5f39ea1b676250f6] btrfs_qgroup_rescan+0x42/0xc0 [btrfs c39c9c546c241c593f03bd6d5f39ea1b676250f6] btrfs_ioctl+0x1ab9/0x25c0 [btrfs c39c9c546c241c593f03bd6d5f39ea1b676250f6] ? __rseq_handle_notify_resume+0xa9/0x4a0 ? mntput_no_expire+0x4a/0x240 ? __seccomp_filter+0x319/0x4d0 __x64_sys_ioctl+0x90/0xd0 do_syscall_64+0x5b/0x80 ? syscall_exit_to_user_mode+0x17/0x40 ? do_syscall_64+0x67/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fd9b790d9bf </TASK> [CAUSE] Since commit e15e9f43c7ca ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting"), if our qgroup is already in inconsistent state, we will no longer do the time-consuming backref walk. This can leave some qgroup records without a valid old_roots ulist. Normally this is fine, as btrfs_qgroup_account_extents() would also skip those records if we have NO_ACCOUNTING flag set. But there is a small window, if we have NO_ACCOUNTING flag set, and inserted some qgroup_record without a old_roots ulist, but then the user triggered a qgroup rescan. During btrfs_qgroup_rescan(), we firstly clear NO_ACCOUNTING flag, then commit current transaction. And since we have a qgroup_record with old_roots = NULL, we trigger the WARN_ON() during btrfs_qgroup_account_extents(). [FIX] Unfortunately due to the introduction of NO_ACCOUNTING flag, the assumption that every qgroup_record would have its old_roots populated is no longer correct. Fix the false alerts and drop the WARN_ON(). Reported-by: Lukas Straub <lukasstraub2@web.de> Reported-by: HanatoK <summersnow9403@gmail.com> Fixes: e15e9f43c7ca ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting") CC: stable@vger.kernel.org # 6.1 Link: https://lore.kernel.org/linux-btrfs/2403c697-ddaf-58ad-3829-0335fc89df09@gmail.com/ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-15btrfs: restore BTRFS_SEQ_LAST when looking up qgroup backref lookupJosef Bacik1-0/+1
In the patch a2c8d27e5ee8 ("btrfs: use a structure to pass arguments to backref walking functions") Filipe converted everybody to using a new context struct to use for backref lookups, but accidentally dropped the BTRFS_SEQ_LAST usage that exists for qgroups. Add this back so we have the previous behavior. Fixes: a2c8d27e5ee8 ("btrfs: use a structure to pass arguments to backref walking functions") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move struct btrfs_tree_parent_check out of disk-io.hChristoph Hellwig1-0/+1
Move struct btrfs_tree_parent_check out of disk-io.h so that volumes.h an various .c files don't have to include disk-io.h just for it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ use tree-checker.h for the structure ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: concentrate all tree block parentness check parameters into one structureQu Wenruo1-4/+14
There are several different tree block parentness check parameters used across several helpers: - level Mandatory - transid Under most cases it's mandatory, but there are several backref cases which skips this check. - owner_root - first_key Utilized by most top-down tree search routine. Otherwise can be skipped. Those four members are not always mandatory checks, and some of them are the same u64, which means if some arguments got swapped compiler will not catch it. Furthermore if we're going to further expand the parentness check, we need to modify quite some helpers just to add one more parameter. This patch will concentrate all these members into a structure called btrfs_tree_parent_check, and pass that structure for the following helpers: - btrfs_read_extent_buffer() - read_tree_block() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use a structure to pass arguments to backref walking functionsFilipe Manana1-14/+24
The public backref walking functions have quite a lot of arguments that are passed down the call stack to find_parent_nodes(), the core function of the backref walking code. The next patches in series will need to add even arguments to these functions that should be passed not only to find_parent_nodes(), but also to other functions used by the later (directly or even lower in the call stack). So create a structure to hold all these arguments and state used by the main backref walking function, find_parent_nodes(), and use it as the argument for the public backref walking functions iterate_extent_inodes(), btrfs_find_all_leafs() and btrfs_find_all_roots(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move root tree prototypes to their own headerJosef Bacik1-0/+1
Move all the root-tree.c prototypes to root-tree.h, and then update all the necessary files to include the new header. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move extent-tree helpers into their own header fileJosef Bacik1-0/+1
Move all the extent tree related prototypes to extent-tree.h out of ctree.h, and then go include it everywhere needed so everything compiles. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: sink gfp_t parameter to btrfs_qgroup_trace_extentDavid Sterba1-10/+7
All callers pass GFP_NOFS, we can drop the parameter and use it directly. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move accessor helpers into accessors.hJosef Bacik1-0/+1
This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move fs wide helpers out of ctree.hJosef Bacik1-0/+1
We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-21btrfs: qgroup: fix sleep from invalid context bug in btrfs_qgroup_inherit()ChenXiaoSong1-8/+1
Syzkaller reported BUG as follows: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 Call Trace: <TASK> dump_stack_lvl+0xcd/0x134 __might_resched.cold+0x222/0x26b kmem_cache_alloc+0x2e7/0x3c0 update_qgroup_limit_item+0xe1/0x390 btrfs_qgroup_inherit+0x147b/0x1ee0 create_subvol+0x4eb/0x1710 btrfs_mksubvol+0xfe5/0x13f0 __btrfs_ioctl_snap_create+0x2b0/0x430 btrfs_ioctl_snap_create_v2+0x25a/0x520 btrfs_ioctl+0x2a1c/0x5ce0 __x64_sys_ioctl+0x193/0x200 do_syscall_64+0x35/0x80 Fix this by calling qgroup_dirty() on @dstqgroup, and update limit item in btrfs_run_qgroups() later outside of the spinlock context. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: qgroup: fix a typo in a commentChristophe JAILLET1-1/+1
Add a missing 'r'. s/qgoup/qgroup/ . Codespell does not catch that for some reason. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: skip subtree scan if it's too high to avoid low stall in ↵Qu Wenruo1-0/+19
btrfs_commit_transaction() Btrfs qgroup has a long history of bringing performance penalty in btrfs_commit_transaction(). Although we tried our best to migrate such impact, there is still an unsolved call site, btrfs_drop_snapshot(). This function will find the highest shared tree block and modify its extent ownership to do a subvolume/snapshot dropping. Such change will affect the whole subtree, and cause tons of qgroup dirty extents and stall btrfs_commit_transaction(). To avoid such problem, here we introduce a new sysfs interface, /sys/fs/btrfs/<uuid>/qgroups/drop_subptree_threshold, to determine at whether and at which level we should skip qgroup accounting for subtree dropping. The default value is BTRFS_MAX_LEVEL, thus every subtree drop will go through qgroup accounting, to ensure qgroup numbers are kept as consistent as possible. While for performance sensitive cases, add a way to change the values to more reasonable values like 3, to make any subtree, which is at or higher than level 3, to mark qgroup inconsistent and skip the accounting. The cost is obvious, the qgroup number is no longer consistent, but at least performance is more reasonable, and users have the control. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup ↵Qu Wenruo1-4/+11
accounting The new flag will make btrfs qgroup skip all its time consuming qgroup accounting. The lifespan is the same as BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN, only get cleared after a new rescan. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCANQu Wenruo1-17/+26
Introduce a new runtime flag, BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN, which will inform qgroup rescan to cancel its work asynchronously. This is to address the window when an operation makes qgroup numbers inconsistent (like qgroup inheriting) while a qgroup rescan is running. In that case, qgroup inconsistent flag will be cleared when qgroup rescan finishes. But we changed the ownership of some extents, which means the rescan is already meaningless, and the qgroup inconsistent flag should not be cleared. With the new flag, each time we set INCONSISTENT flag, we also set this new flag to inform any running qgroup rescan to exit immediately, and leaving the INCONSISTENT flag there. The new runtime flag can only be cleared when a new rescan is started. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: introduce BTRFS_QGROUP_STATUS_FLAGS_MASK for later expansionQu Wenruo1-2/+4
Currently we only have 3 qgroup flags: - BTRFS_QGROUP_STATUS_FLAG_ON - BTRFS_QGROUP_STATUS_FLAG_RESCAN - BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT These flags match the on-disk flags used in btrfs_qgroup_status. But we're going to introduce extra runtime flags which will not reach disks. So here we introduce a new mask, BTRFS_QGROUP_STATUS_FLAGS_MASK, to make sure only those flags can reach disks. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: fix race between quota enable and quota rescan ioctlFilipe Manana1-0/+15
When enabling quotas, at btrfs_quota_enable(), after committing the transaction, we change fs_info->quota_root to point to the quota root we created and set BTRFS_FS_QUOTA_ENABLED at fs_info->flags. Then we try to start the qgroup rescan worker, first by initializing it with a call to qgroup_rescan_init() - however if that fails we end up freeing the quota root but we leave fs_info->quota_root still pointing to it, this can later result in a use-after-free somewhere else. We have previously set the flags BTRFS_FS_QUOTA_ENABLED and BTRFS_QGROUP_STATUS_FLAG_ON, so we can only fail with -EINPROGRESS at btrfs_quota_enable(), which is possible if someone already called the quota rescan ioctl, and therefore started the rescan worker. So fix this by ignoring an -EINPROGRESS and asserting we can't get any other error. Reported-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/linux-btrfs/20220823015931.421355-1-yebin10@huawei.com/ CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: avoid blocking on space revervation when doing nowait dio writesFilipe Manana1-2/+3
When doing a NOWAIT direct IO write, if we can NOCOW then it means we can proceed with the non-blocking, NOWAIT path. However reserving the metadata space and qgroup meta space can often result in blocking - flushing delalloc, wait for ordered extents to complete, trigger transaction commits, etc, going against the semantics of a NOWAIT write. So make the NOWAIT write path to try to reserve all the metadata it needs without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT then return -EAGAIN to make the caller fallback to a blocking direct IO write. This is part of a patchset comprised of the following patches: btrfs: avoid blocking on page locks with nowait dio on compressed range btrfs: avoid blocking nowait dio when locking file range btrfs: avoid double nocow check when doing nowait dio writes btrfs: stop allocating a path when checking if cross reference exists btrfs: free path at can_nocow_extent() before checking for checksum items btrfs: release path earlier at can_nocow_extent() btrfs: avoid blocking when allocating context for nowait dio read/write btrfs: avoid blocking on space revervation when doing nowait dio writes The following test was run before and after applying this patchset: $ cat io-uring-nodatacow-test.sh #!/bin/bash DEV=/dev/sdc MNT=/mnt/sdc MOUNT_OPTIONS="-o ssd -o nodatacow" MKFS_OPTIONS="-R free-space-tree -O no-holes" NUM_JOBS=4 FILE_SIZE=8G RUN_TIME=300 cat <<EOF > /tmp/fio-job.ini [io_uring_rw] rw=randrw fsync=0 fallocate=posix group_reporting=1 direct=1 ioengine=io_uring iodepth=64 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 filesize=$FILE_SIZE runtime=$RUN_TIME time_based filename=foobar directory=$MNT numjobs=$NUM_JOBS thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test was run a 12 cores box with 64G of ram, using a non-debug kernel config (Debian's default config) and a spinning disk. Result before the patchset: READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec Result after the patchset: READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec That's about +7.2% throughput for reads and +6.9% for writes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: remove trivial wrapper btrfs_read_buffer()Filipe Manana1-1/+1
The function btrfs_read_buffer() is useless, it just calls btree_read_extent_buffer_pages() with exactly the same arguments. So remove it and rename btree_read_extent_buffer_pages() to btrfs_read_extent_buffer(), which is a shorter name, has the "btrfs_" prefix (since it's used outside disk-io.c) and the name is clear enough about what it does. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: qgroup: remove outdated TODO commentsSidong Yang1-12/+0
These comments are old, outdated and not very specific. It seems that it doesn't help to inspire anybody to work on that. So we remove them. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Sidong Yang <realwakka@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: qgroup: remove duplicated check in adding qgroup relationsSidong Yang1-9/+32
Removes duplicated check when adding qgroup relations. btrfs_add_qgroup_relations function adds relations by calling add_relation_rb(). add_relation_rb() checks that member/parentid exists in current qgroup_tree. But it already checked before calling the function. It seems that we don't need to double check. Add new function __add_relation_rb() that adds relations with qgroup structures and makes old function use the new one. And it makes btrfs_add_qgroup_relation() function work without double checks by calling the new function. Signed-off-by: Sidong Yang <realwakka@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comments ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: disable qgroups in extent tree v2Josef Bacik1-0/+6
Backref lookups are going to be drastically different with extent tree v2, disable qgroups until we do the work to add this support for extent tree v2. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: move QUOTA_ENABLED check to rescan_should_stop from ↵Nikolay Borisov1-7/+6
btrfs_qgroup_rescan_worker Instead of having 2 places that short circuit the qgroup leaf scan have everything in the qgroup_rescan_leaf function. In addition to that, also ensure that the inconsistent qgroup flag is set when rescan_should_stop returns true. This both retains the old behavior when -EINTR was set in the body of the loop and at the same time also extends this behavior when scanning is interrupted due to remount or unmount operations. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02btrfs: qgroup: fix deadlock between rescan worker and remove qgroupSidong Yang1-1/+8
The commit e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker") by Kawasaki resolves deadlock between quota disable and qgroup rescan worker. But also there is a deadlock case like it. It's about enabling or disabling quota and creating or removing qgroup. It can be reproduced in simple script below. for i in {1..100} do btrfs quota enable /mnt & btrfs qgroup create 1/0 /mnt & btrfs qgroup destroy 1/0 /mnt & btrfs quota disable /mnt & done Here's why the deadlock happens: 1) The quota rescan task is running. 2) Task A calls btrfs_quota_disable(), locks the qgroup_ioctl_lock mutex, and then calls btrfs_qgroup_wait_for_completion(), to wait for the quota rescan task to complete. 3) Task B calls btrfs_remove_qgroup() and it blocks when trying to lock the qgroup_ioctl_lock mutex, because it's being held by task A. At that point task B is holding a transaction handle for the current transaction. 4) The quota rescan task calls btrfs_commit_transaction(). This results in it waiting for all other tasks to release their handles on the transaction, but task B is blocked on the qgroup_ioctl_lock mutex while holding a handle on the transaction, and that mutex is being held by task A, which is waiting for the quota rescan task to complete, resulting in a deadlock between these 3 tasks. To resolve this issue, the thread disabling quota should unlock qgroup_ioctl_lock before waiting rescan completion. Move btrfs_qgroup_wait_for_completion() after unlock of qgroup_ioctl_lock. Fixes: e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Sidong Yang <realwakka@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-31btrfs: fix deadlock between quota disable and qgroup rescan workerShin'ichiro Kawasaki1-2/+19
Quota disable ioctl starts a transaction before waiting for the qgroup rescan worker completes. However, this wait can be infinite and results in deadlock because of circular dependency among the quota disable ioctl, the qgroup rescan worker and the other task with transaction such as block group relocation task. The deadlock happens with the steps following: 1) Task A calls ioctl to disable quota. It starts a transaction and waits for qgroup rescan worker completes. 2) Task B such as block group relocation task starts a transaction and joins to the transaction that task A started. Then task B commits to the transaction. In this commit, task B waits for a commit by task A. 3) Task C as the qgroup rescan worker starts its job and starts a transaction. In this transaction start, task C waits for completion of the transaction that task A started and task B committed. This deadlock was found with fstests test case btrfs/115 and a zoned null_blk device. The test case enables and disables quota, and the block group reclaim was triggered during the quota disable by chance. The deadlock was also observed by running quota enable and disable in parallel with 'btrfs balance' command on regular null_blk devices. An example report of the deadlock: [372.469894] INFO: task kworker/u16:6:103 blocked for more than 122 seconds. [372.479944] Not tainted 5.16.0-rc8 #7 [372.485067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [372.493898] task:kworker/u16:6 state:D stack: 0 pid: 103 ppid: 2 flags:0x00004000 [372.503285] Workqueue: btrfs-qgroup-rescan btrfs_work_helper [btrfs] [372.510782] Call Trace: [372.514092] <TASK> [372.521684] __schedule+0xb56/0x4850 [372.530104] ? io_schedule_timeout+0x190/0x190 [372.538842] ? lockdep_hardirqs_on+0x7e/0x100 [372.547092] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [372.555591] schedule+0xe0/0x270 [372.561894] btrfs_commit_transaction+0x18bb/0x2610 [btrfs] [372.570506] ? btrfs_apply_pending_changes+0x50/0x50 [btrfs] [372.578875] ? free_unref_page+0x3f2/0x650 [372.585484] ? finish_wait+0x270/0x270 [372.591594] ? release_extent_buffer+0x224/0x420 [btrfs] [372.599264] btrfs_qgroup_rescan_worker+0xc13/0x10c0 [btrfs] [372.607157] ? lock_release+0x3a9/0x6d0 [372.613054] ? btrfs_qgroup_account_extent+0xda0/0xda0 [btrfs] [372.620960] ? do_raw_spin_lock+0x11e/0x250 [372.627137] ? rwlock_bug.part.0+0x90/0x90 [372.633215] ? lock_is_held_type+0xe4/0x140 [372.639404] btrfs_work_helper+0x1ae/0xa90 [btrfs] [372.646268] process_one_work+0x7e9/0x1320 [372.652321] ? lock_release+0x6d0/0x6d0 [372.658081] ? pwq_dec_nr_in_flight+0x230/0x230 [372.664513] ? rwlock_bug.part.0+0x90/0x90 [372.670529] worker_thread+0x59e/0xf90 [372.676172] ? process_one_work+0x1320/0x1320 [372.682440] kthread+0x3b9/0x490 [372.687550] ? _raw_spin_unlock_irq+0x24/0x50 [372.693811] ? set_kthread_struct+0x100/0x100 [372.700052] ret_from_fork+0x22/0x30 [372.705517] </TASK> [372.709747] INFO: task btrfs-transacti:2347 blocked for more than 123 seconds. [372.729827] Not tainted 5.16.0-rc8 #7 [372.745907] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [372.767106] task:btrfs-transacti state:D stack: 0 pid: 2347 ppid: 2 flags:0x00004000 [372.787776] Call Trace: [372.801652] <TASK> [372.812961] __schedule+0xb56/0x4850 [372.830011] ? io_schedule_timeout+0x190/0x190 [372.852547] ? lockdep_hardirqs_on+0x7e/0x100 [372.871761] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [372.886792] schedule+0xe0/0x270 [372.901685] wait_current_trans+0x22c/0x310 [btrfs] [372.919743] ? btrfs_put_transaction+0x3d0/0x3d0 [btrfs] [372.938923] ? finish_wait+0x270/0x270 [372.959085] ? join_transaction+0xc75/0xe30 [btrfs] [372.977706] start_transaction+0x938/0x10a0 [btrfs] [372.997168] transaction_kthread+0x19d/0x3c0 [btrfs] [373.013021] ? btrfs_cleanup_transaction.isra.0+0xfc0/0xfc0 [btrfs] [373.031678] kthread+0x3b9/0x490 [373.047420] ? _raw_spin_unlock_irq+0x24/0x50 [373.064645] ? set_kthread_struct+0x100/0x100 [373.078571] ret_from_fork+0x22/0x30 [373.091197] </TASK> [373.105611] INFO: task btrfs:3145 blocked for more than 123 seconds. [373.114147] Not tainted 5.16.0-rc8 #7 [373.120401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [373.130393] task:btrfs state:D stack: 0 pid: 3145 ppid: 3141 flags:0x00004000 [373.140998] Call Trace: [373.145501] <TASK> [373.149654] __schedule+0xb56/0x4850 [373.155306] ? io_schedule_timeout+0x190/0x190 [373.161965] ? lockdep_hardirqs_on+0x7e/0x100 [373.168469] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [373.175468] schedule+0xe0/0x270 [373.180814] wait_for_commit+0x104/0x150 [btrfs] [373.187643] ? test_and_set_bit+0x20/0x20 [btrfs] [373.194772] ? kmem_cache_free+0x124/0x550 [373.201191] ? btrfs_put_transaction+0x69/0x3d0 [btrfs] [373.208738] ? finish_wait+0x270/0x270 [373.214704] ? __btrfs_end_transaction+0x347/0x7b0 [btrfs] [373.222342] btrfs_commit_transaction+0x44d/0x2610 [btrfs] [373.230233] ? join_transaction+0x255/0xe30 [btrfs] [373.237334] ? btrfs_record_root_in_trans+0x4d/0x170 [btrfs] [373.245251] ? btrfs_apply_pending_changes+0x50/0x50 [btrfs] [373.253296] relocate_block_group+0x105/0xc20 [btrfs] [373.260533] ? mutex_lock_io_nested+0x1270/0x1270 [373.267516] ? btrfs_wait_nocow_writers+0x85/0x180 [btrfs] [373.275155] ? merge_reloc_roots+0x710/0x710 [btrfs] [373.283602] ? btrfs_wait_ordered_extents+0xd30/0xd30 [btrfs] [373.291934] ? kmem_cache_free+0x124/0x550 [373.298180] btrfs_relocate_block_group+0x35c/0x930 [btrfs] [373.306047] btrfs_relocate_chunk+0x85/0x210 [btrfs] [373.313229] btrfs_balance+0x12f4/0x2d20 [btrfs] [373.320227] ? lock_release+0x3a9/0x6d0 [373.326206] ? btrfs_relocate_chunk+0x210/0x210 [btrfs] [373.333591] ? lock_is_held_type+0xe4/0x140 [373.340031] ? rcu_read_lock_sched_held+0x3f/0x70 [373.346910] btrfs_ioctl_balance+0x548/0x700 [btrfs] [373.354207] btrfs_ioctl+0x7f2/0x71b0 [btrfs] [373.360774] ? lockdep_hardirqs_on_prepare+0x410/0x410 [373.367957] ? lockdep_hardirqs_on_prepare+0x410/0x410 [373.375327] ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs] [373.383841] ? find_held_lock+0x2c/0x110 [373.389993] ? lock_release+0x3a9/0x6d0 [373.395828] ? mntput_no_expire+0xf7/0xad0 [373.402083] ? lock_is_held_type+0xe4/0x140 [373.408249] ? vfs_fileattr_set+0x9f0/0x9f0 [373.414486] ? selinux_file_ioctl+0x349/0x4e0 [373.420938] ? trace_raw_output_lock+0xb4/0xe0 [373.427442] ? selinux_inode_getsecctx+0x80/0x80 [373.434224] ? lockdep_hardirqs_on+0x7e/0x100 [373.440660] ? force_qs_rnp+0x2a0/0x6b0 [373.446534] ? lock_is_held_type+0x9b/0x140 [373.452763] ? __blkcg_punt_bio_submit+0x1b0/0x1b0 [373.459732] ? security_file_ioctl+0x50/0x90 [373.466089] __x64_sys_ioctl+0x127/0x190 [373.472022] do_syscall_64+0x3b/0x90 [373.477513] entry_SYSCALL_64_after_hwframe+0x44/0xae [373.484823] RIP: 0033:0x7f8f4af7e2bb [373.490493] RSP: 002b:00007ffcbf936178 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [373.500197] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8f4af7e2bb [373.509451] RDX: 00007ffcbf936220 RSI: 00000000c4009420 RDI: 0000000000000003 [373.518659] RBP: 00007ffcbf93774a R08: 0000000000000013 R09: 00007f8f4b02d4e0 [373.527872] R10: 00007f8f4ae87740 R11: 0000000000000246 R12: 0000000000000001 [373.537222] R13: 00007ffcbf936220 R14: 0000000000000000 R15: 0000000000000002 [373.546506] </TASK> [373.550878] INFO: task btrfs:3146 blocked for more than 123 seconds. [373.559383] Not tainted 5.16.0-rc8 #7 [373.565748] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [373.575748] task:btrfs state:D stack: 0 pid: 3146 ppid: 2168 flags:0x00000000 [373.586314] Call Trace: [373.590846] <TASK> [373.595121] __schedule+0xb56/0x4850 [373.600901] ? __lock_acquire+0x23db/0x5030 [373.607176] ? io_schedule_timeout+0x190/0x190 [373.613954] schedule+0xe0/0x270 [373.619157] schedule_timeout+0x168/0x220 [373.625170] ? usleep_range_state+0x150/0x150 [373.631653] ? mark_held_locks+0x9e/0xe0 [373.637767] ? do_raw_spin_lock+0x11e/0x250 [373.643993] ? lockdep_hardirqs_on_prepare+0x17b/0x410 [373.651267] ? _raw_spin_unlock_irq+0x24/0x50 [373.657677] ? lockdep_hardirqs_on+0x7e/0x100 [373.664103] wait_for_completion+0x163/0x250 [373.670437] ? bit_wait_timeout+0x160/0x160 [373.676585] btrfs_quota_disable+0x176/0x9a0 [btrfs] [373.683979] ? btrfs_quota_enable+0x12f0/0x12f0 [btrfs] [373.691340] ? down_write+0xd0/0x130 [373.696880] ? down_write_killable+0x150/0x150 [373.703352] btrfs_ioctl+0x3945/0x71b0 [btrfs] [373.710061] ? find_held_lock+0x2c/0x110 [373.716192] ? lock_release+0x3a9/0x6d0 [373.722047] ? __handle_mm_fault+0x23cd/0x3050 [373.728486] ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs] [373.737032] ? set_pte+0x6a/0x90 [373.742271] ? do_raw_spin_unlock+0x55/0x1f0 [373.748506] ? lock_is_held_type+0xe4/0x140 [373.754792] ? vfs_fileattr_set+0x9f0/0x9f0 [373.761083] ? selinux_file_ioctl+0x349/0x4e0 [373.767521] ? selinux_inode_getsecctx+0x80/0x80 [373.774247] ? __up_read+0x182/0x6e0 [373.780026] ? count_memcg_events.constprop.0+0x46/0x60 [373.787281] ? up_write+0x460/0x460 [373.792932] ? security_file_ioctl+0x50/0x90 [373.799232] __x64_sys_ioctl+0x127/0x190 [373.805237] do_syscall_64+0x3b/0x90 [373.810947] entry_SYSCALL_64_after_hwframe+0x44/0xae [373.818102] RIP: 0033:0x7f1383ea02bb [373.823847] RSP: 002b:00007fffeb4d71f8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [373.833641] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1383ea02bb [373.842961] RDX: 00007fffeb4d7210 RSI: 00000000c0109428 RDI: 0000000000000003 [373.852179] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 [373.861408] R10: 00007f1383daec78 R11: 0000000000000202 R12: 00007fffeb4d874a [373.870647] R13: 0000000000493099 R14: 0000000000000001 R15: 0000000000000000 [373.879838] </TASK> [373.884018] Showing all locks held in the system: [373.894250] 3 locks held by kworker/4:1/58: [373.900356] 1 lock held by khungtaskd/63: [373.906333] #0: ffffffff8945ff60 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 [373.917307] 3 locks held by kworker/u16:6/103: [373.923938] #0: ffff888127b4f138 ((wq_completion)btrfs-qgroup-rescan){+.+.}-{0:0}, at: process_one_work+0x712/0x1320 [373.936555] #1: ffff88810b817dd8 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x73f/0x1320 [373.951109] #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_qgroup_rescan_worker+0x1f6/0x10c0 [btrfs] [373.964027] 2 locks held by less/1803: [373.969982] #0: ffff88813ed56098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x24/0x80 [373.981295] #1: ffffc90000b3b2e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x9e2/0x1060 [373.992969] 1 lock held by btrfs-transacti/2347: [373.999893] #0: ffff88813d4887a8 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0xe3/0x3c0 [btrfs] [374.015872] 3 locks held by btrfs/3145: [374.022298] #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl_balance+0xc3/0x700 [btrfs] [374.034456] #1: ffff88813d48a0a0 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0xfe5/0x2d20 [btrfs] [374.047646] #2: ffff88813d488838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x354/0x930 [btrfs] [374.063295] 4 locks held by btrfs/3146: [374.069647] #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl+0x38b1/0x71b0 [btrfs] [374.081601] #1: ffff88813d488bb8 (&fs_info->subvol_sem){+.+.}-{3:3}, at: btrfs_ioctl+0x38fd/0x71b0 [btrfs] [374.094283] #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_disable+0xc8/0x9a0 [btrfs] [374.106885] #3: ffff88813d489800 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_disable+0xd5/0x9a0 [btrfs] [374.126780] ============================================= To avoid the deadlock, wait for the qgroup rescan worker to complete before starting the transaction for the quota disable ioctl. Clear BTRFS_FS_QUOTA_ENABLE flag before the wait and the transaction to request the worker to complete. On transaction start failure, set the BTRFS_FS_QUOTA_ENABLE flag again. These BTRFS_FS_QUOTA_ENABLE flag changes can be done safely since the function btrfs_quota_disable is not called concurrently because of fs_info->subvol_sem. Also check the BTRFS_FS_QUOTA_ENABLE flag in qgroup_rescan_init to avoid another qgroup rescan worker to start after the previous qgroup worker completed. CC: stable@vger.kernel.org # 5.4+ Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: stop accessing ->extent_root directlyJosef Bacik1-1/+4
When we start having multiple extent roots we'll need to use a helper to get to the correct extent_root. Rename fs_info->extent_root to _extent_root and convert all of the users of the extent root to using the btrfs_extent_root() helper. This will allow us to easily clean up the remaining direct accesses in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: fix deadlock between quota enable and other quota operationsFilipe Manana1-0/+19
When enabling quotas, we attempt to commit a transaction while holding the mutex fs_info->qgroup_ioctl_lock. This can result on a deadlock with other quota operations such as: - qgroup creation and deletion, ioctl BTRFS_IOC_QGROUP_CREATE; - adding and removing qgroup relations, ioctl BTRFS_IOC_QGROUP_ASSIGN. This is because these operations join a transaction and after that they attempt to lock the mutex fs_info->qgroup_ioctl_lock. Acquiring that mutex after joining or starting a transaction is a pattern followed everywhere in qgroups, so the quota enablement operation is the one at fault here, and should not commit a transaction while holding that mutex. Fix this by making the transaction commit while not holding the mutex. We are safe from two concurrent tasks trying to enable quotas because we are serialized by the rw semaphore fs_info->subvol_sem at btrfs_ioctl_quota_ctl(), which is the only call site for enabling quotas. When this deadlock happens, it produces a trace like the following: INFO: task syz-executor:25604 blocked for more than 143 seconds. Not tainted 5.15.0-rc6 #4 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor state:D stack:24800 pid:25604 ppid: 24873 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xcd9/0x2530 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 btrfs_commit_transaction+0x994/0x2e90 fs/btrfs/transaction.c:2201 btrfs_quota_enable+0x95c/0x1790 fs/btrfs/qgroup.c:1120 btrfs_ioctl_quota_ctl fs/btrfs/ioctl.c:4229 [inline] btrfs_ioctl+0x637e/0x7b70 fs/btrfs/ioctl.c:5010 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f86920b2c4d RSP: 002b:00007f868f61ac58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f86921d90a0 RCX: 00007f86920b2c4d RDX: 0000000020005e40 RSI: 00000000c0109428 RDI: 0000000000000008 RBP: 00007f869212bd80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f86921d90a0 R13: 00007fff6d233e4f R14: 00007fff6d233ff0 R15: 00007f868f61adc0 INFO: task syz-executor:25628 blocked for more than 143 seconds. Not tainted 5.15.0-rc6 #4 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor state:D stack:29080 pid:25628 ppid: 24873 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xcd9/0x2530 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425 __mutex_lock_common kernel/locking/mutex.c:669 [inline] __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729 btrfs_remove_qgroup+0xb7/0x7d0 fs/btrfs/qgroup.c:1548 btrfs_ioctl_qgroup_create fs/btrfs/ioctl.c:4333 [inline] btrfs_ioctl+0x683c/0x7b70 fs/btrfs/ioctl.c:5014 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Reported-by: Hao Sun <sunhao.th@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CACkBjsZQF19bQ1C6=yetF3BvL10OSORpFUcWXTP6HErshDB4dQ@mail.gmail.com/ Fixes: 340f1aa27f36 ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable") CC: stable@vger.kernel.org # 4.19 Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-15btrfs: fix invalid delayed ref after subvolume creation failureFilipe Manana1-1/+2
When creating a subvolume, at ioctl.c:create_subvol(), if we fail to insert the new root's root item into the root tree, we are freeing the metadata extent we reserved for the new root to prevent a metadata extent leak, as we don't abort the transaction at that point (since there is nothing at that point that is irreversible). However we allocated the metadata extent for the new root which we are creating for the new subvolume, so its delayed reference refers to the ID of this new root. But when we free the metadata extent we pass the root of the subvolume where the new subvolume is located to btrfs_free_tree_block() - this is incorrect because this will generate a delayed reference that refers to the ID of the parent subvolume's root, and not to ID of the new root. This results in a failure when running delayed references that leads to a transaction abort and a trace like the following: [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs] [3868.739857] Code: 68 0f 85 e6 fb ff (...) [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246 [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002 [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88 [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88 [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000 [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88 [3868.750725] FS: 00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000 [3868.752275] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0 [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [3868.757803] Call Trace: [3868.758281] <TASK> [3868.758655] ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs] [3868.759827] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs] [3868.761047] btrfs_run_delayed_refs+0x86/0x210 [btrfs] [3868.762069] ? lock_acquired+0x19f/0x420 [3868.762829] btrfs_commit_transaction+0x69/0xb20 [btrfs] [3868.763860] ? _raw_spin_unlock+0x29/0x40 [3868.764614] ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs] [3868.765870] create_subvol+0x1d8/0x9a0 [btrfs] [3868.766766] btrfs_mksubvol+0x447/0x4c0 [btrfs] [3868.767669] ? preempt_count_add+0x49/0xa0 [3868.768444] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs] [3868.769639] ? _copy_from_user+0x66/0xa0 [3868.770391] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs] [3868.771495] btrfs_ioctl+0xd1e/0x35c0 [btrfs] [3868.772364] ? __slab_free+0x10a/0x360 [3868.773198] ? rcu_read_lock_sched_held+0x12/0x60 [3868.774121] ? lock_release+0x223/0x4a0 [3868.774863] ? lock_acquired+0x19f/0x420 [3868.775634] ? rcu_read_lock_sched_held+0x12/0x60 [3868.776530] ? trace_hardirqs_on+0x1b/0xe0 [3868.777373] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [3868.778280] ? kmem_cache_free+0x321/0x3c0 [3868.779011] ? __x64_sys_ioctl+0x83/0xb0 [3868.779718] __x64_sys_ioctl+0x83/0xb0 [3868.780387] do_syscall_64+0x3b/0xc0 [3868.781059] entry_SYSCALL_64_after_hwframe+0x44/0xae [3868.781953] RIP: 0033:0x7f281c59e957 [3868.782585] Code: 3c 1c 48 f7 d8 4c (...) [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957 [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003 [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36 [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003 [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc [3868.793765] </TASK> [3868.794037] irq event stamp: 0 [3868.794548] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.797086] softirqs last enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0 [3868.799284] ---[ end trace be24c7002fe27747 ]--- [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2 [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627 [3868.802056] item 0 key (237436928 169 0) itemoff 16250 itemsize 33 [3868.802863] extent refs 1 gen 1265 flags 2 [3868.803447] ref#0: tree block backref root 1610 (...) [3869.064354] item 114 key (241008640 169 0) itemoff 12488 itemsize 33 [3869.065421] extent refs 1 gen 1268 flags 2 [3869.066115] ref#0: tree block backref root 1689 (...) [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622 owner 0 offset 0 [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry Fix this by passing the new subvolume's root ID to btrfs_free_tree_block(). This requires changing the root argument of btrfs_free_tree_block() from struct btrfs_root * to a u64, since at this point during the subvolume creation we have not yet created the struct btrfs_root for the new subvolume, and btrfs_free_tree_block() only needs a root ID and nothing else from a struct btrfs_root. This was triggered by test case generic/475 from fstests. Fixes: 67addf29004c5b ("btrfs: fix metadata extent leak after failure to create subvolume") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: remove ignore_offset argument from btrfs_find_all_roots()Filipe Manana1-4/+4
Currently all the callers of btrfs_find_all_roots() pass a value of false for its ignore_offset argument. This makes the argument pointless and we can remove it and make btrfs_find_all_roots() always pass false as the ignore_offset argument for btrfs_find_all_roots_safe(). So just do that. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-22btrfs: fix lock inversion problem when doing qgroup extent tracingFilipe Manana1-8/+30
At btrfs_qgroup_trace_extent_post() we call btrfs_find_all_roots() with a NULL value as the transaction handle argument, which makes that function take the commit_root_sem semaphore, which is necessary when we don't hold a transaction handle or any other mechanism to prevent a transaction commit from wiping out commit roots. However btrfs_qgroup_trace_extent_post() can be called in a context where we are holding a write lock on an extent buffer from a subvolume tree, namely from btrfs_truncate_inode_items(), called either during truncate or unlink operations. In this case we end up with a lock inversion problem because the commit_root_sem is a higher level lock, always supposed to be acquired before locking any extent buffer. Lockdep detects this lock inversion problem since we switched the extent buffer locks from custom locks to semaphores, and when running btrfs/158 from fstests, it reported the following trace: [ 9057.626435] ====================================================== [ 9057.627541] WARNING: possible circular locking dependency detected [ 9057.628334] 5.14.0-rc2-btrfs-next-93 #1 Not tainted [ 9057.628961] ------------------------------------------------------ [ 9057.629867] kworker/u16:4/30781 is trying to acquire lock: [ 9057.630824] ffff8e2590f58760 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 9057.632542] but task is already holding lock: [ 9057.633551] ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs] [ 9057.635255] which lock already depends on the new lock. [ 9057.636292] the existing dependency chain (in reverse order) is: [ 9057.637240] -> #1 (&fs_info->commit_root_sem){++++}-{3:3}: [ 9057.638138] down_read+0x46/0x140 [ 9057.638648] btrfs_find_all_roots+0x41/0x80 [btrfs] [ 9057.639398] btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs] [ 9057.640283] btrfs_add_delayed_data_ref+0x418/0x490 [btrfs] [ 9057.641114] btrfs_free_extent+0x35/0xb0 [btrfs] [ 9057.641819] btrfs_truncate_inode_items+0x424/0xf70 [btrfs] [ 9057.642643] btrfs_evict_inode+0x454/0x4f0 [btrfs] [ 9057.643418] evict+0xcf/0x1d0 [ 9057.643895] do_unlinkat+0x1e9/0x300 [ 9057.644525] do_syscall_64+0x3b/0xc0 [ 9057.645110] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 9057.645835] -> #0 (btrfs-tree-00){++++}-{3:3}: [ 9057.646600] __lock_acquire+0x130e/0x2210 [ 9057.647248] lock_acquire+0xd7/0x310 [ 9057.647773] down_read_nested+0x4b/0x140 [ 9057.648350] __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 9057.649175] btrfs_read_lock_root_node+0x31/0x40 [btrfs] [ 9057.650010] btrfs_search_slot+0x537/0xc00 [btrfs] [ 9057.650849] scrub_print_warning_inode+0x89/0x370 [btrfs] [ 9057.651733] iterate_extent_inodes+0x1e3/0x280 [btrfs] [ 9057.652501] scrub_print_warning+0x15d/0x2f0 [btrfs] [ 9057.653264] scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs] [ 9057.654295] scrub_bio_end_io_worker+0x101/0x2e0 [btrfs] [ 9057.655111] btrfs_work_helper+0xf8/0x400 [btrfs] [ 9057.655831] process_one_work+0x247/0x5a0 [ 9057.656425] worker_thread+0x55/0x3c0 [ 9057.656993] kthread+0x155/0x180 [ 9057.657494] ret_from_fork+0x22/0x30 [ 9057.658030] other info that might help us debug this: [ 9057.659064] Possible unsafe locking scenario: [ 9057.659824] CPU0 CPU1 [ 9057.660402] ---- ---- [ 9057.660988] lock(&fs_info->commit_root_sem); [ 9057.661581] lock(btrfs-tree-00); [ 9057.662348] lock(&fs_info->commit_root_sem); [ 9057.663254] lock(btrfs-tree-00); [ 9057.663690] *** DEADLOCK *** [ 9057.664437] 4 locks held by kworker/u16:4/30781: [ 9057.665023] #0: ffff8e25922a1148 ((wq_completion)btrfs-scrub){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0 [ 9057.666260] #1: ffffabb3451ffe70 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0 [ 9057.667639] #2: ffff8e25922da198 (&ret->mutex){+.+.}-{3:3}, at: scrub_handle_errored_block.isra.0+0x5d2/0x1640 [btrfs] [ 9057.669017] #3: ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs] [ 9057.670408] stack backtrace: [ 9057.670976] CPU: 7 PID: 30781 Comm: kworker/u16:4 Not tainted 5.14.0-rc2-btrfs-next-93 #1 [ 9057.672030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 9057.673492] Workqueue: btrfs-scrub btrfs_work_helper [btrfs] [ 9057.674258] Call Trace: [ 9057.674588] dump_stack_lvl+0x57/0x72 [ 9057.675083] check_noncircular+0xf3/0x110 [ 9057.675611] __lock_acquire+0x130e/0x2210 [ 9057.676132] lock_acquire+0xd7/0x310 [ 9057.676605] ? __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 9057.677313] ? lock_is_held_type+0xe8/0x140 [ 9057.677849] down_read_nested+0x4b/0x140 [ 9057.678349] ? __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 9057.679068] __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 9057.679760] btrfs_read_lock_root_node+0x31/0x40 [btrfs] [ 9057.680458] btrfs_search_slot+0x537/0xc00 [btrfs] [ 9057.681083] ? _raw_spin_unlock+0x29/0x40 [ 9057.681594] ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs] [ 9057.682336] scrub_print_warning_inode+0x89/0x370 [btrfs] [ 9057.683058] ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs] [ 9057.683834] ? scrub_write_block_to_dev_replace+0xb0/0xb0 [btrfs] [ 9057.684632] iterate_extent_inodes+0x1e3/0x280 [btrfs] [ 9057.685316] scrub_print_warning+0x15d/0x2f0 [btrfs] [ 9057.685977] ? ___ratelimit+0xa4/0x110 [ 9057.686460] scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs] [ 9057.687316] scrub_bio_end_io_worker+0x101/0x2e0 [btrfs] [ 9057.688021] btrfs_work_helper+0xf8/0x400 [btrfs] [ 9057.688649] ? lock_is_held_type+0xe8/0x140 [ 9057.689180] process_one_work+0x247/0x5a0 [ 9057.689696] worker_thread+0x55/0x3c0 [ 9057.690175] ? process_one_work+0x5a0/0x5a0 [ 9057.690731] kthread+0x155/0x180 [ 9057.691158] ? set_kthread_struct+0x40/0x40 [ 9057.691697] ret_from_fork+0x22/0x30 Fix this by making btrfs_find_all_roots() never attempt to lock the commit_root_sem when it is called from btrfs_qgroup_trace_extent_post(). We can't just pass a non-NULL transaction handle to btrfs_find_all_roots() from btrfs_qgroup_trace_extent_post(), because that would make backref lookup not use commit roots and acquire read locks on extent buffers, and therefore could deadlock when btrfs_qgroup_trace_extent_post() is called from the btrfs_truncate_inode_items() code path which has acquired a write lock on an extent buffer of the subvolume btree. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22btrfs: send: fix crash when memory allocations trigger reclaimFilipe Manana1-7/+1
When doing a send we don't expect the task to ever start a transaction after the initial check that verifies if commit roots match the regular roots. This is because after that we set current->journal_info with a stub (special value) that signals we are in send context, so that we take a read lock on an extent buffer when reading it from disk and verifying it is valid (its generation matches the generation stored in the parent). This stub was introduced in 2014 by commit a26e8c9f75b0bf ("Btrfs: don't clear uptodate if the eb is under IO") in order to fix a concurrency issue between send and balance. However there is one particular exception where we end up needing to start a transaction and when this happens it results in a crash with a stack trace like the following: [60015.902283] kernel: WARNING: CPU: 3 PID: 58159 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x21/0x80 [60015.902292] kernel: Modules linked in: uinput rfcomm snd_seq_dummy (...) [60015.902384] kernel: CPU: 3 PID: 58159 Comm: btrfs Not tainted 5.12.9-300.fc34.x86_64 #1 [60015.902387] kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XN-WIFI, BIOS F6 12/24/2015 [60015.902389] kernel: RIP: 0010:kfence_protect_page+0x21/0x80 [60015.902393] kernel: Code: ff 0f 1f 84 00 00 00 00 00 55 48 89 fd (...) [60015.902396] kernel: RSP: 0018:ffff9fb583453220 EFLAGS: 00010246 [60015.902399] kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fb583453224 [60015.902401] kernel: RDX: ffff9fb583453224 RSI: 0000000000000000 RDI: 0000000000000000 [60015.902402] kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [60015.902404] kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002 [60015.902406] kernel: R13: ffff9fb583453348 R14: 0000000000000000 R15: 0000000000000001 [60015.902408] kernel: FS: 00007f158e62d8c0(0000) GS:ffff93bd37580000(0000) knlGS:0000000000000000 [60015.902410] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [60015.902412] kernel: CR2: 0000000000000039 CR3: 00000001256d2000 CR4: 00000000000506e0 [60015.902414] kernel: Call Trace: [60015.902419] kernel: kfence_unprotect+0x13/0x30 [60015.902423] kernel: page_fault_oops+0x89/0x270 [60015.902427] kernel: ? search_module_extables+0xf/0x40 [60015.902431] kernel: ? search_bpf_extables+0x57/0x70 [60015.902435] kernel: kernelmode_fixup_or_oops+0xd6/0xf0 [60015.902437] kernel: __bad_area_nosemaphore+0x142/0x180 [60015.902440] kernel: exc_page_fault+0x67/0x150 [60015.902445] kernel: asm_exc_page_fault+0x1e/0x30 [60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580 [60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 (...) [60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246 [60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000 [60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039 [60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000 [60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001 [60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000 [60015.902468] kernel: btrfs_commit_inode_delayed_inode+0x5d/0x120 [60015.902473] kernel: btrfs_evict_inode+0x2c5/0x3f0 [60015.902476] kernel: evict+0xd1/0x180 [60015.902480] kernel: inode_lru_isolate+0xe7/0x180 [60015.902483] kernel: __list_lru_walk_one+0x77/0x150 [60015.902487] kernel: ? iput+0x1a0/0x1a0 [60015.902489] kernel: ? iput+0x1a0/0x1a0 [60015.902491] kernel: list_lru_walk_one+0x47/0x70 [60015.902495] kernel: prune_icache_sb+0x39/0x50 [60015.902497] kernel: super_cache_scan+0x161/0x1f0 [60015.902501] kernel: do_shrink_slab+0x142/0x240 [60015.902505] kernel: shrink_slab+0x164/0x280 [60015.902509] kernel: shrink_node+0x2c8/0x6e0 [60015.902512] kernel: do_try_to_free_pages+0xcb/0x4b0 [60015.902514] kernel: try_to_free_pages+0xda/0x190 [60015.902516] kernel: __alloc_pages_slowpath.constprop.0+0x373/0xcc0 [60015.902521] kernel: ? __memcg_kmem_charge_page+0xc2/0x1e0 [60015.902525] kernel: __alloc_pages_nodemask+0x30a/0x340 [60015.902528] kernel: pipe_write+0x30b/0x5c0 [60015.902531] kernel: ? set_next_entity+0xad/0x1e0 [60015.902534] kernel: ? switch_mm_irqs_off+0x58/0x440 [60015.902538] kernel: __kernel_write+0x13a/0x2b0 [60015.902541] kernel: kernel_write+0x73/0x150 [60015.902543] kernel: send_cmd+0x7b/0xd0 [60015.902545] kernel: send_extent_data+0x5a3/0x6b0 [60015.902549] kernel: process_extent+0x19b/0xed0 [60015.902551] kernel: btrfs_ioctl_send+0x1434/0x17e0 [60015.902554] kernel: ? _btrfs_ioctl_send+0xe1/0x100 [60015.902557] kernel: _btrfs_ioctl_send+0xbf/0x100 [60015.902559] kernel: ? enqueue_entity+0x18c/0x7b0 [60015.902562] kernel: btrfs_ioctl+0x185f/0x2f80 [60015.902564] kernel: ? psi_task_change+0x84/0xc0 [60015.902569] kernel: ? _flat_send_IPI_mask+0x21/0x40 [60015.902572] kernel: ? check_preempt_curr+0x2f/0x70 [60015.902576] kernel: ? selinux_file_ioctl+0x137/0x1e0 [60015.902579] kernel: ? expand_files+0x1cb/0x1d0 [60015.902582] kernel: ? __x64_sys_ioctl+0x82/0xb0 [60015.902585] kernel: __x64_sys_ioctl+0x82/0xb0 [60015.902588] kernel: do_syscall_64+0x33/0x40 [60015.902591] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae [60015.902595] kernel: RIP: 0033:0x7f158e38f0ab [60015.902599] kernel: Code: ff ff ff 85 c0 79 9b (...) [60015.902602] kernel: RSP: 002b:00007ffcb2519bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [60015.902605] kernel: RAX: ffffffffffffffda RBX: 00007ffcb251ae00 RCX: 00007f158e38f0ab [60015.902607] kernel: RDX: 00007ffcb2519cf0 RSI: 0000000040489426 RDI: 0000000000000004 [60015.902608] kernel: RBP: 0000000000000004 R08: 00007f158e297640 R09: 00007f158e297640 [60015.902610] kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000 [60015.902612] kernel: R13: 0000000000000002 R14: 00007ffcb251aee0 R15: 0000558c1a83e2a0 [60015.902615] kernel: ---[ end trace 7bbc33e23bb887ae ]--- This happens because when writing to the pipe, by calling kernel_write(), we end up doing page allocations using GFP_HIGHUSER | __GFP_ACCOUNT as the gfp flags, which allow reclaim to happen if there is memory pressure. This allocation happens at fs/pipe.c:pipe_write(). If the reclaim is triggered, inode eviction can be triggered and that in turn can result in starting a transaction if the inode has a link count of 0. The transaction start happens early on during eviction, when we call btrfs_commit_inode_delayed_inode() at btrfs_evict_inode(). This happens if there is currently an open file descriptor for an inode with a link count of 0 and the reclaim task gets a reference on the inode before that descriptor is closed, in which case the reclaim task ends up doing the final iput that triggers the inode eviction. When we have assertions enabled (CONFIG_BTRFS_ASSERT=y), this triggers the following assertion at transaction.c:start_transaction(): /* Send isn't supposed to start transactions. */ ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB); And when assertions are not enabled, it triggers a crash since after that assertion we cast current->journal_info into a transaction handle pointer and then dereference it: if (current->journal_info) { WARN_ON(type & TRANS_EXTWRITERS); h = current->journal_info; refcount_inc(&h->use_count); (...) Which obviously results in a crash due to an invalid memory access. The same type of issue can happen during other memory allocations we do directly in the send code with kmalloc (and friends) as they use GFP_KERNEL and therefore may trigger reclaim too, which started to happen since 2016 after commit e780b0d1c1523e ("btrfs: send: use GFP_KERNEL everywhere"). The issue could be solved by setting up a NOFS context for the entire send operation so that reclaim could not be triggered when allocating memory or pages through kernel_write(). However that is not very friendly and we can in fact get rid of the send stub because: 1) The stub was introduced way back in 2014 by commit a26e8c9f75b0bf ("Btrfs: don't clear uptodate if the eb is under IO") to solve an issue exclusive to when send and balance are running in parallel, however there were other problems between balance and send and we do not allow anymore to have balance and send run concurrently since commit 9e967495e0e0ae ("Btrfs: prevent send failures and crashes due to concurrent relocation"). More generically the issues are between send and relocation, and that last commit eliminated only the possibility of having send and balance run concurrently, but shrinking a device also can trigger relocation, and on zoned filesystems we have relocation of partially used block groups triggered automatically as well. The previous patch that has a subject of: "btrfs: ensure relocation never runs while we have send operations running" Addresses all the remaining cases that can trigger relocation. 2) We can actually allow starting and even committing transactions while in a send context if needed because send is not holding any locks that would block the start or the commit of a transaction. So get rid of all the logic added by commit a26e8c9f75b0bf ("Btrfs: don't clear uptodate if the eb is under IO"). We can now always call clear_extent_buffer_uptodate() at verify_parent_transid() since send is the only case that uses commit roots without having a transaction open or without holding the commit_root_sem. Reported-by: Chris Murphy <lists@colorremedies.com> Link: https://lore.kernel.org/linux-btrfs/CAJCQCtRQ57=qXo3kygwpwEBOU_CA_eKvdmjP52sU=eFvuVOEGw@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22btrfs: fix typos in commentsDavid Sterba1-1/+1
Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-28btrfs: fix deadlock when cloning inline extents and using qgroupsFilipe Manana1-1/+1
There are a few exceptional cases where cloning an inline extent needs to copy the inline extent data into a page of the destination inode. When this happens, we end up starting a transaction while having a dirty page for the destination inode and while having the range locked in the destination's inode iotree too. Because when reserving metadata space for a transaction we may need to flush existing delalloc in case there is not enough free space, we have a mechanism in place to prevent a deadlock, which was introduced in commit 3d45f221ce627d ("btrfs: fix deadlock when cloning inline extent and low on free metadata space"). However when using qgroups, a transaction also reserves metadata qgroup space, which can also result in flushing delalloc in case there is not enough available space at the moment. When this happens we deadlock, since flushing delalloc requires locking the file range in the inode's iotree and the range was already locked at the very beginning of the clone operation, before attempting to start the transaction. When this issue happens, stack traces like the following are reported: [72747.556262] task:kworker/u81:9 state:D stack: 0 pid: 225 ppid: 2 flags:0x00004000 [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142) [72747.556271] Call Trace: [72747.556273] __schedule+0x296/0x760 [72747.556277] schedule+0x3c/0xa0 [72747.556279] io_schedule+0x12/0x40 [72747.556284] __lock_page+0x13c/0x280 [72747.556287] ? generic_file_readonly_mmap+0x70/0x70 [72747.556325] extent_write_cache_pages+0x22a/0x440 [btrfs] [72747.556331] ? __set_page_dirty_nobuffers+0xe7/0x160 [72747.556358] ? set_extent_buffer_dirty+0x5e/0x80 [btrfs] [72747.556362] ? update_group_capacity+0x25/0x210 [72747.556366] ? cpumask_next_and+0x1a/0x20 [72747.556391] extent_writepages+0x44/0xa0 [btrfs] [72747.556394] do_writepages+0x41/0xd0 [72747.556398] __writeback_single_inode+0x39/0x2a0 [72747.556403] writeback_sb_inodes+0x1ea/0x440 [72747.556407] __writeback_inodes_wb+0x5f/0xc0 [72747.556410] wb_writeback+0x235/0x2b0 [72747.556414] ? get_nr_inodes+0x35/0x50 [72747.556417] wb_workfn+0x354/0x490 [72747.556420] ? newidle_balance+0x2c5/0x3e0 [72747.556424] process_one_work+0x1aa/0x340 [72747.556426] worker_thread+0x30/0x390 [72747.556429] ? create_worker+0x1a0/0x1a0 [72747.556432] kthread+0x116/0x130 [72747.556435] ? kthread_park+0x80/0x80 [72747.556438] ret_from_fork+0x1f/0x30 [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [72747.566961] Call Trace: [72747.566964] __schedule+0x296/0x760 [72747.566968] ? finish_wait+0x80/0x80 [72747.566970] schedule+0x3c/0xa0 [72747.566995] wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs] [72747.566999] ? finish_wait+0x80/0x80 [72747.567024] lock_extent_bits+0x37/0x90 [btrfs] [72747.567047] btrfs_invalidatepage+0x299/0x2c0 [btrfs] [72747.567051] ? find_get_pages_range_tag+0x2cd/0x380 [72747.567076] __extent_writepage+0x203/0x320 [btrfs] [72747.567102] extent_write_cache_pages+0x2bb/0x440 [btrfs] [72747.567106] ? update_load_avg+0x7e/0x5f0 [72747.567109] ? enqueue_entity+0xf4/0x6f0 [72747.567134] extent_writepages+0x44/0xa0 [btrfs] [72747.567137] ? enqueue_task_fair+0x93/0x6f0 [72747.567140] do_writepages+0x41/0xd0 [72747.567144] __filemap_fdatawrite_range+0xc7/0x100 [72747.567167] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [72747.567195] btrfs_work_helper+0xc2/0x300 [btrfs] [72747.567200] process_one_work+0x1aa/0x340 [72747.567202] worker_thread+0x30/0x390 [72747.567205] ? create_worker+0x1a0/0x1a0 [72747.567208] kthread+0x116/0x130 [72747.567211] ? kthread_park+0x80/0x80 [72747.567214] ret_from_fork+0x1f/0x30 [72747.569686] task:fsstress state:D stack: 0 pid:841421 ppid:841417 flags:0x00000000 [72747.569689] Call Trace: [72747.569691] __schedule+0x296/0x760 [72747.569694] schedule+0x3c/0xa0 [72747.569721] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569725] ? finish_wait+0x80/0x80 [72747.569753] btrfs_qgroup_reserve_data+0x34/0x50 [btrfs] [72747.569781] btrfs_check_data_free_space+0x5f/0xa0 [btrfs] [72747.569804] btrfs_buffered_write+0x1f7/0x7f0 [btrfs] [72747.569810] ? path_lookupat.isra.48+0x97/0x140 [72747.569833] btrfs_file_write_iter+0x81/0x410 [btrfs] [72747.569836] ? __kmalloc+0x16a/0x2c0 [72747.569839] do_iter_readv_writev+0x160/0x1c0 [72747.569843] do_iter_write+0x80/0x1b0 [72747.569847] vfs_writev+0x84/0x140 [72747.569869] ? btrfs_file_llseek+0x38/0x270 [btrfs] [72747.569873] do_writev+0x65/0x100 [72747.569876] do_syscall_64+0x33/0x40 [72747.569879] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [72747.569899] task:fsstress state:D stack: 0 pid:841424 ppid:841417 flags:0x00004000 [72747.569903] Call Trace: [72747.569906] __schedule+0x296/0x760 [72747.569909] schedule+0x3c/0xa0 [72747.569936] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569940] ? finish_wait+0x80/0x80 [72747.569967] __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs] [72747.569989] start_transaction+0x279/0x580 [btrfs] [72747.570014] clone_copy_inline_extent+0x332/0x490 [btrfs] [72747.570041] btrfs_clone+0x5b7/0x7a0 [btrfs] [72747.570068] ? lock_extent_bits+0x64/0x90 [btrfs] [72747.570095] btrfs_clone_files+0xfc/0x150 [btrfs] [72747.570122] btrfs_remap_file_range+0x3d8/0x4a0 [btrfs] [72747.570126] do_clone_file_range+0xed/0x200 [72747.570131] vfs_clone_file_range+0x37/0x110 [72747.570134] ioctl_file_clone+0x7d/0xb0 [72747.570137] do_vfs_ioctl+0x138/0x630 [72747.570140] __x64_sys_ioctl+0x62/0xc0 [72747.570143] do_syscall_64+0x33/0x40 [72747.570146] entry_SYSCALL_64_after_hwframe+0x44/0xa9 So fix this by skipping the flush of delalloc for an inode that is flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under such a special case of cloning an inline extent, when flushing delalloc during qgroup metadata reservation. The special cases for cloning inline extents were added in kernel 5.7 by by commit 05a5a7621ce66c ("Btrfs: implement full reflink support for inline extents"), while having qgroup metadata space reservation flushing delalloc when low on space was added in kernel 5.9 by commit c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable kernel backports. Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/ Fixes: c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") CC: stable@vger.kernel.org # 5.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-28btrfs: do not consider send context as valid when trying to flush qgroupsFilipe Manana1-5/+9
At qgroup.c:try_flush_qgroup() we are asserting that current->journal_info is either NULL or has the value BTRFS_SEND_TRANS_STUB. However allowing for BTRFS_SEND_TRANS_STUB makes no sense because: 1) It is misleading, because send operations are read-only and do not ever need to reserve qgroup space; 2) We already assert that current->journal_info != BTRFS_SEND_TRANS_STUB at transaction.c:start_transaction(); 3) On a kernel without CONFIG_BTRFS_ASSERT=y set, it would result in a crash if try_flush_qgroup() is ever called in a send context, because at transaction.c:start_transaction we cast current->journal_info into a struct btrfs_trans_handle pointer and then dereference it. So just do allow a send context at try_flush_qgroup() and update the comment about it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: move the tree mod log code into its own fileFilipe Manana1-4/+5
The tree modification log, which records modifications done to btrees, is quite large and currently spread all over ctree.c, which is a huge file already. To make things better organized, move all that code into its own separate source and header files. Functions and definitions that are used outside of the module (mostly by ctree.c) are renamed so that they start with a "btrfs_" prefix. Everything else remains unchanged. This makes it easier to go over the tree modification log code every time I need to go read it to fix a bug. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>