summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-03-18io_uring: don't leak creds on SQO attach errorPavel Begunkov1-11/+6
Attaching to already dead/dying SQPOLL task is disallowed in io_sq_offload_create(), but cleanup is hand coded by calling io_put_sq_data()/etc., that miss to put ctx->sq_creds. Defer everything to error-path io_sq_thread_finish(), adding ctx->sqd_list in the error case as well as finish will handle it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-18io_uring: use typesafe pointers in io_uring_taskStefan Metzmacher1-2/+2
Signed-off-by: Stefan Metzmacher <metze@samba.org> Link: https://lore.kernel.org/r/ce2a598e66e48347bb04afbaf2acc67c0cc7971a.1615809009.git.metze@samba.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-18io_uring: remove structures from include/linux/io_uring.hStefan Metzmacher2-1/+25
Link: https://lore.kernel.org/r/8c1d14f3748105f4caeda01716d47af2fa41d11c.1615809009.git.metze@samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-18io_uring: imply MSG_NOSIGNAL for send[msg]()/recv[msg]() callsStefan Metzmacher1-4/+4
We never want to generate any SIGPIPE, -EPIPE only is much better. Signed-off-by: Stefan Metzmacher <metze@samba.org> Link: https://lore.kernel.org/r/38961085c3ec49fd21550c7788f214d1ff02d2d4.1615908477.git.metze@samba.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-17btrfs: fix subvolume/snapshot deletion not triggered on mountFilipe Manana1-1/+15
During the mount procedure we are calling btrfs_orphan_cleanup() against the root tree, which will find all orphans items in this tree. When an orphan item corresponds to a deleted subvolume/snapshot (instead of an inode space cache), it must not delete the orphan item, because that will cause btrfs_find_orphan_roots() to not find the orphan item and therefore not add the corresponding subvolume root to the list of dead roots, which results in the subvolume's tree never being deleted by the cleanup thread. The same applies to the remount from RO to RW path. Fix this by making btrfs_find_orphan_roots() run before calling btrfs_orphan_cleanup() against the root tree. A test case for fstests will follow soon. Reported-by: Robbie Ko <robbieko@synology.com> Link: https://lore.kernel.org/linux-btrfs/b19f4310-35e0-606e-1eea-2dd84d28c5da@synology.com/ Fixes: 638331fa56caea ("btrfs: fix transaction leak and crash after cleaning up orphans on RO mount") CC: stable@vger.kernel.org # 5.11+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17btrfs: fix build when using M=fs/btrfsDavid Sterba1-4/+6
There are people building the module with M= that's supposed to be used for external modules. This got broken in e9aa7c285d20 ("btrfs: enable W=1 checks for btrfs"). $ make M=fs/btrfs scripts/Makefile.lib:10: *** Recursive variable 'KBUILD_CFLAGS' references itself (eventually). Stop. make: *** [Makefile:1755: modules] Error 2 There's a difference compared to 'make fs/btrfs/btrfs.ko' which needs to rebuild a few more things and also the dependency modules need to be available. It could fail with eg. WARNING: Symbol version dump "Module.symvers" is missing. Modules may not have dependencies or modversions. In some environments it's more convenient to rebuild just the btrfs module by M= so let's make it work. The problem is with recursive variable evaluation in += so the conditional C options are stored in a temporary variable to avoid the recursion. Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17btrfs: do not initialize dev replace for bad dev rootJosef Bacik1-0/+3
While helping Neal fix his broken file system I added a debug patch to catch if we were calling btrfs_search_slot with a NULL root, and this stack trace popped: we tried to search with a NULL root CPU: 0 PID: 1760 Comm: mount Not tainted 5.11.0-155.nealbtrfstest.1.fc34.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020 Call Trace: dump_stack+0x6b/0x83 btrfs_search_slot.cold+0x11/0x1b ? btrfs_init_dev_replace+0x36/0x450 btrfs_init_dev_replace+0x71/0x450 open_ctree+0x1054/0x1610 btrfs_mount_root.cold+0x13/0xfa legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x131/0x3d0 ? legacy_get_tree+0x27/0x40 ? btrfs_show_options+0x640/0x640 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x441/0xa80 __x64_sys_mount+0xf4/0x130 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f644730352e Fix this by not starting the device replace stuff if we do not have a NULL dev root. Reported-by: Neal Gompa <ngompa13@gmail.com> CC: stable@vger.kernel.org # 5.11+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17btrfs: initialize device::fs_info alwaysJosef Bacik1-1/+2
Neal reported a panic trying to use -o rescue=all BUG: kernel NULL pointer dereference, address: 0000000000000030 PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 696 Comm: mount Tainted: G W 5.12.0-rc2+ #296 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:btrfs_device_init_dev_stats+0x1d/0x200 RSP: 0018:ffffafaec1483bb8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff9a5715bcb298 RCX: 0000000000000070 RDX: ffff9a5703248000 RSI: ffff9a57052ea150 RDI: ffff9a5715bca400 RBP: ffff9a57052ea150 R08: 0000000000000070 R09: ffff9a57052ea150 R10: 000130faf0741c10 R11: 0000000000000000 R12: ffff9a5703700000 R13: 0000000000000000 R14: ffff9a5715bcb278 R15: ffff9a57052ea150 FS: 00007f600d122c40(0000) GS:ffff9a577bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000030 CR3: 0000000112a46005 CR4: 0000000000370ef0 Call Trace: ? btrfs_init_dev_stats+0x1f/0xf0 ? kmem_cache_alloc+0xef/0x1f0 btrfs_init_dev_stats+0x5f/0xf0 open_ctree+0x10cb/0x1720 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x433/0xa00 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae This happens because when we call btrfs_init_dev_stats we do device->fs_info->dev_root. However device->fs_info isn't initialized because we were only calling btrfs_init_devices_late() if we properly read the device root. However we don't actually need the device root to init the devices, this function simply assigns the devices their ->fs_info pointer properly, so this needs to be done unconditionally always so that we can properly dereference device->fs_info in rescue cases. Reported-by: Neal Gompa <ngompa13@gmail.com> CC: stable@vger.kernel.org # 5.11+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17btrfs: do not initialize dev stats if we have no dev_rootJosef Bacik1-0/+3
Neal reported a panic trying to use -o rescue=all BUG: kernel NULL pointer dereference, address: 0000000000000030 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 4095 Comm: mount Not tainted 5.11.0-0.rc7.149.fc34.x86_64 #1 RIP: 0010:btrfs_device_init_dev_stats+0x4c/0x1f0 RSP: 0018:ffffa60285fbfb68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88b88f806498 RCX: ffff88b82e7a2a10 RDX: ffffa60285fbfb97 RSI: ffff88b82e7a2a10 RDI: 0000000000000000 RBP: ffff88b88f806b3c R08: 0000000000000000 R09: 0000000000000000 R10: ffff88b82e7a2a10 R11: 0000000000000000 R12: ffff88b88f806a00 R13: ffff88b88f806478 R14: ffff88b88f806a00 R15: ffff88b82e7a2a10 FS: 00007f698be1ec40(0000) GS:ffff88b937e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000030 CR3: 0000000092c9c006 CR4: 00000000003706f0 Call Trace: ? btrfs_init_dev_stats+0x1f/0xf0 btrfs_init_dev_stats+0x62/0xf0 open_ctree+0x1019/0x15ff btrfs_mount_root.cold+0x13/0xfa legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x131/0x3d0 ? legacy_get_tree+0x27/0x40 ? btrfs_show_options+0x640/0x640 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x441/0xa80 __x64_sys_mount+0xf4/0x130 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f698c04e52e This happens because we unconditionally attempt to initialize device stats on mount, but we may not have been able to read the device root. Fix this by skipping initializing the device stats if we do not have a device root. Reported-by: Neal Gompa <ngompa13@gmail.com> CC: stable@vger.kernel.org # 5.11+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17btrfs: zoned: remove outdated WARN_ON in direct IOJohannes Thumshirn1-4/+0
In btrfs_submit_direct() there's a WAN_ON_ONCE() that will trigger if we're submitting a DIO write on a zoned filesystem but are not using REQ_OP_ZONE_APPEND to submit the IO to the block device. This is a left over from a previous version where btrfs_dio_iomap_begin() didn't use btrfs_use_zone_append() to check for sequential write only zones. It is an oversight from the development phase. In v11 (I think) I've added 08f455593fff ("btrfs: zoned: cache if block group is on a sequential zone") and forgot to remove the WARN_ON_ONCE() for 544d24f9de73 ("btrfs: zoned: enable zone append writing for direct IO"). When developing auto relocation I got hit by the WARN as a block groups where relocated to conventional zone and the dio code calls btrfs_use_zone_append() introduced by 08f455593fff to check if it can use zone append (a.k.a. if it's a sequential zone) or not and sets the appropriate flags for iomap. I've never hit it in testing before, as I was relying on emulation to test the conventional zones code but this one case wasn't hit, because on emulation fs_info->max_zone_append_size is 0 and the WARN doesn't trigger either. Fixes: 544d24f9de73 ("btrfs: zoned: enable zone append writing for direct IO") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone()Chao Yu1-4/+3
In zonefs_open_zone(), if opened zone count is larger than .s_max_open_zones threshold, we missed to recover .i_wr_refcnt, fix this. Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-03-17kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()Oleg Nesterov1-6/+4
Preparation for fixing get_nr_restart_syscall() on X86 for COMPAT. Add a new helper which sets restart_block->fn and calls a dummy arch_set_restart_data() helper. Fixes: 609c19a385c8 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210201174641.GA17871@redhat.com
2021-03-16btrfs: always pin deleted leaves when there are active tree mod log usersFilipe Manana1-1/+22
When freeing a tree block we may end up adding its extent back to the free space cache/tree, as long as there are no more references for it, it was created in the current transaction and writeback for it never happened. This is generally fine, however when we have tree mod log operations it can result in inconsistent versions of a btree after unwinding extent buffers with the recorded tree mod log operations. This is because: * We only log operations for nodes (adding and removing key/pointers), for leaves we don't do anything; * This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation for a node that points to a leaf that was deleted; * Before we apply the logged operation to unwind a node, we can have that leaf's extent allocated again, either as a node or as a leaf, and possibly for another btree. This is possible if the leaf was created in the current transaction and writeback for it never started, in which case btrfs_free_tree_block() returns its extent back to the free space cache/tree; * Then, before applying the tree mod log operation, some task allocates the metadata extent just freed before, and uses it either as a leaf or as a node for some btree (can be the same or another one, it does not matter); * After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now get the target node with an item pointing to the metadata extent that now has content different from what it had before the leaf was deleted. It might now belong to a different btree and be a node and not a leaf anymore. As a consequence, the results of searches after the unwinding can be unpredictable and produce unexpected results. So make sure we pin extent buffers corresponding to leaves when there are tree mod log users. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-16btrfs: fix race when cloning extent buffer during rewind of an old rootFilipe Manana1-0/+2
While resolving backreferences, as part of a logical ino ioctl call or fiemap, we can end up hitting a BUG_ON() when replaying tree mod log operations of a root, triggering a stack trace like the following: ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:1210! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G W 5.11.0-2d11c0084b02-misc-next+ #89 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0 Code: 05 48 8d 74 10 (...) RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297 RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6 RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20 R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c FS: 00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0 Call Trace: btrfs_search_old_slot+0x265/0x10d0 ? lock_acquired+0xbb/0x600 ? btrfs_search_slot+0x1090/0x1090 ? free_extent_buffer.part.61+0xd7/0x140 ? free_extent_buffer+0x13/0x20 resolve_indirect_refs+0x3e9/0xfc0 ? lock_downgrade+0x3d0/0x3d0 ? __kasan_check_read+0x11/0x20 ? add_prelim_ref.part.11+0x150/0x150 ? lock_downgrade+0x3d0/0x3d0 ? __kasan_check_read+0x11/0x20 ? lock_acquired+0xbb/0x600 ? __kasan_check_write+0x14/0x20 ? do_raw_spin_unlock+0xa8/0x140 ? rb_insert_color+0x30/0x360 ? prelim_ref_insert+0x12d/0x430 find_parent_nodes+0x5c3/0x1830 ? resolve_indirect_refs+0xfc0/0xfc0 ? lock_release+0xc8/0x620 ? fs_reclaim_acquire+0x67/0xf0 ? lock_acquire+0xc7/0x510 ? lock_downgrade+0x3d0/0x3d0 ? lockdep_hardirqs_on_prepare+0x160/0x210 ? lock_release+0xc8/0x620 ? fs_reclaim_acquire+0x67/0xf0 ? lock_acquire+0xc7/0x510 ? poison_range+0x38/0x40 ? unpoison_range+0x14/0x40 ? trace_hardirqs_on+0x55/0x120 btrfs_find_all_roots_safe+0x142/0x1e0 ? find_parent_nodes+0x1830/0x1830 ? btrfs_inode_flags_to_xflags+0x50/0x50 iterate_extent_inodes+0x20e/0x580 ? tree_backref_for_extent+0x230/0x230 ? lock_downgrade+0x3d0/0x3d0 ? read_extent_buffer+0xdd/0x110 ? lock_downgrade+0x3d0/0x3d0 ? __kasan_check_read+0x11/0x20 ? lock_acquired+0xbb/0x600 ? __kasan_check_write+0x14/0x20 ? _raw_spin_unlock+0x22/0x30 ? __kasan_check_write+0x14/0x20 iterate_inodes_from_logical+0x129/0x170 ? iterate_inodes_from_logical+0x129/0x170 ? btrfs_inode_flags_to_xflags+0x50/0x50 ? iterate_extent_inodes+0x580/0x580 ? __vmalloc_node+0x92/0xb0 ? init_data_container+0x34/0xb0 ? init_data_container+0x34/0xb0 ? kvmalloc_node+0x60/0x80 btrfs_ioctl_logical_to_ino+0x158/0x230 btrfs_ioctl+0x205e/0x4040 ? __might_sleep+0x71/0xe0 ? btrfs_ioctl_get_supported_features+0x30/0x30 ? getrusage+0x4b6/0x9c0 ? __kasan_check_read+0x11/0x20 ? lock_release+0xc8/0x620 ? __might_fault+0x64/0xd0 ? lock_acquire+0xc7/0x510 ? lock_downgrade+0x3d0/0x3d0 ? lockdep_hardirqs_on_prepare+0x210/0x210 ? lockdep_hardirqs_on_prepare+0x210/0x210 ? __kasan_check_read+0x11/0x20 ? do_vfs_ioctl+0xfc/0x9d0 ? ioctl_file_clone+0xe0/0xe0 ? lock_downgrade+0x3d0/0x3d0 ? lockdep_hardirqs_on_prepare+0x210/0x210 ? __kasan_check_read+0x11/0x20 ? lock_release+0xc8/0x620 ? __task_pid_nr_ns+0xd3/0x250 ? lock_acquire+0xc7/0x510 ? __fget_files+0x160/0x230 ? __fget_light+0xf2/0x110 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fd1976e2427 Code: 00 00 90 48 8b 05 (...) RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427 RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004 RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120 R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004 R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8 Modules linked in: ---[ end trace ec8931a1c36e57be ]--- (gdb) l *(__tree_mod_log_rewind+0x3b1) 0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210). 1205 * the modification. as we're going backwards, we do the 1206 * opposite of each operation here. 1207 */ 1208 switch (tm->op) { 1209 case MOD_LOG_KEY_REMOVE_WHILE_FREEING: 1210 BUG_ON(tm->slot < n); 1211 fallthrough; 1212 case MOD_LOG_KEY_REMOVE_WHILE_MOVING: 1213 case MOD_LOG_KEY_REMOVE: 1214 btrfs_set_node_key(eb, &tm->key, tm->slot); Here's what happens to hit that BUG_ON(): 1) We have one tree mod log user (through fiemap or the logical ino ioctl), with a sequence number of 1, so we have fs_info->tree_mod_seq == 1; 2) Another task is at ctree.c:balance_level() and we have eb X currently as the root of the tree, and we promote its single child, eb Y, as the new root. Then, at ctree.c:balance_level(), we call: tree_mod_log_insert_root(eb X, eb Y, 1); 3) At tree_mod_log_insert_root() we create tree mod log elements for each slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each with a ->logical pointing to ebX->start. These are placed in an array named tm_list. Lets assume there are N elements (N pointers in eb X); 4) Then, still at tree_mod_log_insert_root(), we create a tree mod log element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set to the level of eb X and ->generation set to the generation of eb X; 5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with tm_list as argument. After that, tree_mod_log_free_eb() calls __tree_mod_log_insert() for each member of tm_list in reverse order, from highest slot in eb X, slot N - 1, to slot 0 of eb X; 6) __tree_mod_log_insert() sets the sequence number of each given tree mod log operation - it increments fs_info->tree_mod_seq and sets fs_info->tree_mod_seq as the sequence number of the given tree mod log operation. This means that for the tm_list created at tree_mod_log_insert_root(), the element corresponding to slot 0 of eb X has the highest sequence number (1 + N), and the element corresponding to the last slot has the lowest sequence number (2); 7) Then, after inserting tm_list's elements into the tree mod log rbtree, the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest sequence number, which is N + 2; 8) Back to ctree.c:balance_level(), we free eb X by calling btrfs_free_tree_block() on it. Because eb X was created in the current transaction, has no other references and writeback did not happen for it, we add it back to the free space cache/tree; 9) Later some other task T allocates the metadata extent from eb X, since it is marked as free space in the space cache/tree, and uses it as a node for some other btree; 10) The tree mod log user task calls btrfs_search_old_slot(), which calls get_old_root(), and finally that calls __tree_mod_log_oldest_root() with time_seq == 1 and eb_root == eb Y; 11) First iteration of the while loop finds the tree mod log element with sequence number N + 2, for the logical address of eb Y and of type MOD_LOG_ROOT_REPLACE; 12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out of the loop, and set root_logical to point to tm->old_root.logical which corresponds to the logical address of eb X; 13) On the next iteration of the while loop, the call to tree_mod_log_search_oldest() returns the smallest tree mod log element for the logical address of eb X, which has a sequence number of 2, an operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to the old slot N - 1 of eb X (eb X had N items in it before being freed); 14) We then break out of the while loop and return the tree mod log operation of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of eb X, to get_old_root(); 15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation and set "logical" to the logical address of eb X, which was the old root. We then call tree_mod_log_search() passing it the logical address of eb X and time_seq == 1; 16) Then before calling tree_mod_log_search(), task T adds a key to eb X, which results in adding a tree mod log operation of type MOD_LOG_KEY_ADD to the tree mod log - this is done at ctree.c:insert_ptr() - but after adding the tree mod log operation and before updating the number of items in eb X from 0 to 1... 17) The task at get_old_root() calls tree_mod_log_search() and gets the tree mod log operation of type MOD_LOG_KEY_ADD just added by task T. Then it enters the following if branch: if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) { (...) } (...) Calls read_tree_block() for eb X, which gets a reference on eb X but does not lock it - task T has it locked. Then it clones eb X while it has nritems set to 0 in its header, before task T sets nritems to 1 in eb X's header. From hereupon we use the clone of eb X which no other task has access to; 18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD mod log operation we just got from tree_mod_log_search() in the previous step and the cloned version of eb X; 19) At __tree_mod_log_rewind(), we set the local variable "n" to the number of items set in eb X's clone, which is 0. Then we enter the while loop, and in its first iteration we process the MOD_LOG_KEY_ADD operation, which just decrements "n" from 0 to (u32)-1, since "n" is declared with a type of u32. At the end of this iteration we call rb_next() to find the next tree mod log operation for eb X, that gives us the mod log operation of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence number of N + 1 (steps 3 to 6); 20) Then we go back to the top of the while loop and trigger the following BUG_ON(): (...) switch (tm->op) { case MOD_LOG_KEY_REMOVE_WHILE_FREEING: BUG_ON(tm->slot < n); fallthrough; (...) Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0. Fix this by taking a read lock on the extent buffer before cloning it at ctree.c:get_old_root(). This should be done regardless of the extent buffer having been freed and reused, as a concurrent task might be modifying it (while holding a write lock on it). Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/ Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-16btrfs: fix slab cache flags for free space tree bitmapDavid Sterba1-1/+1
The free space tree bitmap slab cache is created with SLAB_RED_ZONE but that's a debugging flag and not always enabled. Also the other slabs are created with at least SLAB_MEM_SPREAD that we want as well to average the memory placement cost. Reported-by: Vlastimil Babka <vbabka@suse.cz> Fixes: 3acd48507dc4 ("btrfs: fix allocation of free space cache v1 bitmap pages") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-16Merge tag 'fuse-fixes-5.12-rc4' of ↵Linus Torvalds3-11/+25
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Fix a deadlock and a couple of other bugs" * tag 'fuse-fixes-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: 32-bit user space ioctl compat for fuse device virtiofs: Fail dax mount if device does not support it fuse: fix live lock in fuse_iget()
2021-03-16Merge tag 'nfsd-5.12-2' of ↵Linus Torvalds6-44/+20
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Miscellaneous NFSD fixes for v5.12-rc" * tag 'nfsd-5.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: svcrdma: Revert "svcrdma: Reduce Receive doorbell rate" NFSD: fix error handling in NFSv4.0 callbacks NFSD: fix dest to src mount in inter-server COPY Revert "nfsd4: a client's own opens needn't prevent delegations" Revert "nfsd4: remove check_conflicting_opens warning" rpc: fix NULL dereference on kmalloc failure sunrpc: fix refcount leak for rpc auth modules NFSD: Repair misuse of sv_lock in 5.10.16-rt30. nfsd: don't abort copies early fs: nfsd: fix kconfig dependency warning for NFSD_V4 svcrdma: disable timeouts on rdma backchannel nfsd: Don't keep looking up unhashed files in the nfsd file cache
2021-03-16fanotify: support limited functionality for unprivileged usersAmir Goldstein2-3/+29
Add limited support for unprivileged fanotify groups. An unprivileged users is not allowed to get an open file descriptor in the event nor the process pid of another process. An unprivileged user cannot request permission events, cannot set mount/filesystem marks and cannot request unlimited queue/marks. This enables the limited functionality similar to inotify when watching a set of files and directories for OPEN/ACCESS/MODIFY/CLOSE events, without requiring SYS_CAP_ADMIN privileges. The FAN_REPORT_DFID_NAME init flag, provide a method for an unprivileged listener watching a set of directories (with FAN_EVENT_ON_CHILD) to monitor all changes inside those directories. This typically requires that the listener keeps a map of watched directory fid to dirfd (O_PATH), where fid is obtained with name_to_handle_at() before starting to watch for changes. When getting an event, the reported fid of the parent should be resolved to dirfd and fstatsat(2) with dirfd and name should be used to query the state of the filesystem entry. Link: https://lore.kernel.org/r/20210304112921.3996419-3-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-03-16fanotify: configurable limits via sysfsAmir Goldstein4-30/+114
fanotify has some hardcoded limits. The only APIs to escape those limits are FAN_UNLIMITED_QUEUE and FAN_UNLIMITED_MARKS. Allow finer grained tuning of the system limits via sysfs tunables under /proc/sys/fs/fanotify, similar to tunables under /proc/sys/fs/inotify, with some minor differences. - max_queued_events - global system tunable for group queue size limit. Like the inotify tunable with the same name, it defaults to 16384 and applies on initialization of a new group. - max_user_marks - user ns tunable for marks limit per user. Like the inotify tunable named max_user_watches, on a machine with sufficient RAM and it defaults to 1048576 in init userns and can be further limited per containing user ns. - max_user_groups - user ns tunable for number of groups per user. Like the inotify tunable named max_user_instances, it defaults to 128 in init userns and can be further limited per containing user ns. The slightly different tunable names used for fanotify are derived from the "group" and "mark" terminology used in the fanotify man pages and throughout the code. Considering the fact that the default value for max_user_instances was increased in kernel v5.10 from 8192 to 1048576, leaving the legacy fanotify limit of 8192 marks per group in addition to the max_user_marks limit makes little sense, so the per group marks limit has been removed. Note that when a group is initialized with FAN_UNLIMITED_MARKS, its own marks are not accounted in the per user marks account, so in effect the limit of max_user_marks is only for the collection of groups that are not initialized with FAN_UNLIMITED_MARKS. Link: https://lore.kernel.org/r/20210304112921.3996419-2-amir73il@gmail.com Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-03-16fanotify: limit number of event merge attemptsAmir Goldstein1-0/+6
Event merges are expensive when event queue size is large, so limit the linear search to 128 merge tests. In combination with 128 size hash table, there is a potential to merge with up to 16K events in the hashed queue. Link: https://lore.kernel.org/r/20210304104826.3993892-6-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-03-16fsnotify: use hash table for faster events mergeAmir Goldstein5-17/+116
In order to improve event merge performance, hash events in a 128 size hash table by the event merge key. The fanotify_event size grows by two pointers, but we just reduced its size by removing the objectid member, so overall its size is increased by one pointer. Permission events and overflow event are not merged so they are also not hashed. Link: https://lore.kernel.org/r/20210304104826.3993892-5-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-03-16fanotify: mix event info and pid into merge key hashAmir Goldstein2-26/+66
Improve the merge key hash by mixing more values relevant for merge. For example, all FAN_CREATE name events in the same dir used to have the same merge key based on the dir inode. With this change the created file name is mixed into the merge key. The object id that was used as merge key is redundant to the event info so it is no longer mixed into the hash. Permission events are not hashed, so no need to hash their info. Link: https://lore.kernel.org/r/20210304104826.3993892-4-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-03-16fanotify: reduce event objectid to 29-bit hashAmir Goldstein4-18/+27
objectid is only used by fanotify backend and it is just an optimization for event merge before comparing all fields in event. Move the objectid member from common struct fsnotify_event into struct fanotify_event and reduce it to 29-bit hash to cram it together with the 3-bit event type. Events of different types are never merged, so the combination of event type and hash form a 32-bit key for fast compare of events. This reduces the size of events by one pointer and paves the way for adding hashed queue support for fanotify. Link: https://lore.kernel.org/r/20210304104826.3993892-3-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-03-16fsnotify: allow fsnotify_{peek,remove}_first_event with empty queueAmir Goldstein3-36/+37
Current code has an assumtion that fsnotify_notify_queue_is_empty() is called to verify that queue is not empty before trying to peek or remove an event from queue. Remove this assumption by moving the fsnotify_notify_queue_is_empty() into the functions, allow them to return NULL value and check return value by all callers. This is a prep patch for multi event queues. Link: https://lore.kernel.org/r/20210304104826.3993892-2-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-03-16fuse: 32-bit user space ioctl compat for fuse deviceAlessio Balsini1-10/+16
With a 64-bit kernel build the FUSE device cannot handle ioctl requests coming from 32-bit user space. This is due to the ioctl command translation that generates different command identifiers that thus cannot be used for direct comparisons without proper manipulation. Explicitly extract type and number from the ioctl command to enable 32-bit user space compatibility on 64-bit kernel builds. Signed-off-by: Alessio Balsini <balsini@android.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-03-16quota: Add mountpath based quota supportSascha Hauer1-3/+46
Add syscall quotactl_path, a variant of quotactl which allows to specify the mountpath instead of a path of to a block device. The quotactl syscall expects a path to the mounted block device to specify the filesystem to work on. This limits usage to filesystems which actually have a block device. quotactl_path replaces the path to the block device with a path where the filesystem is mounted at. The global Q_SYNC command to sync all filesystems is not supported for this new syscall, otherwise quotactl_path behaves like quotactl. Link: https://lore.kernel.org/r/20210304123541.30749-2-s.hauer@pengutronix.de Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2021-03-16btrfs: subpage: make readahead work properlyQu Wenruo1-17/+18
In readahead infrastructure, we are using a lot of hard coded PAGE_SHIFT while we're not doing anything specific to PAGE_SIZE. One of the most affected part is the radix tree operation of btrfs_fs_info::reada_tree. If using PAGE_SHIFT, subpage metadata readahead is broken and does no help reading metadata ahead. Fix the problem by using btrfs_fs_info::sectorsize_bits so that readahead could work for subpage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-16btrfs: subpage: fix wild pointer access during metadata read failureQu Wenruo1-1/+30
[BUG] When running fstests for btrfs subpage read-write test, it has a very high chance to crash at generic/475 with the following stack: BTRFS warning (device dm-8): direct IO failed ino 510 rw 1,34817 sector 0xcdf0 len 94208 err no 10 Unable to handle kernel paging request at virtual address ffff80001157e7c0 CPU: 2 PID: 687125 Comm: kworker/u12:4 Tainted: G WC 5.12.0-rc2-custom+ #5 Hardware name: Khadas VIM3 (DT) Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs] pc : queued_spin_lock_slowpath+0x1a0/0x390 lr : do_raw_spin_lock+0xc4/0x11c Call trace: queued_spin_lock_slowpath+0x1a0/0x390 _raw_spin_lock+0x68/0x84 btree_readahead_hook+0x38/0xc0 [btrfs] end_bio_extent_readpage+0x504/0x5f4 [btrfs] bio_endio+0x170/0x1a4 end_workqueue_fn+0x3c/0x60 [btrfs] btrfs_work_helper+0x1b0/0x1b4 [btrfs] process_one_work+0x22c/0x430 worker_thread+0x70/0x3a0 kthread+0x13c/0x140 ret_from_fork+0x10/0x30 Code: 910020e0 8b0200c2 f861d884 aa0203e1 (f8246827) [CAUSE] In end_bio_extent_readpage(), if we hit an error during read, we will handle the error differently for data and metadata. For data we queue a repair, while for metadata, we record the error and let the caller choose what to do. But the code is still using page->private to grab extent buffer, which no longer points to extent buffer for subpage metadata pages. Thus this wild pointer access leads to above crash. [FIX] Introduce a helper, find_extent_buffer_readpage(), to grab extent buffer. The difference against find_extent_buffer_nospinlock() is: - Also handles regular sectorsize == PAGE_SIZE case - No extent buffer refs increase/decrease As extent buffer under IO must have non-zero refs, so this is safe Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-16zonefs: Fix O_APPEND async write handlingDamien Le Moal1-10/+68
zonefs updates the size of a sequential zone file inode only on completion of direct writes. When executing asynchronous append writes (with a file open with O_APPEND or using RWF_APPEND), the use of the current inode size in generic_write_checks() to set an iocb offset thus leads to unaligned write if an application issues an append write operation with another write already being executed. Fix this problem by introducing zonefs_write_checks() as a modified version of generic_write_checks() using the file inode wp_offset for an append write iocb offset. Also introduce zonefs_write_check_limits() to replace generic_write_check_limits() call. This zonefs special helper makes sure that the maximum file limit used is the maximum size of the file being accessed. Since zonefs_write_checks() already truncates the iov_iter, the calls to iov_iter_truncate() in zonefs_file_dio_write() and zonefs_file_buffered_write() are removed. Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: <stable@vger.kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-03-16zonefs: prevent use of seq files as swap fileDamien Le Moal1-0/+16
The sequential write constraint of sequential zone file prevent their use as swap files. Only allow conventional zone files to be used as swap files. Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: <stable@vger.kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-03-16Merge tag 'afs-fixes-20210315' of ↵Linus Torvalds7-31/+12
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: - Fix an oops in AFS that can be triggered by accessing one of the afs.yfs.* xattrs against an OpenAFS server - for instance by commands like "cp -a"[1], "rsync -X" or getfattr[2]. These try and copy all of the xattrs. cp and rsync should pay attention to the list in /etc/xattr.conf, but cp doesn't on Ubuntu and rsync doesn't seem to on Ubuntu or Fedora. xattr.conf has been modified upstream[3], and a new version has just been cut that includes it. I've logged a bug against rsync for the problem there[4]. - Stop listing "afs.*" xattrs[5][6][7], but particularly ACL ones[8] so that they don't confuse cp and rsync. This removes them from the list returned by listxattr(), but they're still available to get/set. Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003498.html [1] Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003501.html [2] Link: https://git.savannah.nongnu.org/cgit/attr.git/commit/?id=74da517cc655a82ded715dea7245ce88ebc91b98 [3] Link: https://github.com/WayneD/rsync/issues/163 [4] Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003516.html [5] Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003524.html [6] Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003565.html # v1 Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003568.html [7] Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003570.html [8] Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003571.html # v2 * tag 'afs-fixes-20210315' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Stop listxattr() from listing "afs.*" attributes afs: Fix accessing YFS xattrs on a non-YFS server
2021-03-15afs: Stop listxattr() from listing "afs.*" attributesDavid Howells6-28/+0
afs_listxattr() lists all the available special afs xattrs (i.e. those in the "afs.*" space), no matter what type of server we're dealing with. But OpenAFS servers, for example, cannot deal with some of the extra-capable attributes that AuriStor (YFS) servers provide. Unfortunately, the presence of the afs.yfs.* attributes causes errors[1] for anything that tries to read them if the server is of the wrong type. Fix the problem by removing afs_listxattr() so that none of the special xattrs are listed (AFS doesn't support xattrs). It does mean, however, that getfattr won't list them, though they can still be accessed with getxattr() and setxattr(). This can be tested with something like: getfattr -d -m ".*" /afs/example.com/path/to/file With this change, none of the afs.* attributes should be visible. Changes: ver #2: - Hide all of the afs.* xattrs, not just the ACL ones. Fixes: ae46578b963f ("afs: Get YFS ACLs and information through xattrs") Reported-by: Gaja Sophie Peters <gaja.peters@math.uni-hamburg.de> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Gaja Sophie Peters <gaja.peters@math.uni-hamburg.de> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003502.html [1] Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003567.html # v1 Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003573.html # v2
2021-03-15afs: Fix accessing YFS xattrs on a non-YFS serverDavid Howells2-3/+12
If someone attempts to access YFS-related xattrs (e.g. afs.yfs.acl) on a file on a non-YFS AFS server (such as OpenAFS), then the kernel will jump to a NULL function pointer because the afs_fetch_acl_operation descriptor doesn't point to a function for issuing an operation on a non-YFS server[1]. Fix this by making afs_wait_for_operation() check that the issue_afs_rpc method is set before jumping to it and setting -ENOTSUPP if not. This fix also covers other potential operations that also only exist on YFS servers. afs_xattr_get/set_yfs() then need to translate -ENOTSUPP to -ENODATA as the former error is internal to the kernel. The bug shows up as an oops like the following: BUG: kernel NULL pointer dereference, address: 0000000000000000 [...] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [...] Call Trace: afs_wait_for_operation+0x83/0x1b0 [kafs] afs_xattr_get_yfs+0xe6/0x270 [kafs] __vfs_getxattr+0x59/0x80 vfs_getxattr+0x11c/0x140 getxattr+0x181/0x250 ? __check_object_size+0x13f/0x150 ? __fput+0x16d/0x250 __x64_sys_fgetxattr+0x64/0xb0 do_syscall_64+0x49/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fb120a9defe This was triggered with "cp -a" which attempts to copy xattrs, including afs ones, but is easier to reproduce with getfattr, e.g.: getfattr -d -m ".*" /afs/openafs.org/ Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Reported-by: Gaja Sophie Peters <gaja.peters@math.uni-hamburg.de> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Gaja Sophie Peters <gaja.peters@math.uni-hamburg.de> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> cc: linux-afs@lists.infradead.org Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003498.html [1] Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003566.html # v1 Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003572.html # v2
2021-03-15btrfs: zoned: fix linked list corruption after log root tree allocation failureFilipe Manana1-4/+4
When using a zoned filesystem, while syncing the log, if we fail to allocate the root node for the log root tree, we are not removing the log context we allocated on stack from the list of log contexts of the log root tree. This means after the return from btrfs_sync_log() we get a corrupted linked list. Fix this by allocating the node before adding our stack allocated context to the list of log contexts of the log root tree. Fixes: 3ddebf27fcd3a9 ("btrfs: zoned: reorder log node allocation on zoned filesystem") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-15btrfs: fix qgroup data rsv leak caused by falloc failureQu Wenruo1-5/+19
[BUG] When running fsstress with only falloc workload, and a very low qgroup limit set, we can get qgroup data rsv leak at unmount time. BTRFS warning (device dm-0): qgroup 0/5 has unreleased space, type 0 rsv 20480 BTRFS error (device dm-0): qgroup reserved space leaked The minimal reproducer looks like: #!/bin/bash dev=/dev/test/test mnt="/mnt/btrfs" fsstress=~/xfstests-dev/ltp/fsstress runtime=8 workload() { umount $dev &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f $dev > /dev/null mount $dev $mnt btrfs quota en $mnt btrfs quota rescan -w $mnt btrfs qgroup limit 16m 0/5 $mnt $fsstress -w -z -f creat=10 -f fallocate=10 -p 2 -n 100 \ -d $mnt -v > /tmp/fsstress umount $mnt if dmesg | grep leak ; then echo "!!! FAILED !!!" exit 1 fi } for (( i=0; i < $runtime; i++)); do echo "=== $i/$runtime===" workload done Normally it would fail before round 4. [CAUSE] In function insert_prealloc_file_extent(), we first call btrfs_qgroup_release_data() to know how many bytes are reserved for qgroup data rsv. Then use that @qgroup_released number to continue our work. But after we call btrfs_qgroup_release_data(), we should either queue @qgroup_released to delayed ref or free them manually in error path. Unfortunately, we lack the error handling to free the released bytes, leaking qgroup data rsv. All the error handling function outside won't help at all, as we have released the range, meaning in inode io tree, the EXTENT_QGROUP_RESERVED bit is already cleared, thus all btrfs_qgroup_free_data() call won't free any data rsv. [FIX] Add free_qgroup tag to manually free the released qgroup data rsv. Reported-by: Nikolay Borisov <nborisov@suse.com> Reported-by: David Sterba <dsterba@suse.cz> Fixes: 9729f10a608f ("btrfs: inode: move qgroup reserved space release to the callers of insert_reserved_file_extent()") CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-15btrfs: track qgroup released data in own variable in insert_prealloc_file_extentQu Wenruo1-5/+6
There is a piece of weird code in insert_prealloc_file_extent(), which looks like: ret = btrfs_qgroup_release_data(inode, file_offset, len); if (ret < 0) return ERR_PTR(ret); if (trans) { ret = insert_reserved_file_extent(trans, inode, file_offset, &stack_fi, true, ret); ... } extent_info.is_new_extent = true; extent_info.qgroup_reserved = ret; ... Note how the variable @ret is abused here, and if anyone is adding code just after btrfs_qgroup_release_data() call, it's super easy to overwrite the @ret and cause tons of qgroup related bugs. Fix such abuse by introducing new variable @qgroup_released, so that we won't reuse the existing variable @ret. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-15btrfs: fix wrong offset to zero out range beyond i_sizeQu Wenruo1-1/+1
[BUG] The test generic/091 fails , with the following output: fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W mapped writes DISABLED Seed set to 1 main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling! main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling! skipping zero size read truncating to largest ever: 0xe400 copying to largest ever: 0x1f400 cloning to largest ever: 0x70000 cloning to largest ever: 0x77000 fallocating to largest ever: 0x7a120 Mapped Read: non-zero data past EOF (0x3a7ff) page offset 0x800 is 0xf2e1 <<< ... [CAUSE] In commit c28ea613fafa ("btrfs: subpage: fix the false data csum mismatch error") end_bio_extent_readpage() changes to only zero the range inside the bvec for incoming subpage support. But that commit is using incorrect offset to calculate the start. For subpage, we can have a case that the whole bvec is beyond isize, thus we need to calculate the correct offset. But the offending commit is using @end (bvec end), other than @start (bvec start) to calculate the start offset. This means, we only zero the last byte of the bvec, not from the isize. This stupid bug makes the range beyond isize is not properly zeroed, and failed above test. [FIX] Use correct @start to calculate the range start. Reported-by: kernel test robot <oliver.sang@intel.com> Fixes: c28ea613fafa ("btrfs: subpage: fix the false data csum mismatch error") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-15xfs: also reject BULKSTAT_SINGLE in a mount user namespaceChristoph Hellwig1-0/+6
BULKSTAT_SINGLE exposed the ondisk uids/gids just like bulkstat, and can be called on any inode, including ones not visible in the current mount. Fixes: f736d93d76d3 ("xfs: support idmapped mounts") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-15xfs: force log and push AIL to clear pinned inodes when aborting mountDarrick J. Wong1-46/+44
If we allocate quota inodes in the process of mounting a filesystem but then decide to abort the mount, it's possible that the quota inodes are sitting around pinned by the log. Now that inode reclaim relies on the AIL to flush inodes, we have to force the log and push the AIL in between releasing the quota inodes and kicking off reclaim to tear down all the incore inodes. Do this by extracting the bits we need from the unmount path and reusing them. As an added bonus, failed writes during a failed mount will not retry forever now. This was originally found during a fuzz test of metadata directories (xfs/1546), but the actual symptom was that reclaim hung up on the quota inodes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-03-15io_uring: fix sqpoll cancellation via task_workPavel Begunkov1-1/+5
Running sqpoll cancellations via task_work_run() is a bad idea because it depends on other task works to be run, but those may be locked in currently running task_work_run() because of how it's (splicing the list in batches). Enqueue and run them through a separate callback head, namely struct io_sq_data::park_task_work. As a nice bonus we now precisely control where it's run, that's much safer than guessing where it can happen as it was before. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15io_uring: add generic callback_head helpersPavel Begunkov1-26/+36
We already have helpers to run/add callback_head but taking ctx and working with ctx->exit_task_work. Extract generic versions of them implemented in terms of struct callback_head, it will be used later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15io_uring: fix concurrent parkingPavel Begunkov1-2/+11
If io_sq_thread_park() of one task got rescheduled right after set_bit(), before it gets back to mutex_lock() there can happen park()/unpark() by another task with SQPOLL locking again and continuing running never seeing that first set_bit(SHOULD_PARK), so won't even try to put the mutex down for parking. It will get parked eventually when SQPOLL drops the lock for reschedule, but may be problematic and will get in the way of further fixes. Account number of tasks waiting for parking with a new atomic variable park_pending and adjust SHOULD_PARK accordingly. It doesn't entirely replaces SHOULD_PARK bit with this atomic var because it's convenient to have it as a bit in the state and will help to do optimisations later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15io_uring: halt SQO submission on ctx exitPavel Begunkov1-0/+8
io_sq_thread_finish() is called in io_ring_ctx_free(), so SQPOLL task is potentially running submitting new requests. It's not a disaster because of using a "try" variant of percpu_ref_get, but is far from nice. Remove ctx from the sqd ctx list earlier, before cancellation loop, so SQPOLL can't find it and so won't submit new requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15io_uring: replace sqd rw_semaphore with mutexPavel Begunkov1-22/+14
The only user of read-locking of sqd->rw_lock is sq_thread itself, which is by definition alone, so we don't really need rw_semaphore, but mutex will do. Replace it with a mutex, and kill read-to-write upgrading and extra task_work handling in io_sq_thread(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15io_uring: fix complete_post use ctx after freePavel Begunkov1-6/+11
If io_req_complete_post() put not a final ref, we can't rely on the request's ctx ref, and so ctx may potentially be freed while complete_post() is in io_cqring_ev_posted()/etc. In that case get an additional ctx reference, and put it in the end, so protecting following io_cqring_ev_posted(). And also prolong ctx lifetime until spin_unlock happens, as we do with mutexes, so added percpu_ref_get() doesn't race with ctx free. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15io_uring: fix ->flags races by linked timeoutsPavel Begunkov1-1/+0
It's racy to modify req->flags from a not owning context, e.g. linked timeout calling req_set_fail_links() for the master request might race with that request setting/clearing flags while being executed concurrently. Just remove req_set_fail_links(prev) from io_link_timeout_fn(), io_async_find_and_cancel() and functions down the line take care of setting the fail bit. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15gfs2: Flag a withdraw if init_threads() failsAndrew Price1-1/+3
Interrupting mount with ^C quickly enough can cause the kthread_run() calls in gfs2's init_threads() to fail and the error path leads to a deadlock on the s_umount rwsem. The abridged chain of events is: [mount path] get_tree_bdev() sget_fc() alloc_super() down_write_nested(&s->s_umount, SINGLE_DEPTH_NESTING); [acquired] gfs2_fill_super() gfs2_make_fs_rw() init_threads() kthread_run() ( Interrupted ) [Error path] gfs2_gl_hash_clear() flush_workqueue(glock_workqueue) wait_for_completion() [workqueue context] glock_work_func() run_queue() do_xmote() freeze_go_sync() freeze_super() down_write(&sb->s_umount) [deadlock] In freeze_go_sync() there is a gfs2_withdrawn() check that we can use to make sure freeze_super() is not called in the error path, so add a gfs2_withdraw_delayed() call when init_threads() fails. Ref: https://bugzilla.kernel.org/show_bug.cgi?id=212231 Reported-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-03-15cifs: Fix preauth hash corruptionVincent Whitchurch1-1/+6
smb311_update_preauth_hash() uses the shash in server->secmech without appropriate locking, and this can lead to sessions corrupting each other's preauth hashes. The following script can easily trigger the problem: #!/bin/sh -e NMOUNTS=10 for i in $(seq $NMOUNTS); mkdir -p /tmp/mnt$i umount /tmp/mnt$i 2>/dev/null || : done while :; do for i in $(seq $NMOUNTS); do mount -t cifs //192.168.0.1/test /tmp/mnt$i -o ... & done wait for i in $(seq $NMOUNTS); do umount /tmp/mnt$i done done Usually within seconds this leads to one or more of the mounts failing with the following errors, and a "Bad SMB2 signature for message" is seen in the server logs: CIFS: VFS: \\192.168.0.1 failed to connect to IPC (rc=-13) CIFS: VFS: cifs_mount failed w/return code = -13 Fix it by holding the server mutex just like in the other places where the shashes are used. Fixes: 8bd68c6e47abff34e4 ("CIFS: implement v3.11 preauth integrity") Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> CC: <stable@vger.kernel.org> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-03-15cifs: update new ACE pointer after populate_new_aces.Shyam Prasad N1-3/+6
After the fix for retaining externally set ACEs with cifsacl and modefromsid,idsfromsid, there was an issue in populating the inherited ACEs after setting the ACEs introduced by these two modes. Fixed this by updating the ACE pointer again after the call to populate_new_aces. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Rohith Surabattula <rohiths@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-03-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds2-17/+14
Merge misc fixes from Andrew Morton: "28 patches. Subsystems affected by this series: mm (memblock, pagealloc, hugetlb, highmem, kfence, oom-kill, madvise, kasan, userfaultfd, memcg, and zram), core-kernel, kconfig, fork, binfmt, MAINTAINERS, kbuild, and ia64" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (28 commits) zram: fix broken page writeback zram: fix return value on writeback_store mm/memcg: set memcg when splitting page mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls mm/userfaultfd: fix memory corruption due to writeprotect kasan: fix KASAN_STACK dependency for HW_TAGS kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC mm/madvise: replace ptrace attach requirement for process_madvise include/linux/sched/mm.h: use rcu_dereference in in_vfork() kfence: fix reports if constant function prefixes exist kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations kfence: fix printk format for ptrdiff_t linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP* MAINTAINERS: exclude uapi directories in API/ABI section binfmt_misc: fix possible deadlock in bm_register_write mm/highmem.c: fix zero_user_segments() with start > end hugetlb: do early cow when page pinned on src mm mm: use is_cow_mapping() across tree where proper ...