summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2020-11-30NFSD: Add tracepoints in nfsd4_decode/encode_compound()Chuck Lever2-2/+79
For troubleshooting purposes, record failures to decode NFSv4 operation arguments and encode operation results. trace_nfsd_compound_decode_err() replaces the dprintk() call sites that are embedded in READ_* macros that are about to be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30NFSD: Add tracepoints in nfsd_dispatch()Chuck Lever2-12/+65
For troubleshooting purposes, record GARBAGE_ARGS and CANT_ENCODE failures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30NFSD: Add common helpers to decode void args and encode void resultsChuck Lever13-86/+64
Start off the conversion to xdr_stream by de-duplicating the functions that decode void arguments and encode void results. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Prepare for xdr_stream-style decoding on the server-sideChuck Lever1-0/+2
A "permanent" struct xdr_stream is allocated in struct svc_rqst so that it is usable by all server-side decoders. A per-rqst scratch buffer is also allocated to handle decoding XDR data items that cross page boundaries. To demonstrate how it will be used, add the first call site for the new svcxdr_init_decode() API. As an additional part of the overall conversion, add symbolic constants for successful and failed XDR operations. Returning "0" is overloaded. Sometimes it means something failed, but sometimes it means success. To make it more clear when XDR decoding functions succeed or fail, introduce symbolic constants. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Add xdr_set_scratch_page() and xdr_reset_scratch_buffer()Chuck Lever10-13/+11
Clean up: De-duplicate some frequently-used code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30nfsd: Fix error return code in nfsd_file_cache_init()Huang Guobin1-0/+1
Fix to return PTR_ERR() error code from the error handling case instead of 0 in function nfsd_file_cache_init(), as done elsewhere in this function. Fixes: 65294c1f2c5e7("nfsd: add a new struct file caching facility to nfsd") Signed-off-by: Huang Guobin <huangguobin4@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30NFSD: Add SPDX header for fs/nfsd/trace.cChuck Lever1-0/+1
Clean up. The file was contributed in 2014 by Christoph Hellwig in commit 31ef83dc0538 ("nfsd: add trace events"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30NFSD: Remove extra "0x" in tracepoint format specifierChuck Lever1-4/+4
Clean up: %p adds its own 0x already. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30NFSD: Clean up the show_nf_may macroChuck Lever1-14/+26
Display all currently possible NFSD_MAY permission flags. Move and rename show_nf_may with a more generic name because the NFSD_MAY permission flags are used in other places besides the file cache. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30nfsd/nfs3: remove unused macro nfsd3_fhandleresAlex Shi1-1/+0
The macro is unused, remove it to tame gcc warning: fs/nfsd/nfs3proc.c:702:0: warning: macro "nfsd3_fhandleres" is not used [-Wunused-macros] Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: linux-nfs@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30NFSD: A semicolon is not needed after a switch statement.Tom Rix1-1/+1
Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30NFSD: Invoke svc_encode_result_payload() in "read" NFSD encodersChuck Lever3-12/+42
Have the NFSD encoders annotate the boundaries of every direct-data-placement eligible result data payload. Then change svcrdma to use that annotation instead of the xdr->page_len when handling Write chunks. For NFSv4 on RDMA, that enables the ability to recognize multiple result payloads per compound. This is a pre-requisite for supporting multiple Write chunks per RPC transaction. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Rename svc_encode_read_payload()Chuck Lever1-1/+1
Clean up: "result payload" is a less confusing name for these payloads. "READ payload" reflects only the NFS usage. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-29Merge tag 'efi-urgent-for-v5.10-rc5' of ↵Linus Torvalds2-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fixes from Borislav Petkov: "More EFI fixes forwarded from Ard Biesheuvel: - revert efivarfs kmemleak fix again - it was a false positive - make CONFIG_EFI_EARLYCON depend on CONFIG_EFI explicitly so it does not pull in other dependencies unnecessarily if CONFIG_EFI is not set - defer attempts to load SSDT overrides from EFI vars until after the efivar layer is up" * tag 'efi-urgent-for-v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi: EFI_EARLYCON should depend on EFI efivarfs: revert "fix memory leak in efivarfs_create()" efi/efivars: Set generic ops before loading SSDT
2020-11-27Merge tag 'io_uring-5.10-2020-11-27' of git://git.kernel.dk/linux-blockLinus Torvalds1-19/+20
Pull io_uring fixes from Jens Axboe: - Out of bounds fix for the cq size cap from earlier this release (Joseph) - iov_iter type check fix (Pavel) - Files grab + cancelation fix (Pavel) * tag 'io_uring-5.10-2020-11-27' of git://git.kernel.dk/linux-block: io_uring: fix files grab/cancel race io_uring: fix ITER_BVEC check io_uring: fix shift-out-of-bounds when round up cq size
2020-11-27Merge tag 'for-5.10-rc5-tag' of ↵Linus Torvalds7-73/+158
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few fixes for various warnings that accumulated over past two weeks: - tree-checker: add missing return values for some errors - lockdep fixes - when reading qgroup config and starting quota rescan - reverse order of quota ioctl lock and VFS freeze lock - avoid accessing potentially stale fs info during device scan, reported by syzbot - add scope NOFS protection around qgroup relation changes - check for running transaction before flushing qgroups - fix tracking of new delalloc ranges for some cases" * tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix lockdep splat when enabling and disabling qgroups btrfs: do nofs allocations when adding and removing qgroup relations btrfs: fix lockdep splat when reading qgroup config on mount btrfs: tree-checker: add missing returns after data_ref alignment checks btrfs: don't access possibly stale fs_info data for printing duplicate device btrfs: tree-checker: add missing return after error in root_item btrfs: qgroup: don't commit transaction when we already hold the handle btrfs: fix missing delalloc new bit for new delalloc ranges
2020-11-26io_uring: fix files grab/cancel racePavel Begunkov1-16/+15
When one task is in io_uring_cancel_files() and another is doing io_prep_async_work() a race may happen. That's because after accounting a request inflight in first call to io_grab_identity() it still may fail and go to io_identity_cow(), which migh briefly keep dangling work.identity and not only. Grab files last, so io_prep_async_work() won't fail if it did get into ->inflight_list. note: the bug shouldn't exist after making io_uring_cancel_files() not poking into other tasks' requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-25efivarfs: revert "fix memory leak in efivarfs_create()"Ard Biesheuvel2-1/+2
The memory leak addressed by commit fe5186cf12e3 is a false positive: all allocations are recorded in a linked list, and freed when the filesystem is unmounted. This leads to double frees, and as reported by David, leads to crashes if SLUB is configured to self destruct when double frees occur. So drop the redundant kfree() again, and instead, mark the offending pointer variable so the allocation is ignored by kmemleak. Cc: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com> Fixes: fe5186cf12e3 ("efivarfs: fix memory leak in efivarfs_create()") Reported-by: David Laight <David.Laight@aculab.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-11-25Merge tag '5.10-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2-15/+74
Pull cifs fixes from Steve French: "Four smb3 fixes for stable: one fixes a memleak, the other three address a problem found with decryption offload that can cause a use after free" * tag '5.10-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: Handle error case during offload read path smb3: Avoid Mid pending list corruption smb3: Call cifs reconnect from demultiplex thread cifs: fix a memleak with modefromsid
2020-11-24io_uring: fix ITER_BVEC checkPavel Begunkov1-1/+1
iov_iter::type is a bitmask that also keeps direction etc., so it shouldn't be directly compared against ITER_*. Use proper helper. Fixes: ff6165b2d7f6 ("io_uring: retain iov_iter state over io_read/io_write calls") Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Cc: <stable@vger.kernel.org> # 5.9 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-24io_uring: fix shift-out-of-bounds when round up cq sizeJoseph Qi1-2/+4
Abaci Fuzz reported a shift-out-of-bounds BUG in io_uring_create(): [ 59.598207] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 [ 59.599665] shift exponent 64 is too large for 64-bit type 'long unsigned int' [ 59.601230] CPU: 0 PID: 963 Comm: a.out Not tainted 5.10.0-rc4+ #3 [ 59.602502] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 59.603673] Call Trace: [ 59.604286] dump_stack+0x107/0x163 [ 59.605237] ubsan_epilogue+0xb/0x5a [ 59.606094] __ubsan_handle_shift_out_of_bounds.cold+0xb2/0x20e [ 59.607335] ? lock_downgrade+0x6c0/0x6c0 [ 59.608182] ? rcu_read_lock_sched_held+0xaf/0xe0 [ 59.609166] io_uring_create.cold+0x99/0x149 [ 59.610114] io_uring_setup+0xd6/0x140 [ 59.610975] ? io_uring_create+0x2510/0x2510 [ 59.611945] ? lockdep_hardirqs_on_prepare+0x286/0x400 [ 59.613007] ? syscall_enter_from_user_mode+0x27/0x80 [ 59.614038] ? trace_hardirqs_on+0x5b/0x180 [ 59.615056] do_syscall_64+0x2d/0x40 [ 59.615940] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 59.617007] RIP: 0033:0x7f2bb8a0b239 This is caused by roundup_pow_of_two() if the input entries larger enough, e.g. 2^32-1. For sq_entries, it will check first and we allow at most IORING_MAX_ENTRIES, so it is okay. But for cq_entries, we do round up first, that may overflow and truncate it to 0, which is not the expected behavior. So check the cq size first and then do round up. Fixes: 88ec3211e463 ("io_uring: round-up cq size before comparing with rounded sq size") Reported-by: Abaci Fuzz <abaci@linux.alibaba.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-23btrfs: fix lockdep splat when enabling and disabling qgroupsFilipe Manana2-9/+53
When running test case btrfs/017 from fstests, lockdep reported the following splat: [ 1297.067385] ====================================================== [ 1297.067708] WARNING: possible circular locking dependency detected [ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted [ 1297.068322] ------------------------------------------------------ [ 1297.068629] btrfs/189080 is trying to acquire lock: [ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.069274] but task is already holding lock: [ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs] [ 1297.070219] which lock already depends on the new lock. [ 1297.071131] the existing dependency chain (in reverse order) is: [ 1297.071721] -> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}: [ 1297.072375] lock_acquire+0xd8/0x490 [ 1297.072710] __mutex_lock+0xa3/0xb30 [ 1297.073061] btrfs_qgroup_inherit+0x59/0x6a0 [btrfs] [ 1297.073421] create_subvol+0x194/0x990 [btrfs] [ 1297.073780] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] [ 1297.074133] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] [ 1297.074498] btrfs_ioctl_snap_create+0x58/0x80 [btrfs] [ 1297.074872] btrfs_ioctl+0x1a90/0x36f0 [btrfs] [ 1297.075245] __x64_sys_ioctl+0x83/0xb0 [ 1297.075617] do_syscall_64+0x33/0x80 [ 1297.075993] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1297.076380] -> #0 (sb_internal#2){.+.+}-{0:0}: [ 1297.077166] check_prev_add+0x91/0xc60 [ 1297.077572] __lock_acquire+0x1740/0x3110 [ 1297.077984] lock_acquire+0xd8/0x490 [ 1297.078411] start_transaction+0x3c5/0x760 [btrfs] [ 1297.078853] btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.079323] btrfs_ioctl+0x2c60/0x36f0 [btrfs] [ 1297.079789] __x64_sys_ioctl+0x83/0xb0 [ 1297.080232] do_syscall_64+0x33/0x80 [ 1297.080680] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1297.081139] other info that might help us debug this: [ 1297.082536] Possible unsafe locking scenario: [ 1297.083510] CPU0 CPU1 [ 1297.084005] ---- ---- [ 1297.084500] lock(&fs_info->qgroup_ioctl_lock); [ 1297.084994] lock(sb_internal#2); [ 1297.085485] lock(&fs_info->qgroup_ioctl_lock); [ 1297.085974] lock(sb_internal#2); [ 1297.086454] *** DEADLOCK *** [ 1297.087880] 3 locks held by btrfs/189080: [ 1297.088324] #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs] [ 1297.088799] #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs] [ 1297.089284] #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs] [ 1297.089771] stack backtrace: [ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1 [ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 1297.092123] Call Trace: [ 1297.092629] dump_stack+0x8d/0xb5 [ 1297.093115] check_noncircular+0xff/0x110 [ 1297.093596] check_prev_add+0x91/0xc60 [ 1297.094076] ? kvm_clock_read+0x14/0x30 [ 1297.094553] ? kvm_sched_clock_read+0x5/0x10 [ 1297.095029] __lock_acquire+0x1740/0x3110 [ 1297.095510] lock_acquire+0xd8/0x490 [ 1297.095993] ? btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.096476] start_transaction+0x3c5/0x760 [btrfs] [ 1297.096962] ? btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.097451] btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.097941] ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs] [ 1297.098429] btrfs_ioctl+0x2c60/0x36f0 [btrfs] [ 1297.098904] ? do_user_addr_fault+0x20c/0x430 [ 1297.099382] ? kvm_clock_read+0x14/0x30 [ 1297.099854] ? kvm_sched_clock_read+0x5/0x10 [ 1297.100328] ? sched_clock+0x5/0x10 [ 1297.100801] ? sched_clock_cpu+0x12/0x180 [ 1297.101272] ? __x64_sys_ioctl+0x83/0xb0 [ 1297.101739] __x64_sys_ioctl+0x83/0xb0 [ 1297.102207] do_syscall_64+0x33/0x80 [ 1297.102673] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1297.103148] RIP: 0033:0x7f773ff65d87 This is because during the quota enable ioctl we lock first the mutex qgroup_ioctl_lock and then start a transaction, and starting a transaction acquires a fs freeze semaphore (at the VFS level). However, every other code path, except for the quota disable ioctl path, we do the opposite: we start a transaction and then lock the mutex. So fix this by making the quota enable and disable paths to start the transaction without having the mutex locked, and then, after starting the transaction, lock the mutex and check if some other task already enabled or disabled the quotas, bailing with success if that was the case. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-23btrfs: do nofs allocations when adding and removing qgroup relationsFilipe Manana1-0/+9
When adding or removing a qgroup relation we are doing a GFP_KERNEL allocation which is not safe because we are holding a transaction handle open and that can make us deadlock if the allocator needs to recurse into the filesystem. So just surround those calls with a nofs context. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-23btrfs: fix lockdep splat when reading qgroup config on mountFilipe Manana1-1/+1
Lockdep reported the following splat when running test btrfs/190 from fstests: [ 9482.126098] ====================================================== [ 9482.126184] WARNING: possible circular locking dependency detected [ 9482.126281] 5.10.0-rc4-btrfs-next-73 #1 Not tainted [ 9482.126365] ------------------------------------------------------ [ 9482.126456] mount/24187 is trying to acquire lock: [ 9482.126534] ffffa0c869a7dac0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}, at: qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.126647] but task is already holding lock: [ 9482.126777] ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs] [ 9482.126886] which lock already depends on the new lock. [ 9482.127078] the existing dependency chain (in reverse order) is: [ 9482.127213] -> #1 (btrfs-quota-00){++++}-{3:3}: [ 9482.127366] lock_acquire+0xd8/0x490 [ 9482.127436] down_read_nested+0x45/0x220 [ 9482.127528] __btrfs_tree_read_lock+0x27/0x120 [btrfs] [ 9482.127613] btrfs_read_lock_root_node+0x41/0x130 [btrfs] [ 9482.127702] btrfs_search_slot+0x514/0xc30 [btrfs] [ 9482.127788] update_qgroup_status_item+0x72/0x140 [btrfs] [ 9482.127877] btrfs_qgroup_rescan_worker+0xde/0x680 [btrfs] [ 9482.127964] btrfs_work_helper+0xf1/0x600 [btrfs] [ 9482.128039] process_one_work+0x24e/0x5e0 [ 9482.128110] worker_thread+0x50/0x3b0 [ 9482.128181] kthread+0x153/0x170 [ 9482.128256] ret_from_fork+0x22/0x30 [ 9482.128327] -> #0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}: [ 9482.128464] check_prev_add+0x91/0xc60 [ 9482.128551] __lock_acquire+0x1740/0x3110 [ 9482.128623] lock_acquire+0xd8/0x490 [ 9482.130029] __mutex_lock+0xa3/0xb30 [ 9482.130590] qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.131577] btrfs_read_qgroup_config+0x43a/0x550 [btrfs] [ 9482.132175] open_ctree+0x1228/0x18a0 [btrfs] [ 9482.132756] btrfs_mount_root.cold+0x13/0xed [btrfs] [ 9482.133325] legacy_get_tree+0x30/0x60 [ 9482.133866] vfs_get_tree+0x28/0xe0 [ 9482.134392] fc_mount+0xe/0x40 [ 9482.134908] vfs_kern_mount.part.0+0x71/0x90 [ 9482.135428] btrfs_mount+0x13b/0x3e0 [btrfs] [ 9482.135942] legacy_get_tree+0x30/0x60 [ 9482.136444] vfs_get_tree+0x28/0xe0 [ 9482.136949] path_mount+0x2d7/0xa70 [ 9482.137438] do_mount+0x75/0x90 [ 9482.137923] __x64_sys_mount+0x8e/0xd0 [ 9482.138400] do_syscall_64+0x33/0x80 [ 9482.138873] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 9482.139346] other info that might help us debug this: [ 9482.140735] Possible unsafe locking scenario: [ 9482.141594] CPU0 CPU1 [ 9482.142011] ---- ---- [ 9482.142411] lock(btrfs-quota-00); [ 9482.142806] lock(&fs_info->qgroup_rescan_lock); [ 9482.143216] lock(btrfs-quota-00); [ 9482.143629] lock(&fs_info->qgroup_rescan_lock); [ 9482.144056] *** DEADLOCK *** [ 9482.145242] 2 locks held by mount/24187: [ 9482.145637] #0: ffffa0c8411c40e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xb9/0x400 [ 9482.146061] #1: ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs] [ 9482.146509] stack backtrace: [ 9482.147350] CPU: 1 PID: 24187 Comm: mount Not tainted 5.10.0-rc4-btrfs-next-73 #1 [ 9482.147788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 9482.148709] Call Trace: [ 9482.149169] dump_stack+0x8d/0xb5 [ 9482.149628] check_noncircular+0xff/0x110 [ 9482.150090] check_prev_add+0x91/0xc60 [ 9482.150561] ? kvm_clock_read+0x14/0x30 [ 9482.151017] ? kvm_sched_clock_read+0x5/0x10 [ 9482.151470] __lock_acquire+0x1740/0x3110 [ 9482.151941] ? __btrfs_tree_read_lock+0x27/0x120 [btrfs] [ 9482.152402] lock_acquire+0xd8/0x490 [ 9482.152887] ? qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.153354] __mutex_lock+0xa3/0xb30 [ 9482.153826] ? qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.154301] ? qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.154768] ? qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.155226] qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.155690] btrfs_read_qgroup_config+0x43a/0x550 [btrfs] [ 9482.156160] open_ctree+0x1228/0x18a0 [btrfs] [ 9482.156643] btrfs_mount_root.cold+0x13/0xed [btrfs] [ 9482.157108] ? rcu_read_lock_sched_held+0x5d/0x90 [ 9482.157567] ? kfree+0x31f/0x3e0 [ 9482.158030] legacy_get_tree+0x30/0x60 [ 9482.158489] vfs_get_tree+0x28/0xe0 [ 9482.158947] fc_mount+0xe/0x40 [ 9482.159403] vfs_kern_mount.part.0+0x71/0x90 [ 9482.159875] btrfs_mount+0x13b/0x3e0 [btrfs] [ 9482.160335] ? rcu_read_lock_sched_held+0x5d/0x90 [ 9482.160805] ? kfree+0x31f/0x3e0 [ 9482.161260] ? legacy_get_tree+0x30/0x60 [ 9482.161714] legacy_get_tree+0x30/0x60 [ 9482.162166] vfs_get_tree+0x28/0xe0 [ 9482.162616] path_mount+0x2d7/0xa70 [ 9482.163070] do_mount+0x75/0x90 [ 9482.163525] __x64_sys_mount+0x8e/0xd0 [ 9482.163986] do_syscall_64+0x33/0x80 [ 9482.164437] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 9482.164902] RIP: 0033:0x7f51e907caaa This happens because at btrfs_read_qgroup_config() we can call qgroup_rescan_init() while holding a read lock on a quota btree leaf, acquired by the previous call to btrfs_search_slot_for_read(), and qgroup_rescan_init() acquires the mutex qgroup_rescan_lock. A qgroup rescan worker does the opposite: it acquires the mutex qgroup_rescan_lock, at btrfs_qgroup_rescan_worker(), and then tries to update the qgroup status item in the quota btree through the call to update_qgroup_status_item(). This inversion of locking order between the qgroup_rescan_lock mutex and quota btree locks causes the splat. Fix this simply by releasing and freeing the path before calling qgroup_rescan_init() at btrfs_read_qgroup_config(). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-23btrfs: tree-checker: add missing returns after data_ref alignment checksDavid Sterba1-0/+2
There are sectorsize alignment checks that are reported but then check_extent_data_ref continues. This was not intended, wrong alignment is not a minor problem and we should return with error. CC: stable@vger.kernel.org # 5.4+ Fixes: 0785a9aacf9d ("btrfs: tree-checker: Add EXTENT_DATA_REF check") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-23btrfs: don't access possibly stale fs_info data for printing duplicate deviceJohannes Thumshirn1-1/+7
Syzbot reported a possible use-after-free when printing a duplicate device warning device_list_add(). At this point it can happen that a btrfs_device::fs_info is not correctly setup yet, so we're accessing stale data, when printing the warning message using the btrfs_printk() wrappers. ================================================================== BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245 Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068 CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1d6/0x29e lib/dump_stack.c:118 print_address_description+0x66/0x620 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report+0x132/0x1d0 mm/kasan/report.c:530 btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245 device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943 btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359 btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008 btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x44840a RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630 RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003 Allocated by task 6945: kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461 kmalloc_node include/linux/slab.h:577 [inline] kvmalloc_node+0x81/0x110 mm/util.c:574 kvmalloc include/linux/mm.h:757 [inline] kvzalloc include/linux/mm.h:765 [inline] btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008 btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 6945: kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track+0x3d/0x70 mm/kasan/common.c:56 kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355 __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422 __cache_free mm/slab.c:3418 [inline] kfree+0x113/0x200 mm/slab.c:3756 deactivate_locked_super+0xa7/0xf0 fs/super.c:335 btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008 btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff8880878e0000 which belongs to the cache kmalloc-16k of size 16384 The buggy address is located 1704 bytes inside of 16384-byte region [ffff8880878e0000, ffff8880878e4000) The buggy address belongs to the page: page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0 head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0 flags: 0xfffe0000010200(slab|head) raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00 raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== The syzkaller reproducer for this use-after-free crafts a filesystem image and loop mounts it twice in a loop. The mount will fail as the crafted image has an invalid chunk tree. When this happens btrfs_mount_root() will call deactivate_locked_super(), which then cleans up fs_info and fs_info::sb. If a second thread now adds the same block-device to the filesystem, it will get detected as a duplicate device and device_list_add() will reject the duplicate and print a warning. But as the fs_info pointer passed in is non-NULL this will result in a use-after-free. Instead of printing possibly uninitialized or already freed memory in btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the device name will be skipped altogether. There was a slightly different approach discussed in https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-23Merge tag 'efi-urgent-for-v5.10-rc3' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fixes from Borislav Petkov: "Forwarded EFI fixes from Ard Biesheuvel: - fix memory leak in efivarfs driver - fix HYP mode issue in 32-bit ARM version of the EFI stub when built in Thumb2 mode - avoid leaking EFI pgd pages on allocation failure" * tag 'efi-urgent-for-v5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi/x86: Free efi_pgd with free_pages() efivarfs: fix memory leak in efivarfs_create() efi/arm: set HSCTLR Thumb2 bit correctly for HVC calls from HYP
2020-11-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-2/+4
Merge misc fixes from Andrew Morton: "8 patches. Subsystems affected by this patch series: mm (madvise, pagemap, readahead, memcg, userfaultfd), kbuild, and vfs" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: fix madvise WILLNEED performance problem libfs: fix error cast of negative value in simple_attr_write() mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault() mm: memcg/slab: fix root memcg vmstats mm: fix readahead_page_batch for retry entries mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exports compiler-clang: remove version check for BPF Tracing mm/madvise: fix memory leak from process_madvise
2020-11-22Merge tag 'ext4_for_linus_fixes2' of ↵Linus Torvalds4-36/+36
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "A final set of miscellaneous bug fixes for ext4" * tag 'ext4_for_linus_fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix bogus warning in ext4_update_dx_flag() jbd2: fix kernel-doc markups ext4: drop fast_commit from /proc/mounts
2020-11-22afs: Fix speculative status fetch going out of order wrt to modificationsDavid Howells3-0/+10
When doing a lookup in a directory, the afs filesystem uses a bulk status fetch to speculatively retrieve the statuses of up to 48 other vnodes found in the same directory and it will then either update extant inodes or create new ones - effectively doing 'lookup ahead'. To avoid the possibility of deadlocking itself, however, the filesystem doesn't lock all of those inodes; rather just the directory inode is locked (by the VFS). When the operation completes, afs_inode_init_from_status() or afs_apply_status() is called, depending on whether the inode already exists, to commit the new status. A case exists, however, where the speculative status fetch operation may straddle a modification operation on one of those vnodes. What can then happen is that the speculative bulk status RPC retrieves the old status, and whilst that is happening, the modification happens - which returns an updated status, then the modification status is committed, then we attempt to commit the speculative status. This results in something like the following being seen in dmesg: kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus say that the vnode had data version 8 when we'd already recorded version 9 due to a local modification. This was causing the cache to be invalidated for that vnode when it shouldn't have been. If it happens on a data file, this might lead to local changes being lost. Fix this by ignoring speculative status updates if the data version doesn't match the expected value. Note that it is possible to get a DV regression if a volume gets restored from a backup - but we should get a callback break in such a case that should trigger a recheck anyway. It might be worth checking the volume creation time in the volsync info and, if a change is observed in that (as would happen on a restore), invalidate all caches associated with the volume. Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-22libfs: fix error cast of negative value in simple_attr_write()Yicong Yang1-2/+4
The attr->set() receive a value of u64, but simple_strtoll() is used for doing the conversion. It will lead to the error cast if user inputs a negative value. Use kstrtoull() instead of simple_strtoll() to convert a string got from the user to an unsigned value. The former will return '-EINVAL' if it gets a negetive value, but the latter can't handle the situation correctly. Make 'val' unsigned long long as what kstrtoull() takes, this will eliminate the compile warning on no 64-bit architectures. Fixes: f7b88631a897 ("fs/libfs.c: fix simple_attr_write() on 32bit machines") Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/1605341356-11872-1-git-send-email-yangyicong@hisilicon.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-21Merge tag 'xfs-5.10-fixes-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds8-41/+124
Pull xfs fixes from Darrick Wong: "The critical fixes are for a crash that someone reported in the xattr code on 32-bit arm last week; and a revert of the rmap key comparison change from last week as it was totally wrong. I need a vacation. :( Summary: - Fix various deficiencies in online fsck's metadata checking code - Fix an integer casting bug in the xattr code on 32-bit systems - Fix a hang in an inode walk when the inode index is corrupt - Fix error codes being dropped when initializing per-AG structures - Fix nowait directio writes that partially succeed but return EAGAIN - Revert last week's rmap comparison patch because it was wrong" * tag 'xfs-5.10-fixes-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: revert "xfs: fix rmap key and record comparison functions" xfs: don't allow NOWAIT DIO across extent boundaries xfs: return corresponding errcode if xfs_initialize_perag() fail xfs: ensure inobt record walks always make forward progress xfs: fix forkoff miscalculation related to XFS_LITINO(mp) xfs: directory scrub should check the null bestfree entries too xfs: strengthen rmap record flags checking xfs: fix the minrecs logic when dealing with inode root child blocks
2020-11-21Merge tag 'fsnotify_for_v5.10-rc5' of ↵Linus Torvalds1-5/+7
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fanotify fix from Jan Kara: "A single fanotify fix from Amir" * tag 'fsnotify_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: fix logic of reporting name info with watched parent
2020-11-20Merge tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-blockLinus Torvalds2-15/+49
Pull io_uring fixes from Jens Axboe: "Mostly regression or stable fodder: - Disallow async path resolution of /proc/self - Tighten constraints for segmented async buffered reads - Fix double completion for a retry error case - Fix for fixed file life times (Pavel)" * tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block: io_uring: order refnode recycling io_uring: get an active ref_node from files_data io_uring: don't double complete failed reissue request mm: never attempt async page lock if we've transferred data already io_uring: handle -EOPNOTSUPP on path resolution proc: don't allow async path resolution of /proc/self components
2020-11-20ext4: fix bogus warning in ext4_update_dx_flag()Jan Kara1-1/+2
The idea of the warning in ext4_update_dx_flag() is that we should warn when we are clearing EXT4_INODE_INDEX on a filesystem with metadata checksums enabled since after clearing the flag, checksums for internal htree nodes will become invalid. So there's no need to warn (or actually do anything) when EXT4_INODE_INDEX is not set. Link: https://lore.kernel.org/r/20201118153032.17281-1-jack@suse.cz Fixes: 48a34311953d ("ext4: fix checksum errors with indexed dirs") Reported-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-11-20jbd2: fix kernel-doc markupsMauro Carvalho Chehab2-31/+34
Kernel-doc markup should use this format: identifier - description They should not have any type before that, as otherwise the parser won't do the right thing. Also, some identifiers have different names between their prototypes and the kernel-doc markup. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/72f5c6628f5f278d67625f60893ffbc2ca28d46e.1605521731.git.mchehab+huawei@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-20xfs: revert "xfs: fix rmap key and record comparison functions"Darrick J. Wong1-8/+8
This reverts commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f. Your maintainer committed a major braino in the rmap code by adding the attr fork, bmbt, and unwritten extent usage bits into rmap record key comparisons. While XFS uses the usage bits *in the rmap records* for cross-referencing metadata in xfs_scrub and xfs_repair, it only needs the owner and offset information to distinguish between reverse mappings of the same physical extent into the data fork of a file at multiple offsets. The other bits are not important for key comparisons for index lookups, and never have been. Eric Sandeen reports that this causes regressions in generic/299, so undo this patch before it does more damage. Reported-by: Eric Sandeen <sandeen@sandeen.net> Fixes: 6ff646b2ceb0 ("xfs: fix rmap key and record comparison functions") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2020-11-19ext4: drop fast_commit from /proc/mountsTheodore Ts'o1-4/+0
The options in /proc/mounts must be valid mount options --- and fast_commit is not a mount option. Otherwise, command sequences like this will fail: # mount /dev/vdc /vdc # mkdir -p /vdc/phoronix_test_suite /pts # mount --bind /vdc/phoronix_test_suite /pts # mount -o remount,nodioread_nolock /pts mount: /pts: mount point not mounted or bad option. And in the system logs, you'll find: EXT4-fs (vdc): Unrecognized mount option "fast_commit" or missing value Fixes: 995a3ed67fc8 ("ext4: add fast_commit feature and handling for extended mount options") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-19xfs: don't allow NOWAIT DIO across extent boundariesDave Chinner1-0/+29
Jens has reported a situation where partial direct IOs can be issued and completed yet still return -EAGAIN. We don't want this to report a short IO as we want XFS to complete user DIO entirely or not at all. This partial IO situation can occur on a write IO that is split across an allocated extent and a hole, and the second mapping is returning EAGAIN because allocation would be required. The trivial reproducer: $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo wrote 4096/4096 bytes at offset 0 4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec) pwrite: Resource temporarily unavailable $ The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done the first 4kB write: xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000 iomap_apply: dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin xfs_iomap_found: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1 iomap_apply_dstmap: dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY Here the first iomap loop has mapped the first 4kB of the file and issued the IO, and we enter the second iomap_apply loop: iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin And we exit with -EAGAIN out because we hit the allocate case trying to make the second 4kB block. Then IO completes on the first 4kB and the original IO context completes and unlocks the inode, returning -EAGAIN to userspace: xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096 xfs_iunlock: dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write There are other vectors to the same problem when we re-enter the mapping code if we have to make multiple mappinfs under NOWAIT conditions. e.g. failing trylocks, COW extents being found, allocation being required, and so on. Avoid all these potential problems by only allowing IOMAP_NOWAIT IO to go ahead if the mapping we retrieve for the IO spans an entire allocated extent. This avoids the possibility of subsequent mappings to complete the IO from triggering NOWAIT semantics by any means as NOWAIT IO will now only enter the mapping code once per NOWAIT IO. Reported-and-tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-11-18xfs: return corresponding errcode if xfs_initialize_perag() failYu Kuai1-3/+8
In xfs_initialize_perag(), if kmem_zalloc(), xfs_buf_hash_init(), or radix_tree_preload() failed, the returned value 'error' is not set accordingly. Reported-as-fixing: 8b26c5825e02 ("xfs: handle ENOMEM correctly during initialisation of perag structures") Fixes: 9b2471797942 ("xfs: cache unlinked pointers in an rhashtable") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-11-18xfs: ensure inobt record walks always make forward progressDarrick J. Wong1-3/+24
The aim of the inode btree record iterator function is to call a callback on every record in the btree. To avoid having to tear down and recreate the inode btree cursor around every callback, it caches a certain number of records in a memory buffer. After each batch of callback invocations, we have to perform a btree lookup to find the next record after where we left off. However, if the keys of the inode btree are corrupt, the lookup might put us in the wrong part of the inode btree, causing the walk function to loop forever. Therefore, we add extra cursor tracking to make sure that we never go backwards neither when performing the lookup nor when jumping to the next inobt record. This also fixes an off by one error where upon resume the lookup should have been for the inode /after/ the point at which we stopped. Found by fuzzing xfs/460 with keys[2].startino = ones causing bulkstat and quotacheck to hang. Fixes: a211432c27ff ("xfs: create simplified inode walk function") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-11-18xfs: fix forkoff miscalculation related to XFS_LITINO(mp)Gao Xiang1-1/+7
Currently, commit e9e2eae89ddb dropped a (int) decoration from XFS_LITINO(mp), and since sizeof() expression is also involved, the result of XFS_LITINO(mp) is simply as the size_t type (commonly unsigned long). Considering the expression in xfs_attr_shortform_bytesfit(): offset = (XFS_LITINO(mp) - bytes) >> 3; let "bytes" be (int)340, and "XFS_LITINO(mp)" be (unsigned long)336. on 64-bit platform, the expression is offset = ((unsigned long)336 - (int)340) >> 3 = (int)(0xfffffffffffffffcUL >> 3) = -1 but on 32-bit platform, the expression is offset = ((unsigned long)336 - (int)340) >> 3 = (int)(0xfffffffcUL >> 3) = 0x1fffffff instead. so offset becomes a large positive number on 32-bit platform, and cause xfs_attr_shortform_bytesfit() returns maxforkoff rather than 0. Therefore, one result is "ASSERT(new_size <= XFS_IFORK_SIZE(ip, whichfork));" assertion failure in xfs_idata_realloc(), which was also the root cause of the original bugreport from Dennis, see: https://bugzilla.redhat.com/show_bug.cgi?id=1894177 And it can also be manually triggered with the following commands: $ touch a; $ setfattr -n user.0 -v "`seq 0 80`" a; $ setfattr -n user.1 -v "`seq 0 80`" a on 32-bit platform. Fix the case in xfs_attr_shortform_bytesfit() by bailing out "XFS_LITINO(mp) < bytes" in advance suggested by Eric and a misleading comment together with this bugfix suggested by Darrick. It seems the other users of XFS_LITINO(mp) are not impacted. Fixes: e9e2eae89ddb ("xfs: only check the superblock version for dinode size calculation") Cc: <stable@vger.kernel.org> # 5.7+ Reported-and-tested-by: Dennis Gilmore <dgilmore@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-11-18xfs: directory scrub should check the null bestfree entries tooDarrick J. Wong1-4/+17
Teach the directory scrubber to check all the bestfree entries, including the null ones. We want to be able to detect the case where the entry is null but there actually /is/ a directory data block. Found by fuzzing lbests[0] = ones in xfs/391. Fixes: df481968f33b ("xfs: scrub directory freespace") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-18xfs: strengthen rmap record flags checkingDarrick J. Wong1-4/+4
We always know the correct state of the rmap record flags (attr, bmbt, unwritten) so check them by direct comparison. Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-18xfs: fix the minrecs logic when dealing with inode root child blocksDarrick J. Wong1-18/+27
The comment and logic in xchk_btree_check_minrecs for dealing with inode-rooted btrees isn't quite correct. While the direct children of the inode root are allowed to have fewer records than what would normally be allowed for a regular ondisk btree block, this is only true if there is only one child block and the number of records don't fit in the inode root. Fixes: 08a3a692ef58 ("xfs: btree scrub should check minrecs") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-18gfs2: Fix regression in freeze_go_syncBob Peterson1-1/+12
Patch 541656d3a513 ("gfs2: freeze should work on read-only mounts") changed the check for glock state in function freeze_go_sync() from "gl->gl_state == LM_ST_SHARED" to "gl->gl_req == LM_ST_EXCLUSIVE". That's wrong and it regressed gfs2's freeze/thaw mechanism because it caused only the freezing node (which requests the glock in EX) to queue freeze work. All nodes go through this go_sync code path during the freeze to drop their SHared hold on the freeze glock, allowing the freezing node to acquire it in EXclusive mode. But all the nodes must freeze access to the file system locally, so they ALL must queue freeze work. The freeze_work calls freeze_func, which makes a request to reacquire the freeze glock in SH, effectively blocking until the thaw from the EX holder. Once thawed, the freezing node drops its EX hold on the freeze glock, then the (blocked) freeze_func reacquires the freeze glock in SH again (on all nodes, including the freezer) so all nodes go back to a thawed state. This patch changes the check back to gl_state == LM_ST_SHARED like it was prior to 541656d3a513. Fixes: 541656d3a513 ("gfs2: freeze should work on read-only mounts") Cc: stable@vger.kernel.org # v5.8+ Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-11-18io_uring: order refnode recyclingPavel Begunkov1-10/+23
Don't recycle a refnode until we're done with all requests of nodes ejected before. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-18io_uring: get an active ref_node from files_dataPavel Begunkov1-3/+1
An active ref_node always can be found in ctx->files_data, it's much safer to get it this way instead of poking into files_data->ref_list. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-18io_uring: don't double complete failed reissue requestJens Axboe1-1/+0
Zorro reports that an xfstest test case is failing, and it turns out that for the reissue path we can potentially issue a double completion on the request for the failure path. There's an issue around the retry as well, but for now, at least just make sure that we handle the error path correctly. Cc: stable@vger.kernel.org Fixes: b63534c41e20 ("io_uring: re-issue block requests that failed because of resources") Reported-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16smb3: Handle error case during offload read pathRohith Surabattula1-1/+19
Mid callback needs to be called only when valid data is read into pages. These patches address a problem found during decryption offload: CIFS: VFS: trying to dequeue a deleted mid that could cause a refcount use after free: Workqueue: smb3decryptd smb2_decrypt_offload [cifs] Signed-off-by: Rohith Surabattula <rohiths@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org> #5.4+ Signed-off-by: Steve French <stfrench@microsoft.com>