summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2019-10-05btrfs: adjust dirty_metadata_bytes after writeback failure of extent bufferDennis Zhou1-0/+9
commit eb5b64f142504a597d67e2109d603055ff765e52 upstream. Before, if a eb failed to write out, we would end up triggering a BUG_ON(). As of f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up"), we no longer BUG_ON(), so we should make life consistent and add back the unwritten bytes to dirty_metadata_bytes. Fixes: f4340622e022 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up") CC: stable@vger.kernel.org # 5.2+ Reviewed-by: Filipe Manana <fdmanana@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05btrfs: Relinquish CPUs in btrfs_compare_treesNikolay Borisov1-0/+1
commit 6af112b11a4bc1b560f60a618ac9c1dcefe9836e upstream. When doing any form of incremental send the parent and the child trees need to be compared via btrfs_compare_trees. This can result in long loop chains without ever relinquishing the CPU. This causes softlockup detector to trigger when comparing trees with a lot of items. Example report: watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153] CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased) Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 40000005 (nZcv daif -PAN -UAO) pc : __ll_sc_arch_atomic_sub_return+0x14/0x20 lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs] sp : ffff00001273b7e0 Call trace: __ll_sc_arch_atomic_sub_return+0x14/0x20 release_extent_buffer+0xdc/0x120 [btrfs] free_extent_buffer.part.0+0xb0/0x118 [btrfs] free_extent_buffer+0x24/0x30 [btrfs] btrfs_release_path+0x4c/0xa0 [btrfs] btrfs_free_path.part.0+0x20/0x40 [btrfs] btrfs_free_path+0x24/0x30 [btrfs] get_inode_info+0xa8/0xf8 [btrfs] finish_inode_if_needed+0xe0/0x6d8 [btrfs] changed_cb+0x9c/0x410 [btrfs] btrfs_compare_trees+0x284/0x648 [btrfs] send_subvol+0x33c/0x520 [btrfs] btrfs_ioctl_send+0x8a0/0xaf0 [btrfs] btrfs_ioctl+0x199c/0x2288 [btrfs] do_vfs_ioctl+0x4b0/0x820 ksys_ioctl+0x84/0xb8 __arm64_sys_ioctl+0x28/0x38 el0_svc_common.constprop.0+0x7c/0x188 el0_svc_handler+0x34/0x90 el0_svc+0x8/0xc Fix this by adding a call to cond_resched at the beginning of the main loop in btrfs_compare_trees. Fixes: 7069830a9e38 ("Btrfs: add btrfs_compare_trees function") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05Btrfs: fix use-after-free when using the tree modification logFilipe Manana1-1/+3
commit efad8a853ad2057f96664328a0d327a05ce39c76 upstream. At ctree.c:get_old_root(), we are accessing a root's header owner field after we have freed the respective extent buffer. This results in an use-after-free that can lead to crashes, and when CONFIG_DEBUG_PAGEALLOC is set, results in a stack trace like the following: [ 3876.799331] stack segment: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [ 3876.799363] CPU: 0 PID: 15436 Comm: pool Not tainted 5.3.0-rc3-btrfs-next-54 #1 [ 3876.799385] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [ 3876.799433] RIP: 0010:btrfs_search_old_slot+0x652/0xd80 [btrfs] (...) [ 3876.799502] RSP: 0018:ffff9f08c1a2f9f0 EFLAGS: 00010286 [ 3876.799518] RAX: ffff8dd300000000 RBX: ffff8dd85a7a9348 RCX: 000000038da26000 [ 3876.799538] RDX: 0000000000000000 RSI: ffffe522ce368980 RDI: 0000000000000246 [ 3876.799559] RBP: dae1922adadad000 R08: 0000000008020000 R09: ffffe522c0000000 [ 3876.799579] R10: ffff8dd57fd788c8 R11: 000000007511b030 R12: ffff8dd781ddc000 [ 3876.799599] R13: ffff8dd9e6240578 R14: ffff8dd6896f7a88 R15: ffff8dd688cf90b8 [ 3876.799620] FS: 00007f23ddd97700(0000) GS:ffff8dda20200000(0000) knlGS:0000000000000000 [ 3876.799643] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3876.799660] CR2: 00007f23d4024000 CR3: 0000000710bb0005 CR4: 00000000003606f0 [ 3876.799682] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3876.799703] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3876.799723] Call Trace: [ 3876.799735] ? do_raw_spin_unlock+0x49/0xc0 [ 3876.799749] ? _raw_spin_unlock+0x24/0x30 [ 3876.799779] resolve_indirect_refs+0x1eb/0xc80 [btrfs] [ 3876.799810] find_parent_nodes+0x38d/0x1180 [btrfs] [ 3876.799841] btrfs_check_shared+0x11a/0x1d0 [btrfs] [ 3876.799870] ? extent_fiemap+0x598/0x6e0 [btrfs] [ 3876.799895] extent_fiemap+0x598/0x6e0 [btrfs] [ 3876.799913] do_vfs_ioctl+0x45a/0x700 [ 3876.799926] ksys_ioctl+0x70/0x80 [ 3876.799938] ? trace_hardirqs_off_thunk+0x1a/0x20 [ 3876.799953] __x64_sys_ioctl+0x16/0x20 [ 3876.799965] do_syscall_64+0x62/0x220 [ 3876.799977] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 3876.799993] RIP: 0033:0x7f23e0013dd7 (...) [ 3876.800056] RSP: 002b:00007f23ddd96ca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 3876.800078] RAX: ffffffffffffffda RBX: 00007f23d80210f8 RCX: 00007f23e0013dd7 [ 3876.800099] RDX: 00007f23d80210f8 RSI: 00000000c020660b RDI: 0000000000000003 [ 3876.800626] RBP: 000055fa2a2a2440 R08: 0000000000000000 R09: 00007f23ddd96d7c [ 3876.801143] R10: 00007f23d8022000 R11: 0000000000000246 R12: 00007f23ddd96d80 [ 3876.801662] R13: 00007f23ddd96d78 R14: 00007f23d80210f0 R15: 00007f23ddd96d80 (...) [ 3876.805107] ---[ end trace e53161e179ef04f9 ]--- Fix that by saving the root's header owner field into a local variable before freeing the root's extent buffer, and then use that local variable when needed. Fixes: 30b0463a9394d9 ("Btrfs: fix accessing the root pointer in tree mod log functions") CC: stable@vger.kernel.org # 3.10+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05btrfs: fix allocation of free space cache v1 bitmap pagesChristophe Leroy3-7/+22
commit 3acd48507dc43eeeb0a1fe965b8bad91cab904a7 upstream. Various notifications of type "BUG kmalloc-4096 () : Redzone overwritten" have been observed recently in various parts of the kernel. After some time, it has been made a relation with the use of BTRFS filesystem and with SLUB_DEBUG turned on. [ 22.809700] BUG kmalloc-4096 (Tainted: G W ): Redzone overwritten [ 22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc [ 22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224 [ 22.811193] __slab_alloc.constprop.26+0x44/0x70 [ 22.811345] kmem_cache_alloc_trace+0xf0/0x2ec [ 22.811588] __load_free_space_cache+0x588/0x780 [btrfs] [ 22.811848] load_free_space_cache+0xf4/0x1b0 [btrfs] [ 22.812090] cache_block_group+0x1d0/0x3d0 [btrfs] [ 22.812321] find_free_extent+0x680/0x12a4 [btrfs] [ 22.812549] btrfs_reserve_extent+0xec/0x220 [btrfs] [ 22.812785] btrfs_alloc_tree_block+0x178/0x5f4 [btrfs] [ 22.813032] __btrfs_cow_block+0x150/0x5d4 [btrfs] [ 22.813262] btrfs_cow_block+0x194/0x298 [btrfs] [ 22.813484] commit_cowonly_roots+0x44/0x294 [btrfs] [ 22.813718] btrfs_commit_transaction+0x63c/0xc0c [btrfs] [ 22.813973] close_ctree+0xf8/0x2a4 [btrfs] [ 22.814107] generic_shutdown_super+0x80/0x110 [ 22.814250] kill_anon_super+0x18/0x30 [ 22.814437] btrfs_kill_super+0x18/0x90 [btrfs] [ 22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83 [ 22.814841] proc_cgroup_show+0xc0/0x248 [ 22.814967] proc_single_show+0x54/0x98 [ 22.815086] seq_read+0x278/0x45c [ 22.815190] __vfs_read+0x28/0x17c [ 22.815289] vfs_read+0xa8/0x14c [ 22.815381] ksys_read+0x50/0x94 [ 22.815475] ret_from_syscall+0x0/0x38 Commit 69d2480456d1 ("btrfs: use copy_page for copying pages instead of memcpy") changed the way bitmap blocks are copied. But allthough bitmaps have the size of a page, they were allocated with kzalloc(). Most of the time, kzalloc() allocates aligned blocks of memory, so copy_page() can be used. But when some debug options like SLAB_DEBUG are activated, kzalloc() may return unaligned pointer. On powerpc, memcpy(), copy_page() and other copying functions use 'dcbz' instruction which provides an entire zeroed cacheline to avoid memory read when the intention is to overwrite a full line. Functions like memcpy() are writen to care about partial cachelines at the start and end of the destination, but copy_page() assumes it gets pages. As pages are naturally cache aligned, copy_page() doesn't care about partial lines. This means that when copy_page() is called with a misaligned pointer, a few leading bytes are zeroed. To fix it, allocate bitmaps through kmem_cache instead of using kzalloc() The cache pool is created with PAGE_SIZE alignment constraint. Reported-by: Erhard F. <erhard_f@mailbox.org> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371 Fixes: 69d2480456d1 ("btrfs: use copy_page for copying pages instead of memcpy") Cc: stable@vger.kernel.org # 4.19+ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: David Sterba <dsterba@suse.com> [ rename to btrfs_free_space_bitmap ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05ovl: filter of trusted xattr results in auditMark Salyzyn1-1/+2
commit 5c2e9f346b815841f9bed6029ebcb06415caf640 upstream. When filtering xattr list for reading, presence of trusted xattr results in a security audit log. However, if there is other content no errno will be set, and if there isn't, the errno will be -ENODATA and not -EPERM as is usually associated with a lack of capability. The check does not block the request to list the xattrs present. Switch to ns_capable_noaudit to reflect a more appropriate check. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: linux-security-module@vger.kernel.org Cc: kernel-team@android.com Cc: stable@vger.kernel.org # v3.18+ Fixes: a082c6f680da ("ovl: filter trusted xattr for non-admin") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05ovl: Fix dereferencing possible ERR_PTR()Ding Xiang1-2/+1
commit 97f024b9171e74c4443bbe8a8dce31b917f97ac5 upstream. if ovl_encode_real_fh() fails, no memory was allocated and the error in the error-valued pointer should be returned. Fixes: 9b6faee07470 ("ovl: check ERR_PTR() return value from ovl_encode_fh()") Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com> Cc: <stable@vger.kernel.org> # v4.16+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05smb3: fix leak in "open on server" perf counterSteve French2-0/+6
commit d2f15428d6a0ebfc0edc364094d7c4a2de7037ed upstream. We were not bumping up the "open on server" (num_remote_opens) counter (in some cases) on opens of the share root so could end up showing as a negative value. CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05smb3: fix unmount hang in open_shrootSteve French1-10/+11
commit 96d9f7ed00b86104bf03adeffc8980897e9694ab upstream. An earlier patch "CIFS: fix deadlock in cached root handling" did not completely address the deadlock in open_shroot. This patch addresses the deadlock. In testing the recent patch: smb3: improve handling of share deleted (and share recreated) we were able to reproduce the open_shroot deadlock to one of the target servers in unmount in a delete share scenario. Fixes: 7e5a70ad88b1e ("CIFS: fix deadlock in cached root handling") This is version 2 of this patch. An earlier version of this patch "smb3: fix unmount hang in open_shroot" had a problem found by Dan. Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Suggested-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Aurelien Aptel <aaptel@suse.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05smb3: allow disabling requesting leasesSteve French4-2/+13
commit 3e7a02d47872081f4b6234a9f72500f1d10f060c upstream. In some cases to work around server bugs or performance problems it can be helpful to be able to disable requesting SMB2.1/SMB3 leases on a particular mount (not to all servers and all shares we are mounted to). Add new mount parm "nolease" which turns off requesting leases on directory or file opens. Currently the only way to disable leases is globally through a module load parameter. This is more granular. Suggested-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05gfs2: clear buf_in_tr when ending a transaction in sweep_bh_for_rgrpsBob Peterson1-0/+1
commit f0b444b349e33ae0d3dd93e25ca365482a5d17d4 upstream. In function sweep_bh_for_rgrps, which is a helper for punch_hole, it uses variable buf_in_tr to keep track of when it needs to commit pending block frees on a partial delete that overflows the transaction created for the delete. The problem is that the variable was initialized at the start of function sweep_bh_for_rgrps but it was never cleared, even when starting a new transaction. This patch reinitializes the variable when the transaction is ended, so the next transaction starts out with it cleared. Fixes: d552a2b9b33e ("GFS2: Non-recursive delete") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05binfmt_elf: Do not move brk for INTERP-less ET_EXECKees Cook1-1/+2
commit 7be3cb019db1cbd5fd5ffe6d64a23fefa4b6f229 upstream. When brk was moved for binaries without an interpreter, it should have been limited to ET_DYN only. In other words, the special case was an ET_DYN that lacks an INTERP, not just an executable that lacks INTERP. The bug manifested for giant static executables, where the brk would end up in the middle of the text area on 32-bit architectures. Reported-and-tested-by: Richard Kojedzinszky <richard@kojedz.in> Fixes: bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct loader exec") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05fuse: fix beyond-end-of-page access in fuse_parse_cache()Tejun Heo1-1/+3
commit e5854b1cdf6cb48a20e01e3bdad0476a4c60a077 upstream. With DEBUG_PAGEALLOC on, the following triggers. BUG: unable to handle page fault for address: ffff88859367c000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 3001067 P4D 3001067 PUD 406d3a8067 PMD 406d30c067 PTE 800ffffa6c983060 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 38 PID: 3110657 Comm: python2.7 RIP: 0010:fuse_readdir+0x88f/0xe7a [fuse] Code: 49 8b 4d 08 49 39 4e 60 0f 84 44 04 00 00 48 8b 43 08 43 8d 1c 3c 4d 01 7e 68 49 89 dc 48 03 5c 24 38 49 89 46 60 8b 44 24 30 <8b> 4b 10 44 29 e0 48 89 ca 48 83 c1 1f 48 83 e1 f8 83 f8 17 49 89 RSP: 0018:ffffc90035edbde0 EFLAGS: 00010286 RAX: 0000000000001000 RBX: ffff88859367bff0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88859367bfed RDI: 0000000000920907 RBP: ffffc90035edbe90 R08: 000000000000014b R09: 0000000000000004 R10: ffff88859367b000 R11: 0000000000000000 R12: 0000000000000ff0 R13: ffffc90035edbee0 R14: ffff889fb8546180 R15: 0000000000000020 FS: 00007f80b5f4a740(0000) GS:ffff889fffa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88859367c000 CR3: 0000001c170c2001 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: iterate_dir+0x122/0x180 __x64_sys_getdents+0xa6/0x140 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 It's in fuse_parse_cache(). %rbx (ffff88859367bff0) is fuse_dirent pointer - addr + offset. FUSE_DIRENT_SIZE() is trying to dereference namelen off of it but that derefs into the next page which is disabled by pagealloc debug causing a PF. This is caused by dirent->namelen being accessed before ensuring that there's enough bytes in the page for the dirent. Fix it by pushing down reclen calculation. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 5d7bc7e8680c ("fuse: allow using readdir cache") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05fuse: fix missing unlock_page in fuse_writepage()Vasily Averin1-0/+1
commit d5880c7a8620290a6c90ced7a0e8bd0ad9419601 upstream. unlock_page() was missing in case of an already in-flight write against the same page. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Fixes: ff17be086477 ("fuse: writepage: skip already in flight") Cc: <stable@vger.kernel.org> # v3.13 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lockEric Biggers3-45/+52
commit 76e43c8ccaa35c30d5df853013561145a0f750a5 upstream. When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock. This may have to wait for fuse_iqueue::waitq.lock to be released by one of many places that take it with IRQs enabled. Since the IRQ handler may take kioctx::ctx_lock, lockdep reports that a deadlock is possible. Fix it by protecting the state of struct fuse_iqueue with a separate spinlock, and only accessing fuse_iqueue::waitq using the versions of the waitqueue functions which do IRQ-safe locking internally. Reproducer: #include <fcntl.h> #include <stdio.h> #include <sys/mount.h> #include <sys/stat.h> #include <sys/syscall.h> #include <unistd.h> #include <linux/aio_abi.h> int main() { char opts[128]; int fd = open("/dev/fuse", O_RDWR); aio_context_t ctx = 0; struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd }; struct iocb *cbp = &cb; sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd); mkdir("mnt", 0700); mount("foo", "mnt", "fuse", 0, opts); syscall(__NR_io_setup, 1, &ctx); syscall(__NR_io_submit, ctx, 1, &cbp); } Beginning of lockdep output: ===================================================== WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected 5.3.0-rc5 #9 Not tainted ----------------------------------------------------- syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: 000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline] 000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline] 000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825 and this task is already holding: 0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline] 0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline] 0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825 which would create a new lock dependency: (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.} but this new dependency connects a SOFTIRQ-irq-safe lock: (&(&ctx->ctx_lock)->rlock){..-.} [...] Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL") Cc: <stable@vger.kernel.org> # v4.19+ Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05btrfs: Detect unbalanced tree with empty leaf before crashing btree operationsQu Wenruo2-0/+16
[ Upstream commit 62fdaa52a3d00a875da771719b6dc537ca79fce1 ] [BUG] With crafted image, btrfs will panic at btree operations: kernel BUG at fs/btrfs/ctree.c:3894! invalid opcode: 0000 [#1] SMP PTI CPU: 0 PID: 1138 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9 RIP: 0010:__push_leaf_left+0x6b6/0x6e0 RSP: 0018:ffffc0bd4128b990 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffa0a4ab8f0e38 RCX: 0000000000000000 RDX: ffffa0a280000000 RSI: 0000000000000000 RDI: ffffa0a4b3814000 RBP: ffffc0bd4128ba38 R08: 0000000000001000 R09: ffffc0bd4128b948 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000240 R13: ffffa0a4b556fb60 R14: ffffa0a4ab8f0af0 R15: ffffa0a4ab8f0af0 FS: 0000000000000000(0000) GS:ffffa0a4b7a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f2461c80020 CR3: 000000022b32a006 CR4: 00000000000206f0 Call Trace: ? _cond_resched+0x1a/0x50 push_leaf_left+0x179/0x190 btrfs_del_items+0x316/0x470 btrfs_del_csums+0x215/0x3a0 __btrfs_free_extent.isra.72+0x5a7/0xbe0 __btrfs_run_delayed_refs+0x539/0x1120 btrfs_run_delayed_refs+0xdb/0x1b0 btrfs_commit_transaction+0x52/0x950 ? start_transaction+0x94/0x450 transaction_kthread+0x163/0x190 kthread+0x105/0x140 ? btrfs_cleanup_transaction+0x560/0x560 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x35/0x40 Modules linked in: ---[ end trace c2425e6e89b5558f ]--- [CAUSE] The offending csum tree looks like this: checksum tree key (CSUM_TREE ROOT_ITEM 0) node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE ... key (EXTENT_CSUM EXTENT_CSUM 85975040) block 29630464 gen 17 key (EXTENT_CSUM EXTENT_CSUM 89911296) block 29642752 gen 17 <<< key (EXTENT_CSUM EXTENT_CSUM 92274688) block 29646848 gen 17 ... leaf 29630464 items 6 free space 1 generation 17 owner CSUM_TREE item 0 key (EXTENT_CSUM EXTENT_CSUM 85975040) itemoff 3987 itemsize 8 range start 85975040 end 85983232 length 8192 ... leaf 29642752 items 0 free space 3995 generation 17 owner 0 ^ empty leaf invalid owner ^ leaf 29646848 items 1 free space 602 generation 17 owner CSUM_TREE item 0 key (EXTENT_CSUM EXTENT_CSUM 92274688) itemoff 627 itemsize 3368 range start 92274688 end 95723520 length 3448832 So we have a corrupted csum tree where one tree leaf is completely empty, causing unbalanced btree, thus leading to unexpected btree balance error. [FIX] For this particular case, we handle it in two directions to catch it: - Check if the tree block is empty through btrfs_verify_level_key() So that invalid tree blocks won't be read out through btrfs_search_slot() and its variants. - Check 0 tree owner in tree checker NO tree is using 0 as its tree owner, detect it and reject at tree block read time. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202821 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05btrfs: tree-checker: Add ROOT_ITEM checkQu Wenruo1-0/+92
[ Upstream commit 259ee7754b6793af8bdd77f9ca818bc41cfe9541 ] This patch will introduce ROOT_ITEM check, which includes: - Key->objectid and key->offset check Currently only some easy check, e.g. 0 as rootid is invalid. - Item size check Root item size is fixed. - Generation checks Generation, generation_v2 and last_snapshot should not be greater than super generation + 1 - Level and alignment check Level should be in [0, 7], and bytenr must be aligned to sector size. - Flags check Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203261 Reported-by: Jungyeon Yoon <jungyeon.yoon@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05btrfs: extent-tree: Make sure we only allocate extents from block groups ↵Qu Wenruo1-0/+8
with the same type [ Upstream commit 2a28468e525f3924efed7f29f2bc5a2926e7e19a ] [BUG] With fuzzed image and MIXED_GROUPS super flag, we can hit the following BUG_ON(): kernel BUG at fs/btrfs/delayed-ref.c:491! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 1849 Comm: sync Tainted: G O 5.2.0-custom #27 RIP: 0010:update_existing_head_ref.cold+0x44/0x46 [btrfs] Call Trace: add_delayed_ref_head+0x20c/0x2d0 [btrfs] btrfs_add_delayed_tree_ref+0x1fc/0x490 [btrfs] btrfs_free_tree_block+0x123/0x380 [btrfs] __btrfs_cow_block+0x435/0x500 [btrfs] btrfs_cow_block+0x110/0x240 [btrfs] btrfs_search_slot+0x230/0xa00 [btrfs] ? __lock_acquire+0x105e/0x1e20 btrfs_insert_empty_items+0x67/0xc0 [btrfs] alloc_reserved_file_extent+0x9e/0x340 [btrfs] __btrfs_run_delayed_refs+0x78e/0x1240 [btrfs] ? kvm_clock_read+0x18/0x30 ? __sched_clock_gtod_offset+0x21/0x50 btrfs_run_delayed_refs.part.0+0x4e/0x180 [btrfs] btrfs_run_delayed_refs+0x23/0x30 [btrfs] btrfs_commit_transaction+0x53/0x9f0 [btrfs] btrfs_sync_fs+0x7c/0x1c0 [btrfs] ? __ia32_sys_fdatasync+0x20/0x20 sync_fs_one_sb+0x23/0x30 iterate_supers+0x95/0x100 ksys_sync+0x62/0xb0 __ia32_sys_sync+0xe/0x20 do_syscall_64+0x65/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe [CAUSE] This situation is caused by several factors: - Fuzzed image The extent tree of this fs missed one backref for extent tree root. So we can allocated space from that slot. - MIXED_BG feature Super block has MIXED_BG flag. - No mixed block groups exists All block groups are just regular ones. This makes data space_info->block_groups[] contains metadata block groups. And when we reserve space for data, we can use space in metadata block group. Then we hit the following file operations: - fallocate We need to allocate data extents. find_free_extent() choose to use the metadata block to allocate space from, and choose the space of extent tree root, since its backref is missing. This generate one delayed ref head with is_data = 1. - extent tree update We need to update extent tree at run_delayed_ref time. This generate one delayed ref head with is_data = 0, for the same bytenr of old extent tree root. Then we trigger the BUG_ON(). [FIX] The quick fix here is to check block_group->flags before using it. The problem can only happen for MIXED_GROUPS fs. Regular filesystems won't have space_info with DATA|METADATA flag, and no way to hit the bug. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203255 Reported-by: Jungyeon Yoon <jungyeon.yoon@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05btrfs: delayed-inode: Kill the BUG_ON() in btrfs_delete_delayed_dir_index()Qu Wenruo1-2/+11
[ Upstream commit 933c22a7512c5c09b1fdc46b557384efe8d03233 ] There is one report of fuzzed image which leads to BUG_ON() in btrfs_delete_delayed_dir_index(). Although that fuzzed image can already be addressed by enhanced extent-tree error handler, it's still better to hunt down more BUG_ON(). This patch will hunt down two BUG_ON()s in btrfs_delete_delayed_dir_index(): - One for error from btrfs_delayed_item_reserve_metadata() Instead of BUG_ON(), we output an error message and free the item. And return the error. All callers of this function handles the error by aborting current trasaction. - One for possible EEXIST from __btrfs_add_delayed_deletion_item() That function can return -EEXIST. We already have a good enough error message for that, only need to clean up the reserved metadata space and allocated item. To help above cleanup, also modifiy __btrfs_remove_delayed_item() called in btrfs_release_delayed_item(), to skip unassociated item. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203253 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05io_uring: fix wrong sequence setting logicJackie Liu1-1/+3
[ Upstream commit 8776f3fa15a5cd213c4dfab7ddaf557983374ea6 ] Sqo_thread will get sqring in batches, which will cause ctx->cached_sq_head to be added in batches. if one of these sqes is set with the DRAIN flag, then he will never get a chance to process, and finally sqo_thread will not exit. Fixes: de0617e4671 ("io_uring: add support for marking commands as draining") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-21ovl: fix regression caused by overlapping layers detectionAmir Goldstein2-26/+48
commit 0be0bfd2de9dfdd2098a9c5b14bdd8f739c9165d upstream. Once upon a time, commit 2cac0c00a6cd ("ovl: get exclusive ownership on upper/work dirs") in v4.13 added some sanity checks on overlayfs layers. This change caused a docker regression. The root cause was mount leaks by docker, which as far as I know, still exist. To mitigate the regression, commit 85fdee1eef1a ("ovl: fix regression caused by exclusive upper/work dir protection") in v4.14 turned the mount errors into warnings for the default index=off configuration. Recently, commit 146d62e5a586 ("ovl: detect overlapping layers") in v5.2, re-introduced exclusive upper/work dir checks regardless of index=off configuration. This changes the status quo and mount leak related bug reports have started to re-surface. Restore the status quo to fix the regressions. To clarify, index=off does NOT relax overlapping layers check for this ovelayfs mount. index=off only relaxes exclusive upper/work dir checks with another overlayfs mount. To cover the part of overlapping layers detection that used the exclusive upper/work dir checks to detect overlap with self upper/work dir, add a trap also on the work base dir. Link: https://github.com/moby/moby/issues/34672 Link: https://lore.kernel.org/linux-fsdevel/20171006121405.GA32700@veci.piliscsaba.szeredi.hu/ Link: https://github.com/containers/libpod/issues/3540 Fixes: 146d62e5a586 ("ovl: detect overlapping layers") Cc: <stable@vger.kernel.org> # v4.19+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Tested-by: Colin Walters <walters@verbum.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-15Revert "ext4: make __ext4_get_inode_loc plug"Linus Torvalds1-3/+0
This reverts commit b03755ad6f33b7b8cd7312a3596a2dbf496de6e7. This is sad, and done for all the wrong reasons. Because that commit is good, and does exactly what it says: avoids a lot of small disk requests for the inode table read-ahead. However, it turns out that it causes an entirely unrelated problem: the getrandom() system call was introduced back in 2014 by commit c6e9d6f38894 ("random: introduce getrandom(2) system call"), and people use it as a convenient source of good random numbers. But part of the current semantics for getrandom() is that it waits for the entropy pool to fill at least partially (unlike /dev/urandom). And at least ArchLinux apparently has a systemd that uses getrandom() at boot time, and the improvements in IO patterns means that existing installations suddenly start hanging, waiting for entropy that will never happen. It seems to be an unlucky combination of not _quite_ enough entropy, together with a particular systemd version and configuration. Lennart says that the systemd-random-seed process (which is what does this early access) is supposed to not block any other boot activity, but sadly that doesn't actually seem to be the case (possibly due bogus dependencies on cryptsetup for encrypted swapspace). The correct fix is to fix getrandom() to not block when it's not appropriate, but that fix is going to take a lot more discussion. Do we just make it act like /dev/urandom by default, and add a new flag for "wait for entropy"? Do we add a boot-time option? Or do we just limit the amount of time it will wait for entropy? So in the meantime, we do the revert to give us time to discuss the eventual fix for the fundamental problem, at which point we can re-apply the ext4 inode table access optimization. Reported-by: Ahmed S. Darwish <darwish.07@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Willy Tarreau <w@1wt.eu> Cc: Alexander E. Patrakov <patrakov@gmail.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-13Merge tag 'for-5.3-rc8-tag' of ↵Linus Torvalds2-17/+34
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Here are two fixes, one of them urgent fixing a bug introduced in 5.2 and reported by many users. It took time to identify the root cause, catching the 5.3 release is higly desired also to push the fix to 5.2 stable tree. The bug is a mess up of return values after adding proper error handling and honestly the kind of bug that can cause sleeping disorders until it's caught. My appologies to everybody who was affected. Summary of what could happen: 1) either a hang when committing a transaction, if this happens there's no risk of corruption, still the hang is very inconvenient and can't be resolved without a reboot 2) writeback for some btree nodes may never be started and we end up committing a transaction without noticing that, this is really serious and that will lead to the "parent transid verify failed" messages" * tag 'for-5.3-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix unwritten extent buffers and hangs on future writeback attempts Btrfs: fix assertion failure during fsync and use of stale transaction
2019-09-12Btrfs: fix unwritten extent buffers and hangs on future writeback attemptsFilipe Manana1-9/+26
The lock_extent_buffer_io() returns 1 to the caller to tell it everything went fine and the callers needs to start writeback for the extent buffer (submit a bio, etc), 0 to tell the caller everything went fine but it does not need to start writeback for the extent buffer, and a negative value if some error happened. When it's about to return 1 it tries to lock all pages, and if a try lock on a page fails, and we didn't flush any existing bio in our "epd", it calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or an error. The page might have been locked elsewhere, not with the goal of starting writeback of the extent buffer, and even by some code other than btrfs, like page migration for example, so it does not mean the writeback of the extent buffer was already started by some other task, so returning a 0 tells the caller (btree_write_cache_pages()) to not start writeback for the extent buffer. Note that epd might currently have either no bio, so flush_write_bio() returns 0 (success) or it might have a bio for another extent buffer with a lower index (logical address). Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the extent buffer and writeback is never started for the extent buffer, future attempts to writeback the extent buffer will hang forever waiting on that bit to be cleared, since it can only be cleared after writeback completes. Such hang is reported with a trace like the following: [49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds. [49887.347059] Not tainted 5.2.13-gentoo #2 [49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [49887.347062] btrfs-transacti D 0 1752 2 0x80004000 [49887.347064] Call Trace: [49887.347069] ? __schedule+0x265/0x830 [49887.347071] ? bit_wait+0x50/0x50 [49887.347072] ? bit_wait+0x50/0x50 [49887.347074] schedule+0x24/0x90 [49887.347075] io_schedule+0x3c/0x60 [49887.347077] bit_wait_io+0x8/0x50 [49887.347079] __wait_on_bit+0x6c/0x80 [49887.347081] ? __lock_release.isra.29+0x155/0x2d0 [49887.347083] out_of_line_wait_on_bit+0x7b/0x80 [49887.347084] ? var_wake_function+0x20/0x20 [49887.347087] lock_extent_buffer_for_io+0x28c/0x390 [49887.347089] btree_write_cache_pages+0x18e/0x340 [49887.347091] do_writepages+0x29/0xb0 [49887.347093] ? kmem_cache_free+0x132/0x160 [49887.347095] ? convert_extent_bit+0x544/0x680 [49887.347097] filemap_fdatawrite_range+0x70/0x90 [49887.347099] btrfs_write_marked_extents+0x53/0x120 [49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0 [49887.347102] btrfs_commit_transaction+0x6bb/0x990 [49887.347103] ? start_transaction+0x33e/0x500 [49887.347105] transaction_kthread+0x139/0x15c So fix this by not overwriting the return value (ret) with the result from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK bit in case flush_write_bio() returns an error, otherwise it will hang any future attempts to writeback the extent buffer, and undo all work done before (set back EXTENT_BUFFER_DIRTY, etc). This is a regression introduced in the 5.2 kernel. Fixes: 2e3c25136adfb ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()") Fixes: f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up") Reported-by: Zdenek Sojka <zsojka@seznam.cz> Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t Reported-by: Drazen Kacar <drazen.kacar@oradian.com> Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-12Btrfs: fix assertion failure during fsync and use of stale transactionFilipe Manana1-8/+8
Sometimes when fsync'ing a file we need to log that other inodes exist and when we need to do that we acquire a reference on the inodes and then drop that reference using iput() after logging them. That generally is not a problem except if we end up doing the final iput() (dropping the last reference) on the inode and that inode has a link count of 0, which can happen in a very short time window if the logging path gets a reference on the inode while it's being unlinked. In that case we end up getting the eviction callback, btrfs_evict_inode(), invoked through the iput() call chain which needs to drop all of the inode's items from its subvolume btree, and in order to do that, it needs to join a transaction at the helper function evict_refill_and_join(). However because the task previously started a transaction at the fsync handler, btrfs_sync_file(), it has current->journal_info already pointing to a transaction handle and therefore evict_refill_and_join() will get that transaction handle from btrfs_join_transaction(). From this point on, two different problems can happen: 1) evict_refill_and_join() will often change the transaction handle's block reserve (->block_rsv) and set its ->bytes_reserved field to a value greater than 0. If evict_refill_and_join() never commits the transaction, the eviction handler ends up decreasing the reference count (->use_count) of the transaction handle through the call to btrfs_end_transaction(), and after that point we have a transaction handle with a NULL ->block_rsv (which is the value prior to the transaction join from evict_refill_and_join()) and a ->bytes_reserved value greater than 0. If after the eviction/iput completes the inode logging path hits an error or it decides that it must fallback to a transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a non-zero value from btrfs_log_dentry_safe(), and because of that non-zero value it tries to commit the transaction using a handle with a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes the transaction commit hit an assertion failure at btrfs_trans_release_metadata() because ->bytes_reserved is not zero but the ->block_rsv is NULL. The produced stack trace for that is like the following: [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 [192922.917553] ------------[ cut here ]------------ [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] (...) [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [192922.925105] Call Trace: [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] [192922.926731] do_fsync+0x38/0x60 [192922.927138] __x64_sys_fdatasync+0x13/0x20 [192922.927543] do_syscall_64+0x60/0x1c0 [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [192922.934077] ---[ end trace f00808b12068168f ]--- 2) If evict_refill_and_join() decides to commit the transaction, it will be able to do it, since the nested transaction join only increments the transaction handle's ->use_count reference counter and it does not prevent the transaction from getting committed. This means that after eviction completes, the fsync logging path will be using a transaction handle that refers to an already committed transaction. What happens when using such a stale transaction can be unpredictable, we are at least having a use-after-free on the transaction handle itself, since the transaction commit will call kmem_cache_free() against the handle regardless of its ->use_count value, or we can end up silently losing all the updates to the log tree after that iput() in the logging path, or using a transaction handle that in the meanwhile was allocated to another task for a new transaction, etc, pretty much unpredictable what can happen. In order to fix both of them, instead of using iput() during logging, use btrfs_add_delayed_iput(), so that the logging path of fsync never drops the last reference on an inode, that step is offloaded to a safe context (usually the cleaner kthread). The assertion failure issue was sporadically triggered by the test case generic/475 from fstests, which loads the dm error target while fsstress is running, which lead to fsync failing while logging inodes with -EIO errors and then trying later to commit the transaction, triggering the assertion failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-06Merge tag 'configfs-for-5.3' of git://git.infradead.org/users/hch/configfsLinus Torvalds3-175/+257
Pull configfs fixes from Christoph Hellwig: "Late configfs fixes from Al that fix pretty nasty removal vs attribute access races" * tag 'configfs-for-5.3' of git://git.infradead.org/users/hch/configfs: configfs: provide exclusion between IO and removals configfs: new object reprsenting tree fragments configfs_register_group() shouldn't be (and isn't) called in rmdirable parts configfs: stash the data we need into configfs_buffer at open time
2019-09-04configfs: provide exclusion between IO and removalsAl Viro2-18/+80
Make sure that attribute methods are not called after the item has been removed from the tree. To do so, we * at the point of no return in removals, grab ->frag_sem exclusive and mark the fragment dead. * call the methods of attributes with ->frag_sem taken shared and only after having verified that the fragment is still alive. The main benefit is for method instances - they are guaranteed that the objects they are accessing *and* all ancestors are still there. Another win is that we don't need to bother with extra refcount on config_item when opening a file - the item will be alive for as long as it stays in the tree, and we won't touch it/attributes/any associated data after it's been removed from the tree. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-02configfs: new object reprsenting tree fragmentsAl Viro3-27/+97
Refcounted, hangs of configfs_dirent, created by operations that add fragments to configfs tree (mkdir and configfs_register_{subsystem,group}). Will be used in the next commit to provide exclusion between fragment removal and ->show/->store calls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-02configfs_register_group() shouldn't be (and isn't) called in rmdirable partsAl Viro1-11/+0
revert cc57c07343bd "configfs: fix registered group removal" It was an attempt to handle something that fundamentally doesn't work - configfs_register_group() should never be done in a part of tree that can be rmdir'ed. And in mainline it never had been, so let's not borrow trouble; the fix was racy anyway, it would take a lot more to make that work and desired semantics is not clear. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-02configfs: stash the data we need into configfs_buffer at open timeAl Viro1-134/+95
simplifies the ->read()/->write()/->release() instances nicely Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-02NFS: Fix inode fileid checks in attribute revalidation codeTrond Myklebust1-8/+10
We want to throw out the attrbute if it refers to the mounted on fileid, and not the real fileid. However we do not want to block cache consistency updates from NFSv4 writes. Reported-by: Murphy Zhou <jencce.kernel@gmail.com> Fixes: 7e10cc25bfa0 ("NFS: Don't refresh attributes with mounted-on-file...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-30Merge tag '5.3-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds7-147/+135
Pull cifs fixes from Steve French: "A few small SMB3 fixes, and a larger one to fix various older string handling functions" * tag '5.3-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal module number cifs: replace various strncpy with strscpy and similar cifs: Use kzfree() to zero out the password cifs: set domainName when a domain-key is used in multiuser
2019-08-28cifs: update internal module numberSteve French1-1/+1
To 2.22 Signed-off-by: Steve French <stfrench@microsoft.com>
2019-08-28cifs: replace various strncpy with strscpy and similarRonnie Sahlberg6-146/+112
Using strscpy is cleaner, and avoids some problems with handling maximum length strings. Linus noticed the original problem and Aurelien pointed out some additional problems. Fortunately most of this is SMB1 code (and in particular the ASCII string handling older, which is less common). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-08-28cifs: Use kzfree() to zero out the passwordDan Carpenter1-1/+1
It's safer to zero out the password so that it can never be disclosed. Fixes: 0c219f5799c7 ("cifs: set domainName when a domain-key is used in multiuser") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-08-28cifs: set domainName when a domain-key is used in multiuserRonnie Sahlberg1-0/+22
RHBZ: 1710429 When we use a domain-key to authenticate using multiuser we must also set the domainnmame for the new volume as it will be used and passed to the server in the NTLMSSP Domain-name. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-08-27Merge tag 'nfs-for-5.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds11-95/+131
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable fixes: - Fix a page lock leak in nfs_pageio_resend() - Ensure O_DIRECT reports an error if the bytes read/written is 0 - Don't handle errors if the bind/connect succeeded - Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was invalidat ed" Bugfixes: - Don't refresh attributes with mounted-on-file information - Fix return values for nfs4_file_open() and nfs_finish_open() - Fix pnfs layoutstats reporting of I/O errors - Don't use soft RPC calls for pNFS/flexfiles I/O, and don't abort for soft I/O errors when the user specifies a hard mount. - Various fixes to the error handling in sunrpc - Don't report writepage()/writepages() errors twice" * tag 'nfs-for-5.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: remove set but not used variable 'mapping' NFSv2: Fix write regression NFSv2: Fix eof handling NFS: Fix writepage(s) error handling to not report errors twice NFS: Fix spurious EIO read errors pNFS/flexfiles: Don't time out requests on hard mounts SUNRPC: Handle connection breakages correctly in call_status() Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was invalidated" SUNRPC: Handle EADDRINUSE and ENOBUFS correctly pNFS/flexfiles: Turn off soft RPC calls SUNRPC: Don't handle errors if the bind/connect succeeded NFS: On fatal writeback errors, we need to call nfs_inode_remove_request() NFS: Fix initialisation of I/O result struct in nfs_pgio_rpcsetup NFS: Ensure O_DIRECT reports an error if the bytes read/written is 0 NFSv4/pnfs: Fix a page lock leak in nfs_pageio_resend() NFSv4: Fix return value in nfs_finish_open() NFSv4: Fix return values for nfs4_file_open() NFS: Don't refresh attributes with mounted-on-file information
2019-08-27NFS: remove set but not used variable 'mapping'YueHaibing1-2/+0
Fixes gcc '-Wunused-but-set-variable' warning: fs/nfs/write.c: In function nfs_page_async_flush: fs/nfs/write.c:609:24: warning: variable mapping set but not used [-Wunused-but-set-variable] It is not use since commit aefb623c422e ("NFS: Fix writepage(s) error handling to not report errors twice") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-27NFSv2: Fix write regressionTrond Myklebust1-1/+3
Ensure we update the write result count on success, since the RPC call itself does not do so. Reported-by: Jan Stancek <jstancek@redhat.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Tested-by: Jan Stancek <jstancek@redhat.com>
2019-08-27NFSv2: Fix eof handlingTrond Myklebust1-1/+2
If we received a reply from the server with a zero length read and no error, then that implies we are at eof. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-26NFS: Fix writepage(s) error handling to not report errors twiceTrond Myklebust1-8/+13
If writepage()/writepages() saw an error, but handled it without reporting it, we should not be re-reporting that error on exit. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-26NFS: Fix spurious EIO read errorsTrond Myklebust3-21/+36
If the client attempts to read a page, but the read fails due to some spurious error (e.g. an ACCESS error or a timeout, ...) then we need to allow other processes to retry. Also try to report errors correctly when doing a synchronous readpage. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-26pNFS/flexfiles: Don't time out requests on hard mountsTrond Myklebust1-2/+9
If the mount is hard, we should ignore the 'io_maxretrans' module parameter so that we always keep retrying. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-26Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was invalidated"Trond Myklebust1-17/+0
This reverts commit a79f194aa4879e9baad118c3f8bb2ca24dbef765. The mechanism for aborting I/O is racy, since we are not guaranteed that the request is asleep while we're changing both task->tk_status and task->tk_action. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v5.1
2019-08-26pNFS/flexfiles: Turn off soft RPC callsTrond Myklebust1-5/+10
The pNFS/flexfiles I/O requests are sent with the SOFTCONN flag set, so they automatically time out if the connection breaks. It should therefore not be necessary to have the soft flag set in addition. Fixes: 5f01d9539496 ("nfs41: create NFSv3 DS connection if specified") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-25Merge tag 'for-linus-5.3-rc6' of ↵Linus Torvalds3-3/+5
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull UBIFS and JFFS2 fixes from Richard Weinberger: "UBIFS: - Don't block too long in writeback_inodes_sb() - Fix for a possible overrun of the log head - Fix double unlock in orphan_delete() JFFS2: - Remove C++ style from UAPI header and unbreak picky toolchains" * tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubifs: Limit the number of pages in shrink_liability ubifs: Correctly initialize c->min_log_bytes ubifs: Fix double unlock around orphan_delete() jffs2: Remove C++ style comments from uapi header
2019-08-25userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctxOleg Nesterov1-12/+13
userfaultfd_release() should clear vm_flags/vm_userfaultfd_ctx even if mm->core_state != NULL. Otherwise a page fault can see userfaultfd_missing() == T and use an already freed userfaultfd_ctx. Link: http://lkml.kernel.org/r/20190820160237.GB4983@redhat.com Fixes: 04f5866e41fb ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Peter Xu <peterx@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24Merge tag 'xfs-5.3-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-0/+1
Pull xfs fix from Darrick Wong: "A single patch that fixes a xfs lockup problem when a chown/chgrp operation fails due to running out of quota. It has survived the usual xfstests runs and merges cleanly with this morning's master: - Fix a forgotten inode unlock when chown/chgrp fail due to quota" * tag 'xfs-5.3-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOT
2019-08-24Merge tag 'for-linus-20190823' of git://git.kernel.dk/linux-blockLinus Torvalds1-19/+47
Pull block fixes from Jens Axboe: "Here's a set of fixes that should go into this release. This contains: - Three minor fixes for NVMe. - Three minor tweaks for the io_uring polling logic. - Officially mark Song as the MD maintainer, after he's been filling that role sucessfully for the last 6 months or so" * tag 'for-linus-20190823' of git://git.kernel.dk/linux-block: io_uring: add need_resched() check in inner poll loop md: update MAINTAINERS info io_uring: don't enter poll loop if we have CQEs pending nvme: Add quirk for LiteON CL1 devices running FW 22301111 nvme: Fix cntlid validation when not using NVMEoF nvme-multipath: fix possible I/O hang when paths are updated io_uring: fix potential hang with polled IO
2019-08-23Merge tag 'xfs-5.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds4-88/+82
Pull xfs fixes from Darrick Wong: "Here are a few more bug fixes that trickled in since the last pull. They've survived the usual xfstests runs and merge cleanly with this morning's master. I expect there to be one more pull request tomorrow for the fix to that quota related inode unlock bug that we were reviewing last night, but it will continue to soak in the testing machine for several more hours. - Fix missing compat ioctl handling for get/setlabel - Fix missing ioctl pointer sanitization on s390 - Fix a page locking deadlock in the dedupe comparison code - Fix inadequate locking in reflink code w.r.t. concurrent directio - Fix broken error detection when breaking layouts" * tag 'xfs-5.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: fs/xfs: Fix return code of xfs_break_leased_layouts() xfs: fix reflink source file racing with directio writes vfs: fix page locking deadlocks when deduping files xfs: compat_ioctl: use compat_ptr() xfs: fall back to native ioctls for unhandled compat ones
2019-08-23Merge tag 'ceph-for-5.3-rc6' of git://github.com/ceph/ceph-clientLinus Torvalds7-15/+30
Pull ceph fixes from Ilya Dryomov: "Three important fixes tagged for stable (an indefinite hang, a crash on an assert and a NULL pointer dereference) plus a small series from Luis fixing instances of vfree() under spinlock" * tag 'ceph-for-5.3-rc6' of git://github.com/ceph/ceph-client: libceph: fix PG split vs OSD (re)connect race ceph: don't try fill file_lock on unsuccessful GETFILELOCK reply ceph: clear page dirty before invalidate page ceph: fix buffer free while holding i_ceph_lock in fill_inode() ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob() ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr() libceph: allow ceph_buffer_put() to receive a NULL ceph_buffer