summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2020-10-12ceph: add a note explaining session reject error stringIlya Dryomov1-0/+4
error_string key in the metadata map of MClientSession message is intended for humans, but unfortunately became part of the on-wire format with the introduction of recover_session=clean mode in commit 131d7eb4faa1 ("ceph: auto reconnect after blacklisted"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12libceph, rbd, ceph: "blacklist" -> "blocklist"Ilya Dryomov5-26/+26
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12ceph: have ceph_writepages_start call pagevec_lookup_range_tagJeff Layton1-3/+2
Currently it calls pagevec_lookup_range_nr_tag(), but that may be inefficient, as we might end up having to search several times as we get down to looking for fewer pages to fill the array. Thus spake Willy: "I think ceph is misusing pagevec_lookup_range_nr_tag(). Let's suppose you get a range which is AAAAbbbbAAAAbbbbAAAAbbbbbbbb(...)bbbbAAAA and you try to fetch max_pages=13. First loop will get AAAAbbbbAAAAb and have 8 locked_pages. The next call will get bbbAA and now locked_pages=10. Next call gets AAb ... and now you're iterating your way through all the 'b' one page at a time until you find that first A." 'A' here refers to pages that are eligible for writeback and 'b' represents ones that aren't (for whatever reason). Not capping the number of return pages may mean that we sometimes find more pages than are needed, but the extra references will just get put at the end. Ceph is also the only caller of pagevec_lookup_range_nr_tag(), so this change should allow us to eliminate that call as well. That will be done in a follow-on patch. Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12ceph: use kill_anon_super helperJeff Layton1-3/+1
ceph open-codes this around some other activity and the rationale for it isn't clear. There is no need to delay free_anon_bdev until the end of kill_sb. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12ceph: metrics for opened files, pinned caps and opened inodesXiubo Li5-3/+74
In client for each inode, it may have many opened files and may have been pinned in more than one MDS servers. And some inodes are idle, which have no any opened files. This patch will show these metrics in the debugfs, likes: item total ----------------------------------------- opened files / total inodes 14 / 5 pinned i_caps / total inodes 7 / 5 opened inodes / total inodes 3 / 5 Will send these metrics to ceph, which will be used by the `fs top`, later. [ jlayton: drop unrelated hunk, count hashed inodes instead of allocated ones ] URL: https://tracker.ceph.com/issues/47005 Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12ceph: add ceph_sb_to_mdsc helper support to parse the mdscXiubo Li8-30/+26
This will help simplify the code. [ jlayton: fix minor merge conflict in quota.c ] Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12ceph: drop special-casing for ITER_PIPE in ceph_sync_readJeff Layton1-47/+24
This special casing was added in 7ce469a53e71 (ceph: fix splice read for no Fc capability case). The confirm callback for ITER_PIPE expects that the page is Uptodate and returns an error otherwise. A simpler workaround is just to use the Uptodate bit, which has no meaning for anonymous pages. Rip out the special casing for ITER_PIPE and just SetPageUptodate before we copy to the iter. Cc: John Hubbard <jhubbard@nvidia.com> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12ceph: add column 'mds' to show caps in more user friendlyYanhu Cao1-3/+4
In multi-mds, the 'caps' debugfs file will have duplicate ino, add the 'mds' column to indicate which mds session the cap belongs to. Signed-off-by: Yanhu Cao <gmayyyha@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12ceph: remove unnecessary return in switch statementLuis Henriques1-2/+0
Since there's a return immediately after the 'break', there's no need for this extra 'return' in the S_IFDIR case. Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12ceph: encode inodes' parent/d_name in cap reconnect messageYan, Zheng1-28/+61
Since nautilus, MDS tracks dirfrags whose child inodes have caps in open file table. When MDS recovers, it prefetches all of these dirfrags. This avoids using backtrace to load inodes. But dirfrags prefetch may load lots of useless inodes into cache, and make MDS run out of memory. Recent MDS adds an option that disables dirfrags prefetch. When dirfrags prefetch is disabled. Recovering MDS only prefetches corresponding dir inodes. Including inodes' parent/d_name in cap reconnect message can help MDS to load inodes into its cache. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12fuse: connection remove fixMiklos Szeredi1-0/+7
Re-add lost removal of fc from fuse_conn_list and the control filesystem. Reported-by: kernel test robot <rong.a.chen@intel.com> Fixes: fcee216beb9c ("fuse: split fuse_mount off of fuse_conn") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-12cifs: compute full_path already in cifs_readdir()Ronnie Sahlberg1-14/+16
Cleanup patch for followon to cache additional information for the root directory when directory lease held. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-12cifs: return cached_fid from open_shrootRonnie Sahlberg3-13/+23
Cleanup patch for followon to cache additional information for the root directory when directory lease held. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-12update structure definitions from updated protocol documentationSteve French1-7/+57
MS-SMB2 was updated recently to include new protocol definitions for updated compression payload header and new RDMA transform capabilities Update structure definitions in smb2pdu.h to match Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-10-12smb3: add defines for new crypto algorithmsSteve French1-0/+2
In encryption capabilities negotiate context can now request AES256 GCM or CCM Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-10-12Convert trailing spaces and periods in path componentsBoris Protopopov1-1/+7
When converting trailing spaces and periods in paths, do so for every component of the path, not just the last component. If the conversion is not done for every path component, then subsequent operations in directories with trailing spaces or periods (e.g. create(), mkdir()) will fail with ENOENT. This is because on the server, the directory will have a special symbol in its name, and the client needs to provide the same. Signed-off-by: Boris Protopopov <pboris@amazon.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-11ubifs: mount_ubifs: Release authentication resource in error handling pathZhihao Cheng1-4/+6
Release the authentication related resource in some error handling branches in mount_ubifs(). Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Cc: <stable@vger.kernel.org> # 4.20+ Fixes: d8a22773a12c6d7 ("ubifs: Enable authentication support") Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-10-11ubifs: Don't parse authentication mount options in remount processZhihao Cheng1-6/+12
There is no need to dump authentication options while remounting, because authentication initialization can only be doing once in the first mount process. Dumping authentication mount options in remount process may cause memory leak if UBIFS has already been mounted with old authentication mount options. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Cc: <stable@vger.kernel.org> # 4.20+ Fixes: d8a22773a12c6d7 ("ubifs: Enable authentication support") Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-10-11ubifs: Fix a memleak after dumping authentication mount optionsZhihao Cheng1-2/+14
Fix a memory leak after dumping authentication mount options in error handling branch. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Cc: <stable@vger.kernel.org> # 4.20+ Fixes: d8a22773a12c6d7 ("ubifs: Enable authentication support") Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-10-11Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-6/+5
Pull vfs fix from Al Viro: "Fixes an obvious bug (memory leak introduced in 5.8)" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: pipe: Fix memory leaks in create_pipe_files()
2020-10-11cifs: Fix incomplete memory allocation on setxattr pathVladimir Zapolskiy1-1/+1
On setxattr() syscall path due to an apprent typo the size of a dynamically allocated memory chunk for storing struct smb2_file_full_ea_info object is computed incorrectly, to be more precise the first addend is the size of a pointer instead of the wanted object size. Coincidentally it makes no difference on 64-bit platforms, however on 32-bit targets the following memcpy() writes 4 bytes of data outside of the dynamically allocated memory. ============================================================================= BUG kmalloc-16 (Not tainted): Redzone overwritten ----------------------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: 0x79e69a6f-0x9e5cdecf @offset=368. First byte 0x73 instead of 0xcc INFO: Slab 0xd36d2454 objects=85 used=51 fp=0xf7d0fc7a flags=0x35000201 INFO: Object 0x6f171df3 @offset=352 fp=0x00000000 Redzone 5d4ff02d: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ................ Object 6f171df3: 00 00 00 00 00 05 06 00 73 6e 72 75 62 00 66 69 ........snrub.fi Redzone 79e69a6f: 73 68 32 0a sh2. Padding 56254d82: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ CPU: 0 PID: 8196 Comm: attr Tainted: G B 5.9.0-rc8+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 Call Trace: dump_stack+0x54/0x6e print_trailer+0x12c/0x134 check_bytes_and_report.cold+0x3e/0x69 check_object+0x18c/0x250 free_debug_processing+0xfe/0x230 __slab_free+0x1c0/0x300 kfree+0x1d3/0x220 smb2_set_ea+0x27d/0x540 cifs_xattr_set+0x57f/0x620 __vfs_setxattr+0x4e/0x60 __vfs_setxattr_noperm+0x4e/0x100 __vfs_setxattr_locked+0xae/0xd0 vfs_setxattr+0x4e/0xe0 setxattr+0x12c/0x1a0 path_setxattr+0xa4/0xc0 __ia32_sys_lsetxattr+0x1d/0x20 __do_fast_syscall_32+0x40/0x70 do_fast_syscall_32+0x29/0x60 do_SYSENTER_32+0x15/0x20 entry_SYSENTER_32+0x9f/0xf2 Fixes: 5517554e4313 ("cifs: Add support for writing attributes on SMB2+") Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-10io_uring: keep a pointer ref_node in file_dataPavel Begunkov1-8/+6
->cur_refs of struct fixed_file_data always points to percpu_ref embedded into struct fixed_file_ref_node. Don't overuse container_of() and offsetting, and point directly to fixed_file_ref_node. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: refactor *files_register()'s error pathsPavel Begunkov1-46/+32
Don't keep repeating cleaning sequences in error paths, write it once in the and use labels. It's less error prone and looks cleaner. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: clean file_data access in files_registerPavel Begunkov1-36/+33
Keep file_data in a local var and replace with it complex references such as ctx->file_data. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: don't delay io_init_req() error checkPavel Begunkov1-2/+1
Don't postpone io_init_req() error checks and do that right after calling it. There is no control-flow statements or dependencies with sqe/submitted accounting, so do those earlier, that makes the code flow a bit more natural. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: clean leftovers after splitting issuePavel Begunkov1-8/+6
Kill extra if in io_issue_sqe() and place send/recv[msg] calls appropriately under switch's cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: remove timeout.list after hrtimer cancelPavel Begunkov1-9/+2
Remove timeouts from ctx->timeout_list after hrtimer_try_to_cancel() successfully cancels it. With this we don't need to care whether there was a race and it was removed in io_timeout_fn(), and that will be handy for following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: use a separate struct for timeout_removePavel Begunkov1-9/+9
Don't use struct io_timeout for both IORING_OP_TIMEOUT and IORING_OP_TIMEOUT_REMOVE, they're quite different. Split them in two, that allows to remove an unused field in struct io_timeout, and btw kill ->flags not used by either. This also easier to follow, especially for timeout remove. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: improve submit_state.ios_left accountingPavel Begunkov1-5/+5
state->ios_left isn't decremented for requests that don't need a file, so it might be larger than number of SQEs left. That in some circumstances makes us to grab more files that is needed so imposing extra put. Deaccount one ios_left for each request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: simplify io_file_get()Pavel Begunkov1-16/+14
Keep ->needs_file_no_error check out of io_file_get(), and let callers handle it. It makes it more straightforward. Also, as the only error it can hand back -EBADF, make it return a file or NULL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: kill extra check in fixed io_file_get()Pavel Begunkov1-2/+1
ctx->nr_user_files == 0 IFF ctx->file_data == NULL and there fixed files are not used. Hence, verifying fds only against ctx->nr_user_files is enough. Remove the other check from hot path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: clean up ->files grabbingPavel Begunkov1-39/+13
Move work.files grabbing into io_prep_async_work() to all other work resources initialisation. We don't need to keep it separately now, as ->ring_fd/file are gone. It also allows to not grab it when a request is not going to io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: don't io_prep_async_work() linked reqsPavel Begunkov1-3/+0
There is no real reason left for preparing io-wq work context for linked requests in advance, remove it as this might become a bottleneck in some cases. Reported-by: Roman Gershman <romger@amazon.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09f2fs: fix to set SBI_NEED_FSCK flag for inconsistent inodeChao Yu1-0/+3
If compressed inode has inconsistent fields on i_compress_algorithm, i_compr_blocks and i_log_cluster_size, we missed to set SBI_NEED_FSCK to notice fsck to repair the inode, fix it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-09io_uring: Convert advanced XArray uses to the normal APIMatthew Wilcox (Oracle)1-12/+2
There are no bugs here that I've spotted, it's just easier to use the normal API and there are no performance advantages to using the more verbose advanced API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09io_uring: Fix XArray usage in io_uring_add_task_fileMatthew Wilcox (Oracle)1-12/+9
The xas_store() wasn't paired with an xas_nomem() loop, so if it couldn't allocate memory using GFP_NOWAIT, it would leak the reference to the file descriptor. Also the node pointed to by the xas could be freed between the call to xas_load() under the rcu_read_lock() and the acquisition of the xa_lock. It's easier to just use the normal xa_load/xa_store interface here. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> [axboe: fix missing assign after alloc, cur_uring -> tctx rename] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09io_uring: Fix use of XArray in __io_uring_files_cancelMatthew Wilcox (Oracle)1-14/+5
We have to drop the lock during each iteration, so there's no advantage to using the advanced API. Convert this to a standard xa_for_each() loop. Reported-by: syzbot+27c12725d8ff0bfe1a13@syzkaller.appspotmail.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09fuse: implement crossmountsMax Reitz4-3/+105
FUSE servers can indicate crossmount points by setting FUSE_ATTR_SUBMOUNT in fuse_attr.flags. The inode will then be marked as S_AUTOMOUNT, and the .d_automount implementation creates a new submount at that location, so that the submount gets a distinct st_dev value. Note that all submounts get a distinct superblock and a distinct st_dev value, so for virtio-fs, even if the same filesystem is mounted more than once on the host, none of its mount points will have the same st_dev. We need distinct superblocks because the superblock points to the root node, but the different host mounts may show different trees (e.g. due to submounts in some of them, but not in others). Right now, this behavior is only enabled when fuse_conn.auto_submounts is set, which is the case only for virtio-fs. Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-09NFSv4: Use the net namespace uniquifier if it is setTrond Myklebust1-4/+17
If a container sets a net namespace specific uniquifier, then use that in the setclientid/exchangeid process. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-09NFSv4: Clean up initialisation of uniquified client id stringsTrond Myklebust1-41/+34
When the user sets a uniquifier, then ensure we copy the string so that calls to strlen() etc are atomic with calls to snprintf(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-09f2fs: reject CASEFOLD inode flag without casefold featureEric Biggers1-0/+7
syzbot reported: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 6860 Comm: syz-executor835 Not tainted 5.9.0-rc8-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:utf8_casefold+0x43/0x1b0 fs/unicode/utf8-core.c:107 [...] Call Trace: f2fs_init_casefolded_name fs/f2fs/dir.c:85 [inline] __f2fs_setup_filename fs/f2fs/dir.c:118 [inline] f2fs_prepare_lookup+0x3bf/0x640 fs/f2fs/dir.c:163 f2fs_lookup+0x10d/0x920 fs/f2fs/namei.c:494 __lookup_hash+0x115/0x240 fs/namei.c:1445 filename_create+0x14b/0x630 fs/namei.c:3467 user_path_create fs/namei.c:3524 [inline] do_mkdirat+0x56/0x310 fs/namei.c:3664 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 [...] The problem is that an inode has F2FS_CASEFOLD_FL set, but the filesystem doesn't have the casefold feature flag set, and therefore super_block::s_encoding is NULL. Fix this by making sanity_check_inode() reject inodes that have F2FS_CASEFOLD_FL when the filesystem doesn't have the casefold feature. Reported-by: syzbot+05139c4039d0679e19ff@syzkaller.appspotmail.com Fixes: 2c2eb7a300cd ("f2fs: Support case-insensitive file name lookups") Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-09f2fs: fix memory alignment to support 32bitJaegeuk Kim1-1/+1
In 32bit system, 64-bits key breaks memory alignment. This fixes the commit "f2fs: support 64-bits key in f2fs rb-tree node entry". Reported-by: Nicolas Chauvet <kwizart@gmail.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-09io_uring: fix break condition for __io_uring_register() waitingJens Axboe1-3/+3
Colin reports that there's unreachable code, since we only ever break if ret == 0. This is correct, and is due to a reversed logic condition in when to break or not. Break out of the loop if we don't process any task work, in that case we do want to return -EINTR. Fixes: af9c1a44f8de ("io_uring: process task work in io_uring_register()") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09erofs: remove unnecessary enum entriesChengguang Xu1-2/+0
Opt_nouser_xattr and Opt_noacl are useless, so just remove them. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20201005071550.66193-1-cgxu519@mykernel.net Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-10-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski9-31/+82
Small conflict around locking in rxrpc_process_event() - channel_lock moved to bundle in next, while state lock needs _bh() from net. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-08Merge tag 'exfat-for-5.9-rc9' of ↵Linus Torvalds5-22/+12
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fixes from Namjae Jeon: - Fix use of uninitialized spinlock on error path - Fix missing err assignment in exfat_build_inode() * tag 'exfat-for-5.9-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: fix use of uninitialized spinlock on error path exfat: fix pointer error checking
2020-10-08afs: Fix deadlock between writeback and truncateDavid Howells3-9/+50
The afs filesystem has a lock[*] that it uses to serialise I/O operations going to the server (vnode->io_lock), as the server will only perform one modification operation at a time on any given file or directory. This prevents the the filesystem from filling up all the call slots to a server with calls that aren't going to be executed in parallel anyway, thereby allowing operations on other files to obtain slots. [*] Note that is probably redundant for directories at least since i_rwsem is used to serialise directory modifications and lookup/reading vs modification. The server does allow parallel non-modification ops, however. When a file truncation op completes, we truncate the in-memory copy of the file to match - but we do it whilst still holding the io_lock, the idea being to prevent races with other operations. However, if writeback starts in a worker thread simultaneously with truncation (whilst notify_change() is called with i_rwsem locked, writeback pays it no heed), it may manage to set PG_writeback bits on the pages that will get truncated before afs_setattr_success() manages to call truncate_pagecache(). Truncate will then wait for those pages - whilst still inside io_lock: # cat /proc/8837/stack [<0>] wait_on_page_bit_common+0x184/0x1e7 [<0>] truncate_inode_pages_range+0x37f/0x3eb [<0>] truncate_pagecache+0x3c/0x53 [<0>] afs_setattr_success+0x4d/0x6e [<0>] afs_wait_for_operation+0xd8/0x169 [<0>] afs_do_sync_operation+0x16/0x1f [<0>] afs_setattr+0x1fb/0x25d [<0>] notify_change+0x2cf/0x3c4 [<0>] do_truncate+0x7f/0xb2 [<0>] do_sys_ftruncate+0xd1/0x104 [<0>] do_syscall_64+0x2d/0x3a [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The writeback operation, however, stalls indefinitely because it needs to get the io_lock to proceed: # cat /proc/5940/stack [<0>] afs_get_io_locks+0x58/0x1ae [<0>] afs_begin_vnode_operation+0xc7/0xd1 [<0>] afs_store_data+0x1b2/0x2a3 [<0>] afs_write_back_from_locked_page+0x418/0x57c [<0>] afs_writepages_region+0x196/0x224 [<0>] afs_writepages+0x74/0x156 [<0>] do_writepages+0x2d/0x56 [<0>] __writeback_single_inode+0x84/0x207 [<0>] writeback_sb_inodes+0x238/0x3cf [<0>] __writeback_inodes_wb+0x68/0x9f [<0>] wb_writeback+0x145/0x26c [<0>] wb_do_writeback+0x16a/0x194 [<0>] wb_workfn+0x74/0x177 [<0>] process_one_work+0x174/0x264 [<0>] worker_thread+0x117/0x1b9 [<0>] kthread+0xec/0xf1 [<0>] ret_from_fork+0x1f/0x30 and thus deadlock has occurred. Note that whilst afs_setattr() calls filemap_write_and_wait(), the fact that the caller is holding i_rwsem doesn't preclude more pages being dirtied through an mmap'd region. Fix this by: (1) Use the vnode validate_lock to mediate access between afs_setattr() and afs_writepages(): (a) Exclusively lock validate_lock in afs_setattr() around the whole RPC operation. (b) If WB_SYNC_ALL isn't set on entry to afs_writepages(), trying to shared-lock validate_lock and returning immediately if we couldn't get it. (c) If WB_SYNC_ALL is set, wait for the lock. The validate_lock is also used to validate a file and to zap its cache if the file was altered by a third party, so it's probably a good fit for this. (2) Move the truncation outside of the io_lock in setattr, using the same hook as is used for local directory editing. This requires the old i_size to be retained in the operation record as we commit the revised status to the inode members inside the io_lock still, but we still need to know if we reduced the file size. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-08direct-io: defer alignment check until after the EOF checkGabriel Krisman Bertazi1-8/+8
Prior to commit 9fe55eea7e4b ("Fix race when checking i_size on direct i/o read"), an unaligned direct read past end of file would trigger EOF, since generic_file_aio_read detected this read-at-EOF condition and skipped the direct IO read entirely, returning 0. After that change, the read now reaches dio_generic, which detects the misalignment and returns EINVAL. This consolidates the generic direct-io to follow the same behavior of filesystems. Apparently, this fix will only affect ocfs2 since other filesystems do this verification before calling do_blockdev_direct_IO, with the exception of f2fs, which has the same bug, but is fixed in the next patch. it can be verified by a read loop on a file that does a partial read before EOF (On file that doesn't end at an aligned address). The following code fails on an unaligned file on filesystems without prior validation without this patch, but not on btrfs, ext4, and xfs. while (done < total) { ssize_t delta = pread(fd, buf + done, total - done, off + done); if (!delta) break; ... } Fix this regression by moving the misalignment check to after the EOF check added by commit 74cedf9b6c60 ("direct-io: Fix negative return from dio read beyond eof"). Based on a patch by Jamie Liu. Link: https://lore.kernel.org/r/20201008062620.2928326-4-krisman@collabora.com Reported-by: Jamie Liu <jamieliu@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-10-08direct-io: don't force writeback for reads beyond EOFGabriel Krisman Bertazi1-13/+11
If a DIO read starts past EOF, the kernel won't attempt it, so we don't need to flush dirty pages before failing the syscall. Link: https://lore.kernel.org/r/20201008062620.2928326-3-krisman@collabora.com Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-10-08direct-io: clean up error paths of do_blockdev_direct_IOGabriel Krisman Bertazi1-21/+14
In preparation to resort DIO checks, reduce code duplication of error handling in do_blockdev_direct_IO. Link: https://lore.kernel.org/r/20201008062620.2928326-2-krisman@collabora.com Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Jan Kara <jack@suse.cz>