summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-02-01io_uring: generalize io_queue_rsrc_removalBijan Mottahedeh1-7/+15
Generalize io_queue_rsrc_removal to handle both files and buffers. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [remove io_mapped_ubuf from rsrc tables/etc. for now] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: rename file related variables to rsrcBijan Mottahedeh1-111/+117
This is a prep rename patch for subsequent patches to generalize file registration. [io_uring_rsrc_update:: rename fds -> data] Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [leave io_uring_files_update as struct] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: modularize io_sqe_buffers_registerBijan Mottahedeh1-17/+34
Move allocation of buffer management structures, and validation of buffers into separate routines. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: modularize io_sqe_buffer_registerBijan Mottahedeh1-103/+107
Split io_sqe_buffer_register into two routines: - io_sqe_buffer_register() registers a single buffer - io_sqe_buffers_register iterates over all user specified buffers Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: enable LOOKUP_CACHED path resolution for filename lookupsJens Axboe1-20/+27
Instead of being pessimistic and assume that path lookup will block, use LOOKUP_CACHED to attempt just a cached lookup. This ensures that the fast path is always done inline, and we only punt to async context if IO is needed to satisfy the lookup. For forced nonblock open attempts, mark the file O_NONBLOCK over the actual ->open() call as well. We can safely clear this again before doing fd_install(), so it'll never be user visible that we fiddled with it. This greatly improves the performance of file open where the dentry is already cached: ached 5.10-git 5.10-git+LOOKUP_CACHED Speedup --------------------------------------------------------------- 33% 1,014,975 900,474 1.1x 89% 545,466 292,937 1.9x 100% 435,636 151,475 2.9x The more cache hot we are, the faster the inline LOOKUP_CACHED optimization helps. This is unsurprising and expected, as a thread offload becomes a more dominant part of the total overhead. If we look at io_uring tracing, doing an IORING_OP_OPENAT on a file that isn't in the dentry cache will yield: 275.550481: io_uring_create: ring 00000000ddda6278, fd 3 sq size 8, cq size 16, flags 0 275.550491: io_uring_submit_sqe: ring 00000000ddda6278, op 18, data 0x0, non block 1, sq_thread 0 275.550498: io_uring_queue_async_work: ring 00000000ddda6278, request 00000000c0267d17, flags 69760, normal queue, work 000000003d683991 275.550502: io_uring_cqring_wait: ring 00000000ddda6278, min_events 1 275.550556: io_uring_complete: ring 00000000ddda6278, user_data 0x0, result 4 which shows a failed nonblock lookup, then punt to worker, and then we complete with fd == 4. This takes 65 usec in total. Re-running the same test case again: 281.253956: io_uring_create: ring 0000000008207252, fd 3 sq size 8, cq size 16, flags 0 281.253967: io_uring_submit_sqe: ring 0000000008207252, op 18, data 0x0, non block 1, sq_thread 0 281.253973: io_uring_complete: ring 0000000008207252, user_data 0x0, result 4 shows the same request completing inline, also returning fd == 4. This takes 6 usec. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01Merge branch 'work.namei' of ↵Jens Axboe2-45/+49
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-5.12/io_uring Merge RESOLVE_CACHED bits from Al, as the io_uring changes will build on top of that. * 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED fs: add support for LOOKUP_CACHED saner calling conventions for unlazy_child() fs: make unlazy_walk() error handling consistent fs/namei.c: Remove unlikely of status being -ECHILD in lookup_fast() do_tmpfile(): don't mess with finish_open()
2021-02-01fs/nfs: remove duplicate includeMenglong Dong1-4/+0
'nfs42.h' is already included above and can be removed here. Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-01-31Merge tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-25/+44
Pull NFS client fixes from Trond Myklebust: - SUNRPC: Handle 0 length opaque XDR object data properly - Fix a layout segment leak in pnfs_layout_process() - pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn - pNFS/NFSv4: Improve rejection of out-of-order layouts - pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process() * tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: Handle 0 length opaque XDR object data properly SUNRPC: Move simple_get_bytes and simple_get_netobj into private header pNFS/NFSv4: Improve rejection of out-of-order layouts pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process() pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
2021-01-31Merge tag '5.11-rc5-smb3' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds6-20/+81
Pull cifs fixes from Steve French: "Four cifs patches found in additional testing of the conversion to the new mount API: three small option processing ones, and one fixing domain based DFS referrals" * tag '5.11-rc5-smb3' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix dfs domain referrals cifs: returning mount parm processing errors correctly cifs: fix mounts to subdirectories of target cifs: ignore auto and noauto options if given
2021-01-30ecryptfs: use DEFINE_MUTEX() for mutex lockZheng Yongjun1-2/+1
mutex lock can be initialized automatically with DEFINE_MUTEX() rather than explicitly calling mutex_init(). Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: Tyler Hicks <code@tyhicks.com>
2021-01-30eCryptfs: add a semicolonTom Rix2-2/+2
Function like macros should have a semicolon. Signed-off-by: Tom Rix <trix@redhat.com> [tyhicks: Remove the trailing semicolin from the macro's definition, as suggested by Joe Perches] Signed-off-by: Tyler Hicks <code@tyhicks.com>
2021-01-30nfsd: skip some unnecessary stats in the v4 caseJ. Bruce Fields1-17/+27
In the typical case of v4 and an i_version-supporting filesystem, we can skip a stat which is only required to fake up a change attribute from ctime. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-30nfs: use change attribute for NFS re-exportsJ. Bruce Fields2-1/+22
When exporting NFS, we may as well use the real change attribute returned by the original server instead of faking up a change attribute from the ctime. Note we can't do that by setting I_VERSION--that would also turn on the logic in iversion.h which treats the lower bit specially, and that doesn't make sense for NFS. So instead we define a new export operation for filesystems like NFS that want to manage the change attribute themselves. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-30rxrpc: Fix deadlock around release of dst cached on udp tunnelDavid Howells1-3/+3
AF_RXRPC sockets use UDP ports in encap mode. This causes socket and dst from an incoming packet to get stolen and attached to the UDP socket from whence it is leaked when that socket is closed. When a network namespace is removed, the wait for dst records to be cleaned up happens before the cleanup of the rxrpc and UDP socket, meaning that the wait never finishes. Fix this by moving the rxrpc (and, by dependence, the afs) private per-network namespace registrations to the device group rather than subsys group. This allows cached rxrpc local endpoints to be cleared and their UDP sockets closed before we try waiting for the dst records. The symptom is that lines looking like the following: unregister_netdevice: waiting for lo to become free get emitted at regular intervals after running something like the referenced syzbot test. Thanks to Vadim for tracking this down and work out the fix. Reported-by: syzbot+df400f2f24a1677cd7e0@syzkaller.appspotmail.com Reported-by: Vadim Fedorenko <vfedorenko@novek.ru> Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook") Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Vadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/161196443016.3868642.5577440140646403533.stgit@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-30Merge tag 'for-5.11-rc5-tag' of ↵Linus Torvalds6-51/+46
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes for a late rc: - fix lockdep complaint on 32bit arches and also remove an unsafe memory use due to device vs filesystem lifetime - two fixes for free space tree: * race during log replay and cache rebuild, now more likely to happen due to changes in this dev cycle * possible free space tree corruption with online conversion during initial tree population" * tag 'for-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix log replay failure due to race with space cache rebuild btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch btrfs: fix possible free space tree corruption with online conversion
2021-01-30Merge tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+9
Pull block fixes from Jens Axboe: "All over the place fixes for this release: - blk-cgroup iteration teardown resched fix (Baolin) - NVMe pull request from Christoph: - add another Write Zeroes quirk (Chaitanya Kulkarni) - handle a no path available corner case (Daniel Wagner) - use the proper RCU aware list_add helper (Chao Leng) - bcache regression fix (Coly) - bdev->bd_size_lock IRQ fix. This will be fixed in drivers for 5.12, but for now, we'll make it IRQ safe (Damien) - null_blk zoned init fix (Damien) - add_partition() error handling fix (Dinghao) - s390 dasd kobject fix (Jan) - nbd fix for freezing queue while adding connections (Josef) - tag queueing regression fix (Ming) - revert of a patch that inadvertently meant that we regressed write performance on raid (Maxim)" * tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block: null_blk: cleanup zoned mode initialization nvme-core: use list_add_tail_rcu instead of list_add_tail for nvme_init_ns_head nvme-multipath: Early exit if no path is available nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a SPCC device bcache: only check feature sets when sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES block: fix bd_size_lock use blk-cgroup: Use cond_resched() when destroy blkgs Revert "block: simplify set_init_blocksize" to regain lost performance nbd: freeze the queue while we're adding connections s390/dasd: Fix inconsistent kobject removal block: Fix an error handling in add_partition blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue
2021-01-30Merge tag 'io_uring-5.11-2021-01-29' of git://git.kernel.dk/linux-blockLinus Torvalds1-42/+53
Pull io_uring fixes from Jens Axboe: "We got the cancelation story sorted now, so for all intents and purposes, this should be it for 5.11 outside of any potential little fixes that may come in. This contains: - task_work task state fixes (Hao, Pavel) - Cancelation fixes (me, Pavel) - Fix for an inflight req patch in this release (Pavel) - Fix for a lock deadlock issue (Pavel)" * tag 'io_uring-5.11-2021-01-29' of git://git.kernel.dk/linux-block: io_uring: reinforce cancel on flush during exit io_uring: fix sqo ownership false positive warning io_uring: fix list corruption for splice file_get io_uring: fix flush cqring overflow list while TASK_INTERRUPTIBLE io_uring: fix wqe->lock/completion_lock deadlock io_uring: fix cancellation taking mutex while TASK_UNINTERRUPTIBLE io_uring: fix __io_uring_files_cancel() with TASK_UNINTERRUPTIBLE io_uring: only call io_cqring_ev_posted() if events were posted io_uring: if we see flush on exit, cancel related tasks
2021-01-29tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu()Will Deacon1-1/+1
The 'start' and 'end' arguments to tlb_gather_mmu() are no longer needed now that there is a separate function for 'fullmm' flushing. Remove the unused arguments and update all callers. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com
2021-01-29tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu()Will Deacon1-1/+1
Since commit 7a30df49f63a ("mm: mmu_gather: remove __tlb_reset_range() for force flush"), the 'start' and 'end' arguments to tlb_finish_mmu() are no longer used, since we flush the whole mm in case of a nested invalidation. Remove the unused arguments and update all callers. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org
2021-01-29mm: proc: Invalidate TLB after clearing soft-dirty page stateWill Deacon1-4/+5
Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"), TLB invalidation is elided in tlb_finish_mmu() if no entries were batched via the tlb_remove_*() functions. Consequently, the page-table modifications performed by clear_refs_write() in response to a write to /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is fine when simply aging the ptes, in the case of clearing the "soft-dirty" state we can end up with entries where pte_write() is false, yet a writable mapping remains in the TLB. Fix this by avoiding the mmu_gather API altogether: managing both the 'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB invalidation for the sort-dirty path, much like mprotect() does already. Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush”) Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/20210127235347.1402-2-will@kernel.org
2021-01-29fs: Remove dcookies supportViresh Kumar2-357/+0
The dcookies stuff was only used by the kernel's old oprofile code. Now that oprofile's support is removed from the kernel, there is no need for dcookies as well. Remove it. Suggested-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Robert Richter <rric@kernel.org> Acked-by: William Cohen <wcohen@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Thomas Gleixner <tglx@linutronix.de>
2021-01-29cifs: fix dfs domain referralsRonnie Sahlberg6-16/+75
The new mount API requires additional changes to how DFS is handled. Additional testing of DFS uncovered problems with domain based DFS referrals (a follow on patch addresses DFS links) which this patch addresses. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-29Merge tag 'ecryptfs-5.11-rc6-setxattr-fix' of ↵Linus Torvalds1-3/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs Pull ecryptfs fix from Tyler Hicks: "Fix a regression that resulted in two rounds of UID translations when setting v3 namespaced file capabilities in some configurations" * tag 'ecryptfs-5.11-rc6-setxattr-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs: ecryptfs: fix uid translation for setxattr on security.capability
2021-01-29io_uring: reinforce cancel on flush during exitPavel Begunkov1-2/+1
What 84965ff8a84f0 ("io_uring: if we see flush on exit, cancel related tasks") really wants is to cancel all relevant REQ_F_INFLIGHT requests reliably. That can be achieved by io_uring_cancel_files(), but we'll miss it calling io_uring_cancel_task_requests(files=NULL) from io_uring_flush(), because it will go through __io_uring_cancel_task_requests(). Just always call io_uring_cancel_files() during cancel, it's good enough for now. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-29cifs: returning mount parm processing errors correctlySteve French1-4/+4
During additional testing of the updated cifs.ko with the new mount API support, we found a few additional cases where we were logging errors, but not returning them to the user. For example: a) invalid security mechanisms b) invalid cache options c) unsupported rdma d) invalid smb dialect requested Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api") Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-28io_uring: fix sqo ownership false positive warningPavel Begunkov1-2/+0
WARNING: CPU: 0 PID: 21359 at fs/io_uring.c:9042 io_uring_cancel_task_requests+0xe55/0x10c0 fs/io_uring.c:9042 Call Trace: io_uring_flush+0x47b/0x6e0 fs/io_uring.c:9227 filp_close+0xb4/0x170 fs/open.c:1295 close_files fs/file.c:403 [inline] put_files_struct fs/file.c:418 [inline] put_files_struct+0x1cc/0x350 fs/file.c:415 exit_files+0x7e/0xa0 fs/file.c:435 do_exit+0xc22/0x2ae0 kernel/exit.c:820 do_group_exit+0x125/0x310 kernel/exit.c:922 get_signal+0x427/0x20f0 kernel/signal.c:2773 arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Now io_uring_cancel_task_requests() can be called not through file notes but directly, remove a WARN_ONCE() there that give us false positives. That check is not very important and we catch it in other places. Fixes: 84965ff8a84f0 ("io_uring: if we see flush on exit, cancel related tasks") Cc: stable@vger.kernel.org # 5.9+ Reported-by: syzbot+3e3d9bd0c6ce9efbc3ef@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-28io_uring: fix list corruption for splice file_getPavel Begunkov1-1/+2
kernel BUG at lib/list_debug.c:29! Call Trace: __list_add include/linux/list.h:67 [inline] list_add include/linux/list.h:86 [inline] io_file_get+0x8cc/0xdb0 fs/io_uring.c:6466 __io_splice_prep+0x1bc/0x530 fs/io_uring.c:3866 io_splice_prep fs/io_uring.c:3920 [inline] io_req_prep+0x3546/0x4e80 fs/io_uring.c:6081 io_queue_sqe+0x609/0x10d0 fs/io_uring.c:6628 io_submit_sqe fs/io_uring.c:6705 [inline] io_submit_sqes+0x1495/0x2720 fs/io_uring.c:6953 __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9353 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 io_file_get() may be called from splice, and so REQ_F_INFLIGHT may already be set. Fixes: 02a13674fa0e8 ("io_uring: account io_uring internal files as REQ_F_INFLIGHT") Cc: stable@vger.kernel.org # 5.9+ Reported-by: syzbot+6879187cf57845801267@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-28cifs: fix mounts to subdirectories of targetSteve French1-0/+1
The "prefixpath" mount option needs to be ignored which was missed in the recent conversion to the new mount API (prefixpath would be set by the mount helper if mounting a subdirectory of the root of a share e.g. //server/share/subdir) Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api") Suggested-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2021-01-28cifs: ignore auto and noauto options if givenAdam Harvey1-0/+1
In 24e0a1eff9e2, the noauto and auto options were missed when migrating to the new mount API. As a result, users with noauto in their fstab mount options are now unable to mount cifs filesystems, as they'll receive an "Unknown parameter" error. This restores the old behaviour of ignoring noauto and auto if they're given. Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api") Signed-off-by: Adam Harvey <adam@adamharvey.name> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-28NFSv4_2: SSC helper should use its own config.Dai Ngo6-3/+22
Currently NFSv4_2 SSC helper, nfs_ssc, incorrectly uses GRACE_PERIOD as its config. Fix by adding new config NFS_V4_2_SSC_HELPER which depends on NFS_V4_2 and is automatically selected when NFSD_V4 is enabled. Also removed the file name from a comment in nfs_ssc.c. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-28nfsd: cstate->session->se_client -> cstate->clpJ. Bruce Fields2-11/+10
I'm not sure why we're writing this out the hard way in so many places. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-28nfsd: simplify nfsd4_check_open_reclaimJ. Bruce Fields3-19/+5
The set_client() was already taken care of by process_open1(). The comments here are mostly redundant with the code. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-28nfsd: remove unused set_client argumentJ. Bruce Fields1-13/+10
Every caller is setting this argument to false, so we don't need it. Also cut this comment a bit and remove an unnecessary warning. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-28ovl: implement volatile-specific fsync error behaviourSargun Dhillon6-11/+63
Overlayfs's volatile option allows the user to bypass all forced sync calls to the upperdir filesystem. This comes at the cost of safety. We can never ensure that the user's data is intact, but we can make a best effort to expose whether or not the data is likely to be in a bad state. The best way to handle this in the time being is that if an overlayfs's upperdir experiences an error after a volatile mount occurs, that error will be returned on fsync, fdatasync, sync, and syncfs. This is contradictory to the traditional behaviour of VFS which fails the call once, and only raises an error if a subsequent fsync error has occurred, and been raised by the filesystem. One awkward aspect of the patch is that we have to manually set the superblock's errseq_t after the sync_fs callback as opposed to just returning an error from syncfs. This is because the call chain looks something like this: sys_syncfs -> sync_filesystem -> __sync_filesystem -> /* The return value is ignored here sb->s_op->sync_fs(sb) _sync_blockdev /* Where the VFS fetches the error to raise to userspace */ errseq_check_and_advance Because of this we call errseq_set every time the sync_fs callback occurs. Due to the nature of this seen / unseen dichotomy, if the upperdir is an inconsistent state at the initial mount time, overlayfs will refuse to mount, as overlayfs cannot get a snapshot of the upperdir's errseq that will increment on error until the user calls syncfs. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Fixes: c86243b090bc ("ovl: provide a mount option "volatile"") Cc: stable@vger.kernel.org Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-01-28ovl: skip getxattr of security labelsAmir Goldstein1-7/+8
When inode has no listxattr op of its own (e.g. squashfs) vfs_listxattr calls the LSM inode_listsecurity hooks to list the xattrs that LSMs will intercept in inode_getxattr hooks. When selinux LSM is installed but not initialized, it will list the security.selinux xattr in inode_listsecurity, but will not intercept it in inode_getxattr. This results in -ENODATA for a getxattr call for an xattr returned by listxattr. This situation was manifested as overlayfs failure to copy up lower files from squashfs when selinux is built-in but not initialized, because ovl_copy_xattr() iterates the lower inode xattrs by vfs_listxattr() and vfs_getxattr(). ovl_copy_xattr() skips copy up of security labels that are indentified by inode_copy_up_xattr LSM hooks, but it does that after vfs_getxattr(). Since we are not going to copy them, skip vfs_getxattr() of the security labels. Reported-by: Michael Labriola <michael.d.labriola@gmail.com> Tested-by: Michael Labriola <michael.d.labriola@gmail.com> Link: https://lore.kernel.org/linux-unionfs/2nv9d47zt7.fsf@aldarion.sourceruckus.org/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-01-28ovl: fix dentry leak in ovl_get_redirectLiangyan1-1/+1
We need to lock d_parent->d_lock before dget_dlock, or this may have d_lockref updated parallelly like calltrace below which will cause dentry->d_lockref leak and risk a crash. CPU 0 CPU 1 ovl_set_redirect lookup_fast ovl_get_redirect __d_lookup dget_dlock //no lock protection here spin_lock(&dentry->d_lock) dentry->d_lockref.count++ dentry->d_lockref.count++ [   49.799059] PGD 800000061fed7067 P4D 800000061fed7067 PUD 61fec5067 PMD 0 [   49.799689] Oops: 0002 [#1] SMP PTI [   49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 #1 [   49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014 [   49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20 [   49.803470] RSP: 0018:ffffac6fc5417e98 EFLAGS: 00010246 [   49.803949] RAX: 0000000000000000 RBX: ffff93b8da3446c0 RCX: 0000000a00000000 [   49.804600] RDX: 0000000000000001 RSI: 000000000000000a RDI: 0000000000000088 [   49.805252] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff993cf040 [   49.805898] R10: ffff93b92292e580 R11: ffffd27f188a4b80 R12: 0000000000000000 [   49.806548] R13: 00000000ffffff9c R14: 00000000fffffffe R15: ffff93b8da3446c0 [   49.807200] FS:  00007ffbedffb700(0000) GS:ffff93b927880000(0000) knlGS:0000000000000000 [   49.807935] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [   49.808461] CR2: 0000000000000088 CR3: 00000005e3f74006 CR4: 00000000003606a0 [   49.809113] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [   49.809758] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [   49.810410] Call Trace: [   49.810653]  d_delete+0x2c/0xb0 [   49.810951]  vfs_rmdir+0xfd/0x120 [   49.811264]  do_rmdir+0x14f/0x1a0 [   49.811573]  do_syscall_64+0x5b/0x190 [   49.811917]  entry_SYSCALL_64_after_hwframe+0x44/0xa9 [   49.812385] RIP: 0033:0x7ffbf505ffd7 [   49.814404] RSP: 002b:00007ffbedffada8 EFLAGS: 00000297 ORIG_RAX: 0000000000000054 [   49.815098] RAX: ffffffffffffffda RBX: 00007ffbedffb640 RCX: 00007ffbf505ffd7 [   49.815744] RDX: 0000000004449700 RSI: 0000000000000000 RDI: 0000000006c8cd50 [   49.816394] RBP: 00007ffbedffaea0 R08: 0000000000000000 R09: 0000000000017d0b [   49.817038] R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000012 [   49.817687] R13: 00000000072823d8 R14: 00007ffbedffb700 R15: 00000000072823d8 [   49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd libps2 i8042 [   49.819052] CR2: 0000000000000088 [   49.819368] ---[ end trace 4e652b8aa299aa2d ]--- [   49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20 [   49.821880] RSP: 0018:ffffac6fc5417e98 EFLAGS: 00010246 [   49.822363] RAX: 0000000000000000 RBX: ffff93b8da3446c0 RCX: 0000000a00000000 [   49.823008] RDX: 0000000000000001 RSI: 000000000000000a RDI: 0000000000000088 [   49.823658] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff993cf040 [   49.825404] R10: ffff93b92292e580 R11: ffffd27f188a4b80 R12: 0000000000000000 [   49.827147] R13: 00000000ffffff9c R14: 00000000fffffffe R15: ffff93b8da3446c0 [   49.828890] FS:  00007ffbedffb700(0000) GS:ffff93b927880000(0000) knlGS:0000000000000000 [   49.830725] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [   49.832359] CR2: 0000000000000088 CR3: 00000005e3f74006 CR4: 00000000003606a0 [   49.834085] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [   49.835792] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Cc: <stable@vger.kernel.org> Fixes: a6c606551141 ("ovl: redirect on rename-dir") Signed-off-by: Liangyan <liangyan.peng@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-01-28ovl: avoid deadlock on directory ioctlMiklos Szeredi1-16/+7
The function ovl_dir_real_file() currently uses the inode lock to serialize writes to the od->upperfile field. However, this function will get called by ovl_ioctl_set_flags(), which utilizes the inode lock too. In this case ovl_dir_real_file() will try to claim a lock that is owned by a function in its call stack, which won't get released before ovl_dir_real_file() returns. Fix by replacing the open coded compare and exchange by an explicit atomic op. Fixes: 61536bed2149 ("ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories") Cc: stable@vger.kernel.org # v5.10 Reported-by: Icenowy Zheng <icenowy@aosc.io> Tested-by: Icenowy Zheng <icenowy@aosc.io> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-01-28ovl: perform vfs_getxattr() with mounter credsMiklos Szeredi1-0/+2
The vfs_getxattr() in ovl_xattr_set() is used to check whether an xattr exist on a lower layer file that is to be removed. If the xattr does not exist, then no need to copy up the file. This call of vfs_getxattr() wasn't wrapped in credential override, and this is probably okay. But for consitency wrap this instance as well. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-01-28ovl: add warning on user_ns mismatchMiklos Szeredi1-0/+4
Currently there's no way to create an overlay filesystem outside of the current user namespace. Make sure that if this assumption changes it doesn't go unnoticed. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-01-28f2fs: deprecate f2fs_trace_ioJaegeuk Kim10-239/+0
This patch deprecates f2fs_trace_io, since f2fs uses page->private more broadly, resulting in more buggy cases. Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-28f2fs: Remove readahead collision detectionMatthew Wilcox (Oracle)3-28/+0
With the new ->readahead operation, locked pages are added to the page cache, preventing two threads from racing with each other to read the same chunk of file, so this is dead code. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-28f2fs: remove unused stat_{inc, dec}_atomic_writeJack Qiu1-2/+0
Just clean code, no logical change. Signed-off-by: Jack Qiu <jack.qiu@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-28f2fs: introduce sb_status sysfs nodeChao Yu1-0/+8
Introduce /sys/fs/f2fs/<devname>/stat/sb_status to show superblock status in real time as a hexadecimal value. value sb status macro description 0x1 SBI_IS_DIRTY, /* dirty flag for checkpoint */ 0x2 SBI_IS_CLOSE, /* specify unmounting */ 0x4 SBI_NEED_FSCK, /* need fsck.f2fs to fix */ 0x8 SBI_POR_DOING, /* recovery is doing or not */ 0x10 SBI_NEED_SB_WRITE, /* need to recover superblock */ 0x20 SBI_NEED_CP, /* need to checkpoint */ 0x40 SBI_IS_SHUTDOWN, /* shutdown by ioctl */ 0x80 SBI_IS_RECOVERED, /* recovered orphan/data */ 0x100 SBI_CP_DISABLED, /* CP was disabled last mount */ 0x200 SBI_CP_DISABLED_QUICK, /* CP was disabled quickly */ 0x400 SBI_QUOTA_NEED_FLUSH, /* need to flush quota info in CP */ 0x800 SBI_QUOTA_SKIP_FLUSH, /* skip flushing quota in current CP */ 0x1000 SBI_QUOTA_NEED_REPAIR, /* quota file may be corrupted */ 0x2000 SBI_IS_RESIZEFS, /* resizefs is in process */ Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-28f2fs: fix to use per-inode maxbytesChengguang Xu4-9/+16
F2FS inode may have different max size, e.g. compressed file have less blkaddr entries in all its direct-node blocks, result in being with less max filesize. So change to use per-inode maxbytes. Suggested-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-28f2fs: compress: fix potential deadlockChao Yu3-6/+11
generic/269 reports a hangtask issue, the root cause is ABBA deadlock described as below: Thread A Thread B - down_write(&sbi->gc_lock) -- A - f2fs_write_data_pages - lock all pages in cluster -- B - f2fs_write_multi_pages - f2fs_write_raw_pages - f2fs_write_single_data_page - f2fs_balance_fs - down_write(&sbi->gc_lock) -- A - f2fs_gc - do_garbage_collect - ra_data_block - pagecache_get_page -- B To fix this, it needs to avoid calling f2fs_balance_fs() if there is still cluster pages been locked in context of cluster writeback, so instead, let's call f2fs_balance_fs() in the end of f2fs_write_raw_pages() when all cluster pages were unlocked. Fixes: 4c8ff7095bef ("f2fs: support data compression") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-28libfs: unexport generic_ci_d_compare() and generic_ci_d_hash()Eric Biggers1-5/+3
Now that generic_set_encrypted_ci_d_ops() has been added and ext4 and f2fs are using it, it's no longer necessary to export generic_ci_d_compare() and generic_ci_d_hash() to filesystems. Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-28f2fs: fix to set/clear I_LINKABLE under i_lockChao Yu1-0/+8
fsstress + fault injection test case reports a warning message as below: WARNING: CPU: 13 PID: 6226 at fs/inode.c:361 inc_nlink+0x32/0x40 Call Trace: f2fs_init_inode_metadata+0x25c/0x4a0 [f2fs] f2fs_add_inline_entry+0x153/0x3b0 [f2fs] f2fs_add_dentry+0x75/0x80 [f2fs] f2fs_do_add_link+0x108/0x160 [f2fs] f2fs_rename2+0x6ab/0x14f0 [f2fs] vfs_rename+0x70c/0x940 do_renameat2+0x4d8/0x4f0 __x64_sys_renameat2+0x4b/0x60 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Following race case can cause this: Thread A Kworker - f2fs_rename - f2fs_create_whiteout - __f2fs_tmpfile - f2fs_i_links_write - f2fs_mark_inode_dirty_sync - mark_inode_dirty_sync - writeback_single_inode - __writeback_single_inode - spin_lock(&inode->i_lock) - inode->i_state |= I_LINKABLE - inode->i_state &= ~dirty - spin_unlock(&inode->i_lock) - f2fs_add_link - f2fs_do_add_link - f2fs_add_dentry - f2fs_add_inline_entry - f2fs_init_inode_metadata - f2fs_i_links_write - inc_nlink - WARN_ON(!(inode->i_state & I_LINKABLE)) Fix to add i_lock to avoid i_state update race condition. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-28f2fs: fix null page reference in redirty_blocksDaeho Jeong1-2/+4
By Colin's static analysis, we found out there is a null page reference under low memory situation in redirty_blocks. I've made the page finding loop stop immediately and return an error not to cause further memory pressure when we run into a failure to find a page under low memory condition. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reported-by: Colin Ian King <colin.king@canonical.com> Fixes: 5fdb322ff2c2 ("f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE") Reviewed-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-28f2fs: clean up post-read processingEric Biggers3-264/+297
Rework the post-read processing logic to be much easier to understand. At least one bug is fixed by this: if an I/O error occurred when reading from disk, decryption and verity would be performed on the uninitialized data, causing misleading messages in the kernel log. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-28f2fs: trival cleanup in move_data_block()Chao Yu1-5/+3
Trival cleanups: - relocate set_summary() before its use - relocate "allocate block address" to correct place - remove unneeded f2fs_wait_on_page_writeback() Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>