summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-03-13NFS: remove IS_SWAPFILE hackNeilBrown1-5/+0
This code is pointless as IS_SWAPFILE is always defined. So remove it. Suggested-by: Mark Hemment <markhemm@googlemail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13NFS: Remove remaining dfprintks related to fscache and remove NFSDBG_FSCACHEDave Wysochanski1-10/+0
The fscache cookie APIs including fscache_acquire_cookie() and fscache_relinquish_cookie() now have very good tracing. Thus, there is no real need for dfprintks in the NFS fscache interface. The NFS fscache interface has removed all dfprintks so remove the NFSDBG_FSCACHE defines. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13NFS: Replace dfprintks with tracepoints in fscache read and write page functionsDave Wysochanski2-18/+102
Most of fscache and other NFS IO paths are now using tracepoints. Remove the dfprintks in the NFS fscache read/write page functions and replace with tracepoints at the begin and end of the functions. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13NFS: Rename fscache read and write pages functionsDave Wysochanski3-22/+15
Rename NFS fscache functions in a more consistent fashion to better reflect when we read from and write to fscache. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13NFS: Cleanup usage of nfs_inode in fscache interfaceDave Wysochanski2-15/+13
A number of places in the fscache interface used nfs_inode when inode could be used, simplifying the code. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13NFSv4.1 restrict GETATTR fs_location query to the main transportOlga Kornievskaia1-2/+13
In the presence of trunking transports, it's helpful to make sure that during the migration event, the GETATTR for fs_location attribute happens on the main transport. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13NFS: remove unneeded check in decode_devicenotify_args()Alexey Khoroshilov1-4/+0
[You don't often get email from khoroshilov@ispras.ru. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.] Overflow check in not needed anymore after we switch to kmalloc_array(). Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Fixes: a4f743a6bb20 ("NFSv4.1: Convert open-coded array allocation calls to kmalloc_array()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13Kbuild: add -Wno-shift-negative-value where -Wextra is usedArnd Bergmann1-0/+1
As a preparation for moving to -std=gnu11, turn off the -Wshift-negative-value option. This warning is enabled by gcc when building with -Wextra for c99 or higher, but not for c89. Since the kernel already relies on well-defined overflow behavior, the warning is not helpful and can simply be disabled in all locations that use -Wextra. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM/Clang v13.0.0 (x86-64) Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2022-03-13ext4: do not call FC trace event in ext4_fc_commit() if FS does not support FCRitesh Harjani1-3/+3
This just puts trace_ext4_fc_commit_start(sb) & ktime_get() for measuring FC commit time, after the check of whether sb supports JOURNAL_FAST_COMMIT or not. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/d53cf3e535924ec0a1eb41a560e96561b0727e7a.1647057583.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-13ext4: remove unused enum EXT4_FC_COMMIT_FAILEDRitesh Harjani1-1/+0
Below commit removed all references of EXT4_FC_COMMIT_FAILED. commit 0915e464cb274 ("ext4: simplify updating of fast commit stats") Just remove it since it is not used anymore. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/c941357e476be07a1138c7319ca5faab7fb80fc6.1647057583.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-13ext4: warn when dirtying page w/o buffers in data=journal modeJan Kara1-4/+6
Recently I've got a report of BUG_ON trigerring during transaction commit in ext4_journalled_writepage_callback() because we've spotted a dirty page without buffers. Add WARN_ON_ONCE to ext4_journalled_set_page_dirty() to catch the problematic condition earlier where we have better chance of understanding which code path is creating dirty data without preparing the page properly. Also update the comment with current information when we are at it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220310101832.5645-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-13ext4: make mb_optimize_scan performance mount option work with extentsOjaswin Mujoo1-1/+1
Currently mb_optimize_scan scan feature which improves filesystem performance heavily (when FS is fragmented), seems to be not working with files with extents (ext4 by default has files with extents). This patch fixes that and makes mb_optimize_scan feature work for files with extents. Below are some performance numbers obtained when allocating a 10M and 100M file with and w/o this patch on a filesytem with no 1M contiguous block. <perf numbers> =============== Workload: dd if=/dev/urandom of=test conv=fsync bs=1M count=10/100 Time taken ===================================================== no. Size without-patch with-patch Diff(%) 1 10M 0m8.401s 0m5.623s 33.06% 2 100M 1m40.465s 1m14.737s 25.6% <debug stats> ============= w/o patch: mballoc: reqs: 17056 success: 11407 groups_scanned: 13643 cr0_stats: hits: 37 groups_considered: 9472 useless_loops: 36 bad_suggestions: 0 cr1_stats: hits: 11418 groups_considered: 908560 useless_loops: 1894 bad_suggestions: 0 cr2_stats: hits: 1873 groups_considered: 6913 useless_loops: 21 cr3_stats: hits: 21 groups_considered: 5040 useless_loops: 21 extents_scanned: 417364 goal_hits: 3707 2^n_hits: 37 breaks: 1873 lost: 0 buddies_generated: 239/240 buddies_time_used: 651080 preallocated: 705 discarded: 478 with patch: mballoc: reqs: 12768 success: 11305 groups_scanned: 12768 cr0_stats: hits: 1 groups_considered: 18 useless_loops: 0 bad_suggestions: 0 cr1_stats: hits: 5829 groups_considered: 50626 useless_loops: 0 bad_suggestions: 0 cr2_stats: hits: 6938 groups_considered: 580363 useless_loops: 0 cr3_stats: hits: 0 groups_considered: 0 useless_loops: 0 extents_scanned: 309059 goal_hits: 0 2^n_hits: 1 breaks: 1463 lost: 0 buddies_generated: 239/240 buddies_time_used: 791392 preallocated: 673 discarded: 446 Fixes: 196e402 (ext4: improve cr 0 / cr 1 group scanning) Cc: stable@kernel.org Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Reported-by: Nageswara R Sastry <rnsastry@linux.ibm.com> Suggested-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/fc9a48f7f8dcfc83891a8b21f6dd8cdf056ed810.1646732698.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-13ext4: make mb_optimize_scan option work with set/unset mount cmdOjaswin Mujoo1-10/+14
After moving to the new mount API, mb_optimize_scan mount option handling was not working as expected due to the parsed value always being overwritten by default. Refactor and fix this to the expected behavior described below: * mb_optimize_scan=1 - On * mb_optimize_scan=0 - Off * mb_optimize_scan not passed - On if no. of BGs > threshold else off * Remounts retain previous value unless we explicitly pass the option with a new value Fixes: cebe85d570cf ("ext4: switch to the new mount api") Cc: stable@kernel.org Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/c98970fe99f26718586d02e942f293300fb48ef3.1646732698.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-12io_uring: remove duplicated member check for io_msg_ring_prep()Jens Axboe1-3/+2
Julia and the kernel test robot report that the prep handling for this command inadvertently checks one field twice: fs/io_uring.c:4338:42-56: duplicated argument to && or || Get rid of it. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Julia Lawall <julia.lawall@lip6.fr> Fixes: 4f57f06ce218 ("io_uring: add support for IORING_OP_MSG_RING command") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-11Merge branch 'davidh' (fixes from David Howells)Linus Torvalds3-8/+35
Merge misc fixes from David Howells: "A set of patches for watch_queue filter issues noted by Jann. I've added in a cleanup patch from Christophe Jaillet to convert to using formal bitmap specifiers for the note allocation bitmap. Also two filesystem fixes (afs and cachefiles)" * emailed patches from David Howells <dhowells@redhat.com>: cachefiles: Fix volume coherency attribute afs: Fix potential thrashing in afs writeback watch_queue: Make comment about setting ->defunct more accurate watch_queue: Fix lack of barrier/sync/lock between post and read watch_queue: Free the alloc bitmap when the watch_queue is torn down watch_queue: Fix the alloc bitmap size to reflect notes allocated watch_queue: Use the bitmap API when applicable watch_queue: Fix to always request a pow-of-2 pipe ring size watch_queue: Fix to release page in ->release() watch_queue, pipe: Free watchqueue state after clearing pipe ring watch_queue: Fix filter limit check
2022-03-11cachefiles: Fix volume coherency attributeDavid Howells1-3/+20
A network filesystem may set coherency data on a volume cookie, and if given, cachefiles will store this in an xattr on the directory in the cache corresponding to the volume. The function that sets the xattr just stores the contents of the volume coherency buffer directly into the xattr, with nothing added; the checking function, on the other hand, has a cut'n'paste error whereby it tries to interpret the xattr contents as would be the xattr on an ordinary file (using the cachefiles_xattr struct). This results in a failure to match the coherency data because the buffer ends up being shifted by 18 bytes. Fix this by defining a structure specifically for the volume xattr and making both the setting and checking functions use it. Since the volume coherency doesn't work if used, take the opportunity to insert a reserved field for future use, set it to 0 and check that it is 0. Log mismatch through the appropriate tracepoint. Note that this only affects cifs; 9p, afs, ceph and nfs don't use the volume coherency data at the moment. Fixes: 32e150037dce ("fscache, cachefiles: Store the volume coherency data") Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Steve French <smfrench@gmail.com> cc: linux-cifs@vger.kernel.org cc: linux-cachefs@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11afs: Fix potential thrashing in afs writebackDavid Howells1-1/+8
In afs_writepages_region(), if the dirty page we find is undergoing writeback or write to cache, but the sync_mode is WB_SYNC_NONE, we go round the loop trying the same page again and again with no pausing or waiting unless and until another thread manages to clear the writeback and fscache flags. Fix this with three measures: (1) Advance start to after the page we found. (2) Break out of the loop and return if rescheduling is requested. (3) Arbitrarily give up after a maximum of 5 skips. Fixes: 31143d5d515e ("AFS: implement basic file write support") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Marc Dionne <marc.dionne@auristor.com> Acked-by: Marc Dionne <marc.dionne@auristor.com> Link: https://lore.kernel.org/r/164692725757.2097000.2060513769492301854.stgit@warthog.procyon.org.uk/ # v1 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11watch_queue: Fix lack of barrier/sync/lock between post and readDavid Howells1-1/+2
There's nothing to synchronise post_one_notification() versus pipe_read(). Whilst posting is done under pipe->rd_wait.lock, the reader only takes pipe->mutex which cannot bar notification posting as that may need to be made from contexts that cannot sleep. Fix this by setting pipe->head with a barrier in post_one_notification() and reading pipe->head with a barrier in pipe_read(). If that's not sufficient, the rd_wait.lock will need to be taken, possibly in a ->confirm() op so that it only applies to notifications. The lock would, however, have to be dropped before copy_page_to_iter() is invoked. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11watch_queue, pipe: Free watchqueue state after clearing pipe ringDavid Howells1-3/+5
In free_pipe_info(), free the watchqueue state after clearing the pipe ring as each pipe ring descriptor has a release function, and in the case of a notification message, this is watch_queue_pipe_buf_release() which tries to mark the allocation bitmap that was previously released. Fix this by moving the put of the pipe's ref on the watch queue to after the ring has been cleared. We still need to call watch_queue_clear() before doing that to make sure that the pipe is disconnected from any notification sources first. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11f2fs: don't get FREEZE lock in f2fs_evict_inode in frozen fsJaegeuk Kim4-2/+10
Let's purge inode cache in order to avoid the below deadlock. [freeze test] shrinkder freeze_super - pwercpu_down_write(SB_FREEZE_FS) - super_cache_scan - down_read(&sb->s_umount) - prune_icache_sb - dispose_list - evict - f2fs_evict_inode thaw_super - down_write(&sb->s_umount); - __percpu_down_read(SB_FREEZE_FS) Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-11NFSD: Fix nfsd_breaker_owns_lease() return valuesChuck Lever1-2/+10
These have been incorrect since the function was introduced. A proper kerneldoc comment is added since this function, though static, is part of an external interface. Reported-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-03-11NFSD: Clean up _lm_ operation namesChuck Lever1-4/+4
The common practice is to name function instances the same as the method names, but with a uniquifying prefix. Commit aef9583b234a ("NFSD: Get reference of lockowner when coping file_lock") missed this -- the new function names should both have been of the form "nfsd4_lm_*". Before more lock manager operations are added in NFSD, rename these two functions for consistency. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-03-11NFSD: Remove CONFIG_NFSD_V3Chuck Lever8-51/+3
Eventually support for NFSv2 in the Linux NFS server is to be deprecated and then removed. However, NFSv2 is the "always supported" version that is available as soon as CONFIG_NFSD is set. Before NFSv2 support can be removed, we need to choose a different "always supported" version. This patch removes CONFIG_NFSD_V3 so that NFSv3 is always supported, as NFSv2 is today. When NFSv2 support is removed, NFSv3 will become the only "always supported" NFS version. The defconfigs still need to be updated to remove CONFIG_NFSD_V3=y. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-03-11tracehook: Remove tracehook.hEric W. Biederman6-6/+1
Now that all of the definitions have moved out of tracehook.h into ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in tracehook.h so remove it. Update the few files that were depending upon tracehook.h to bring in definitions to use the headers they need directly. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-13-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-11task_work: Decouple TIF_NOTIFY_SIGNAL and task_workEric W. Biederman2-2/+6
There are a small handful of reasons besides pending signals that the kernel might want to break out of interruptible sleeps. The flag TIF_NOTIFY_SIGNAL and the helpers that set and clear TIF_NOTIFY_SIGNAL provide that the infrastructure for breaking out of interruptible sleeps and entering the return to user space slow path for those cases. Expand tracehook_notify_signal inline in it's callers and remove it, which makes clear that TIF_NOTIFY_SIGNAL and task_work are separate concepts. Update the comment on set_notify_signal to more accurately describe it's purpose. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-9-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10io_uring: allow submissions to continue on errorJens Axboe1-3/+9
By default, io_uring will stop submitting a batch of requests if we run into an error submitting a request. This isn't strictly necessary, as the error result is passed out-of-band via a CQE anyway. And it can be a bit confusing for some applications. Provide a way to setup a ring that will continue submitting on error, when the error CQE has been posted. There's still one case that will break out of submission. If we fail allocating a request, then we'll still return -ENOMEM. We could in theory post a CQE for that condition too even if we never got a request. Leave that for a potential followup. Reported-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10task_work: Introduce task_work_pendingEric W. Biederman1-3/+3
Wrap the test of task->task_works in a helper function to make it clear what is being tested. All of the other readers of task->task_work use READ_ONCE and this is even necessary on current as other processes can update task->task_work. So for consistency I have added READ_ONCE into task_work_pending. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-7-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10io_uring: recycle provided buffers if request goes asyncJens Axboe1-0/+36
If we are using provided buffers, it's less than useful to have a buffer selected and pinned if a request needs to go async or arms poll for notification trigger on when we can process it. Recycle the buffer in those events, so we don't pin it for the duration of the request. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: ensure reads re-import for selected buffersJens Axboe1-0/+10
If we drop buffers between scheduling a retry, then we need to re-import when we start the request again. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: retry early for reads if we can pollJens Axboe1-0/+3
Most of the logic in io_read() deals with regular files, and in some ways it would make sense to split the handling into S_IFREG and others. But at least for retry, we don't need to bother setting up a bunch of state just to abort in the loop later. In particular, don't bother forcing setup of async data for a normal non-vectored read when we don't need it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io-uring: Make statx API stableStefan Roesch3-17/+58
One of the key architectual tenets is to keep the parameters for io-uring stable. After the call has been submitted, its value can be changed. Unfortunaltely this is not the case for the current statx implementation. IO-Uring change: This changes replaces the const char * filename pointer in the io_statx structure with a struct filename *. In addition it also creates the filename object during the prepare phase. With this change, the opcode also needs to invoke cleanup, so the filename object gets freed after processing the request. fs change: This replaces the const char* __user filename parameter in the two functions do_statx and vfs_statx with a struct filename *. In addition to be able to correctly construct a filename object a new helper function getname_statx_lookup_flags is introduced. The function makes sure that do_statx and vfs_statx is invoked with the correct lookup flags. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220225185326.1373304-2-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: Add support for napi_busy_pollOlivier Langlois1-1/+231
The sqpoll thread can be used for performing the napi busy poll in a similar way that it does io polling for file systems supporting direct access bypassing the page cache. The other way that io_uring can be used for napi busy poll is by calling io_uring_enter() to get events. If the user specify a timeout value, it is distributed between polling and sleeping by using the systemwide setting /proc/sys/net/core/busy_poll. The changes have been tested with this program: https://github.com/lano1106/io_uring_udp_ping and the result is: Without sqpoll: NAPI busy loop disabled: rtt min/avg/max/mdev = 40.631/42.050/58.667/1.547 us NAPI busy loop enabled: rtt min/avg/max/mdev = 30.619/31.753/61.433/1.456 us With sqpoll: NAPI busy loop disabled: rtt min/avg/max/mdev = 42.087/44.438/59.508/1.533 us NAPI busy loop enabled: rtt min/avg/max/mdev = 35.779/37.347/52.201/0.924 us Co-developed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/810bd9408ffc510ff08269e78dca9df4af0b9e4e.1646777484.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: minor io_cqring_wait() optimizationOlivier Langlois1-8/+8
Move up the block manipulating the sig variable to execute code that may encounter an error and exit first before continuing executing the rest of the function and avoid useless computations Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/84513f7cc1b1fb31d8f4cb910aee033391d036b4.1646777484.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: add support for IORING_OP_MSG_RING commandJens Axboe1-0/+55
This adds support for IORING_OP_MSG_RING, which allows an SQE to signal another ring. That allows either waking up someone waiting on the ring, or even passing a 64-bit value via the user_data field in the CQE. sqe->fd must contain the fd of a ring that should receive the CQE. sqe->off will be propagated to the cqe->user_data on the target ring, and sqe->len will be propagated to cqe->res. The results CQE will have IORING_CQE_F_MSG set in its flags, to indicate that this CQE was generated from a messaging request rather than a SQE issued locally on that ring. This effectively allows passing a 64-bit and a 32-bit quantify between the two rings. This request type has the following request specific error cases: - -EBADFD. Set if the sqe->fd doesn't point to a file descriptor that is of the io_uring type. - -EOVERFLOW. Set if we were not able to deliver a request to the target ring. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: speedup provided buffer handlingJens Axboe1-22/+117
In testing high frequency workloads with provided buffers, we spend a lot of time in allocating and freeing the buffer units themselves. Rather than repeatedly free and alloc them, add a recycling cache instead. There are two caches: - ctx->io_buffers_cache. This is the one we grab from in the submission path, and it's protected by ctx->uring_lock. For inline completions, we can recycle straight back to this cache and not need any extra locking. - ctx->io_buffers_comp. If we're not under uring_lock, then we use this list to recycle buffers. It's protected by the completion_lock. On adding a new buffer, check io_buffers_cache. If it's empty, check if we can splice entries from the io_buffers_comp_cache. This reduces about 5-10% of overhead from provided buffers, bringing it pretty close to the non-provided path. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: add support for registering ring file descriptorsJens Axboe1-5/+177
Lots of workloads use multiple threads, in which case the file table is shared between them. This makes getting and putting the ring file descriptor for each io_uring_enter(2) system call more expensive, as it involves an atomic get and put for each call. Similarly to how we allow registering normal file descriptors to avoid this overhead, add support for an io_uring_register(2) API that allows to register the ring fds themselves: 1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update structs, and registers them with the task. 2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update structs, and unregisters them. When a ring fd is registered, it is internally represented by an offset. This offset is returned to the application, and the application then uses this offset and sets IORING_ENTER_REGISTERED_RING for the io_uring_enter(2) system call. This works just like using a registered file descriptor, rather than a real one, in an SQE, where IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal offset/descriptor rather than a real file descriptor. In initial testing, this provides a nice bump in performance for threaded applications in real world cases where the batch count (eg number of requests submitted per io_uring_enter(2) invocation) is low. In a microbenchmark, submitting NOP requests, we see the following increases in performance: Requests per syscall Baseline Registered Increase ---------------------------------------------------------------- 1 ~7030K ~8080K +15% 2 ~13120K ~14800K +13% 4 ~22740K ~25300K +11% Co-developed-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: documentation fixupDylan Yudaken1-1/+1
Fix incorrect name reference in comment. ki_filp does not exist in the struct, but file does. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220224105157.1332353-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: do not recalculate ppos unnecessarilyDylan Yudaken1-6/+12
There is a slight optimisation to be had by calculating the correct pos pointer inside io_kiocb_update_pos and then using that later. It seems code size drops by a bit: 000000000000a1b0 0000000000000400 t io_read 000000000000a5b0 0000000000000319 t io_write vs 000000000000a1b0 00000000000003f6 t io_read 000000000000a5b0 0000000000000310 t io_write Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: update kiocb->ki_pos at execution timeDylan Yudaken1-8/+18
Update kiocb->ki_pos at execution time rather than in io_prep_rw(). io_prep_rw() happens before the job is enqueued to a worker and so the offset might be read multiple times before being executed once. Ensures that the file position in a set of _linked_ SQEs will be only obtained after earlier SQEs have completed, and so will include their incremented file position. Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: remove duplicated calls to io_kiocb_pposDylan Yudaken1-2/+5
io_kiocb_ppos is called in both branches, and it seems that the compiler does not fuse this. Fusing removes a few bytes from loop_rw_iter. Before: $ nm -S fs/io_uring.o | grep loop_rw_iter 0000000000002430 0000000000000124 t loop_rw_iter After: $ nm -S fs/io_uring.o | grep loop_rw_iter 0000000000002430 000000000000010d t loop_rw_iter Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: Remove unneeded test in io_run_task_work_sig()Olivier Langlois1-3/+3
Avoid testing TIF_NOTIFY_SIGNAL twice by calling task_sigpending() directly from io_run_task_work_sig() Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/bd7c0495f7656e803e5736708591bb665e6eaacd.1645041650.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io-uring: Make tracepoints consistent.Stefan Roesch1-11/+13
This makes the io-uring tracepoints consistent. Where it makes sense the tracepoints start with the following four fields: - context (ring) - request - user_data - opcode. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220214180430.70572-3-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io-uring: add __fill_cqe functionStefan Roesch1-9/+13
This introduces the __fill_cqe function. This is necessary to correctly issue the io_uring_complete tracepoint. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220214180430.70572-2-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io-wq: use IO_WQ_ACCT_NR rather than hardcoded numberHao Xu1-2/+2
It's better to use the defined enum stuff not the hardcoded number to define array. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220206095241.121485-4-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io-wq: reduce acct->lock crossing functions lock/unlockHao Xu1-20/+12
reduce acct->lock lock and unlock in different functions to make the code clearer. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220206095241.121485-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io-wq: decouple work_list protection from the big wqe->lockHao Xu1-44/+52
wqe->lock is abused, it now protects acct->work_list, hash stuff, nr_workers, wqe->free_list and so on. Lets first get the work_list out of the wqe-lock mess by introduce a specific lock for work list. This is the first step to solve the huge contension between work insertion and work consumption. good thing: - split locking for bound and unbound work list - reduce contension between work_list visit and (worker's)free_list. For the hash stuff, since there won't be a work with same file in both bound and unbound work list, thus they won't visit same hash entry. it works well to use the new lock to protect hash stuff. Results: set max_unbound_worker = 4, test with echo-server: nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16 (-n connection, -l workload) before this patch: Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074 Overhead Command Shared Object Symbol 28.59% iou-wrk-10021 [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 8.89% io_uring_echo_s [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 6.20% iou-wrk-10021 [kernel.vmlinux] [k] _raw_spin_lock 2.45% io_uring_echo_s [kernel.vmlinux] [k] io_prep_async_work 2.36% iou-wrk-10021 [kernel.vmlinux] [k] _raw_spin_lock_irqsave 2.29% iou-wrk-10021 [kernel.vmlinux] [k] io_worker_handle_work 1.29% io_uring_echo_s [kernel.vmlinux] [k] io_wqe_enqueue 1.06% iou-wrk-10021 [kernel.vmlinux] [k] io_wqe_worker 1.06% io_uring_echo_s [kernel.vmlinux] [k] _raw_spin_lock 1.03% iou-wrk-10021 [kernel.vmlinux] [k] __schedule 0.99% iou-wrk-10021 [kernel.vmlinux] [k] tcp_sendmsg_locked with this patch: Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943 Overhead Command Shared Object Symbol 16.86% iou-wrk-10893 [kernel.vmlinux] [k] native_queued_spin_lock_slowpat 9.10% iou-wrk-10893 [kernel.vmlinux] [k] _raw_spin_lock 4.53% io_uring_echo_s [kernel.vmlinux] [k] native_queued_spin_lock_slowpat 2.87% iou-wrk-10893 [kernel.vmlinux] [k] io_worker_handle_work 2.57% iou-wrk-10893 [kernel.vmlinux] [k] _raw_spin_lock_irqsave 2.56% io_uring_echo_s [kernel.vmlinux] [k] io_prep_async_work 1.82% io_uring_echo_s [kernel.vmlinux] [k] _raw_spin_lock 1.33% iou-wrk-10893 [kernel.vmlinux] [k] io_wqe_worker 1.26% io_uring_echo_s [kernel.vmlinux] [k] try_to_wake_up spin_lock failure from 25.59% + 8.89% = 34.48% to 16.86% + 4.53% = 21.39% TPS is similar, while cpu usage is from almost 400% to 350% Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220206095241.121485-2-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: Fix use of uninitialized ret in io_eventfd_register()Nathan Chancellor1-3/+3
Clang warns: fs/io_uring.c:9396:9: warning: variable 'ret' is uninitialized when used here [-Wuninitialized] return ret; ^~~ fs/io_uring.c:9373:13: note: initialize the variable 'ret' to silence this warning int fd, ret; ^ = 0 1 warning generated. Just return 0 directly and reduce the scope of ret to the if statement, as that is the only place that it is used, which is how the function was before the fixes commit. Fixes: 1a75fac9a0f9 ("io_uring: avoid ring quiesce while registering/unregistering eventfd") Link: https://github.com/ClangBuiltLinux/linux/issues/1579 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20220207162410.1013466-1-nathan@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: remove ring quiesce for io_uring_registerUsama Arif1-83/+0
None of the opcodes in io_uring_register use ring quiesce anymore. Hence io_register_op_must_quiesce always returns false and io_ctx_quiesce is never called. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-6-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: avoid ring quiesce while registering restrictions and enabling ringsUsama Arif1-0/+2
IORING_SETUP_R_DISABLED prevents submitting requests and so there will be no requests until IORING_REGISTER_ENABLE_RINGS is called. And IORING_REGISTER_RESTRICTIONS works only before IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is not needed for these opcodes. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-5-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10io_uring: avoid ring quiesce while registering async eventfdUsama Arif1-10/+12
This is done using the RCU data structure (io_ev_fd). eventfd_async is moved from io_ring_ctx to io_ev_fd which is RCU protected hence avoiding ring quiesce which is much more expensive than an RCU lock. The place where eventfd_async is read is already under rcu_read_lock so there is no extra RCU read-side critical section needed. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-4-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>