summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-04-12io_uring: move rsrc_put callback into io_rsrc_dataPavel Begunkov1-17/+14
io_rsrc_node's callback operates only on a single io_rsrc_data and only with its resources, so rsrc_put() callback is actually a property of io_rsrc_data. Move it there, it makes code much nicecr. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9417c2fba3c09e8668f05747006a603d416d34b4.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: encapsulate rsrc node manipulationsPavel Begunkov1-26/+13
io_rsrc_node_get() and io_rsrc_node_set() are always used together, merge them into one so most users don't even see io_rsrc_node and don't need to care about it. It helped to catch io_sqe_files_register() inferring rsrc data argument for get and set differently, not a problem but a good sign. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0827b080b2e61b3dec795380f7e1a1995595d41f.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: use rsrc prealloc infra for files regPavel Begunkov1-15/+6
Keep it consistent with update and use io_rsrc_node_prealloc() + io_rsrc_node_get() in io_sqe_files_register() as well, that will be used in future patches, not as error prone and allows to deduplicate rsrc_node init. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cf87321e6be5e38f4dc7fe5079d2aa6945b1ace0.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: simplify io_rsrc_node_ref_zeroPavel Begunkov1-8/+4
Replace queue_delayed_work() with mod_delayed_work() in io_rsrc_node_ref_zero() as the later one can schedule a new work, and cleanup it further for better readability. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3b2b23e3a1ea4bbf789cd61815d33e05d9ff945e.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: name rsrc bits consistentlyPavel Begunkov1-79/+71
Keep resource related structs' and functions' naming consistent, in particular use "io_rsrc" prefix for everything. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/962f5acdf810f3a62831e65da3932cde24f6d9df.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io-wq: cancel task_work on exit only targeting the current 'wq'Jens Axboe1-1/+11
With using task_work_cancel(), we're potentially canceling task_work that isn't related to this specific io_wq. Use the newly added task_work_cancel_match() to ensure that we only remove and cancel work items that are specific to this io_wq. Fixes: 685fe7feedb9 ("io-wq: eliminate the need for a manager thread") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: fix race around poll update and poll triggeringJens Axboe1-8/+10
Joakim reports that in some conditions he sees a multishot poll request being canceled, and that it coincides with getting -EALREADY on modification. As part of the poll update procedure, there's a small window where the request is marked as canceled, and if this coincides with the event actually triggering, then we can get a spurious -ECANCELED and termination of the multishot request. Don't mark the poll request as being canceled for update. We also don't care if we race on removal unless it's a one-shot request, we can safely updated for either case. Fixes: b69de288e913 ("io_uring: allow events and user_data update of running poll requests") Reported-by: Joakim Hassila <joj@mac.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: reg buffer overflow checks hardeningPavel Begunkov1-0/+5
We are safe with overflows in io_sqe_buffer_register() because it will just yield alloc failure, but it's nicer to check explicitly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b0625551be3d97b80a5fd21c8cd79dc1c91f0b5.1616624589.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICEJens Axboe1-4/+0
Now that we have any worker being attached to the original task as threads, accounting of CPU time is directly attributed to the original task as well. This means that we no longer have to restrict SQPOLL to needing elevated privileges, as it's really no different from just having the task spawn a busy looping thread in userspace. Reported-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io-wq: eliminate the need for a manager threadJens Axboe3-159/+120
io-wq relies on a manager thread to create/fork new workers, as needed. But there's really no strong need for it anymore. We have the following cases that fork a new worker: 1) Work queue. This is done from the task itself always, and it's trivial to create a worker off that path, if needed. 2) All workers have gone to sleep, and we have more work. This is called off the sched out path. For this case, use a task_work items to queue a fork-worker operation. 3) Hashed work completion. Don't think we need to do anything off this case. If need be, it could just use approach 2 as well. Part of this change is incrementing the running worker count before the fork, to avoid cases where we observe we need a worker and then queue creation of one. Then new work comes in, we fork a new one. That last queue operation should have waited for the previous worker to come up, it's quite possible we don't even need it. Hence move the worker running from before we fork it off to more efficiently handle that case. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: allow events and user_data update of running poll requestsJens Axboe1-8/+87
This adds two new POLL_ADD flags, IORING_POLL_UPDATE_EVENTS and IORING_POLL_UPDATE_USER_DATA. As with the other POLL_ADD flag, these are masked into sqe->len. If set, the POLL_ADD will have the following behavior: - sqe->addr must contain the the user_data of the poll request that needs to be modified. This field is otherwise invalid for a POLL_ADD command. - If IORING_POLL_UPDATE_EVENTS is set, sqe->poll_events must contain the new mask for the existing poll request. There are no checks for whether these are identical or not, if a matching poll request is found, then it is re-armed with the new mask. - If IORING_POLL_UPDATE_USER_DATA is set, sqe->off must contain the new user_data for the existing poll request. A POLL_ADD with any of these flags set may complete with any of the following results: 1) 0, which means that we successfully found the existing poll request specified, and performed the re-arm procedure. Any error from that re-arm will be exposed as a completion event for that original poll request, not for the update request. 2) -ENOENT, if no existing poll request was found with the given user_data. 3) -EALREADY, if the existing poll request was already in the process of being removed/canceled/completing. 4) -EACCES, if an attempt was made to modify an internal poll request (eg not one originally issued ass IORING_OP_POLL_ADD). The usual -EINVAL cases apply as well, if any invalid fields are set in the sqe for this command type. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: abstract out a io_poll_find_helper()Jens Axboe1-5/+16
We'll need this helper for another purpose, for now just abstract it out and have io_poll_cancel() use it for lookups. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: terminate multishot poll for CQ ring overflowJens Axboe1-8/+12
If we hit overflow and fail to allocate an overflow entry for the completion, terminate the multishot poll mode. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: abstract out helper for removing poll waitqs/hashesJens Axboe1-1/+9
No functional changes in this patch, just preparation for kill multishot poll on CQ overflow. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: add multishot mode for IORING_OP_POLL_ADDJens Axboe1-18/+46
The default io_uring poll mode is one-shot, where once the event triggers, the poll command is completed and won't trigger any further events. If we're doing repeated polling on the same file or socket, then it can be more efficient to do multishot, where we keep triggering whenever the event becomes true. This deviates from the usual norm of having one CQE per SQE submitted. Add a CQE flag, IORING_CQE_F_MORE, which tells the application to expect further completion events from the submitted SQE. Right now the only user of this is POLL_ADD in multishot mode. Since sqe->poll_events is using the space that we normally use for adding flags to commands, use sqe->len for the flag space for POLL_ADD. Multishot mode is selected by setting IORING_POLL_ADD_MULTI in sqe->len. An application should expect more CQEs for the specificed SQE if the CQE is flagged with IORING_CQE_F_MORE. In multishot mode, only cancelation or an error will terminate the poll request, in which case the flag will be cleared. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: include cflags in completion trace eventJens Axboe1-1/+1
We should be including the completion flags for better introspection on exactly what completion event was logged. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: allocate memory for overflowed CQEsPavel Begunkov1-55/+46
Instead of using a request itself for overflowed CQE stashing, allocate a separate entry. The disadvantage is that the allocation may fail and it will be accounted as lost (see rings->cq_overflow), so we lose reliability in case of memory pressure if the application is driving the CQ ring into overflow. However, it opens a way for for multiple CQEs per an SQE and even generating SQE-less CQEs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: use GFP_ATOMIC | __GFP_ACCOUNT] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: mask in error/nval/hangup consistently for pollJens Axboe1-3/+4
Instead of masking these in as part of regular POLL_ADD prep, do it in io_init_poll_iocb(), and include NVAL as that's generally unmaskable, and RDHUP alongside the HUP that is already set. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: optimise rw complete error handlingPavel Begunkov1-15/+18
Expect read/write to succeed and create a hot path for this case, in particular hide all error handling with resubmission under a single check with the desired result. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: hide iter revert in resubmit_prepPavel Begunkov1-9/+8
Move iov_iter_revert() resetting iterator in case of -EIOCBQUEUED into io_resubmit_prep(), so we don't do heavy revert in hot path, also saves a couple of checks. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: don't alter iopoll reissue fail ret codePavel Begunkov1-25/+16
When reissue_prep failed in io_complete_rw_iopoll(), we change return code to -EIO to prevent io_iopoll_complete() from doing resubmission. Mark requests with a new flag (i.e. REQ_F_DONT_REISSUE) instead and retain the original return value. It also removes io_rw_reissue() from io_iopoll_complete() that will be used later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: optimise kiocb_end_write for !ISREGPavel Begunkov1-3/+3
file_end_write() is only for regular files, so the function do a couple of dereferences to get inode and check for it. However, we already have REQ_F_ISREG at hand, just use it and inline file_end_write(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: kill unused REQ_F_NO_FILE_TABLEPavel Begunkov1-8/+1
current->files are always valid now even for io-wq threads, so kill not used anymore REQ_F_NO_FILE_TABLE. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: don't init req->work fully in advancePavel Begunkov1-11/+14
req->work is mostly unused unless it's punted, and io_init_req() is too hot for fully initialising it. Fortunately, we can skip init work.next as it's controlled by io-wq, and can not touch work.flags by moving everything related into io_prep_async_work(). The only field left is req->work.creds, but there is nothing can be done, keep maintaining it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io-wq: refactor *_get_acct()Pavel Begunkov1-10/+7
Extract a helper for io_work_get_acct() and io_wqe_get_acct() to avoid duplication. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: remove tctx->sqpollPavel Begunkov1-1/+0
struct io_uring_task::sqpoll is not used anymore, kill it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: don't do extra EXITING cancellationsPavel Begunkov1-5/+1
io_match_task() matches all requests with PF_EXITING task, even though those may be valid requests. It was necessary for SQPOLL cancellation, but now it kills all requests before exiting via io_uring_cancel_sqpoll(), so it's not needed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: don't clear REQ_F_LINK_TIMEOUTPavel Begunkov1-4/+2
REQ_F_LINK_TIMEOUT is a hint that to look for linked timeouts to cancel, we're leaving it even when it's already fired. Hence don't care to clear it in io_kill_linked_timeout(), it's safe and is called only once. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: optimise io_req_task_work_add()Pavel Begunkov1-32/+18
Inline io_task_work_add() into io_req_task_work_add(). They both work with a request, so keeping them separate doesn't make things much more clear, but merging allows optimise it. Apart from small wins like not reading req->ctx or not calculating @notify in the hot path, i.e. with tctx->task_state set, it avoids doing wake_up_process() for every single add, but only after actually done task_work_add(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: abolish old io_put_file()Pavel Begunkov1-9/+10
io_put_file() doesn't do a good job at generating a good code. Inline it, so we can check REQ_F_FIXED_FILE first, prioritising FIXED_FILE case over requests without files, and saving a memory load in that case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: optimise io_dismantle_req() fast pathPavel Begunkov1-15/+19
Reshuffle io_dismantle_req() checks to put most of slow path stuff under a single if. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: inline io_clean_op()'s fast pathPavel Begunkov1-12/+9
Inline io_clean_op(), leaving __io_clean_op() but renaming it. This will be used in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: remove __io_req_task_cancel()Pavel Begunkov1-25/+3
Both io_req_complete_failed() and __io_req_task_cancel() do the same thing: set failure flag, put both req refs and emit an CQE. The former one is a bit more advance as it puts req back into a req cache, so make it to take over __io_req_task_cancel() and remove the last one. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: add helper flushing locked_free_listPavel Begunkov1-12/+12
Add a new helper io_flush_cached_locked_reqs() that splices locked_free_list to free_list, and does it right doing all sync and invariant reinit. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: refactor io_free_req_deferred()Pavel Begunkov1-4/+1
We don't care about ret value in io_free_req_deferred(), make the code a bit more concise. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: inline io_put_req and friendsPavel Begunkov1-2/+2
One big omission is that io_put_req() haven't been marked inline, and at least gcc 9 doesn't inline it, not to mention that it's really hot and extra function call is intolerable, especially when it doesn't put a final ref. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: refactor rsrc refnode allocationPavel Begunkov1-19/+39
There are two problems: 1) we always allocate refnodes in advance and free them if those haven't been used. It's expensive, takes two allocations, where one of them is percpu. And it may be pretty common not actually using them. 2) Current API with allocating a refnode and setting some of the fields is error prone, we don't ever want to have a file node runninng fixed buffer callback... Solve both with pre-init/get API. Pre-init just leaves the node for later if not used, and for get (i.e. io_rsrc_refnode_get()), you need to explicitly pass all arguments setting callbacks/etc., so it's more resilient. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: refactor io_flush_cached_reqs()Pavel Begunkov1-6/+10
Emphasize that return value of io_flush_cached_reqs() depends on number of requests in the cache. It looks nicer and might help tools from false-negative analyses. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: optimise success case of __io_queue_sqePavel Begunkov1-9/+9
Move the case of successfully issued request by doing that check first. It's not much of a difference, just generates slightly better code for me. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: inline __io_queue_linked_timeout()Pavel Begunkov1-11/+4
Inline __io_queue_linked_timeout(), we don't need it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: keep io_req_free_batch() call localityPavel Begunkov1-1/+1
Don't do a function call (io_dismantle_req()) in the middle and place it to near other function calls, otherwise may lead to excessive register spilling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: optimise tctx node checks/allocPavel Begunkov1-24/+29
First of all, w need to set tctx->sqpoll only when we add a new entry into ->xa, so move it from the hot path. Also extract a hot path for io_uring_add_task_file() as an inline helper. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: optimise io_uring_enter()Pavel Begunkov1-7/+7
Add unlikely annotations, because my compiler pretty much mispredicts every first check, and apart jumping around in the fast path, it also generates extra instructions, like in advance setting ret value. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: don't take ctx refs in task_work handlerPavel Begunkov1-5/+0
__tctx_task_work() guarantees that ctx won't be killed while running task_works, so we can remove now unnecessary ctx pinning for internally armed polling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: transform ret == 0 for poll cancelation completionsJens Axboe1-0/+3
We can set canceled == true and complete out-of-line, ensure that we catch that and correctly return -ECANCELED if the poll operation got canceled. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: correct comment on poll vs iopollJens Axboe1-1/+1
The correct function is io_iopoll_complete(), which deals with completions of IOPOLL requests, not io_poll_complete(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: cache async and regular file state for fixed filesJens Axboe1-13/+57
We have to dig quite deep to check for particularly whether or not a file supports a fast-path nonblock attempt. For fixed files, we can do this lookup once and cache the state instead. This adds two new bits to track whether we support async read/write attempt, and lines up the REQ_F_ISREG bit with those two. The file slot re-uses the last 3 (or 2, for 32-bit) of the file pointer to cache that state, and then we mask it in when we go and use a fixed file. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: don't check for io_uring_fops for fixed filesJens Axboe1-2/+4
We don't allow them at registration time, so limit the check for needing inflight tracking in io_file_get() to the non-fixed path. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: simplify io_sqd_update_thread_idle()Pavel Begunkov1-5/+2
Use a more comprehensible() max instead of hand coding it with ifs in io_sqd_update_thread_idle(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12io_uring: switch to atomic_t for io_kiocb reference countJens Axboe1-7/+17
io_uring manipulates references twice for each request, and hence is very sensitive to performance of the reference count. This commit borrows a trick from: commit f958d7b528b1b40c44cfda5eabe2d82760d868c3 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu Apr 11 10:06:20 2019 -0700 mm: make page ref count overflow check tighter and more explicit and switches to atomic_t for references, while still retaining overflow and underflow checks. This is good for a 2-3% increase in peak IOPS on a single core. Before: IOPS=2970879, IOS/call=31/31, inflight=128 (128) IOPS=2952597, IOS/call=31/31, inflight=128 (128) IOPS=2943904, IOS/call=31/31, inflight=128 (128) IOPS=2930006, IOS/call=31/31, inflight=96 (96) and after: IOPS=3054354, IOS/call=31/31, inflight=128 (128) IOPS=3059038, IOS/call=31/31, inflight=128 (128) IOPS=3060320, IOS/call=31/31, inflight=128 (128) IOPS=3068256, IOS/call=31/31, inflight=96 (96) Signed-off-by: Jens Axboe <axboe@kernel.dk>