Age | Commit message (Collapse) | Author | Files | Lines |
|
Some file systems, ocfs2 in this case, will return -EOPNOTSUPP for
an IOCB_NOWAIT read/write attempt. While this can be argued to be
correct, the usual return value for something that requires blocking
issue is -EAGAIN.
A refactoring io_uring commit dropped calling kiocb_done() for
negative return values, which is otherwise where we already do that
transformation. To ensure we catch it in both spots, check it in
__io_read() itself as well.
Reported-by: Robert Sander <r.sander@heinlein-support.de>
Link: https://fosstodon.org/@gurubert@mastodon.gurubert.de/113112431889638440
Cc: stable@vger.kernel.org
Fixes: a08d195b586a ("io_uring/rw: split io_read() into a helper")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Port req_cachep to struct kmem_cache_args.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
The submit queue polling threads are userland threads that just never
exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF,
the affinity of the poller thread is set to the cpu specified in
sq_thread_cpu. However, this CPU can be outside of the cpuset defined
by the cgroup cpuset controller. This violates the rules defined by the
cpuset controller and is a potential issue for realtime applications.
In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in
case no explicit pinning is required by inheriting the one of the
creating task. In case of explicit pinning, the check is more
complicated, as also a cpu outside of the parent cpumask is allowed.
We implemented this by using cpuset_cpus_allowed (that has support for
cgroup cpusets) and testing if the requested cpu is in the set.
Fixes: 37d1e2e3642e ("io_uring: move SQPOLL thread io-wq forked worker")
Cc: stable@vger.kernel.org # 6.1+
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/r/20240909150036.55921-1-felix.moessbauer@siemens.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
atomic_t for the struct io_ev_fd references and there are no issues with
it. While the ref getting and putting for the eventfd code is somewhat
performance critical for cases where eventfd signaling is used (news
flash, you should not...), it probably doesn't warrant using an atomic_t
for this. Let's just move to it to refcount_t to get the added
protection of over/underflows.
Link: https://lore.kernel.org/lkml/202409082039.hnsaIJ3X-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409082039.hnsaIJ3X-lkp@intel.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
rsrc_put_fn is declared but never used, remove it.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20240902062134.136387-3-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Fixes checkpatch warning
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20240902062134.136387-2-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If GCOV is enabled and this option is set, it enables code coverage
profiling of the io_uring subsystem. Only use this for test purposes,
as it will impact the runtime performance.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_provided_buffers_select() returns 0 to indicate success, but it should
be returning 1 to indicate that 1 vec was mapped. This causes peeking
to fail with classic provided buffers, and while that's not a use case
that anyone should use, it should still work correctly.
The end result is that no buffer will be selected, and hence a completion
with '0' as the result will be posted, without a buffer attached.
Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For buffer registration (or updates), a userspace iovec is copied in
and updated. If the application is within a compat syscall, then the
iovec type is compat_iovec rather than iovec. However, the type used
in __io_sqe_buffers_update() and io_sqe_buffers_register() is always
struct iovec, and hence the source is incremented by the size of a
non-compat iovec in the loop. This misses every other iovec in the
source, and will run into garbage half way through the copies and
return -EFAULT to the application.
Maintain the source address separately and assign to our user vec
pointer, so that copies always happen from the right source address.
While in there, correct a bad placement of __user which triggered
the following sparse warning prior to this fix:
io_uring/rsrc.c:981:33: warning: cast removes address space '__user' of expression
io_uring/rsrc.c:981:30: warning: incorrect type in assignment (different address spaces)
io_uring/rsrc.c:981:30: expected struct iovec const [noderef] __user *uvec
io_uring/rsrc.c:981:30: got struct iovec *[noderef] __user
Fixes: f4eaf8eda89e ("io_uring/rsrc: Drop io_copy_iov in favor of iovec API")
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
By default, any recv/read operation that uses provided buffers will
consume at least 1 buffer fully (and maybe more, in case of bundles).
This adds support for incremental consumption, meaning that an
application may add large buffers, and each read/recv will just consume
the part of the buffer that it needs.
For example, let's say an application registers 1MB buffers in a
provided buffer ring, for streaming receives. If it gets a short recv,
then the full 1MB buffer will be consumed and passed back to the
application. With incremental consumption, only the part that was
actually used is consumed, and the buffer remains the current one.
This means that both the application and the kernel needs to keep track
of what the current receive point is. Each recv will still pass back a
buffer ID and the size consumed, the only difference is that before the
next receive would always be the next buffer in the ring. Now the same
buffer ID may return multiple receives, each at an offset into that
buffer from where the previous receive left off. Example:
Application registers a provided buffer ring, and adds two 32K buffers
to the ring.
Buffer1 address: 0x1000000 (buffer ID 0)
Buffer2 address: 0x2000000 (buffer ID 1)
A recv completion is received with the following values:
cqe->res 0x1000 (4k bytes received)
cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0)
and the application now knows that 4096b of data is available at
0x1000000, the start of that buffer, and that more data from this buffer
will be coming. Now the next receive comes in:
cqe->res 0x2010 (8k bytes received)
cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0)
which tells the application that 8k is available where the last
completion left off, at 0x1001000. Next completion is:
cqe->res 0x5000 (20k bytes received)
cqe->flags 0x1 (CQE_F_BUFFER set, buffer ID 0)
and the application now knows that 20k of data is available at
0x1003000, which is where the previous receive ended. CQE_F_BUF_MORE
isn't set, as no more data is available in this buffer ID. The next
completion is then:
cqe->res 0x1000 (4k bytes received)
cqe->flags 0x10001 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 1)
which tells the application that buffer ID 1 is now the current one,
hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next
receive point for this buffer ID.
When a buffer will be reused by future CQE completions,
IORING_CQE_BUF_MORE will be set in cqe->flags. This tells the application
that the kernel isn't done with the buffer yet, and that it should expect
more completions for this buffer ID. Will only be set by provided buffer
rings setup with IOU_PBUF_RING INC, as that's the only type of buffer
that will see multiple consecutive completions for the same buffer ID.
For any other provided buffer type, any completion that passes back
a buffer to the application is final.
Once a buffer has been fully consumed, the buffer ring head is
incremented and the next receive will indicate the next buffer ID in the
CQE cflags.
On the send side, the application can manage how much data is sent from
an existing buffer by setting sqe->len to the desired send length.
An application can request incremental consumption by setting
IOU_PBUF_RING_INC in the provided buffer ring registration. Outside of
that, any provided buffer ring setup and buffer additions is done like
before, no changes there. The only change is in how an application may
see multiple completions for the same buffer ID, hence needing to know
where the next receive will happen.
Note that like existing provided buffer rings, this should not be used
with IOSQE_ASYNC, as both really require the ring to remain locked over
the duration of the buffer selection and the operation completion. It
will consume a buffer otherwise regardless of the size of the IO done.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for needing the consumed length, pass in the length being
completed. Unused right now, but will be used when it is possible to
partially consume a buffer.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit 79996b45f7b28c0e3e08a95bab80119e95317e28.
Revert the change that restricts a send provided buffer to be zero, so
it will always consume the whole buffer. This is strictly needed for
partial consumption, as the send may very well be a subset of the
current buffer. In fact, that's the intended use case.
For non-incremental provided buffer rings, an application should set
sqe->len carefully to avoid the potential issue described in the
reverted commit. It is recommended that '0' still be set for len for
that case, if the application is set on maintaining more than 1 send
inflight for the same socket. This is somewhat of a nonsensical thing
to do.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for using this helper in kbuf.h as well, move it there and
turn it into a macro.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Committing the selected ring buffer is currently done in three different
spots, combine it into a helper and just call that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
nr_iovs is capped at 1024, and mode only has a few low values. We can
safely make them u16, in preparation for adding a few more members.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Expose min_wait_usec in io_uring_getevents_arg, replacing the pad member
that is currently in there. The value is in usecs, which is explained in
the name as well.
Note that if min_wait_usec and a normal timeout is used in conjunction,
the normal timeout is still relative to the base time. For example, if
min_wait_usec is set to 100 and the normal timeout is 1000, the max
total time waited is still 1000. This also means that if the normal
timeout is shorter than min_wait_usec, then only the min_wait_usec will
take effect.
See previous commit for an explanation of how this works.
IORING_FEAT_MIN_TIMEOUT is added as a feature flag for this, as
applications doing submit_and_wait_timeout() style operations will
generally not see the -EINVAL from the wait side as they return the
number of IOs submitted. Only if no IOs are submitted will the -EINVAL
bubble back up to the application.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Waiting for events with io_uring has two knobs that can be set:
1) The number of events to wake for
2) The timeout associated with the event
Waiting will abort when either of those conditions are met, as expected.
This adds support for a third event, which is associated with the number
of events to wait for. Applications generally like to handle batches of
completions, and right now they'd set a number of events to wait for and
the timeout for that. If no events have been received but the timeout
triggers, control is returned to the application and it can wait again.
However, if the application doesn't have anything to do until events are
reaped, then it's possible to make this waiting more efficient.
For example, the application may have a latency time of 50 usecs and
wanting to handle a batch of 8 requests at the time. If it uses 50 usecs
as the timeout, then it'll be doing 20K context switches per second even
if nothing is happening.
This introduces the notion of min batch wait time. If the min batch wait
time expires, then we'll return to userspace if we have any events at all.
If none are available, the general wait time is applied. Any request
arriving after the min batch wait time will cause waiting to stop and
return control to the application.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for having two distinct timeouts and avoid waking the
task if we don't need to.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for expanding how we handle waits, move the actual
schedule and schedule_timeout() handling into a helper.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Rather than need to pass in 2 or 3 separate arguments, add a struct
to encapsulate the timeout and sigset_t parts of waiting. In preparation
for adding another argument for waiting.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a new registration opcode IORING_REGISTER_CLOCK, which allows the
user to select which clock id it wants to use with CQ waiting timeouts.
It only allows a subset of all posix clocks and currently supports
CLOCK_MONOTONIC and CLOCK_BOOTTIME.
Suggested-by: Lewis Baker <lewissbaker@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/98f2bc8a3c36cdf8f0e6a275245e81e903459703.1723039801.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In addition to current relative timeouts for the waiting loop, where the
timespec argument specifies the maximum time it can wait for, add
support for the absolute mode, with the value carrying a CLOCK_MONOTONIC
absolute time until which we should return control back to the user.
Suggested-by: Lewis Baker <lewissbaker@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d5b74d67ada882590b2e42aa3aa7117bbf6b55f.1723039801.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove io_napi_adjust_timeout() and move the adjustments out of the
common path into __io_napi_busy_loop(). Now the limit it's calculated
based on struct io_wait_queue::timeout, for which we query current time
another time. The overhead shouldn't be a problem, it's a polling path,
however that can be optimised later by additionally saving the delta
time value in io_cqring_wait().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/88e14686e245b3b42ff90a3c4d70895d48676206.1723039801.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
we don't need to set ->napi_prefer_busy_poll if we're not going to poll,
do the checks first and all polling preparation after.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2ad7ede8cc7905328fc62e8c3805fdb11635ae0b.1723039801.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We could just move these two and save some space, but in preparation
for adding another flag, turn them into flags first.
This saves 8 bytes in struct io_buffer_list, making it exactly half
a cacheline on 64-bit archs now rather than 40 bytes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just like what is being done on the recv side, if we only map a single
segment, then use ITER_UBUF for mapping it. That's more efficient than
using an ITER_IOVEC.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
req->buf_list is assigned higher up and is safe to use as we remain
within a locked region, as is the 'bl' variable itself from which it
was assigned. To improve readability, use 'bl' directly rather than
get it from the io_kiocb, if we need to increment the head directly
in the buffer selection path. This makes it readily apparent that
it's the same io_buffer_list being used.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
reverse the order of the element evaluation in an if statement.
for many users that are not using iopoll, the iopoll_list will always
evaluate to false after having made a memory access whereas to_submit is
very likely already loaded in a register.
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/052ca60b5c49e7439e4b8bd33bfab4a09d36d3d6.1722374371.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add support for checking and coalescing multi-hugepage-backed fixed
buffers. The coalescing optimizes both time and space consumption caused
by mapping and storing multi-hugepage fixed buffers.
A coalescable multi-hugepage buffer should fully cover its folios
(except potentially the first and last one), and these folios should
have the same size. These requirements are for easier processing later,
also we need same size'd chunks in io_import_fixed for fast iov_iter
adjust.
Signed-off-by: Chenliang Li <cliang01.li@samsung.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20240731090133.4106-3-cliang01.li@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Store the folio shift and folio mask into imu struct and use it in
iov_iter adjust, as we will have non PAGE_SIZE'd chunks if a
multi-hugepage buffer get coalesced.
Signed-off-by: Chenliang Li <cliang01.li@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20240731090133.4106-2-cliang01.li@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This info may be useful when attempting to debug a problem involving a
ring using the NAPI feature.
Here is an example of the output:
ip-172-31-39-89 /proc/772/fdinfo # cat 14
pos: 0
flags: 02000002
mnt_id: 16
ino: 10243
SqMask: 0xff
SqHead: 633
SqTail: 633
CachedSqHead: 633
CqMask: 0x3fff
CqHead: 430250
CqTail: 430250
CachedCqTail: 430250
SQEs: 0
CQEs: 0
SqThread: 885
SqThreadCpu: 0
SqTotalTime: 52793826
SqWorkTime: 3590465
UserFiles: 0
UserBufs: 0
PollList:
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=6, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=6, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
op=10, task_works=0
CqOverflowList:
NAPI: enabled
napi_busy_poll_to: 1
napi_prefer_busy_poll: true
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bb184f8b62703ddd3e6e19eae7ab6c67b97e1e10.1722293317.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Harden the buffer peeking a bit, by adding a sanity check for it having
a valid size. Outside of that, arg->max_len is a size_t, though it's
only ever set to a 32-bit value (as it's governed by MAX_RW_COUNT).
Bump our needed check to a size_t so we know it fits. Finally, cap the
calculated needed iov value to the PEEK_MAX_IMPORT, which is the
maximum number of segments that should be peeked.
Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There's a debug check in io_sq_thread_park() checking if it's the SQPOLL
thread itself calling park. KCSAN warns about this, as we should not be
reading sqd->thread outside of sqd->lock.
Just silence this with data_race(). The pointer isn't used for anything
but this debug check.
Reported-by: syzbot+2b946a3fd80caf971b21@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
This commit converts (almost) all of f.file to
fd_file(f). It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
io_napi_entry() has 2 calling sites. One of them is unlikely to find an
entry and if it does, the timeout should arguable not be updated.
The other io_napi_entry() calling site is overwriting the update made
by io_napi_entry() so the io_napi_entry() timeout value update has no or
little value and therefore is removed.
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/145b54ff179f87609e20dffaf5563c07cdbcad1a.1723423275.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
doing so avoids the overhead of adding napi ids to all the rings that do
not enable napi.
if no id is added to napi_list because napi is disabled,
__io_napi_busy_loop() will not be called.
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Fixes: b4ccc4dd1330 ("io_uring/napi: enable even with a timeout of 0")
Link: https://lore.kernel.org/r/bd989ccef5fda14f5fd9888faf4fefcf66bd0369.1723400131.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If a send is issued marked with IOSQE_BUFFER_SELECT for selecting a
buffer, unless it's a bundle, it should not select multiple buffers.
Cc: stable@vger.kernel.org
Fixes: a05d1f625c7a ("io_uring/net: support bundles for send")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If the iovec inside the kmsg isn't already allocated AND one gets
expanded beyond the fixed size, then the request may not already have
been marked for cleanup. Ensure that it is.
Cc: stable@vger.kernel.org
Fixes: a05d1f625c7a ("io_uring/net: support bundles for send")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If the iovec inside the kmsg isn't already allocated AND one gets
expanded beyond the fixed size, then the request may not already have
been marked for cleanup. Ensure that it is.
Cc: stable@vger.kernel.org
Fixes: 2f9c9515bdfd ("io_uring/net: support bundles for recv")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
These lists are unused, remove them.
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0a0ae3e955aed0f3e3d29882fb3d3cb575e0009b.1722294947.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This refresh statement was originally present in the original patch:
https://lore.kernel.org/netdev/20221121191437.996297-2-shr@devkernel.io/
It has been removed with no explanation in v6:
https://lore.kernel.org/netdev/20230201222254.744422-2-shr@devkernel.io/
It is important to make the refresh for multishot requests, because if no
new requests using the same NAPI device are added to the ring, the entry
will become stale and be removed silently. The unsuspecting user will
not know that their ring had busy polling for only 60 seconds before
being pruned.
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: 8d0c12a80cdeb ("io-uring: add napi busy poll support")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/0fe61a019ec61e5708cd117cb42ed0dab95e1617.1722294646.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pass the waiting time for __io_napi_adjust_timeout as ktime and get rid
of all timespec64 conversions. It's especially simpler since the caller
already have a ktime.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4f5b8e8eed4f53a1879e031a6712b25381adc23d.1722003776.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It's more natural to use ktime/ns instead of keeping around usec,
especially since we're comparing it against user provided timers,
so convert napi busy poll internal handling to ktime. It's also nicer
since the type (ktime_t vs unsigned long) now tells the unit of measure.
Keep everything as ktime, which we convert to/from micro seconds for
IORING_[UN]REGISTER_NAPI. The net/ busy polling works seems to work with
usec, however it's not real usec as shift by 10 is used to get it from
nsecs, see busy_loop_current_time(), so it's easy to get truncated nsec
back and we get back better precision.
Note, we can further improve it later by removing the truncation and
maybe convincing net/ to use ktime/ns instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/95e7ec8d095069a3ed5d40a4bc6f8b586698bc7e.1722003776.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
syzbot reports that KMSAN complains that 'nr_tw' is an uninit-value
with the following report:
BUG: KMSAN: uninit-value in io_req_local_work_add io_uring/io_uring.c:1192 [inline]
BUG: KMSAN: uninit-value in io_req_task_work_add_remote+0x588/0x5d0 io_uring/io_uring.c:1240
io_req_local_work_add io_uring/io_uring.c:1192 [inline]
io_req_task_work_add_remote+0x588/0x5d0 io_uring/io_uring.c:1240
io_msg_remote_post io_uring/msg_ring.c:102 [inline]
io_msg_data_remote io_uring/msg_ring.c:133 [inline]
io_msg_ring_data io_uring/msg_ring.c:152 [inline]
io_msg_ring+0x1c38/0x1ef0 io_uring/msg_ring.c:305
io_issue_sqe+0x383/0x22c0 io_uring/io_uring.c:1710
io_queue_sqe io_uring/io_uring.c:1924 [inline]
io_submit_sqe io_uring/io_uring.c:2180 [inline]
io_submit_sqes+0x1259/0x2f20 io_uring/io_uring.c:2295
__do_sys_io_uring_enter io_uring/io_uring.c:3205 [inline]
__se_sys_io_uring_enter+0x40c/0x3ca0 io_uring/io_uring.c:3142
__x64_sys_io_uring_enter+0x11f/0x1a0 io_uring/io_uring.c:3142
x64_sys_call+0x2d82/0x3c10 arch/x86/include/generated/asm/syscalls_64.h:427
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
which is the following check:
if (nr_tw < nr_wait)
return;
in io_req_local_work_add(). While nr_tw itself cannot be uninitialized,
it does depend on req->flags, which off the msg ring issue path can
indeed be uninitialized.
Fix this by always clearing the allocated 'req' fully if we can't grab
one from the cache itself.
Fixes: 50cf5f3842af ("io_uring/msg_ring: add an alloc cache for io_kiocb entries")
Reported-by: syzbot+82609b8937a4458106ca@syzkaller.appspotmail.com
Link: https://lore.kernel.org/io-uring/000000000000fd3d8d061dfc0e4a@google.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is a difference in how io_queue_sqe and io_wq_submit_work treat
error codes they get from io_issue_sqe. The first one fails anything
unknown but latter only fails when the code is negative.
It doesn't make sense to have this discrepancy, align them to the
io_queue_sqe behaviour.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c550e152bf4a290187f91a4322ddcb5d6d1f2c73.1721819383.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We don't have to return error code from an op handler back to core
io_uring, so once io_uring_cmd() sets the results and handles errors we
can juts return IOU_OK and simplify the code.
Note, only valid with e0b23d9953b0c ("io_uring: optimise ltimeout for
inline execution"), there was a problem with iopoll before.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8eae2be5b2a49236cd5f1dadbd1aa5730e9e2d4f.1721819383.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The __must_hold annotation in io_match_task() uses a non existing
parameter "req", fix it.
Fixes: 6af3f48bf6156 ("io_uring: fix link traversal locking")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3e65ee7709e96507cef3d93291746f2c489f2307.1721819383.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
IORING_SETUP_IOPOLL rings don't have any netpoll handling, let's fail
attempts to register netpolling in this case, there might be people who
will mix up IOPOLL and netpoll.
Cc: stable@vger.kernel.org
Fixes: ef1186c1a875b ("io_uring: add register/unregister napi function")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1e7553aee0a8ae4edec6742cd6dd0c1e6914fba8.1721819383.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_uring_cancel_generic() should retry if any state changes like a
request is completed, however in case of a task exit it only goes for
another loop and avoids schedule() if any tracked (i.e. REQ_F_INFLIGHT)
request got completed.
Let's assume we have a non-tracked request executing in iowq and a
tracked request linked to it. Let's also assume
io_uring_cancel_generic() fails to find and cancel the request, i.e.
via io_run_local_work(), which may happen as io-wq has gaps.
Next, the request logically completes, io-wq still hold a ref but queues
it for completion via tw, which happens in
io_uring_try_cancel_requests(). After, right before prepare_to_wait()
io-wq puts the request, grabs the linked one and tries executes it, e.g.
arms polling. Finally the cancellation loop calls prepare_to_wait(),
there are no tw to run, no tracked request was completed, so the
tctx_inflight() check passes and the task is put to indefinite sleep.
Cc: stable@vger.kernel.org
Fixes: 3f48cf18f886c ("io_uring: unify files and task cancel")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/acac7311f4e02ce3c43293f8f1fda9c705d158f1.1721819383.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Syz reports a problem, which boils down to NULL vs IS_ERR inconsistent
error handling in io_alloc_pbuf_ring().
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:__io_remove_buffers+0xac/0x700 io_uring/kbuf.c:341
Call Trace:
<TASK>
io_put_bl io_uring/kbuf.c:378 [inline]
io_destroy_buffers+0x14e/0x490 io_uring/kbuf.c:392
io_ring_ctx_free+0xa00/0x1070 io_uring/io_uring.c:2613
io_ring_exit_work+0x80f/0x8a0 io_uring/io_uring.c:2844
process_one_work kernel/workqueue.c:3231 [inline]
process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312
worker_thread+0x86d/0xd40 kernel/workqueue.c:3390
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
Cc: stable@vger.kernel.org
Reported-by: syzbot+2074b1a3d447915c6f1c@syzkaller.appspotmail.com
Fixes: 87585b05757dc ("io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c5f9df20560bd9830401e8e48abc029e7cfd9f5e.1721329239.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|