starfive-tech/linux.git - StarFive Tech Linux Kernel for VisionFive (JH7110) boards (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
2024-09-10	io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN	Jens Axboe	1	-0/+8
	Some file systems, ocfs2 in this case, will return -EOPNOTSUPP for an IOCB_NOWAIT read/write attempt. While this can be argued to be correct, the usual return value for something that requires blocking issue is -EAGAIN. A refactoring io_uring commit dropped calling kiocb_done() for negative return values, which is otherwise where we already do that transformation. To ensure we catch it in both spots, check it in __io_read() itself as well. Reported-by: Robert Sander <r.sander@heinlein-support.de> Link: https://fosstodon.org/@gurubert@mastodon.gurubert.de/113112431889638440 Cc: stable@vger.kernel.org Fixes: a08d195b586a ("io_uring/rw: split io_read() into a helper") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10	io_uring: port to struct kmem_cache_args	Christian Brauner	1	-6/+8
	Port req_cachep to struct kmem_cache_args. Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-09-09	io_uring/sqpoll: do not allow pinning outside of cpuset	Felix Moessbauer	1	-1/+4
	The submit queue polling threads are userland threads that just never exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF, the affinity of the poller thread is set to the cpu specified in sq_thread_cpu. However, this CPU can be outside of the cpuset defined by the cgroup cpuset controller. This violates the rules defined by the cpuset controller and is a potential issue for realtime applications. In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in case no explicit pinning is required by inheriting the one of the creating task. In case of explicit pinning, the check is more complicated, as also a cpu outside of the parent cpumask is allowed. We implemented this by using cpuset_cpus_allowed (that has support for cgroup cpusets) and testing if the requested cpu is in the set. Fixes: 37d1e2e3642e ("io_uring: move SQPOLL thread io-wq forked worker") Cc: stable@vger.kernel.org # 6.1+ Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com> Link: https://lore.kernel.org/r/20240909150036.55921-1-felix.moessbauer@siemens.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-09	io_uring/eventfd: move refs to refcount_t	Jens Axboe	1	-6/+6
	atomic_t for the struct io_ev_fd references and there are no issues with it. While the ref getting and putting for the eventfd code is somewhat performance critical for cases where eventfd signaling is used (news flash, you should not...), it probably doesn't warrant using an atomic_t for this. Let's just move to it to refcount_t to get the added protection of over/underflows. Link: https://lore.kernel.org/lkml/202409082039.hnsaIJ3X-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409082039.hnsaIJ3X-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-02	io_uring: remove unused rsrc_put_fn	Anuj Gupta	1	-2/+0
	rsrc_put_fn is declared but never used, remove it. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20240902062134.136387-3-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-02	io_uring: add new line after variable declaration	Anuj Gupta	1	-0/+1
	Fixes checkpatch warning Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20240902062134.136387-2-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-30	io_uring: add GCOV_PROFILE_URING Kconfig option	Jens Axboe	1	-0/+4
	If GCOV is enabled and this option is set, it enables code coverage profiling of the io_uring subsystem. Only use this for test purposes, as it will impact the runtime performance. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-30	io_uring/kbuf: return correct iovec count from classic buffer peek	Jens Axboe	1	-1/+1
	io_provided_buffers_select() returns 0 to indicate success, but it should be returning 1 to indicate that 1 vec was mapped. This causes peeking to fail with classic provided buffers, and while that's not a use case that anyone should use, it should still work correctly. The end result is that no buffer will be selected, and hence a completion with '0' as the result will be posted, without a buffer attached. Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-30	io_uring/rsrc: ensure compat iovecs are copied correctly	Jens Axboe	1	-4/+15
	For buffer registration (or updates), a userspace iovec is copied in and updated. If the application is within a compat syscall, then the iovec type is compat_iovec rather than iovec. However, the type used in __io_sqe_buffers_update() and io_sqe_buffers_register() is always struct iovec, and hence the source is incremented by the size of a non-compat iovec in the loop. This misses every other iovec in the source, and will run into garbage half way through the copies and return -EFAULT to the application. Maintain the source address separately and assign to our user vec pointer, so that copies always happen from the right source address. While in there, correct a bad placement of __user which triggered the following sparse warning prior to this fix: io_uring/rsrc.c:981:33: warning: cast removes address space '__user' of expression io_uring/rsrc.c:981:30: warning: incorrect type in assignment (different address spaces) io_uring/rsrc.c:981:30: expected struct iovec const [noderef] __user uvec io_uring/rsrc.c:981:30: got struct iovec [noderef] __user Fixes: f4eaf8eda89e ("io_uring/rsrc: Drop io_copy_iov in favor of iovec API") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29	io_uring/kbuf: add support for incremental buffer consumption	Jens Axboe	2	-20/+64
	By default, any recv/read operation that uses provided buffers will consume at least 1 buffer fully (and maybe more, in case of bundles). This adds support for incremental consumption, meaning that an application may add large buffers, and each read/recv will just consume the part of the buffer that it needs. For example, let's say an application registers 1MB buffers in a provided buffer ring, for streaming receives. If it gets a short recv, then the full 1MB buffer will be consumed and passed back to the application. With incremental consumption, only the part that was actually used is consumed, and the buffer remains the current one. This means that both the application and the kernel needs to keep track of what the current receive point is. Each recv will still pass back a buffer ID and the size consumed, the only difference is that before the next receive would always be the next buffer in the ring. Now the same buffer ID may return multiple receives, each at an offset into that buffer from where the previous receive left off. Example: Application registers a provided buffer ring, and adds two 32K buffers to the ring. Buffer1 address: 0x1000000 (buffer ID 0) Buffer2 address: 0x2000000 (buffer ID 1) A recv completion is received with the following values: cqe->res 0x1000 (4k bytes received) cqe->flags 0x11 (CQE_F_BUFFER\|CQE_F_BUF_MORE set, buffer ID 0) and the application now knows that 4096b of data is available at 0x1000000, the start of that buffer, and that more data from this buffer will be coming. Now the next receive comes in: cqe->res 0x2010 (8k bytes received) cqe->flags 0x11 (CQE_F_BUFFER\|CQE_F_BUF_MORE set, buffer ID 0) which tells the application that 8k is available where the last completion left off, at 0x1001000. Next completion is: cqe->res 0x5000 (20k bytes received) cqe->flags 0x1 (CQE_F_BUFFER set, buffer ID 0) and the application now knows that 20k of data is available at 0x1003000, which is where the previous receive ended. CQE_F_BUF_MORE isn't set, as no more data is available in this buffer ID. The next completion is then: cqe->res 0x1000 (4k bytes received) cqe->flags 0x10001 (CQE_F_BUFFER\|CQE_F_BUF_MORE set, buffer ID 1) which tells the application that buffer ID 1 is now the current one, hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next receive point for this buffer ID. When a buffer will be reused by future CQE completions, IORING_CQE_BUF_MORE will be set in cqe->flags. This tells the application that the kernel isn't done with the buffer yet, and that it should expect more completions for this buffer ID. Will only be set by provided buffer rings setup with IOU_PBUF_RING INC, as that's the only type of buffer that will see multiple consecutive completions for the same buffer ID. For any other provided buffer type, any completion that passes back a buffer to the application is final. Once a buffer has been fully consumed, the buffer ring head is incremented and the next receive will indicate the next buffer ID in the CQE cflags. On the send side, the application can manage how much data is sent from an existing buffer by setting sqe->len to the desired send length. An application can request incremental consumption by setting IOU_PBUF_RING_INC in the provided buffer ring registration. Outside of that, any provided buffer ring setup and buffer additions is done like before, no changes there. The only change is in how an application may see multiple completions for the same buffer ID, hence needing to know where the next receive will happen. Note that like existing provided buffer rings, this should not be used with IOSQE_ASYNC, as both really require the ring to remain locked over the duration of the buffer selection and the operation completion. It will consume a buffer otherwise regardless of the size of the IO done. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29	io_uring/kbuf: pass in 'len' argument for buffer commit	Jens Axboe	5	-30/+31
	In preparation for needing the consumed length, pass in the length being completed. Unused right now, but will be used when it is possible to partially consume a buffer. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29	Revert "io_uring: Require zeroed sqe->len on provided-buffers send"	Jens Axboe	1	-3/+1
	This reverts commit 79996b45f7b28c0e3e08a95bab80119e95317e28. Revert the change that restricts a send provided buffer to be zero, so it will always consume the whole buffer. This is strictly needed for partial consumption, as the send may very well be a subset of the current buffer. In fact, that's the intended use case. For non-incremental provided buffer rings, an application should set sqe->len carefully to avoid the potential issue described in the reverted commit. It is recommended that '0' still be set for len for that case, if the application is set on maintaining more than 1 send inflight for the same socket. This is somewhat of a nonsensical thing to do. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29	io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h	Jens Axboe	2	-6/+3
	In preparation for using this helper in kbuf.h as well, move it there and turn it into a macro. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29	io_uring/kbuf: add io_kbuf_commit() helper	Jens Axboe	2	-8/+13
	Committing the selected ring buffer is currently done in three different spots, combine it into a helper and just call that. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg	Jens Axboe	1	-2/+2
	nr_iovs is capped at 1024, and mode only has a few low values. We can safely make them u16, in preparation for adding a few more members. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring: wire up min batch wake timeout	Jens Axboe	1	-4/+4
	Expose min_wait_usec in io_uring_getevents_arg, replacing the pad member that is currently in there. The value is in usecs, which is explained in the name as well. Note that if min_wait_usec and a normal timeout is used in conjunction, the normal timeout is still relative to the base time. For example, if min_wait_usec is set to 100 and the normal timeout is 1000, the max total time waited is still 1000. This also means that if the normal timeout is shorter than min_wait_usec, then only the min_wait_usec will take effect. See previous commit for an explanation of how this works. IORING_FEAT_MIN_TIMEOUT is added as a feature flag for this, as applications doing submit_and_wait_timeout() style operations will generally not see the -EINVAL from the wait side as they return the number of IOs submitted. Only if no IOs are submitted will the -EINVAL bubble back up to the application. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring: add support for batch wait timeout	Jens Axboe	2	-13/+82
	Waiting for events with io_uring has two knobs that can be set: 1) The number of events to wake for 2) The timeout associated with the event Waiting will abort when either of those conditions are met, as expected. This adds support for a third event, which is associated with the number of events to wait for. Applications generally like to handle batches of completions, and right now they'd set a number of events to wait for and the timeout for that. If no events have been received but the timeout triggers, control is returned to the application and it can wait again. However, if the application doesn't have anything to do until events are reaped, then it's possible to make this waiting more efficient. For example, the application may have a latency time of 50 usecs and wanting to handle a batch of 8 requests at the time. If it uses 50 usecs as the timeout, then it'll be doing 20K context switches per second even if nothing is happening. This introduces the notion of min batch wait time. If the min batch wait time expires, then we'll return to userspace if we have any events at all. If none are available, the general wait time is applied. Any request arriving after the min batch wait time will cause waiting to stop and return control to the application. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring: implement our own schedule timeout handling	Jens Axboe	2	-4/+33
	In preparation for having two distinct timeouts and avoid waking the task if we don't need to. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring: move schedule wait logic into helper	Jens Axboe	1	-16/+21
	In preparation for expanding how we handle waits, move the actual schedule and schedule_timeout() handling into a helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring: encapsulate extraneous wait flags into a separate struct	Jens Axboe	1	-21/+24
	Rather than need to pass in 2 or 3 separate arguments, add a struct to encapsulate the timeout and sigset_t parts of waiting. In preparation for adding another argument for waiting. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring: user registered clockid for wait timeouts	Pavel Begunkov	4	-3/+46
	Add a new registration opcode IORING_REGISTER_CLOCK, which allows the user to select which clock id it wants to use with CQ waiting timeouts. It only allows a subset of all posix clocks and currently supports CLOCK_MONOTONIC and CLOCK_BOOTTIME. Suggested-by: Lewis Baker <lewissbaker@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/98f2bc8a3c36cdf8f0e6a275245e81e903459703.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring: add absolute mode wait timeouts	Pavel Begunkov	1	-7/+8
	In addition to current relative timeouts for the waiting loop, where the timespec argument specifies the maximum time it can wait for, add support for the absolute mode, with the value carrying a CLOCK_MONOTONIC absolute time until which we should return control back to the user. Suggested-by: Lewis Baker <lewissbaker@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d5b74d67ada882590b2e42aa3aa7117bbf6b55f.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring/napi: postpone napi timeout adjustment	Pavel Begunkov	3	-38/+7
	Remove io_napi_adjust_timeout() and move the adjustments out of the common path into __io_napi_busy_loop(). Now the limit it's calculated based on struct io_wait_queue::timeout, for which we query current time another time. The overhead shouldn't be a problem, it's a polling path, however that can be optimised later by additionally saving the delta time value in io_cqring_wait(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/88e14686e245b3b42ff90a3c4d70895d48676206.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring/napi: refactor __io_napi_busy_loop()	Pavel Begunkov	1	-3/+4
	we don't need to set ->napi_prefer_busy_poll if we're not going to poll, do the checks first and all polling preparation after. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2ad7ede8cc7905328fc62e8c3805fdb11635ae0b.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring/kbuf: turn io_buffer_list booleans into flags	Jens Axboe	2	-23/+26
	We could just move these two and save some space, but in preparation for adding another flag, turn them into flags first. This saves 8 bytes in struct io_buffer_list, making it exactly half a cacheline on 64-bit archs now rather than 40 bytes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring/net: use ITER_UBUF for single segment send maps	Jens Axboe	1	-3/+12
	Just like what is being done on the recv side, if we only map a single segment, then use ITER_UBUF for mapping it. That's more efficient than using an ITER_IOVEC. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring/kbuf: use 'bl' directly rather than req->buf_list	Jens Axboe	1	-1/+1
	req->buf_list is assigned higher up and is safe to use as we remain within a locked region, as is the 'bl' variable itself from which it was assigned. To improve readability, use 'bl' directly rather than get it from the io_kiocb, if we need to increment the head directly in the buffer selection path. This makes it readily apparent that it's the same io_buffer_list being used. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring: micro optimization of __io_sq_thread() condition	Olivier Langlois	1	-1/+1
	reverse the order of the element evaluation in an if statement. for many users that are not using iopoll, the iopoll_list will always evaluate to false after having made a memory access whereas to_submit is very likely already loaded in a register. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/052ca60b5c49e7439e4b8bd33bfab4a09d36d3d6.1722374371.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring/rsrc: enable multi-hugepage buffer coalescing	Chenliang Li	2	-32/+110
	Add support for checking and coalescing multi-hugepage-backed fixed buffers. The coalescing optimizes both time and space consumption caused by mapping and storing multi-hugepage fixed buffers. A coalescable multi-hugepage buffer should fully cover its folios (except potentially the first and last one), and these folios should have the same size. These requirements are for easier processing later, also we need same size'd chunks in io_import_fixed for fast iov_iter adjust. Signed-off-by: Chenliang Li <cliang01.li@samsung.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20240731090133.4106-3-cliang01.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring/rsrc: store folio shift and mask into imu	Chenliang Li	2	-9/+8
	Store the folio shift and folio mask into imu struct and use it in iov_iter adjust, as we will have non PAGE_SIZE'd chunks if a multi-hugepage buffer get coalesced. Signed-off-by: Chenliang Li <cliang01.li@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20240731090133.4106-2-cliang01.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25	io_uring: add napi busy settings to the fdinfo output	Olivier Langlois	1	-1/+13
	This info may be useful when attempting to debug a problem involving a ring using the NAPI feature. Here is an example of the output: ip-172-31-39-89 /proc/772/fdinfo # cat 14 pos: 0 flags: 02000002 mnt_id: 16 ino: 10243 SqMask: 0xff SqHead: 633 SqTail: 633 CachedSqHead: 633 CqMask: 0x3fff CqHead: 430250 CqTail: 430250 CachedCqTail: 430250 SQEs: 0 CQEs: 0 SqThread: 885 SqThreadCpu: 0 SqTotalTime: 52793826 SqWorkTime: 3590465 UserFiles: 0 UserBufs: 0 PollList: op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=6, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=6, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 CqOverflowList: NAPI: enabled napi_busy_poll_to: 1 napi_prefer_busy_poll: true Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bb184f8b62703ddd3e6e19eae7ab6c67b97e1e10.1722293317.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-21	io_uring/kbuf: sanitize peek buffer setup	Jens Axboe	1	-3/+6
	Harden the buffer peeking a bit, by adding a sanity check for it having a valid size. Outside of that, arg->max_len is a size_t, though it's only ever set to a 32-bit value (as it's governed by MAX_RW_COUNT). Bump our needed check to a size_t so we know it fits. Finally, cap the calculated needed iov value to the PEEK_MAX_IMPORT, which is the maximum number of segments that should be peeked. Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-13	io_uring/sqpoll: annotate debug task == current with data_race()	Jens Axboe	1	-1/+1
	There's a debug check in io_sq_thread_park() checking if it's the SQPOLL thread itself calling park. KCSAN warns about this, as we should not be reading sqd->thread outside of sqd->lock. Just silence this with data_race(). The pointer isn't used for anything but this debug check. Reported-by: syzbot+2b946a3fd80caf971b21@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-13	introduce fd_file(), convert all accessors to it.	Al Viro	1	-5/+5
	For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-12	io_uring/napi: remove duplicate io_napi_entry timeout assignation	Olivier Langlois	1	-1/+0
	io_napi_entry() has 2 calling sites. One of them is unlikely to find an entry and if it does, the timeout should arguable not be updated. The other io_napi_entry() calling site is overwriting the update made by io_napi_entry() so the io_napi_entry() timeout value update has no or little value and therefore is removed. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/145b54ff179f87609e20dffaf5563c07cdbcad1a.1723423275.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-12	io_uring/napi: check napi_enabled in io_napi_add() before proceeding	Olivier Langlois	2	-2/+2
	doing so avoids the overhead of adding napi ids to all the rings that do not enable napi. if no id is added to napi_list because napi is disabled, __io_napi_busy_loop() will not be called. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Fixes: b4ccc4dd1330 ("io_uring/napi: enable even with a timeout of 0") Link: https://lore.kernel.org/r/bd989ccef5fda14f5fd9888faf4fefcf66bd0369.1723400131.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-08	io_uring/net: don't pick multiple buffers for non-bundle send	Jens Axboe	1	-2/+3
	If a send is issued marked with IOSQE_BUFFER_SELECT for selecting a buffer, unless it's a bundle, it should not select multiple buffers. Cc: stable@vger.kernel.org Fixes: a05d1f625c7a ("io_uring/net: support bundles for send") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-08	io_uring/net: ensure expanded bundle send gets marked for cleanup	Jens Axboe	1	-0/+1
	If the iovec inside the kmsg isn't already allocated AND one gets expanded beyond the fixed size, then the request may not already have been marked for cleanup. Ensure that it is. Cc: stable@vger.kernel.org Fixes: a05d1f625c7a ("io_uring/net: support bundles for send") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-08	io_uring/net: ensure expanded bundle recv gets marked for cleanup	Jens Axboe	1	-0/+1
	If the iovec inside the kmsg isn't already allocated AND one gets expanded beyond the fixed size, then the request may not already have been marked for cleanup. Ensure that it is. Cc: stable@vger.kernel.org Fixes: 2f9c9515bdfd ("io_uring/net: support bundles for recv") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-30	io_uring: remove unused local list heads in NAPI functions	Olivier Langlois	1	-2/+0
	These lists are unused, remove them. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0a0ae3e955aed0f3e3d29882fb3d3cb575e0009b.1722294947.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-30	io_uring: keep multishot request NAPI timeout current	Olivier Langlois	1	-0/+1
	This refresh statement was originally present in the original patch: https://lore.kernel.org/netdev/20221121191437.996297-2-shr@devkernel.io/ It has been removed with no explanation in v6: https://lore.kernel.org/netdev/20230201222254.744422-2-shr@devkernel.io/ It is important to make the refresh for multishot requests, because if no new requests using the same NAPI device are added to the ring, the entry will become stale and be removed silently. The unsuspecting user will not know that their ring had busy polling for only 60 seconds before being pruned. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: 8d0c12a80cdeb ("io-uring: add napi busy poll support") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/0fe61a019ec61e5708cd117cb42ed0dab95e1617.1722294646.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-26	io_uring/napi: pass ktime to io_napi_adjust_timeout	Pavel Begunkov	3	-17/+11
	Pass the waiting time for __io_napi_adjust_timeout as ktime and get rid of all timespec64 conversions. It's especially simpler since the caller already have a ktime. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4f5b8e8eed4f53a1879e031a6712b25381adc23d.1722003776.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-26	io_uring/napi: use ktime in busy polling	Pavel Begunkov	3	-23/+29
	It's more natural to use ktime/ns instead of keeping around usec, especially since we're comparing it against user provided timers, so convert napi busy poll internal handling to ktime. It's also nicer since the type (ktime_t vs unsigned long) now tells the unit of measure. Keep everything as ktime, which we convert to/from micro seconds for IORING_[UN]REGISTER_NAPI. The net/ busy polling works seems to work with usec, however it's not real usec as shift by 10 is used to get it from nsecs, see busy_loop_current_time(), so it's easy to get truncated nsec back and we get back better precision. Note, we can further improve it later by removing the truncation and maybe convincing net/ to use ktime/ns instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/95e7ec8d095069a3ed5d40a4bc6f8b586698bc7e.1722003776.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-25	io_uring/msg_ring: fix uninitialized use of target_req->flags	Jens Axboe	1	-3/+3
	syzbot reports that KMSAN complains that 'nr_tw' is an uninit-value with the following report: BUG: KMSAN: uninit-value in io_req_local_work_add io_uring/io_uring.c:1192 [inline] BUG: KMSAN: uninit-value in io_req_task_work_add_remote+0x588/0x5d0 io_uring/io_uring.c:1240 io_req_local_work_add io_uring/io_uring.c:1192 [inline] io_req_task_work_add_remote+0x588/0x5d0 io_uring/io_uring.c:1240 io_msg_remote_post io_uring/msg_ring.c:102 [inline] io_msg_data_remote io_uring/msg_ring.c:133 [inline] io_msg_ring_data io_uring/msg_ring.c:152 [inline] io_msg_ring+0x1c38/0x1ef0 io_uring/msg_ring.c:305 io_issue_sqe+0x383/0x22c0 io_uring/io_uring.c:1710 io_queue_sqe io_uring/io_uring.c:1924 [inline] io_submit_sqe io_uring/io_uring.c:2180 [inline] io_submit_sqes+0x1259/0x2f20 io_uring/io_uring.c:2295 __do_sys_io_uring_enter io_uring/io_uring.c:3205 [inline] __se_sys_io_uring_enter+0x40c/0x3ca0 io_uring/io_uring.c:3142 __x64_sys_io_uring_enter+0x11f/0x1a0 io_uring/io_uring.c:3142 x64_sys_call+0x2d82/0x3c10 arch/x86/include/generated/asm/syscalls_64.h:427 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f which is the following check: if (nr_tw < nr_wait) return; in io_req_local_work_add(). While nr_tw itself cannot be uninitialized, it does depend on req->flags, which off the msg ring issue path can indeed be uninitialized. Fix this by always clearing the allocated 'req' fully if we can't grab one from the cache itself. Fixes: 50cf5f3842af ("io_uring/msg_ring: add an alloc cache for io_kiocb entries") Reported-by: syzbot+82609b8937a4458106ca@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/000000000000fd3d8d061dfc0e4a@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24	io_uring: align iowq and task request error handling	Pavel Begunkov	1	-1/+1
	There is a difference in how io_queue_sqe and io_wq_submit_work treat error codes they get from io_issue_sqe. The first one fails anything unknown but latter only fails when the code is negative. It doesn't make sense to have this discrepancy, align them to the io_queue_sqe behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c550e152bf4a290187f91a4322ddcb5d6d1f2c73.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24	io_uring: simplify io_uring_cmd return	Pavel Begunkov	1	-1/+1
	We don't have to return error code from an op handler back to core io_uring, so once io_uring_cmd() sets the results and handles errors we can juts return IOU_OK and simplify the code. Note, only valid with e0b23d9953b0c ("io_uring: optimise ltimeout for inline execution"), there was a problem with iopoll before. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8eae2be5b2a49236cd5f1dadbd1aa5730e9e2d4f.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24	io_uring: fix io_match_task must_hold	Pavel Begunkov	1	-1/+1
	The __must_hold annotation in io_match_task() uses a non existing parameter "req", fix it. Fixes: 6af3f48bf6156 ("io_uring: fix link traversal locking") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3e65ee7709e96507cef3d93291746f2c489f2307.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24	io_uring: don't allow netpolling with SETUP_IOPOLL	Pavel Begunkov	1	-0/+2
	IORING_SETUP_IOPOLL rings don't have any netpoll handling, let's fail attempts to register netpolling in this case, there might be people who will mix up IOPOLL and netpoll. Cc: stable@vger.kernel.org Fixes: ef1186c1a875b ("io_uring: add register/unregister napi function") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e7553aee0a8ae4edec6742cd6dd0c1e6914fba8.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24	io_uring: tighten task exit cancellations	Pavel Begunkov	1	-1/+4
	io_uring_cancel_generic() should retry if any state changes like a request is completed, however in case of a task exit it only goes for another loop and avoids schedule() if any tracked (i.e. REQ_F_INFLIGHT) request got completed. Let's assume we have a non-tracked request executing in iowq and a tracked request linked to it. Let's also assume io_uring_cancel_generic() fails to find and cancel the request, i.e. via io_run_local_work(), which may happen as io-wq has gaps. Next, the request logically completes, io-wq still hold a ref but queues it for completion via tw, which happens in io_uring_try_cancel_requests(). After, right before prepare_to_wait() io-wq puts the request, grabs the linked one and tries executes it, e.g. arms polling. Finally the cancellation loop calls prepare_to_wait(), there are no tw to run, no tracked request was completed, so the tctx_inflight() check passes and the task is put to indefinite sleep. Cc: stable@vger.kernel.org Fixes: 3f48cf18f886c ("io_uring: unify files and task cancel") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/acac7311f4e02ce3c43293f8f1fda9c705d158f1.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-20	io_uring: fix error pbuf checking	Pavel Begunkov	1	-1/+3
	Syz reports a problem, which boils down to NULL vs IS_ERR inconsistent error handling in io_alloc_pbuf_ring(). KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:__io_remove_buffers+0xac/0x700 io_uring/kbuf.c:341 Call Trace: <TASK> io_put_bl io_uring/kbuf.c:378 [inline] io_destroy_buffers+0x14e/0x490 io_uring/kbuf.c:392 io_ring_ctx_free+0xa00/0x1070 io_uring/io_uring.c:2613 io_ring_exit_work+0x80f/0x8a0 io_uring/io_uring.c:2844 process_one_work kernel/workqueue.c:3231 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312 worker_thread+0x86d/0xd40 kernel/workqueue.c:3390 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Cc: stable@vger.kernel.org Reported-by: syzbot+2074b1a3d447915c6f1c@syzkaller.appspotmail.com Fixes: 87585b05757dc ("io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c5f9df20560bd9830401e8e48abc029e7cfd9f5e.1721329239.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>