kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
4 days	io-wq: check that the predecessor is hashed in io_wq_remove_pending()	Nicholas Carlini	1	-1/+2
	commit d6a2d7b04b5a093021a7a0e2e69e9d5237dfa8cc upstream. io_wq_remove_pending() needs to fix up wq->hash_tail[] if the cancelled work was the tail of its hash bucket. When doing this, it checks whether the preceding entry in acct->work_list has the same hash value, but never checks that the predecessor is hashed at all. io_get_work_hash() is simply atomic_read(&work->flags) >> IO_WQ_HASH_SHIFT, and the hash bits are never set for non-hashed work, so it returns 0. Thus, when a hashed bucket-0 work is cancelled while a non-hashed work is its list predecessor, the check spuriously passes and a pointer to the non-hashed io_kiocb is stored in wq->hash_tail[0]. Because non-hashed work is dequeued via the fast path in io_get_next_work(), which never touches hash_tail[], the stale pointer is never cleared. Therefore, after the non-hashed io_kiocb completes and is freed back to req_cachep, wq->hash_tail[0] is a dangling pointer. The io_wq is per-task (tctx->io_wq) and survives ring open/close, so the dangling pointer persists for the lifetime of the task; the next hashed bucket-0 enqueue dereferences it in io_wq_insert_work() and wq_list_add_after() writes through freed memory. Add the missing io_wq_is_hashed() check so a non-hashed predecessor never inherits a hash_tail[] slot. Cc: stable@vger.kernel.org Fixes: 204361a77f40 ("io-wq: fix hang after cancelling pending hashed work") Signed-off-by: Nicholas Carlini <nicholas@carlini.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 days	io_uring/napi: cap busy_poll_to 10 msec	Jens Axboe	1	-0/+2
	[ Upstream commit df8599ee18c0e5fe343ffe0b4c379636b8bb839a ] Currently there's no cap on the maximum amount of time that napi is allowed to poll if no events are found, which can lead to kernel complaints on a task being stuck as there's no conditional rescheduling done within that loop. Just cap it to 10 msec in total, that's already way above any kind of sane value that will reap any benefits, yet low enough that it's nowhere near being able to trigger preemption complaints. Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
10 days	io_uring/zcrx: warn on freelist violations	Pavel Begunkov	1	-0/+2
	commit 770594e78c3964cf23cf5287f849437cdde9b7d0 upstream. The freelist is appropriately sized to always be able to take a free niov, but let's be more defensive and check the invariant with a warning. That should help to catch any double-free issues. Suggested-by: Kai Aizen <kai@snailsploit.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/2f3cea363b04649755e3b6bb9ab66485a95936d5.1776760901.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 days	io_uring/zcrx: use guards for locking	Pavel Begunkov	1	-8/+7
	commit 898ad80d1207cbdb22b21bafb6de4adfd7627bd0 upstream. Convert last several places using manual locking to guards to simplify the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/eb4667cfaf88c559700f6399da9e434889f5b04a.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
13 days	io_uring/tw: serialize ctx->retry_llist with ->uring_lock	Jens Axboe	1	-1/+11
	Commit 17666e2d7592c3e85260cafd3950121524acc2c5 upstream. The DEFER_TASKRUN local task work paths all run under ctx->uring_lock, which serializes them with each other and with the rest of the ring's hot paths. io_move_task_work_from_local() is the exception - it's called from io_ring_exit_work() on a kworker without holding the lock and from the iopoll cancelation side right after dropping it. ->work_llist is fine with this, as it's only ever updated via the expected paths. But the ->retry_llist is updated while runing, and hence it could potentially race between normal task_work running and the task-has-exited shutdown path. Simply grab ->uring_lock while moving the local work to the fallback list for exit purposes, which nicely serializes it across both the normal additions and the exit prune path. Cc: stable@vger.kernel.org Fixes: f46b9cdb22f7 ("io_uring: limit local tw done") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
13 days	io_uring/kbuf: support min length left for incremental buffers	Martin Michaelis	2	-3/+16
	Commit 7deba791ad495ce1d7921683f4f7d1190fa210d1 upstream. Incrementally consumed buffer rings are generally fully consumed, but it's quite possible that the application has a minimum size it needs to meet to avoid truncation. Currently that minimum limit is 1 byte, but this should be a setting that is the hands of the application. For recvmsg multishot, a prime use case for incrementally consumed buffers, the application may get spurious -EFAULT returned at the end of an incrementally consumed buffer, as less space is available than the headers need. Grab a u32 field in struct io_uring_buf_reg, which the application can use to inform the kernel of the minimum size that should be available in an incrementally consumed buffer. If less than that is available, the current buffer is fully processed and the next one will be picked. Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1433 Signed-off-by: Martin Michaelis <code@mgjm.de> [axboe: write commit message, change io_buffer_list member name] Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07	io_uring/poll: ensure EPOLL_ONESHOT is propagated for EPOLL_URING_WAKE	Jens Axboe	1	-1/+3
	commit 1967f0b1cafdde37aa9e08e6021c14bcc484b7a5 upstream. Commit: aacf2f9f382c ("io_uring: fix req->apoll_events") fixed an issue where poll->events and req->apoll_events weren't synchronized, but then when the commit referenced in Fixes got added, it didn't ensure the same thing. If we mask in EPOLLONESHOT in the regular EPOLL_URING_WAKE path, then ensure it's done for both. Including a link to the original report below, even though it's mostly nonsense. But it includes a reproducer that does show that IORING_CQE_F_MORE is set in the previous CQE, while no more CQEs will be generated for this request. Just ignore anything that pretends this is security related in any way, it's just the typical AI nonsense. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/CAM0zi7yQzF3eKncgHo4iVM5yFLAjsiob_ucqyWKs=hyd_GqiMg@mail.gmail.com/ Reported-by: Azizcan Daştan <azizcan.d@mileniumsec.com> Fixes: 4464853277d0 ("io_uring: pass in EPOLL_URING_WAKE for eventfd signaling and wakeups") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07	io_uring/poll: fix signed comparison in io_poll_get_ownership()	Longxuan Yu	1	-1/+1
	commit 326941b22806cbf2df1fbfe902b7908b368cce42 upstream. io_poll_get_ownership() uses a signed comparison to check whether poll_refs has reached the threshold for the slowpath: if (unlikely(atomic_read(&req->poll_refs) >= IO_POLL_REF_BIAS)) atomic_read() returns int (signed). When IO_POLL_CANCEL_FLAG (BIT(31)) is set in poll_refs, the value becomes negative in signed arithmetic, so the >= 128 comparison always evaluates to false and the slowpath is never taken. Fix this by casting the atomic_read() result to unsigned int before the comparison, so that the cancel flag is treated as a large positive value and correctly triggers the slowpath. Fixes: a26a35e9019f ("io_uring: make poll refs more robust") Cc: stable@vger.kernel.org Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Longxuan Yu <ylong030@ucr.edu> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/3a3508b08bcd7f1bc3beff848ae6e1d73d355043.1775965597.git.ylong030@ucr.edu Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07	io_uring/timeout: check unused sqe fields	Pavel Begunkov	1	-0/+4
	commit 484ae637a3e3d909718de7c07afd3bb34b6b8504 upstream. Zero check unused SQE fields addr3 and pad2 for timeout and timeout update requests. They're not needed now, but could be used sometime in the future. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07	io_uring/register: fix ring resizing with mixed/large SQEs/CQEs	Jens Axboe	1	-8/+28
	Commit 45cd95763e198d74d369ede43aef0b1955b8dea4 upstream. The ring resizing only properly handles "normal" sized SQEs or CQEs, if there are pending entries around a resize. This normally should not be the case, but the code is supposed to handle this regardless. For the mixed SQE/CQE cases, the current copying works fine as they are indexed in the same way. Each half is just copied separately. But for fixed large SQEs and CQEs, the iteration and copy need to take that into account. Cc: stable@kernel.org Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-11	io_uring: protect remaining lockless ctx->rings accesses with RCU	Jens Axboe	2	-27/+69
	Commit 61a11cf4812726aceaee17c96432e1c08f6ed6cb upstream. Commit 96189080265e addressed one case of ctx->rings being potentially accessed while a resize is happening on the ring, but there are still a few others that need handling. Add a helper for retrieving the rings associated with an io_uring context, and add some sanity checking to that to catch bad uses. ->rings_rcu is always valid, as long as it's used within RCU read lock. Any use of ->rings_rcu or ->rings inside either ->uring_lock or ->completion_lock is sane as well. Do the minimum fix for the current kernel, but set it up such that this basic infra can be extended for later kernels to make this harder to mess up in the future. Thanks to Junxi Qian for finding and debugging this issue. Cc: stable@vger.kernel.org Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Reviewed-by: Junxi Qian <qjx1298677004@gmail.com> Tested-by: Junxi Qian <qjx1298677004@gmail.com> Link: https://lore.kernel.org/io-uring/20260330172348.89416-1-qjx1298677004@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-11	io_uring/net: fix slab-out-of-bounds read in io_bundle_nbufs()	Junxi Qian	1	-0/+4
	commit b948f9d5d3057b01188e36664e7c7604d1c8ecb5 upstream. sqe->len is __u32 but gets stored into sr->len which is int. When userspace passes sqe->len values exceeding INT_MAX (e.g. 0xFFFFFFFF), sr->len overflows to a negative value. This negative value propagates through the bundle recv/send path: 1. io_recv(): sel.val = sr->len (ssize_t gets -1) 2. io_recv_buf_select(): arg.max_len = sel->val (size_t gets 0xFFFFFFFFFFFFFFFF) 3. io_ring_buffers_peek(): buf->len is not clamped because max_len is astronomically large 4. iov[].iov_len = 0xFFFFFFFF flows into io_bundle_nbufs() 5. io_bundle_nbufs(): min_t(int, 0xFFFFFFFF, ret) yields -1, causing ret to increase instead of decrease, creating an infinite loop that reads past the allocated iov[] array This results in a slab-out-of-bounds read in io_bundle_nbufs() from the kmalloc-64 slab, as nbufs increments past the allocated iovec entries. BUG: KASAN: slab-out-of-bounds in io_bundle_nbufs+0x128/0x160 Read of size 8 at addr ffff888100ae05c8 by task exp/145 Call Trace: io_bundle_nbufs+0x128/0x160 io_recv_finish+0x117/0xe20 io_recv+0x2db/0x1160 Fix this by rejecting negative sr->len values early in both io_sendmsg_prep() and io_recvmsg_prep(). Since sqe->len is __u32, any value > INT_MAX indicates overflow and is not a valid length. Fixes: a05d1f625c7a ("io_uring/net: support bundles for send") Cc: stable@vger.kernel.org Signed-off-by: Junxi Qian <qjx1298677004@gmail.com> Link: https://patch.msgid.link/20260329153909.279046-1-qjx1298677004@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-11	io_uring/rsrc: reject zero-length fixed buffer import	Qi Tang	1	-0/+4
	[ Upstream commit 111a12b422a8cfa93deabaef26fec48237163214 ] validate_fixed_range() admits buf_addr at the exact end of the registered region when len is zero, because the check uses strict greater-than (buf_end > imu->ubuf + imu->len). io_import_fixed() then computes offset == imu->len, which causes the bvec skip logic to advance past the last bio_vec entry and read bv_offset from out-of-bounds slab memory. Return early from io_import_fixed() when len is zero. A zero-length import has no data to transfer and should not walk the bvec array at all. BUG: KASAN: slab-out-of-bounds in io_import_reg_buf+0x697/0x7f0 Read of size 4 at addr ffff888002bcc254 by task poc/103 Call Trace: io_import_reg_buf+0x697/0x7f0 io_write_fixed+0xd9/0x250 __io_issue_sqe+0xad/0x710 io_issue_sqe+0x7d/0x1100 io_submit_sqes+0x86a/0x23c0 __do_sys_io_uring_enter+0xa98/0x1590 Allocated by task 103: The buggy address is located 12 bytes to the right of allocated 584-byte region [ffff888002bcc000, ffff888002bcc248) Fixes: 8622b20f23ed ("io_uring: add validate_fixed_range() for validate fixed buffer") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Link: https://patch.msgid.link/20260329164936.240871-1-tpluszz77@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-25	io_uring/kbuf: propagate BUF_MORE through early buffer commit path	Jens Axboe	1	-3/+7
	commit 418eab7a6f3c002d8e64d6e95ec27118017019af upstream. When io_should_commit() returns true (eg for non-pollable files), buffer commit happens at buffer selection time and sel->buf_list is set to NULL. When __io_put_kbufs() generates CQE flags at completion time, it calls __io_put_kbuf_ring() which finds a NULL buffer_list and hence cannot determine whether the buffer was consumed or not. This means that IORING_CQE_F_BUF_MORE is never set for non-pollable input with incrementally consumed buffers. Likewise for io_buffers_select(), which always commits upfront and discards the return value of io_kbuf_commit(). Add REQ_F_BUF_MORE to store the result of io_kbuf_commit() during early commit. Then __io_put_kbuf_ring() can check this flag and set IORING_F_BUF_MORE accordingy. Reported-by: Martin Michaelis <code@mgjm.de> Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1553 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-25	io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF	Jens Axboe	1	-0/+4
	commit 3ecd3e03144b38a21a3b70254f1b9d2e16629b09 upstream. For a zero length transfer, io_kbuf_inc_commit() is called with !len. Since we never enter the while loop to consume the buffers, io_kbuf_inc_commit() ends up returning true, consuming the buffer. But if no data was consumed, by definition it cannot have consumed the buffer. Return false for that case. Reported-by: Martin Michaelis <code@mgjm.de> Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1553 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-25	io_uring/poll: fix multishot recv missing EOF on wakeup race	Jens Axboe	1	-2/+7
	commit a68ed2df72131447d131531a08fe4dfcf4fa4653 upstream. When a socket send and shutdown() happen back-to-back, both fire wake-ups before the receiver's task_work has a chance to run. The first wake gets poll ownership (poll_refs=1), and the second bumps it to 2. When io_poll_check_events() runs, it calls io_poll_issue() which does a recv that reads the data and returns IOU_RETRY. The loop then drains all accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only the first event was consumed. Since the shutdown is a persistent state change, no further wakeups will happen, and the multishot recv can hang forever. Check specifically for HUP in the poll loop, and ensure that another loop is done to check for status if more than a single poll activation is pending. This ensures we don't lose the shutdown event. Cc: stable@vger.kernel.org Fixes: dbc2564cfe0f ("io_uring: let fast poll support multishot") Reported-by: Francis Brosseau <francis@malagauche.com> Link: https://github.com/axboe/liburing/issues/1549 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19	io_uring/eventfd: use ctx->rings_rcu for flags checking	Jens Axboe	1	-3/+7
	Commit 177c69432161f6e4bab07ccacf8a1748a6898a6b upstream. Similarly to what commit e78f7b70e837 did for local task work additions, use ->rings_rcu under RCU rather than dereference ->rings directly. See that commit for more details. Cc: stable@vger.kernel.org Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19	io_uring: ensure ctx->rings is stable for task work flags manipulation	Jens Axboe	2	-3/+33
	Commit 96189080265e6bb5dde3a4afbaf947af493e3f82 upstream. If DEFER_TASKRUN \| SETUP_TASKRUN is used and task work is added while the ring is being resized, it's possible for the OR'ing of IORING_SQ_TASKRUN to happen in the small window of swapping into the new rings and the old rings being freed. Prevent this by adding a 2nd ->rings pointer, ->rings_rcu, which is protected by RCU. The task work flags manipulation is inside RCU already, and if the resize ring freeing is done post an RCU synchronize, then there's no need to add locking to the fast path of task work additions. Note: this is only done for DEFER_TASKRUN, as that's the only setup mode that supports ring resizing. If this ever changes, then they too need to use the io_ctx_mark_taskrun() helper. Link: https://lore.kernel.org/io-uring/20260309062759.482210-1-naup96721@gmail.com/ Cc: stable@vger.kernel.org Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Reported-by: Hao-Yu Yang <naup96721@gmail.com> Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19	io_uring/kbuf: check if target buffer list is still legacy on recycle	Jens Axboe	1	-2/+11
	commit c2c185be5c85d37215397c8e8781abf0a69bec1f upstream. There's a gap between when the buffer was grabbed and when it potentially gets recycled, where if the list is empty, someone could've upgraded it to a ring provided type. This can happen if the request is forced via io-wq. The legacy recycling is missing checking if the buffer_list still exists, and if it's of the correct type. Add those checks. Cc: stable@vger.kernel.org Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Reported-by: Keenan Dong <keenanat2000@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19	io_uring/net: reject SEND_VECTORIZED when unsupported	Pavel Begunkov	1	-0/+2
	commit c36e28becd0586ac98318fd335e5e91d19cd2623 upstream. IORING_SEND_VECTORIZED with registered buffers is not implemented but could be. Don't silently ignore the flag in this case but reject it with an error. It only affects sendzc as normal sends don't support registered buffers. Fixes: 6f02527729bd3 ("io_uring/net: Allow to do vectorized send") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19	io_uring/zcrx: use READ_ONCE with user shared RQEs	Pavel Begunkov	1	-2/+3
	commit 531bb98a030cc1073bd7ed9a502c0a3a781e92ee upstream. Refill queue entries are shared with the user space, use READ_ONCE when reading them. Fixes: 34a3e60821ab9 ("io_uring/zcrx: implement zerocopy receive pp memory provider"); Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04	io_uring/zcrx: fix user_ref race between scrub and refill paths	Kai Aizen	1	-3/+7
	[ Upstream commit 003049b1c4fb8aabb93febb7d1e49004f6ad653b ] The io_zcrx_put_niov_uref() function uses a non-atomic check-then-decrement pattern (atomic_read followed by separate atomic_dec) to manipulate user_refs. This is serialized against other callers by rq_lock, but io_zcrx_scrub() modifies the same counter with atomic_xchg() WITHOUT holding rq_lock. On SMP systems, the following race exists: CPU0 (refill, holds rq_lock) CPU1 (scrub, no rq_lock) put_niov_uref: atomic_read(uref) - 1 // window opens atomic_xchg(uref, 0) - 1 return_niov_freelist(niov) [PUSH #1] // window closes atomic_dec(uref) - wraps to -1 returns true return_niov(niov) return_niov_freelist(niov) [PUSH #2: DOUBLE-FREE] The same niov is pushed to the freelist twice, causing free_count to exceed nr_iovs. Subsequent freelist pushes then perform an out-of-bounds write (a u32 value) past the kvmalloc'd freelist array into the adjacent slab object. Fix this by replacing the non-atomic read-then-dec in io_zcrx_put_niov_uref() with an atomic_try_cmpxchg loop that atomically tests and decrements user_refs. This makes the operation safe against concurrent atomic_xchg from scrub without requiring scrub to acquire rq_lock. Fixes: 34a3e60821ab ("io_uring/zcrx: implement zerocopy receive pp memory provider") Cc: stable@vger.kernel.org Signed-off-by: Kai Aizen <kai@snailsploit.com> [pavel: removed a warning and a comment] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04	io_uring/cmd_net: fix too strict requirement on ioctl	Asbjørn Sloth Tønnesen	1	-3/+6
	[ Upstream commit 600b665b903733bd60334e86031b157cc823ee55 ] Attempting SOCKET_URING_OP_SETSOCKOPT on an AF_NETLINK socket resulted in an -EOPNOTSUPP, as AF_NETLINK doesn't have an ioctl in its struct proto, but only in struct proto_ops. Prior to the blamed commit, io_uring_cmd_sock() only had two cmd_op operations, both requiring ioctl, thus the check was warranted. Since then, 4 new cmd_op operations have been added, none of which depend on ioctl. This patch moves the ioctl check, so it only applies to the original operations. AFAICT, the ioctl requirement was unintentional, and it wasn't visible in the blamed patch within 3 lines of context. Cc: stable@vger.kernel.org Fixes: a5d2f99aff6b ("io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT") Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04	io_uring/zcrx: fix sgtable leak on mapping failures	Pavel Begunkov	1	-0/+3
	[ Upstream commit a983aae397767e9da931128ff2b5bf9066513ce3 ] In an unlikely case when io_populate_area_dma() fails, which could only happen on a PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA machine, io_zcrx_map_area() will have an initialised and not freed table. It was supposed to be cleaned up in the error path, but !is_mapped prevents that. Fixes: 439a98b972fbb ("io_uring/zcrx: deduplicate area mapping") Cc: stable@vger.kernel.org Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04	io_uring/openclose: fix io_pipe_fixed() slot tracking for specific slots	Jens Axboe	1	-3/+6
	[ Upstream commit f4d0668b38d8784f33a9a36c72ed5d0078247538 ] __io_fixed_fd_install() returns 0 on success for non-alloc mode (specific slot), not the slot index. io_pipe_fixed() used this return value directly as the slot index in fds[], which can cause the reported values returned via copy_to_user() to be incorrect, or the error path operating on the incorrect direct descriptor. Fix by computing the actual 0-based slot index (slot - 1) for specific slot mode, while preserving the existing behavior for auto-alloc mode where __io_fixed_fd_install() already returns the allocated index. Cc: stable@vger.kernel.org Fixes: 53db8a71ecb4 ("io_uring: add support for IORING_OP_PIPE") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04	io_uring/filetable: clamp alloc_hint to the configured alloc range	Jens Axboe	1	-0/+4
	[ Upstream commit a6bded921ed35f21b3f6bd8e629bf488499ca442 ] Explicit fixed file install/remove operations on slots outside the configured alloc range can corrupt alloc_hint via io_file_bitmap_set() and io_file_bitmap_clear(), which unconditionally update alloc_hint to the bit position. This causes subsequent auto-allocations to fall outside the configured range. For example, if the alloc range is [10, 20) and a file is removed at slot 2, alloc_hint gets set to 2. The next auto-alloc then starts searching from slot 2, potentially returning a slot below the range. Fix this by clamping alloc_hint to [file_alloc_start, file_alloc_end) at the top of io_file_bitmap_get() before starting the search. Cc: stable@vger.kernel.org Fixes: 6e73dffbb93c ("io_uring: let to set a range for file slot allocation") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04	io_uring/net: don't continue send bundle if poll was required for retry	Jens Axboe	1	-1/+5
	[ Upstream commit 806ae939c41e5da1d94a1e2b31f5702e96b6c3e3 ] If a send bundle has picked a bunch of buffers, then it needs to send all of those to be complete. This may require poll arming, if the send buffer ends up being full. Once a send bundle has been poll armed, no further bundles should be attempted. This allows a current bundle to complete even though it needs to go through polling to do so, but it will not allow another bundle to be started once that has happened. Ideally we would abort a bundle if it was only partially sent, but as some parts of it already went out on the wire, this obviously isn't feasible. Not continuing more bundle attempts post encountering a full socket buffer is the second best thing. Cc: stable@vger.kernel.org Fixes: a05d1f625c7a ("io_uring/net: support bundles for send") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04	io_uring/timeout: annotate data race in io_flush_timeouts()	Jens Axboe	1	-1/+1
	[ Upstream commit 42b12cb5fd4554679bac06bbdd05dc8b643bcc42 ] syzbot correctly reports this as a KCSAN race, as ctx->cached_cq_tail should be read under ->uring_lock. This isn't immediately feasible in io_flush_timeouts(), but as long as we read a stable value, that should be good enough. If two io-wq threads compete on this value, then they will both end up calling io_flush_timeouts() and at least one of them will see the correct value. Reported-by: syzbot+6c48db7d94402407301e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27	io_uring/rsrc: clean up buffer cloning arg validation	Joanne Koong	1	-21/+6
	commit b8201b50e403815f941d1c6581a27fdbfe7d0fd4 upstream. Get rid of some redundant checks and move the src arg validation to before the buffer table allocation, which simplifies error handling. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27	io_uring/cancel: de-unionize file and user_data in struct io_cancel_data	Jens Axboe	1	-4/+2
	[ Upstream commit 22dbb0987bd1e0ec3b1e4ad20756a98f99aa4a08 ] By having them share the same space in struct io_cancel_data, it ends up disallowing IORING_ASYNC_CANCEL_FD\|IORING_ASYNC_CANCEL_USERDATA from working. Eg you cannot match on both a file and user_data for cancelation purposes. This obviously isn't a common use case as nobody has reported this, but it does result in -ENOENT potentially being returned when trying to match on both, rather than actually doing what the API says it would. Fixes: 4bf94615b888 ("io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27	io_uring: delay sqarray static branch disablement	Pavel Begunkov	1	-4/+4
	[ Upstream commit 56112578c71213a10c995a56835bddb5e9ab1ed0 ] io_key_has_sqarray static branch can be easily switched on/off by the user every time patching the kernel. That can be very disruptive as it might require heavy synchronisation across all CPUs. Use deferred static keys, which can rate-limit it by deferring, batching and potentially effectively eliminating dec+inc pairs. Fixes: 9b296c625ac1d ("io_uring: static_key for !IORING_SETUP_NO_SQARRAY") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27	io_uring/kbuf: fix memory leak if io_buffer_add_list fails	Jens Axboe	1	-2/+3
	[ Upstream commit 442ae406603a94f1a263654494f425302ceb0445 ] io_register_pbuf_ring() ignores the return value of io_buffer_add_list(), which can fail if xa_store() returns an error (e.g., -ENOMEM). When this happens, the function returns 0 (success) to the caller, but the io_buffer_list structure is neither added to the xarray nor freed. In practice this requires failure injection to hit, hence not a real issue. But it should get fixed up none the less. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27	io_uring/sync: validate passed in offset	Jens Axboe	1	-0/+2
	[ Upstream commit 649dd18f559891bdafc5532d737c7dfb56060a6d ] Check if the passed in offset is negative once cast to sync->off. This ensures that -EINVAL is returned for that case, like it would be for sync_file_range(2). Fixes: c992fe2925d7 ("io_uring: add fsync support") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27	io_uring: use release-acquire ordering for IORING_SETUP_R_DISABLED	Caleb Sander Mateos	3	-4/+17
	[ Upstream commit 7a8737e1132ff07ca225aa7a4008f87319b5b1ca ] io_uring_enter(), __io_msg_ring_data(), and io_msg_send_fd() read ctx->flags and ctx->submitter_task without holding the ctx's uring_lock. This means they may race with the assignment to ctx->submitter_task and the clearing of IORING_SETUP_R_DISABLED from ctx->flags in io_register_enable_rings(). Ensure the correct ordering of the ctx->flags and ctx->submitter_task memory accesses by storing to ctx->flags using release ordering and loading it using acquire ordering. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 4add705e4eeb ("io_uring: remove io_register_submitter") Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-19	io_uring/fdinfo: be a bit nicer when looping a lot of SQEs/CQEs	Jens Axboe	1	-3/+8
	[ Upstream commit 38cfdd9dd279473a73814df9fd7e6e716951d361 ] Add cond_resched() in those dump loops, just in case a lot of entries are being dumped. And detect invalid CQ ring head/tail entries, to avoid iterating more than what is necessary. Generally not an issue, but can be if things like KASAN or other debugging metrics are enabled. Reported-by: 是参差 <shicenci@gmail.com> Link: https://lore.kernel.org/all/PS1PPF7E1D7501FE5631002D242DD89403FAB9BA@PS1PPF7E1D7501F.apcprd02.prod.outlook.com/ Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-16	io_uring: allow io-wq workers to exit when unused	Li Chen	1	-0/+11
	commit 91214661489467f8452d34edbf257488d85176e4 upstream. io_uring keeps a per-task io-wq around, even when the task no longer has any io_uring instances. If the task previously used io_uring for file I/O, this can leave an unrelated iou-wrk-* worker thread behind after the last io_uring instance is gone. When the last io_uring ctx is removed from the task context, mark the io-wq exit-on-idle so workers can go away. Clear the flag on subsequent io_uring usage. Signed-off-by: Li Chen <me@linux.beauty> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-16	io_uring/io-wq: add exit-on-idle state	Li Chen	2	-2/+26
	commit 38aa434ab9335ce2d178b7538cdf01d60b2014c3 upstream. io-wq uses an idle timeout to shrink the pool, but keeps the last worker around indefinitely to avoid churn. For tasks that used io_uring for file I/O and then stop using io_uring, this can leave an iou-wrk-* thread behind even after all io_uring instances are gone. This is unnecessary overhead and also gets in the way of process checkpoint/restore. Add an exit-on-idle state that makes all io-wq workers exit as soon as they become idle, and provide io_wq_set_exit_on_idle() to toggle it. Signed-off-by: Li Chen <me@linux.beauty> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-11	io_uring/zcrx: fix page array leak	Pavel Begunkov	1	-0/+1
	[ Upstream commit 0ae91d8ab70922fb74c22c20bedcb69459579b1c ] d9f595b9a65e ("io_uring/zcrx: fix leaking pages on sg init fail") fixed a page leakage but didn't free the page array, release it as well. Fixes: b84621d96ee02 ("io_uring/zcrx: allocate sgtable for umem areas") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11	io_uring/rw: free potentially allocated iovec on cache put failure	Jens Axboe	1	-4/+11
	[ Upstream commit 4b9748055457ac3a0710bf210c229d01ea1b01b9 ] If a read/write request goes through io_req_rw_cleanup() and has an allocated iovec attached and fails to put to the rw_cache, then it may end up with an unaccounted iovec pointer. Have io_rw_recycle() return whether it recycled the request or not, and use that to gauge whether to free a potential iovec or not. Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11	io_uring: use GFP_NOWAIT for overflow CQEs on legacy rings	Alexandre Negrel	1	-1/+1
	[ Upstream commit fc5ff2500976cd2710a7acecffd12d95ee4f98fc ] Allocate the overflowing CQE with GFP_NOWAIT instead of GFP_ATOMIC. This changes causes allocations to fail earlier in out-of-memory situations, rather than being deferred. Using GFP_ATOMIC allows a process to exceed memory limits. Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220794 Signed-off-by: Alexandre Negrel <alexandre@negrel.dev> Link: https://lore.kernel.org/io-uring/20251229201933.515797-1-alexandre@negrel.dev/ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30	io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop	Jens Axboe	1	-1/+1
	commit 10dc959398175736e495f71c771f8641e1ca1907 upstream. Currently this is checked before running the pending work. Normally this is quite fine, as work items either end up blocking (which will create a new worker for other items), or they complete fairly quickly. But syzbot reports an issue where io-wq takes seemingly forever to exit, and with a bit of debugging, this turns out to be because it queues a bunch of big (2GB - 4096b) reads with a /dev/msr* file. Since this file type doesn't support ->read_iter(), loop_rw_iter() ends up handling them. Each read returns 16MB of data read, which takes 20 (!!) seconds. With a bunch of these pending, processing the whole chain can take a long time. Easily longer than the syzbot uninterruptible sleep timeout of 140 seconds. This then triggers a complaint off the io-wq exit path: INFO: task syz.4.135:6326 blocked for more than 143 seconds. Not tainted syzkaller #0 Blocked by coredump. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz.4.135 state:D stack:26824 pid:6326 tgid:6324 ppid:5957 task_flags:0x400548 flags:0x00080000 Call Trace: <TASK> context_switch kernel/sched/core.c:5256 [inline] __schedule+0x1139/0x6150 kernel/sched/core.c:6863 __schedule_loop kernel/sched/core.c:6945 [inline] schedule+0xe7/0x3a0 kernel/sched/core.c:6960 schedule_timeout+0x257/0x290 kernel/time/sleep_timeout.c:75 do_wait_for_common kernel/sched/completion.c:100 [inline] __wait_for_common+0x2fc/0x4e0 kernel/sched/completion.c:121 io_wq_exit_workers io_uring/io-wq.c:1328 [inline] io_wq_put_and_exit+0x271/0x8a0 io_uring/io-wq.c:1356 io_uring_clean_tctx+0x10d/0x190 io_uring/tctx.c:203 io_uring_cancel_generic+0x69c/0x9a0 io_uring/cancel.c:651 io_uring_files_cancel include/linux/io_uring.h:19 [inline] do_exit+0x2ce/0x2bd0 kernel/exit.c:911 do_group_exit+0xd3/0x2a0 kernel/exit.c:1112 get_signal+0x2671/0x26d0 kernel/signal.c:3034 arch_do_signal_or_restart+0x8f/0x7e0 arch/x86/kernel/signal.c:337 __exit_to_user_mode_loop kernel/entry/common.c:41 [inline] exit_to_user_mode_loop+0x8c/0x540 kernel/entry/common.c:75 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline] do_syscall_64+0x4ee/0xf80 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa02738f749 RSP: 002b:00007fa0281ae0e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: fffffffffffffe00 RBX: 00007fa0275e6098 RCX: 00007fa02738f749 RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fa0275e6098 RBP: 00007fa0275e6090 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fa0275e6128 R14: 00007fff14e4fcb0 R15: 00007fff14e4fd98 There's really nothing wrong here, outside of processing these reads will take a LONG time. However, we can speed up the exit by checking the IO_WQ_BIT_EXIT inside the io_worker_handle_work() loop, as syzbot will exit the ring after queueing up all of these reads. Then once the first item is processed, io-wq will simply cancel the rest. That should avoid syzbot running into this complaint again. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/68a2decc.050a0220.e29e5.0099.GAE@google.com/ Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23	io_uring: move local task_work in exit cancel loop	Ming Lei	1	-4/+4
	commit da579f05ef0faada3559e7faddf761c75cdf85e1 upstream. With IORING_SETUP_DEFER_TASKRUN, task work is queued to ctx->work_llist (local work) rather than the fallback list. During io_ring_exit_work(), io_move_task_work_from_local() was called once before the cancel loop, moving work from work_llist to fallback_llist. However, task work can be added to work_llist during the cancel loop itself. There are two cases: 1) io_kill_timeouts() is called from io_uring_try_cancel_requests() to cancel pending timeouts, and it adds task work via io_req_queue_tw_complete() for each cancelled timeout: 2) URING_CMD requests like ublk can be completed via io_uring_cmd_complete_in_task() from ublk_queue_rq() during canceling, given ublk request queue is only quiesced when canceling the 1st uring_cmd. Since io_allowed_defer_tw_run() returns false in io_ring_exit_work() (kworker != submitter_task), io_run_local_work() is never invoked, and the work_llist entries are never processed. This causes io_uring_try_cancel_requests() to loop indefinitely, resulting in 100% CPU usage in kworker threads. Fix this by moving io_move_task_work_from_local() inside the cancel loop, ensuring any work on work_llist is moved to fallback before each cancel attempt. Cc: stable@vger.kernel.org Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-17	io_uring/io-wq: fix incorrect io_wq_for_each_worker() termination logic	Jens Axboe	1	-3/+3
	commit e0392a10c9e80a3991855a81317da3039fcbe32c upstream. A previous commit added this helper, and had it terminate if false is returned from the handler. However, that is completely opposite, it should abort the loop if true is returned. Fix this up by having io_wq_for_each_worker() keep iterating as long as false is returned, and only abort if true is returned. Cc: stable@vger.kernel.org Fixes: 751eedc4b4b7 ("io_uring/io-wq: move worker lists to struct io_wq_acct") Reported-by: Lewis Campbell <info@lewiscampbell.tech> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02	io_uring/rsrc: fix lost entries after cloned range	Joanne Koong	1	-1/+11
	commit 525916ce496615f531091855604eab9ca573b195 upstream. When cloning with node replacements (IORING_REGISTER_DST_REPLACE), destination entries after the cloned range are not copied over. Add logic to copy them over to the new destination table. Fixes: c1329532d5aa ("io_uring/rsrc: allow cloning with node replacements") Cc: stable@vger.kernel.org Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02	io_uring: fix filename leak in __io_openat_prep()	Prithvi Tambewagh	1	-1/+1
	commit b14fad555302a2104948feaff70503b64c80ac01 upstream. __io_openat_prep() allocates a struct filename using getname(). However, for the condition of the file being installed in the fixed file table as well as having O_CLOEXEC flag set, the function returns early. At that point, the request doesn't have REQ_F_NEED_CLEANUP flag set. Due to this, the memory for the newly allocated struct filename is not cleaned up, causing a memory leak. Fix this by setting the REQ_F_NEED_CLEANUP for the request just after the successful getname() call, so that when the request is torn down, the filename will be cleaned up, along with other resources needing cleanup. Reported-by: syzbot+00e61c43eb5e4740438f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=00e61c43eb5e4740438f Tested-by: syzbot+00e61c43eb5e4740438f@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Prithvi Tambewagh <activprithvi@gmail.com> Fixes: b9445598d8c6 ("io_uring: openat directly into fixed fd table") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02	io_uring: fix min_wait wakeups for SQPOLL	Jens Axboe	1	-0/+3
	commit e15cb2200b934e507273510ba6bc747d5cde24a3 upstream. Using min_wait, two timeouts are given: 1) The min_wait timeout, within which up to 'wait_nr' events are waited for. 2) The overall long timeout, which is entered if no events are generated in the min_wait window. If the min_wait has expired, any event being posted must wake the task. For SQPOLL, that isn't the case, as it won't trigger the io_has_work() condition, as it will have already processed the task_work that happened when an event was posted. This causes any event to trigger post the min_wait to not always cause the waiting application to wakeup, and instead it will wait until the overall timeout has expired. This can be shown in a test case that has a 1 second min_wait, with a 5 second overall wait, even if an event triggers after 1.5 seconds: axboe@m2max-kvm /d/iouring-mre (master)> zig-out/bin/iouring info: MIN_TIMEOUT supported: true, features: 0x3ffff info: Testing: min_wait=1000ms, timeout=5s, wait_nr=4 info: 1 cqes in 5000.2ms where the expected result should be: axboe@m2max-kvm /d/iouring-mre (master)> zig-out/bin/iouring info: MIN_TIMEOUT supported: true, features: 0x3ffff info: Testing: min_wait=1000ms, timeout=5s, wait_nr=4 info: 1 cqes in 1500.3ms When the min_wait timeout triggers, reset the number of completions needed to wake the task. This should ensure that any future events will wake the task, regardless of how many events it originally wanted to wait for. Reported-by: Tip ten Brink <tip@tenbrinkmeijs.com> Cc: stable@vger.kernel.org Fixes: 1100c4a2656d ("io_uring: add support for batch wait timeout") Link: https://github.com/axboe/liburing/issues/1477 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02	io_uring/poll: correctly handle io_poll_add() return value on update	Jens Axboe	1	-2/+7
	commit 84230ad2d2afbf0c44c32967e525c0ad92e26b4e upstream. When the core of io_uring was updated to handle completions consistently and with fixed return codes, the POLL_REMOVE opcode with updates got slightly broken. If a POLL_ADD is pending and then POLL_REMOVE is used to update the events of that request, if that update causes the POLL_ADD to now trigger, then that completion is lost and a CQE is never posted. Additionally, ensure that if an update does cause an existing POLL_ADD to complete, that the completion value isn't always overwritten with -ECANCELED. For that case, whatever io_poll_add() set the value to should just be retained. Cc: stable@vger.kernel.org Fixes: 97b388d70b53 ("io_uring: handle completions in the core") Reported-by: syzbot+641eec6b7af1f62f2b99@syzkaller.appspotmail.com Tested-by: syzbot+641eec6b7af1f62f2b99@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02	io_uring: fix nr_segs calculation in io_import_kbuf	huang-jl	1	-0/+1
	[ Upstream commit 114ea9bbaf7681c4d363e13b7916e6fef6a4963a ] io_import_kbuf() calculates nr_segs incorrectly when iov_offset is non-zero after iov_iter_advance(). It doesn't account for the partial consumption of the first bvec. The problem comes when meet the following conditions: 1. Use UBLK_F_AUTO_BUF_REG feature of ublk. 2. The kernel will help to register the buffer, into the io uring. 3. Later, the ublk server try to send IO request using the registered buffer in the io uring, to read/write to fuse-based filesystem, with O_DIRECT. >From a userspace perspective, the ublk server thread is blocked in the kernel, and will see "soft lockup" in the kernel dmesg. When ublk registers a buffer with mixed-size bvecs like [4K]6 + [12K] and a request partially consumes a bvec, the next request's nr_segs calculation uses bvec->bv_len instead of (bv_len - iov_offset). This causes fuse_get_user_pages() to loop forever because nr_segs indicates fewer pages than actually needed. Specifically, the infinite loop happens at: fuse_get_user_pages() -> iov_iter_extract_pages() -> iov_iter_extract_bvec_pages() Since the nr_segs is miscalculated, the iov_iter_extract_bvec_pages returns when finding that i->nr_segs is zero. Then iov_iter_extract_pages returns zero. However, fuse_get_user_pages does still not get enough data/pages, causing infinite loop. Example: - Bvecs: [4K, 4K, 4K, 4K, 4K, 4K, 12K, ...] - Request 1: 32K at offset 0, uses 64K + 8K of the 12K bvec - Request 2: 32K at offset 32K - iov_offset = 8K (8K already consumed from 12K bvec) - Bug: calculates using 12K, not (12K - 8K) = 4K - Result: nr_segs too small, infinite loop in fuse_get_user_pages. Fix by accounting for iov_offset when calculating the first segment's available length. Fixes: b419bed4f0a6 ("io_uring/rsrc: ensure segments counts are correct on kbuf buffers") Signed-off-by: huang-jl <huang-jl@deepseek.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18	io_uring/kbuf: use READ_ONCE() for userspace-mapped memory	Caleb Sander Mateos	1	-5/+5
	[ Upstream commit 78385c7299f7514697d196b3233a91bd5e485591 ] The struct io_uring_buf elements in a buffer ring are in a memory region accessible from userspace. A malicious/buggy userspace program could therefore write to them at any time, so they should be accessed with READ_ONCE() in the kernel. Commit 98b6fa62c84f ("io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths") already switched the reads of the len field to READ_ONCE(). Do the same for bid and addr. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Cc: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18	io_uring/zcrx: call netdev_queue_get_dma_dev() under instance lock	David Wei	1	-6/+10
	[ Upstream commit b6c5f9454ef34fd2753ba7843ef4d9a295c43eee ] netdev ops must be called under instance lock or rtnl_lock, but io_register_zcrx_ifq() isn't doing this for netdev_queue_get_dma_dev(). Fix this by taking the instance lock using netdev_get_by_index_lock(). Extended the instance lock section to include attaching a memory provider. Could not move io_zcrx_create_area() outside, since the dmabuf codepath IORING_ZCRX_AREA_DMABUF requires ifq->dev. Fixes: 59b8b32ac8d4 ("io_uring/zcrx: add support for custom DMA devices") Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>