kernel/linux.git/io_uring, branch v6.18.36

io_uring/wait: fix min_timeout behavior

2026-06-19T11:44:10+00:00

Commit 29fe1bd01b99714f3136f922230a643c2742cda9 upstream. The wakeup condition if a min timeout is present and has expired is that at least _one_ CQE was posted. Thus set the cq_tail target to ->cq_min_tail + 1. Without this commit a spurious wakeup can result in a premature wakeup because io_should_wake() will return true even if _no_ CQE was posted at all. Cc: Tip ten Brink Fixes: e15cb2200b93 ("io_uring: fix min_wait wakeups for SQPOLL") Cc: stable@vger.kernel.org Signed-off-by: Christian A. Ehrhardt Link: https://patch.msgid.link/20260606201120.1441447-1-lk@c--e.de Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

io_uring/kbuf: don't truncate end buffer for bundles

2026-06-19T11:44:10+00:00

Commit 70f4886bcbb929e88038c8807f1daf7fc587ae7c upstream. If buffers have been peeked for a bundle receive, the kernel will truncate the end buffer, if the available length is shorter than the buffer itself. This is unnecessary, as applications iterating bundle receives must always use the minimum size of the buffer length and the remaining number of bytes in the bundle. The examples in liburing do that as well, eg examples/proxy.c. If the kernel does truncate this buffer AND the current transfer fails, then the buffer will be left with a smaller size than what is otherwise available. Just remove the buffer truncation, as it's not necessary in the first place. Link: https://lore.kernel.org/io-uring/CAAEr8jbY60noGj1fw_k91UJRBkyiRVoS6=nLhZ7Svwidjn4CAA@mail.gmail.com/ Reported-by: Federico Brasili Cc: stable@vger.kernel.org Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retries

2026-06-19T11:44:04+00:00

commit ed46f39c47eb5530a9c161481a2080d3a869cfaf upstream. When a bundle recv retries inside io_recv_finish(), the merge logic OR the saved cflags from the previous iteration with the cflags returned by the new iteration: cflags = req->cqe.flags | (cflags & CQE_F_MASK); Bits listed in CQE_F_MASK are inherited from the new iteration, and all other bits (notably IORING_CQE_F_BUFFER and the buffer ID) come from the saved cflags. Before this change CQE_F_MASK covered only IORING_CQE_F_SOCK_NONEMPTY and IORING_CQE_F_MORE. When using provided buffer rings (IOU_PBUF_RING_INC) with incremental mode, and bundle recv, io_kbuf_inc_commit() can leave the head ring entry partially consumed, __io_put_kbufs() then sets IORING_CQE_F_BUF_MORE on the returned cflags so userspace knows the buffer ID will be reused for subsequent completions. Because IORING_CQE_F_BUF_MORE was not in CQE_F_MASK, the merge above silently dropped it whenever the final retry iteration partially consumed the buffer, and the subsequent req->cqe.flags = cflags & ~CQE_F_MASK save would have left a stale IORING_CQE_F_BUF_MORE in the carried-over cflags had one been present. Userspace would then wrongfully advance it ring head past an entry the kernel still uses. Add IORING_CQE_F_BUF_MORE to CQE_F_MASK so it is both inherited from the new iteration into the user-visible CQE and stripped from the saved cflags between iterations. Cc: stable@vger.kernel.org Signed-off-by: Clément Léger Assisted-by: Claude:claude-opus-4.6 Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://patch.msgid.link/20260604160715.2482972-1-cleger@meta.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

io_uring/nop: pass all errors to userspace

2026-06-01T15:51:08+00:00

[ Upstream commit e97ff8b62d4690c69297f0f6de874f0564cc01a4 ] This fixes an inconsistency where io_nop() called req_set_fail() based on ret, but passed just nop->result to userspace. Originally, ret is a even copy of nop->result, but is set to an error when such happens subsequently. Now that's also passed to userspace. Fixes: a85f31052bce ("io_uring/nop: add support for testing registered files and buffers") Signed-off-by: Alexander A. Klimov Link: https://patch.msgid.link/20260520180045.538533-1-grandmaster@al2klimov.de Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

io_uring/net: punt IORING_OP_BIND async if it needs file create

2026-06-01T15:51:01+00:00

[ Upstream commit ccd25890f73c082fe2657ed227b497d6ac5fdc40 ] For two reasons: 1) An opcode cannot block inside io_uring_enter() doing submissions, as it'll stall the submission side pipeline. 2) Ending up in sb_start_write() -> __sb_start_write() -> percpu_down_read_freezable() introduces a new lockdep edge, which it correctly complains about. Check if the socket type is AF_UNIX and has a non-empty pathname. If it does, mark it REQ_F_FORCE_ASYNC to punt the submission to io-wq rather than attempt to do it inline. Fixes: 7481fd93fa0a ("io_uring: Introduce IORING_OP_BIND") Reviewed-by: Gabriel Krisman Bertazi Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

io_uring/waitid: clear waitid info before copying it to userspace

2026-06-01T15:50:41+00:00

commit 93d93f5f8da791e98159795c6ef683f45bd95d13 upstream. IORING_OP_WAITID stores its result fields in struct io_waitid::info and later copies them to userspace siginfo. The prep path initializes the request arguments, but it does not initialize info itself. If the wait operation completes without reporting a child event, the common wait code can return without writing wo_info. In that case io_waitid_finish() still copies iw->info to userspace, exposing stale bytes from the reused io_kiocb command storage. Clear the result storage during prep so the io_uring path matches the regular waitid syscall, which uses a zero-initialized struct waitid_info. Fixes: f31ecf671ddc ("io_uring: add IORING_OP_WAITID support") Cc: stable@vger.kernel.org # 6.7+ Signed-off-by: Heechan Kang Link: https://patch.msgid.link/20260516184709.852814-1-gganji11@naver.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

io-wq: check that the predecessor is hashed in io_wq_remove_pending()

2026-05-23T11:07:18+00:00

commit d6a2d7b04b5a093021a7a0e2e69e9d5237dfa8cc upstream. io_wq_remove_pending() needs to fix up wq->hash_tail[] if the cancelled work was the tail of its hash bucket. When doing this, it checks whether the preceding entry in acct->work_list has the same hash value, but never checks that the predecessor is hashed at all. io_get_work_hash() is simply atomic_read(&work->flags) >> IO_WQ_HASH_SHIFT, and the hash bits are never set for non-hashed work, so it returns 0. Thus, when a hashed bucket-0 work is cancelled while a non-hashed work is its list predecessor, the check spuriously passes and a pointer to the non-hashed io_kiocb is stored in wq->hash_tail[0]. Because non-hashed work is dequeued via the fast path in io_get_next_work(), which never touches hash_tail[], the stale pointer is never cleared. Therefore, after the non-hashed io_kiocb completes and is freed back to req_cachep, wq->hash_tail[0] is a dangling pointer. The io_wq is per-task (tctx->io_wq) and survives ring open/close, so the dangling pointer persists for the lifetime of the task; the next hashed bucket-0 enqueue dereferences it in io_wq_insert_work() and wq_list_add_after() writes through freed memory. Add the missing io_wq_is_hashed() check so a non-hashed predecessor never inherits a hash_tail[] slot. Cc: stable@vger.kernel.org Fixes: 204361a77f40 ("io-wq: fix hang after cancelling pending hashed work") Signed-off-by: Nicholas Carlini Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

io_uring/napi: cap busy_poll_to 10 msec

2026-05-23T11:07:11+00:00

[ Upstream commit df8599ee18c0e5fe343ffe0b4c379636b8bb839a ] Currently there's no cap on the maximum amount of time that napi is allowed to poll if no events are found, which can lead to kernel complaints on a task being stuck as there's no conditional rescheduling done within that loop. Just cap it to 10 msec in total, that's already way above any kind of sane value that will reap any benefits, yet low enough that it's nowhere near being able to trigger preemption complaints. Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support") Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

io_uring/zcrx: warn on freelist violations

2026-05-17T15:15:35+00:00

commit 770594e78c3964cf23cf5287f849437cdde9b7d0 upstream. The freelist is appropriately sized to always be able to take a free niov, but let's be more defensive and check the invariant with a warning. That should help to catch any double-free issues. Suggested-by: Kai Aizen Signed-off-by: Pavel Begunkov Link: https://patch.msgid.link/2f3cea363b04649755e3b6bb9ab66485a95936d5.1776760901.git.asml.silence@gmail.com Signed-off-by: Jens Axboe Signed-off-by: Harshit Mogalapalli Signed-off-by: Greg Kroah-Hartman

io_uring/zcrx: use guards for locking

2026-05-17T15:15:34+00:00

commit 898ad80d1207cbdb22b21bafb6de4adfd7627bd0 upstream. Convert last several places using manual locking to guards to simplify the code. Signed-off-by: Pavel Begunkov Link: https://patch.msgid.link/eb4667cfaf88c559700f6399da9e434889f5b04a.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe Signed-off-by: Harshit Mogalapalli Signed-off-by: Greg Kroah-Hartman