summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)AuthorFilesLines
5 daysio-wq: check that the predecessor is hashed in io_wq_remove_pending()Nicholas Carlini1-1/+2
commit d6a2d7b04b5a093021a7a0e2e69e9d5237dfa8cc upstream. io_wq_remove_pending() needs to fix up wq->hash_tail[] if the cancelled work was the tail of its hash bucket. When doing this, it checks whether the preceding entry in acct->work_list has the same hash value, but never checks that the predecessor is hashed at all. io_get_work_hash() is simply atomic_read(&work->flags) >> IO_WQ_HASH_SHIFT, and the hash bits are never set for non-hashed work, so it returns 0. Thus, when a hashed bucket-0 work is cancelled while a non-hashed work is its list predecessor, the check spuriously passes and a pointer to the non-hashed io_kiocb is stored in wq->hash_tail[0]. Because non-hashed work is dequeued via the fast path in io_get_next_work(), which never touches hash_tail[], the stale pointer is never cleared. Therefore, after the non-hashed io_kiocb completes and is freed back to req_cachep, wq->hash_tail[0] is a dangling pointer. The io_wq is per-task (tctx->io_wq) and survives ring open/close, so the dangling pointer persists for the lifetime of the task; the next hashed bucket-0 enqueue dereferences it in io_wq_insert_work() and wq_list_add_after() writes through freed memory. Add the missing io_wq_is_hashed() check so a non-hashed predecessor never inherits a hash_tail[] slot. Cc: stable@vger.kernel.org Fixes: 204361a77f40 ("io-wq: fix hang after cancelling pending hashed work") Signed-off-by: Nicholas Carlini <nicholas@carlini.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysio_uring/napi: cap busy_poll_to 10 msecJens Axboe1-0/+2
[ Upstream commit df8599ee18c0e5fe343ffe0b4c379636b8bb839a ] Currently there's no cap on the maximum amount of time that napi is allowed to poll if no events are found, which can lead to kernel complaints on a task being stuck as there's no conditional rescheduling done within that loop. Just cap it to 10 msec in total, that's already way above any kind of sane value that will reap any benefits, yet low enough that it's nowhere near being able to trigger preemption complaints. Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysio_uring/zcrx: warn on freelist violationsPavel Begunkov1-0/+2
commit 770594e78c3964cf23cf5287f849437cdde9b7d0 upstream. The freelist is appropriately sized to always be able to take a free niov, but let's be more defensive and check the invariant with a warning. That should help to catch any double-free issues. Suggested-by: Kai Aizen <kai@snailsploit.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/2f3cea363b04649755e3b6bb9ab66485a95936d5.1776760901.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 daysio_uring/zcrx: use guards for lockingPavel Begunkov1-8/+7
commit 898ad80d1207cbdb22b21bafb6de4adfd7627bd0 upstream. Convert last several places using manual locking to guards to simplify the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/eb4667cfaf88c559700f6399da9e434889f5b04a.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
14 daysio_uring/tw: serialize ctx->retry_llist with ->uring_lockJens Axboe1-1/+11
commit 17666e2d7592c3e85260cafd3950121524acc2c5 upstream. The DEFER_TASKRUN local task work paths all run under ctx->uring_lock, which serializes them with each other and with the rest of the ring's hot paths. io_move_task_work_from_local() is the exception - it's called from io_ring_exit_work() on a kworker without holding the lock and from the iopoll cancelation side right after dropping it. ->work_llist is fine with this, as it's only ever updated via the expected paths. But the ->retry_llist is updated while runing, and hence it could potentially race between normal task_work running and the task-has-exited shutdown path. Simply grab ->uring_lock while moving the local work to the fallback list for exit purposes, which nicely serializes it across both the normal additions and the exit prune path. Cc: stable@vger.kernel.org Fixes: f46b9cdb22f7 ("io_uring: limit local tw done") Reported-by: Robert Femmer <robert.femmer@x41-dsec.de> Reported-by: Christian Reitter <invd@inhq.net> Reported-by: Michael Rodler <michael.rodler@x41-dsec.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
14 daysio_uring/kbuf: support min length left for incremental buffersMartin Michaelis2-1/+14
commit 7deba791ad495ce1d7921683f4f7d1190fa210d1 upstream. Incrementally consumed buffer rings are generally fully consumed, but it's quite possible that the application has a minimum size it needs to meet to avoid truncation. Currently that minimum limit is 1 byte, but this should be a setting that is the hands of the application. For recvmsg multishot, a prime use case for incrementally consumed buffers, the application may get spurious -EFAULT returned at the end of an incrementally consumed buffer, as less space is available than the headers need. Grab a u32 field in struct io_uring_buf_reg, which the application can use to inform the kernel of the minimum size that should be available in an incrementally consumed buffer. If less than that is available, the current buffer is fully processed and the next one will be picked. Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1433 Signed-off-by: Martin Michaelis <code@mgjm.de> [axboe: write commit message, change io_buffer_list member name] Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07io_uring/poll: ensure EPOLL_ONESHOT is propagated for EPOLL_URING_WAKEJens Axboe1-1/+3
commit 1967f0b1cafdde37aa9e08e6021c14bcc484b7a5 upstream. Commit: aacf2f9f382c ("io_uring: fix req->apoll_events") fixed an issue where poll->events and req->apoll_events weren't synchronized, but then when the commit referenced in Fixes got added, it didn't ensure the same thing. If we mask in EPOLLONESHOT in the regular EPOLL_URING_WAKE path, then ensure it's done for both. Including a link to the original report below, even though it's mostly nonsense. But it includes a reproducer that does show that IORING_CQE_F_MORE is set in the previous CQE, while no more CQEs will be generated for this request. Just ignore anything that pretends this is security related in any way, it's just the typical AI nonsense. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/CAM0zi7yQzF3eKncgHo4iVM5yFLAjsiob_ucqyWKs=hyd_GqiMg@mail.gmail.com/ Reported-by: Azizcan Daştan <azizcan.d@mileniumsec.com> Fixes: 4464853277d0 ("io_uring: pass in EPOLL_URING_WAKE for eventfd signaling and wakeups") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07io_uring/poll: fix signed comparison in io_poll_get_ownership()Longxuan Yu1-1/+1
commit 326941b22806cbf2df1fbfe902b7908b368cce42 upstream. io_poll_get_ownership() uses a signed comparison to check whether poll_refs has reached the threshold for the slowpath: if (unlikely(atomic_read(&req->poll_refs) >= IO_POLL_REF_BIAS)) atomic_read() returns int (signed). When IO_POLL_CANCEL_FLAG (BIT(31)) is set in poll_refs, the value becomes negative in signed arithmetic, so the >= 128 comparison always evaluates to false and the slowpath is never taken. Fix this by casting the atomic_read() result to unsigned int before the comparison, so that the cancel flag is treated as a large positive value and correctly triggers the slowpath. Fixes: a26a35e9019f ("io_uring: make poll refs more robust") Cc: stable@vger.kernel.org Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Longxuan Yu <ylong030@ucr.edu> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/3a3508b08bcd7f1bc3beff848ae6e1d73d355043.1775965597.git.ylong030@ucr.edu Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07io_uring/zcrx: fix user_struct uafPavel Begunkov1-1/+1
commit 0fcccfd87152f957fa8312b841f6efef42a05a20 upstream. io_free_rbuf_ring() usees a struct user_struct, which io_zcrx_ifq_free() puts it down before destroying the ring. Cc: stable@vger.kernel.org Fixes: 5c686456a4e83 ("io_uring/zcrx: add user_struct and mm_struct to io_zcrx_ifq") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/e560ae00960d27a810522a7efc0e201c82dff351.1776760917.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07io_uring/register: fix ring resizing with mixed/large SQEs/CQEsJens Axboe1-8/+28
commit 45cd95763e198d74d369ede43aef0b1955b8dea4 upstream. The ring resizing only properly handles "normal" sized SQEs or CQEs, if there are pending entries around a resize. This normally should not be the case, but the code is supposed to handle this regardless. For the mixed SQE/CQE cases, the current copying works fine as they are indexed in the same way. Each half is just copied separately. But for fixed large SQEs and CQEs, the iteration and copy need to take that into account. Cc: stable@kernel.org Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07io_uring/timeout: check unused sqe fieldsPavel Begunkov1-0/+4
commit 484ae637a3e3d909718de7c07afd3bb34b6b8504 upstream. Zero check unused SQE fields addr3 and pad2 for timeout and timeout update requests. They're not needed now, but could be used sometime in the future. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07io_uring/zcrx: return back two step unregistrationPavel Begunkov3-3/+51
commit e5361d25e241ac3a23177fa74ae91d049bad00d3 upstream. There are reports where io_uring instance removal takes too long and an ifq reallocation by another zcrx instance fails. Split zcrx destruction into two steps similarly how it was before, first close the queue early but maintain zcrx alive, and then when all inflight requests are completed, drop the main zcrx reference. For extra protection, mark terminated zcrx instances in xarray and warn if we double put them. Cc: stable@vger.kernel.org # 6.19+ Link: https://github.com/axboe/liburing/issues/1550 Reported-by: Youngmin Choi <youngminchoi94@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/0ce21f0565ab4358668922a28a8a36922dfebf76.1774261953.git.asml.silence@gmail.com [axboe: NULL ifq before break inside scoped guard] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-03Merge tag 'io_uring-7.0-20260403' of ↵Linus Torvalds7-29/+87
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - A previous fix in this release covered the case of the rings being RCU protected during resize, but it missed a few spots. This covers the rest - Fix the cBPF filters when COW'ed, introduced in this merge window - Fix for an attempt to import a zero sized buffer - Fix for a missing clamp in importing bundle buffers * tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/bpf_filters: retain COW'ed settings on parse failures io_uring: protect remaining lockless ctx->rings accesses with RCU io_uring/rsrc: reject zero-length fixed buffer import io_uring/net: fix slab-out-of-bounds read in io_bundle_nbufs()
2026-04-01io_uring/bpf_filters: retain COW'ed settings on parse failuresJens Axboe1-1/+9
If io_parse_restrictions() fails, it ends up clearing any restrictions currently set. The intent is only to clear whatever it already applied, but it ends up clearing everything, including whatever settings may have been applied in a copy-on-write fashion already. Ensure that those are retained. Link: https://lore.kernel.org/io-uring/CAK8a0jzF-zaO5ZmdOrmfuxrhXuKg5m5+RDuO7tNvtj=kUYbW7Q@mail.gmail.com/ Reported-by: antonius <bluedragonsec2023@gmail.com> Fixes: ed82f35b926b ("io_uring: allow registration of per-task restrictions") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring: protect remaining lockless ctx->rings accesses with RCUJens Axboe4-28/+70
Commit 96189080265e addressed one case of ctx->rings being potentially accessed while a resize is happening on the ring, but there are still a few others that need handling. Add a helper for retrieving the rings associated with an io_uring context, and add some sanity checking to that to catch bad uses. ->rings_rcu is always valid, as long as it's used within RCU read lock. Any use of ->rings_rcu or ->rings inside either ->uring_lock or ->completion_lock is sane as well. Do the minimum fix for the current kernel, but set it up such that this basic infra can be extended for later kernels to make this harder to mess up in the future. Thanks to Junxi Qian for finding and debugging this issue. Cc: stable@vger.kernel.org Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Reviewed-by: Junxi Qian <qjx1298677004@gmail.com> Tested-by: Junxi Qian <qjx1298677004@gmail.com> Link: https://lore.kernel.org/io-uring/20260330172348.89416-1-qjx1298677004@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-29io_uring/rsrc: reject zero-length fixed buffer importQi Tang1-0/+4
validate_fixed_range() admits buf_addr at the exact end of the registered region when len is zero, because the check uses strict greater-than (buf_end > imu->ubuf + imu->len). io_import_fixed() then computes offset == imu->len, which causes the bvec skip logic to advance past the last bio_vec entry and read bv_offset from out-of-bounds slab memory. Return early from io_import_fixed() when len is zero. A zero-length import has no data to transfer and should not walk the bvec array at all. BUG: KASAN: slab-out-of-bounds in io_import_reg_buf+0x697/0x7f0 Read of size 4 at addr ffff888002bcc254 by task poc/103 Call Trace: io_import_reg_buf+0x697/0x7f0 io_write_fixed+0xd9/0x250 __io_issue_sqe+0xad/0x710 io_issue_sqe+0x7d/0x1100 io_submit_sqes+0x86a/0x23c0 __do_sys_io_uring_enter+0xa98/0x1590 Allocated by task 103: The buggy address is located 12 bytes to the right of allocated 584-byte region [ffff888002bcc000, ffff888002bcc248) Fixes: 8622b20f23ed ("io_uring: add validate_fixed_range() for validate fixed buffer") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Link: https://patch.msgid.link/20260329164936.240871-1-tpluszz77@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-29io_uring/net: fix slab-out-of-bounds read in io_bundle_nbufs()Junxi Qian1-0/+4
sqe->len is __u32 but gets stored into sr->len which is int. When userspace passes sqe->len values exceeding INT_MAX (e.g. 0xFFFFFFFF), sr->len overflows to a negative value. This negative value propagates through the bundle recv/send path: 1. io_recv(): sel.val = sr->len (ssize_t gets -1) 2. io_recv_buf_select(): arg.max_len = sel->val (size_t gets 0xFFFFFFFFFFFFFFFF) 3. io_ring_buffers_peek(): buf->len is not clamped because max_len is astronomically large 4. iov[].iov_len = 0xFFFFFFFF flows into io_bundle_nbufs() 5. io_bundle_nbufs(): min_t(int, 0xFFFFFFFF, ret) yields -1, causing ret to increase instead of decrease, creating an infinite loop that reads past the allocated iov[] array This results in a slab-out-of-bounds read in io_bundle_nbufs() from the kmalloc-64 slab, as nbufs increments past the allocated iovec entries. BUG: KASAN: slab-out-of-bounds in io_bundle_nbufs+0x128/0x160 Read of size 8 at addr ffff888100ae05c8 by task exp/145 Call Trace: io_bundle_nbufs+0x128/0x160 io_recv_finish+0x117/0xe20 io_recv+0x2db/0x1160 Fix this by rejecting negative sr->len values early in both io_sendmsg_prep() and io_recvmsg_prep(). Since sqe->len is __u32, any value > INT_MAX indicates overflow and is not a valid length. Fixes: a05d1f625c7a ("io_uring/net: support bundles for send") Cc: stable@vger.kernel.org Signed-off-by: Junxi Qian <qjx1298677004@gmail.com> Link: https://patch.msgid.link/20260329153909.279046-1-qjx1298677004@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-28Merge tag 'io_uring-7.0-20260327' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: "Just two small fixes, both fixing regressions added in the fdinfo code in 6.19 with the SQE mixed size support" * tag 'io_uring-7.0-20260327' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/fdinfo: fix OOB read in SQE_MIXED wrap check io_uring/fdinfo: fix SQE_MIXED SQE displaying
2026-03-27io_uring/fdinfo: fix OOB read in SQE_MIXED wrap checkNicholas Carlini1-1/+2
__io_uring_show_fdinfo() iterates over pending SQEs and, for 128-byte SQEs on an IORING_SETUP_SQE_MIXED ring, needs to detect when the second half of the SQE would be past the end of the sq_sqes array. The current check tests (++sq_head & sq_mask) == 0, but sq_head is only incremented when a 128-byte SQE is encountered, not on every iteration. The actual array index is sq_idx = (i + sq_head) & sq_mask, which can be sq_mask (the last slot) while the wrap check passes. Fix by checking sq_idx directly. Keep the sq_head increment so the loop still skips the second half of the 128-byte SQE on the next iteration. Fixes: 1cba30bf9fdd ("io_uring: add support for IORING_SETUP_SQE_MIXED") Signed-off-by: Nicholas Carlini <nicholas@carlini.com> Link: https://patch.msgid.link/20260327021823.3138396-1-nicholas@carlini.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-26io_uring/fdinfo: fix SQE_MIXED SQE displayingJens Axboe1-0/+1
When displaying pending SQEs for a MIXED ring, each 128-byte SQE increments sq_head to skip the second slot, but the loop counter is not adjusted. This can cause the loop to read past sq_tail by one entry for each 128-byte SQE encountered, displaying SQEs that haven't been made consumable yet by the application. Match the kernel's own consumption logic in io_init_req() which decrements what's left when consuming the extra slot. Fixes: 1cba30bf9fdd ("io_uring: add support for IORING_SETUP_SQE_MIXED") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-20Merge tag 'io_uring-7.0-20260320' of ↵Linus Torvalds2-5/+18
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - A bit of a work-around for AF_UNIX recv multishot, as the in-kernel implementation doesn't properly signal EOF. We'll likely rework this one going forward, but the fix is sufficient for now - Two fixes for incrementally consumed buffers, for non-pollable files and for 0 byte reads * tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/kbuf: propagate BUF_MORE through early buffer commit path io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF io_uring/poll: fix multishot recv missing EOF on wakeup race
2026-03-20io_uring/kbuf: propagate BUF_MORE through early buffer commit pathJens Axboe1-3/+7
When io_should_commit() returns true (eg for non-pollable files), buffer commit happens at buffer selection time and sel->buf_list is set to NULL. When __io_put_kbufs() generates CQE flags at completion time, it calls __io_put_kbuf_ring() which finds a NULL buffer_list and hence cannot determine whether the buffer was consumed or not. This means that IORING_CQE_F_BUF_MORE is never set for non-pollable input with incrementally consumed buffers. Likewise for io_buffers_select(), which always commits upfront and discards the return value of io_kbuf_commit(). Add REQ_F_BUF_MORE to store the result of io_kbuf_commit() during early commit. Then __io_put_kbuf_ring() can check this flag and set IORING_F_BUF_MORE accordingy. Reported-by: Martin Michaelis <code@mgjm.de> Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1553 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-20io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOFJens Axboe1-0/+4
For a zero length transfer, io_kbuf_inc_commit() is called with !len. Since we never enter the while loop to consume the buffers, io_kbuf_inc_commit() ends up returning true, consuming the buffer. But if no data was consumed, by definition it cannot have consumed the buffer. Return false for that case. Reported-by: Martin Michaelis <code@mgjm.de> Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1553 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-17io_uring/poll: fix multishot recv missing EOF on wakeup raceJens Axboe1-2/+7
When a socket send and shutdown() happen back-to-back, both fire wake-ups before the receiver's task_work has a chance to run. The first wake gets poll ownership (poll_refs=1), and the second bumps it to 2. When io_poll_check_events() runs, it calls io_poll_issue() which does a recv that reads the data and returns IOU_RETRY. The loop then drains all accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only the first event was consumed. Since the shutdown is a persistent state change, no further wakeups will happen, and the multishot recv can hang forever. Check specifically for HUP in the poll loop, and ensure that another loop is done to check for status if more than a single poll activation is pending. This ensures we don't lose the shutdown event. Cc: stable@vger.kernel.org Fixes: dbc2564cfe0f ("io_uring: let fast poll support multishot") Reported-by: Francis Brosseau <francis@malagauche.com> Link: https://github.com/axboe/liburing/issues/1549 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-13Merge tag 'io_uring-7.0-20260312' of ↵Linus Torvalds6-11/+55
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Fix an inverted true/false comment on task_no_new_privs, from the BPF filtering changes merged in this release - Use the migration disabling way of running the BPF filters, as the io_uring side doesn't do that already - Fix an issue with ->rings stability under resize, both for local task_work additions and for eventfd signaling - Fix an issue with SQE mixed mode, where a bounds check wasn't correct for having a 128b SQE - Fix an issue where a legacy provided buffer group is changed to to ring mapped one while legacy buffers from that group are in flight * tag 'io_uring-7.0-20260312' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/kbuf: check if target buffer list is still legacy on recycle io_uring: fix physical SQE bounds check for SQE_MIXED 128-byte ops io_uring/eventfd: use ctx->rings_rcu for flags checking io_uring: ensure ctx->rings is stable for task work flags manipulation io_uring/bpf_filter: use bpf_prog_run_pin_on_cpu() to prevent migration io_uring/register: fix comment about task_no_new_privs
2026-03-12io_uring/kbuf: check if target buffer list is still legacy on recycleJens Axboe1-2/+11
There's a gap between when the buffer was grabbed and when it potentially gets recycled, where if the list is empty, someone could've upgraded it to a ring provided type. This can happen if the request is forced via io-wq. The legacy recycling is missing checking if the buffer_list still exists, and if it's of the correct type. Add those checks. Cc: stable@vger.kernel.org Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Reported-by: Keenan Dong <keenanat2000@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-11io_uring: fix physical SQE bounds check for SQE_MIXED 128-byte opsTom Ryan1-1/+1
When IORING_SETUP_SQE_MIXED is used without IORING_SETUP_NO_SQARRAY, the boundary check for 128-byte SQE operations in io_init_req() validated the logical SQ head position rather than the physical SQE index. The existing check: !(ctx->cached_sq_head & (ctx->sq_entries - 1)) ensures the logical position isn't at the end of the ring, which is correct for NO_SQARRAY rings where physical == logical. However, when sq_array is present, an unprivileged user can remap any logical position to an arbitrary physical index via sq_array. Setting sq_array[N] = sq_entries - 1 places a 128-byte operation at the last physical SQE slot, causing the 128-byte memcpy in io_uring_cmd_sqe_copy() to read 64 bytes past the end of the SQE array. Replace the cached_sq_head alignment check with a direct validation of the physical SQE index, which correctly handles both sq_array and NO_SQARRAY cases. Fixes: 1cba30bf9fdd ("io_uring: add support for IORING_SETUP_SQE_MIXED") Signed-off-by: Tom Ryan <ryan36005@gmail.com> Link: https://patch.msgid.link/20260310052003.72871-1-ryan36005@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-11io_uring/eventfd: use ctx->rings_rcu for flags checkingJens Axboe1-3/+7
Similarly to what commit e78f7b70e837 did for local task work additions, use ->rings_rcu under RCU rather than dereference ->rings directly. See that commit for more details. Cc: stable@vger.kernel.org Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-11io_uring: ensure ctx->rings is stable for task work flags manipulationJens Axboe3-2/+33
If DEFER_TASKRUN | SETUP_TASKRUN is used and task work is added while the ring is being resized, it's possible for the OR'ing of IORING_SQ_TASKRUN to happen in the small window of swapping into the new rings and the old rings being freed. Prevent this by adding a 2nd ->rings pointer, ->rings_rcu, which is protected by RCU. The task work flags manipulation is inside RCU already, and if the resize ring freeing is done post an RCU synchronize, then there's no need to add locking to the fast path of task work additions. Note: this is only done for DEFER_TASKRUN, as that's the only setup mode that supports ring resizing. If this ever changes, then they too need to use the io_ctx_mark_taskrun() helper. Link: https://lore.kernel.org/io-uring/20260309062759.482210-1-naup96721@gmail.com/ Cc: stable@vger.kernel.org Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Reported-by: Hao-Yu Yang <naup96721@gmail.com> Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-09io_uring/bpf_filter: use bpf_prog_run_pin_on_cpu() to prevent migrationJens Axboe1-1/+1
Since the caller, __io_uring_run_bpf_filters(), doesn't prevent migration, it should use the migration disabling variant for running the BPF program. Fixes: d42eb05e60fe ("io_uring: add support for BPF filtering for opcode restrictions") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-09io_uring/register: fix comment about task_no_new_privsJann Horn1-2/+2
The actual code is right, but the comment is the wrong way around. Fixes: ed82f35b926b ("io_uring: allow registration of per-task restrictions") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-06Merge tag 'io_uring-7.0-20260305' of ↵Linus Torvalds2-3/+7
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Fix a typo in the mock_file help text - Fix a comment regarding IORING_SETUP_TASKRUN_FLAG in the io_uring.h UAPI header - Use READ_ONCE() for reading refill queue entries - Reject SEND_VECTORIZED for fixed buffer sends, as it isn't implemented. Currently this flag is silently ignored This is in preparation for making these work, but first we need a fixup so that older kernels will correctly reject them - Ensure "0" means default for the rx page size * tag 'io_uring-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/zcrx: use READ_ONCE with user shared RQEs io_uring/mock: Fix typo in help text io_uring/net: reject SEND_VECTORIZED when unsupported io_uring: correct comment for IORING_SETUP_TASKRUN_FLAG io_uring/zcrx: don't set rx_page_size when not requested
2026-03-04io_uring/zcrx: use READ_ONCE with user shared RQEsPavel Begunkov1-2/+3
Refill queue entries are shared with the user space, use READ_ONCE when reading them. Fixes: 34a3e60821ab9 ("io_uring/zcrx: implement zerocopy receive pp memory provider"); Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-02io_uring/net: reject SEND_VECTORIZED when unsupportedPavel Begunkov1-0/+2
IORING_SEND_VECTORIZED with registered buffers is not implemented but could be. Don't silently ignore the flag in this case but reject it with an error. It only affects sendzc as normal sends don't support registered buffers. Fixes: 6f02527729bd3 ("io_uring/net: Allow to do vectorized send") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-27io_uring/zcrx: don't set rx_page_size when not requestedJakub Kicinski1-1/+2
The rx_buf_len parameter was recently added to the Rx zero-copy implementation. The expectation is that when not set system will maintain previous behavior and use the default buffer size (PAGE_SIZE). This works correctly at the iouring level, but we don't preserve the same "zero means default" semantics when registering the memory provider on the netdev. mp_param.rx_page_size is unconditionally set to PAGE_SIZE. This causes __net_mp_open_rxq() to check for QCFG_RX_PAGE_SIZE support in the driver, and return -EOPNOTSUPP for drivers that don't advertise it -- even though the user never asked for large buffers. Only set mp_param.rx_page_size when rx_buf_len was explicitly provided, so that the default page size path works on all zcrx-capable drivers. mlx5 and fbnic only support 4kB pages in the current release. Fixes: 795663b4d160 ("io_uring/zcrx: implement large rx buffer support") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-27Merge tag 'io_uring-7.0-20260227' of ↵Linus Torvalds2-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: "Just two minor patches in here, ensuring the use of READ_ONCE() for sqe field reading is consistent across the codebase. There were two missing cases, now they are covered too" * tag 'io_uring-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/timeout: READ_ONCE sqe->addr io_uring/cmd_net: use READ_ONCE() for ->addr3 read
2026-02-25io_uring/timeout: READ_ONCE sqe->addrPavel Begunkov1-2/+2
We should use READ_ONCE when reading from a SQE, make sure timeout gets a stable timespec address. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-24io_uring/cmd_net: use READ_ONCE() for ->addr3 readJens Axboe1-1/+1
Any SQE read should use READ_ONCE(), to ensure the result is read once and only once. Doesn't really matter for this case, but it's better to keep these 100% consistent and always use READ_ONCE() for the prep side of SQE handling. Fixes: 5d24321e4c15 ("io_uring: Introduce getsockname io_uring cmd") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-22Convert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds1-1/+1
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds10-15/+15
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Merge tag 'kmalloc_obj-treewide-v7.0-rc1' of ↵Linus Torvalds15-44/+42
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kmalloc_obj conversion from Kees Cook: "This does the tree-wide conversion to kmalloc_obj() and friends using coccinelle, with a subsequent small manual cleanup of whitespace alignment that coccinelle does not handle. This uncovered a clang bug in __builtin_counted_by_ref(), so the conversion is preceded by disabling that for current versions of clang. The imminent clang 22.1 release has the fix. I've done allmodconfig build tests for x86_64, arm64, i386, and arm. I did defconfig builds for alpha, m68k, mips, parisc, powerpc, riscv, s390, sparc, sh, arc, csky, xtensa, hexagon, and openrisc" * tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kmalloc_obj: Clean up after treewide replacements treewide: Replace kmalloc with kmalloc_obj for non-scalar types compiler_types: Disable __builtin_counted_by_ref for Clang
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook15-44/+42
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-19io_uring: add IORING_OP_URING_CMD128 to opcode checksCaleb Sander Mateos3-3/+9
io_should_commit(), io_uring_classic_poll(), and io_do_iopoll() compare struct io_kiocb's opcode against IORING_OP_URING_CMD to implement special treatment for uring_cmds. The recently added opcode IORING_OP_URING_CMD128 is meant to be equivalent to IORING_OP_URING_CMD, so treat it the same way in these functions. Fixes: 1cba30bf9fdd ("io_uring: add support for IORING_SETUP_SQE_MIXED") Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-18io_uring/zcrx: fix user_ref race between scrub and refill pathsKai Aizen1-3/+7
The io_zcrx_put_niov_uref() function uses a non-atomic check-then-decrement pattern (atomic_read followed by separate atomic_dec) to manipulate user_refs. This is serialized against other callers by rq_lock, but io_zcrx_scrub() modifies the same counter with atomic_xchg() WITHOUT holding rq_lock. On SMP systems, the following race exists: CPU0 (refill, holds rq_lock) CPU1 (scrub, no rq_lock) put_niov_uref: atomic_read(uref) - 1 // window opens atomic_xchg(uref, 0) - 1 return_niov_freelist(niov) [PUSH #1] // window closes atomic_dec(uref) - wraps to -1 returns true return_niov(niov) return_niov_freelist(niov) [PUSH #2: DOUBLE-FREE] The same niov is pushed to the freelist twice, causing free_count to exceed nr_iovs. Subsequent freelist pushes then perform an out-of-bounds write (a u32 value) past the kvmalloc'd freelist array into the adjacent slab object. Fix this by replacing the non-atomic read-then-dec in io_zcrx_put_niov_uref() with an atomic_try_cmpxchg loop that atomically tests and decrements user_refs. This makes the operation safe against concurrent atomic_xchg from scrub without requiring scrub to acquire rq_lock. Fixes: 34a3e60821ab ("io_uring/zcrx: implement zerocopy receive pp memory provider") Cc: stable@vger.kernel.org Signed-off-by: Kai Aizen <kai@snailsploit.com> [pavel: removed a warning and a comment] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-17Merge tag 'io_uring-7.0-20260216' of ↵Linus Torvalds15-88/+129
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull more io_uring updates from Jens Axboe: "This is a mix of cleanups and fixes. No major fixes in here, just a bunch of little fixes. Some of them marked for stable as it fixes behavioral issues - Fix an issue with SOCKET_URING_OP_SETSOCKOPT for netlink sockets, due to a too restrictive check on it having an ioctl handler - Remove a redundant SQPOLL check in ring creation - Kill dead accounting for zero-copy send, which doesn't use ->buf or ->len post the initial setup - Fix missing clamp of the allocation hint, which could cause allocations to fall outside of the range the application asked for. Still within the allowed limits. - Fix for IORING_OP_PIPE's handling of direct descriptors - Tweak to the API for the newly added BPF filters, making them more future proof in terms of how applications deal with them - A few fixes for zcrx, fixing a few error handling conditions - Fix for zcrx request flag checking - Add support for querying the zcrx page size - Improve the NO_SQARRAY static branch inc/dec, avoiding busy conditions causing too much traffic - Various little cleanups" * tag 'io_uring-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/bpf_filter: pass in expected filter payload size io_uring/bpf_filter: move filter size and populate helper into struct io_uring/cancel: de-unionize file and user_data in struct io_cancel_data io_uring/rsrc: improve regbuf iov validation io_uring: remove unneeded io_send_zc accounting io_uring/cmd_net: fix too strict requirement on ioctl io_uring: delay sqarray static branch disablement io_uring/query: add query.h copyright notice io_uring/query: return support for custom rx page size io_uring/zcrx: check unsupported flags on import io_uring/zcrx: fix post open error handling io_uring/zcrx: fix sgtable leak on mapping failures io_uring: use the right type for creds iteration io_uring/openclose: fix io_pipe_fixed() slot tracking for specific slots io_uring/filetable: clamp alloc_hint to the configured alloc range io_uring/rsrc: replace reg buffer bit field with flags io_uring/zcrx: improve types for size calculation io_uring/tctx: avoid modifying loop variable in io_ring_add_registered_file io_uring: simplify IORING_SETUP_DEFER_TASKRUN && !SQPOLL check
2026-02-17io_uring/bpf_filter: pass in expected filter payload sizeJens Axboe1-16/+49
It's quite possible that opcodes that have payloads attached to them, like IORING_OP_OPENAT/OPENAT2 or IORING_OP_SOCKET, that these paylods can change over time. For example, on the openat/openat2 side, the struct open_how argument is extensible, and could be extended in the future to allow further arguments to be passed in. Allow registration of a cBPF filter to give the size of the filter as seen by userspace. If that filter is for an opcode that takes extra payload data, allow it if the application payload expectation is the same size than the kernels. If that is the case, the kernel supports filtering on the payload that the application expects. If the size differs, the behavior depends on the IO_URING_BPF_FILTER_SZ_STRICT flag: 1) If IO_URING_BPF_FILTER_SZ_STRICT is set and the size expectation differs, fail the attempt to load the filter. 2) If IO_URING_BPF_FILTER_SZ_STRICT isn't set, allow the filter if the userspace pdu size is smaller than what the kernel offers. 3) Regardless if IO_URING_BPF_FILTER_SZ_STRICT, fail loading the filter if the userspace pdu size is bigger than what the kernel supports. An attempt to load a filter due to sizing will error with -EMSGSIZE. For that error, the registration struct will have filter->pdu_size populated with the pdu size that the kernel uses. Reported-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-17io_uring/bpf_filter: move filter size and populate helper into structJens Axboe3-11/+18
Rather than open-code this logic in io_uring_populate_bpf_ctx() with a switch, move it to the issue side definitions. Outside of making this easier to extend in the future, it's also a prep patch for using the pdu size for a given opcode filter elsewhere. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-17io_uring/cancel: de-unionize file and user_data in struct io_cancel_dataJens Axboe1-4/+2
By having them share the same space in struct io_cancel_data, it ends up disallowing IORING_ASYNC_CANCEL_FD|IORING_ASYNC_CANCEL_USERDATA from working. Eg you cannot match on both a file and user_data for cancelation purposes. This obviously isn't a common use case as nobody has reported this, but it does result in -ENOENT potentially being returned when trying to match on both, rather than actually doing what the API says it would. Fixes: 4bf94615b888 ("io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-16io_uring/rsrc: improve regbuf iov validationPavel Begunkov1-21/+10
Deduplicate io_buffer_validate() calls by moving the checks into io_sqe_buffer_register(). Now we also don't need special handling in io_buffer_validate() passing through buffer removal requests. I also was using it as a cleanup before some other changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-16io_uring: remove unneeded io_send_zc accountingDylan Yudaken1-2/+0
zc->len and zc->buf are not actually used once you get to the retry stage. The buffer remains in kmsg->msg.msg_iter, which is setup in io_send_setup. Note: it still seems needed in io_send due to io_send_select_buffer needing it (for the len parameter). Signed-off-by: Dylan Yudaken <dyudaken@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>