kernel/linux.git/io_uring/io_uring.c, branch v7.2-rc1

Merge tag 'for-7.2/io_uring-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

2026-06-16T07:23:59+00:00

Pull io_uring updates from Jens Axboe: - Rework the task_work infrastructure. Both the local (DEFER_TASKRUN) and the normal (tctx) task_work lists were llist based, which is LIFO ordered, and hence each run had to do an O(n) list reversal pass first to restore queue order. Additionally, to cap the amount of task_work run, each method needed a retry list as well. Add a lockless MPCS FIFO queue (based on Dmitry Vyukov's intrusive MPSC algorithm) and switch both task_work lists to it. It performs better than llists and we can then also ditch the retry lists as well as entries are popped one-at-the-time. On top of those changes, run the tctx fallback task_work directly and remove the now-unused per-ctx fallback machinery entirely. - zcrx user notifications. Add a mechanism for zcrx to communicate conditions back to userspace via a dedicated CQE, with the initial users being notification on running out of buffers and on a frag copy fallback, plus shared-memory notification statistics. Alongside that, a series of zcrx reliability and cleanup fixes: more reliable scrubbing, poisoning pointers on unregistration, dropping an extra ifq close, adding a ctx back-pointer, reordering fd allocation in the export path, and killing a dead 'sock' member. - Allow using io_uring registered buffers for plain SEND and RECV, not just for the zero-copy send path. This enables targets like ublk's NBD backend to push/pull IO data directly to/from a registered buffer over a plain send/recv on a TCP socket. - Registered buffer improvements: account huge pages correctly, bump the io_mapped_ubuf length field to size_t, and raise the previous 1GB registered buffer size limit. - Restrict the ctx access exposed to io_uring BPF struct_ops programs by handing them an opaque type rather than the full io_ring_ctx, and add a separate MAINTAINERS entry for the bpf-ops code. - Allow opcode filtering on IORING_OP_CONNECT. - Validate ring-provided buffer addresses with access_ok(), and align the legacy buffer add limit with MAX_BIDS_PER_BGID. - Various other cleanups and minor fixes, including avoiding msghdr async data on connect/bind, dropping async_size for OP_LISTEN, making the POLL_FIRST receive side checks consistent, re-checking IO_WQ_BIT_EXIT for each linked work item, and using trace_call__##name() at guarded tracepoint call sites. * tag 'for-7.2/io_uring-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (31 commits) io_uring/bpf-ops: add a separate maintainer entry io_uring/net: make POLL_FIRST receive side checks consistent io_uring: remove the per-ctx fallback task_work machinery io_uring: run the tctx task_work fallback directly io_uring: switch normal task_work to a mpscq io_uring: switch local task_work to a mpscq io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue io_uring: grab RCU read lock marking task run io_uring/zcrx: kill dead 'sock' member in struct io_zcrx_args io_uring/kbuf: validate ring provided buffer addresses with access_ok() io_uring/net: support registered buffer for plain send and recv io_uring/nop: Drop a wrong comment in struct io_nop io_uring/net: Remove async_size for OP_LISTEN io_uring/net: Avoid msghdr on op_connect/op_bind async data io_uring/bpf-ops: restrict ctx access to BPF io_uring/io-wq: re-check IO_WQ_BIT_EXIT for each linked work item io_uring/kbuf: align legacy buffer add limit with MAX_BIDS_PER_BGID io_uring/zcrx: add shared-memory notification statistics io_uring/zcrx: notify user on frag copy fallback io_uring/zcrx: notify user when out of buffers ...

Merge tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

2026-06-16T03:14:43+00:00

Pull slab updates from Vlastimil Babka: - Support for "allocation tokens" (currently available in Clang 22+) for smarter partitioning of kmalloc caches based on the allocated object type, which can be enabled instead of the "random" per-caller-address-hash partitioning. It should be able to deterministically separate types containing a pointer from those that do not (Marco Elver) - Improvements and simplification of the kmem_cache_alloc_bulk() and mempool_alloc_bulk() API. This includes adaptation of callers (Christoph Hellwig) - Performance improvements and cleanups related mostly to sheaves refill (Hao Li, Shengming Hu, Vlastimil Babka) - Several fixups for the slabinfo tool (Xuewen Wang) * tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: do not limit zeroing to orig_size when only red zoning is enabled mm/slub: preserve original size in _kmalloc_nolock_noprof retry path mm: simplify the mempool_alloc_bulk API mm/slab: improve kmem_cache_alloc_bulk mm/slub: detach and reattach partial slabs in batch mm/slub: introduce helpers for node partial slab state mm/slub: use empty sheaf helpers for oversized sheaves tools/mm/slabinfo: remove redundant slab->partial assignment tools/mm/slabinfo: remove dead assignment in get_obj_and_str() tools/mm/slabinfo: Fix trace disable logic inversion MAINTAINERS: add slab-related scripts and tools to SLAB ALLOCATOR mm/slub: fix typo in sheaves comment mm, slab: simplify returning slab in __refill_objects_node() mm, slab: add an optimistic __slab_try_return_freelist() slab: fix kernel-docs for mm-api slab: improve KMALLOC_PARTITION_RANDOM randomness slab: support for compiler-assisted type-based slab cache partitioning mm/slub: defer freelist construction until after bulk allocation from a new slab

io_uring: remove the per-ctx fallback task_work machinery

2026-06-13T12:27:20+00:00

With the tctx fallback running its entries directly, the per-ctx fallback work has a single user left: moving local (DEFER_TASKRUN) task_work entries out of a ring that is going away. Both of its call sites are process context and don't hold ->uring_lock, the same conditions the deferred fallback work itself ran under - so run the entries in cancel mode right there instead, and rename the helper to io_cancel_local_task_work() to match what it now does. With that, ->fallback_llist, ->fallback_work, io_fallback_req_func() and __io_fallback_tw() can all go away, along with the fallback work flushing in the ring exit and cancel paths. Requests that get orphaned by an exiting task now run via the tctx fallback work, which the ring exit side implicitly waits on through the ctx refs those requests hold. Signed-off-by: Jens Axboe

io_uring: switch local task_work to a mpscq

2026-06-13T12:27:06+00:00

The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO ordered, and hence __io_run_local_work() has to restore the right running order with an O(n) llist_reverse_order() pass first. On top of that, a batch that gets capped by max_events needs the leftover entries parked on a separate ->retry_llist, as they can't be pushed back to the shared list. Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg retry loop, entries are popped in queue order with no reversal pass, capping a run simply leaves the remainder on the queue, and ->retry_llist goes away entirely. The consumer cursor, ->work_head, lives with the rest of the ->uring_lock protected state rather than next to the queue, so that popping entries doesn't dirty the producer side cacheline. For low amounts of task_work, this ends up being a bit more efficient than the existing scheme. As an example of that, doing multishot receives for 8 clients has the following task_work overhead: 1.02% sock-test [kernel.kallsyms] [k] io_req_local_work_add 0.88% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop 0.60% sock-test [kernel.kallsyms] [k] llist_reverse_order 0.14% sock-test [kernel.kallsyms] [k] __io_run_local_work 2.64% at ~46Gb/sec and after this change: 1.08% sock-test [kernel.kallsyms] [k] io_req_local_work_add 1.03% sock-test [kernel.kallsyms] [k] __io_run_local_work 2.11% at ~53Gb/sec which has less overhead even though that test run was faster. For a case of having 1024 clients on a single ring: 2.22% sock-test [kernel.kallsyms] [k] llist_reverse_order 0.84% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop 0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add 0.02% sock-test [kernel.kallsyms] [k] __io_run_local_work 3.50% at ~24Gb/sec we start to see the llist reversing taking a considerable amount of time, and the total add+run task_work overhead is around 3.5%. After the change: 0.90% sock-test [kernel.kallsyms] [k] __io_run_local_work 0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add 1.32% at ~26Gb/sec most of that overhead is gone, and performance is better as well. Caleb Sander Mateos reports that it improves the performance of a ublk 4kb workload by 4% [1], while testing v1 of this patchset. [1] https://lore.kernel.org/io-uring/CADUfDZr-MMYBaP-e+y9+xuRhuiunO2sBTUCmwZyd7AgT8sVtiQ@mail.gmail.com/ Signed-off-by: Jens Axboe

mm/slab: improve kmem_cache_alloc_bulk

2026-06-03T16:20:43+00:00

The kmem_cache_alloc_bulk return value is weird. It returns the number of allocated objects, but that must always be 0 or the requested number based on the implementations and the handling in the callers, but that assumption is not actually documented anywhere, which confuses automated review tools. Fix this by returning a bool if the allocation succeeded and adding a kerneldoc comment explaining the API. [rob.clark@oss.qualcomm.com: fixups in msm_iommu_pagetable_prealloc_allocate() ] Signed-off-by: Christoph Hellwig Reviewed-by: Alexander Lobakin # skbuff Link: https://patch.msgid.link/20260528093437.2519248-2-hch@lst.de Signed-off-by: Vlastimil Babka (SUSE)

io_uring/zcrx: notify user when out of buffers

2026-05-26T16:42:01+00:00

There are currently no easy ways for the user to know if zcrx is out of buffers and page pool fails to allocate. Add uapi for zcrx to communicate it back. It's implemented as a separate CQE, which for now is posted to the creator ctx. To use it, on registration the user space needs to pass an instance of struct zcrx_notification_desc, which tells the kernel the user_data for resulting CQEs and which event types are expected / allowed. When an allowed event happens, zcrx will post a CQE containing the specified user_data, and lower bits of cqe->res will be set to the event mask. Before the kernel could post another notification of the given type, the user needs to acknowledge that it processed the previous one by issuing IORING_REGISTER_ZCRX_CTRL with ZCRX_CTRL_ARM_NOTIFICATION. The only notification type the patch implements is ZCRX_NOTIF_NO_BUFFERS, but we'll need more of them in the future. Co-developed-by: Vishwanath Seshagiri Signed-off-by: Pavel Begunkov Signed-off-by: Vishwanath Seshagiri Link: https://patch.msgid.link/35cd307a03a43583838a2e151fc641c69abd786f.1779189667.git.asml.silence@gmail.com Signed-off-by: Jens Axboe

io_uring: propagate array_index_nospec opcode into req->opcode

2026-05-18T14:59:12+00:00

Commit 1e988c3fe126 ("io_uring: prevent opcode speculation") added array_index_nospec() to io_init_req(), but applied it only to a local opcode variable. req->opcode is initialized from sqe->opcode before the bounds check and remains the raw value. Keep req->opcode as the canonical opcode in io_init_req(): reject out-of-range values architecturally, then write the array_index_nospec() result back to req->opcode before any table lookup. This keeps downstream users of req->opcode from observing the raw user byte on a mispredicted path. No functional change: array_index_nospec() is a no-op for opcodes in [0, IORING_OP_LAST), and out-of-range opcodes are still rejected at the bounds check above the assignment. Fixes: 1e988c3fe126 ("io_uring: prevent opcode speculation") Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Michael Bommarito Link: https://patch.msgid.link/20260517213010.696135-1-michael.bommarito@gmail.com Signed-off-by: Jens Axboe

io_uring/rsrc: add huge page accounting for registered buffers

2026-05-14T14:11:53+00:00

Track huge page references in a per-ring xarray to prevent double accounting when the same huge page is used by multiple registered buffers, either within the same ring or across cloned rings. When registering buffers backed by huge pages, we need to account for RLIMIT_MEMLOCK. But if multiple buffers share the same huge page (common with cloned buffers), we must not account for the same page multiple times. Similarly, we must only unaccount when the last reference to a huge page is released. Maintain a per-ring xarray (hpage_acct) that tracks reference counts for each huge page. When registering a buffer, for each unique huge page, increment its accounting reference count, and only account pages that are newly added. When unregistering a buffer, for each unique huge page, decrement its refcount. Once the refcount hits zero, the page is unaccounted. Note: any account is done against the ctx->user that was assigned when the ring was setup. As before, if root is running the operation, no accounting is done. With these changes, any use of imu->acct_pages is also dead, hence kill it from struct io_mapped_ubuf. This shrinks it from 56b to 48b on a 64-bit arch. Additionally, hpage_already_acct() is gone, which was an O(M*M) scan over current + previous registrations. Signed-off-by: Jens Axboe

io_uring: validate user-controlled cq.head in io_cqe_cache_refill()

2026-05-14T03:44:57+00:00

A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU: [root@fedora io_uring_stress]# ps -ef | grep io_uring root 1240 1 99 13:36 ? 00:01:35 [io_uring_stress] The task loops inside io_cqring_wait() and never returns to userspace, and SIGKILL has no effect. This is caused by the CQ ring exposing rings->cq.head to userspace as writable, while the authoritative tail lives in kernel-private ctx->cached_cq_tail. io_cqe_cache_refill() computes free space as an unsigned subtraction: free = ctx->cq_entries - min(tail - head, ctx->cq_entries); If userspace keeps head within [0, tail], the subtraction is well defined and min() just acts as a defensive clamp. But if userspace advances head past tail, (tail - head) wraps to a huge value, free becomes 0, and io_cqe_cache_refill() fails. The CQE is pushed onto the overflow list and IO_CHECK_CQ_OVERFLOW_BIT is set. The wait loop in io_cqring_wait() relies on an invariant: refill() only fails when the CQ is *physically* full, in which case rings->cq.tail has been advanced to iowq->cq_tail and io_should_wake() returns true. The tampered head breaks this: refill() fails while the ring is not full, no OCQE is copied in, rings->cq.tail never catches up, io_should_wake() stays false, and io_cqring_wait_schedule() keeps returning early because IO_CHECK_CQ_OVERFLOW_BIT is still set. The result is a tight retry loop that never returns to userspace. Introduce io_cqring_queued() as the single point that converts the (tail, head) pair into a trustworthy queued count. Since the real head/tail distance is bounded by cq_entries (far below 2^31), a signed comparison reliably detects userspace moving head past tail; in that case treat the queue as empty so callers see the full cache as free and forward progress is preserved. Suggested-by: Jens Axboe Signed-off-by: Zizhi Wo Link: https://patch.msgid.link/20260514021847.4062782-1-wozizhi@huaweicloud.com [axboe: fixup commit message, kill 'queued' var, and keep it all in io_uring.c] Signed-off-by: Jens Axboe

io_uring: hold uring_lock when walking link chain in io_wq_free_work()

2026-05-11T17:14:29+00:00

io_wq_free_work() calls io_req_find_next() from io-wq worker context, which reads and clears req->link without holding any lock. This can potentially race with other paths that mutate the same chain under ctx->uring_lock. Take ctx->uring_lock around the io_req_find_next() call. Only requests with IO_REQ_LINK_FLAGS reach this path, which is not the hot path. Signed-off-by: Jens Axboe