<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/io_uring/io_uring.c, branch v7.2-rc1</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v7.2-rc1</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v7.2-rc1'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-06-16T07:23:59+00:00</updated>
<entry>
<title>Merge tag 'for-7.2/io_uring-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux</title>
<updated>2026-06-16T07:23:59+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-06-16T07:23:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9b40ba14edcdf70240af8114092a76f75f070774'/>
<id>urn:sha1:9b40ba14edcdf70240af8114092a76f75f070774</id>
<content type='text'>
Pull io_uring updates from Jens Axboe:

 - Rework the task_work infrastructure.

   Both the local (DEFER_TASKRUN) and the normal (tctx) task_work lists
   were llist based, which is LIFO ordered, and hence each run had to do
   an O(n) list reversal pass first to restore queue order.
   Additionally, to cap the amount of task_work run, each method needed
   a retry list as well.

   Add a lockless MPCS FIFO queue (based on Dmitry Vyukov's intrusive
   MPSC algorithm) and switch both task_work lists to it. It performs
   better than llists and we can then also ditch the retry lists as well
   as entries are popped one-at-the-time.

   On top of those changes, run the tctx fallback task_work directly and
   remove the now-unused per-ctx fallback machinery entirely.

 - zcrx user notifications.

   Add a mechanism for zcrx to communicate conditions back to userspace
   via a dedicated CQE, with the initial users being notification on
   running out of buffers and on a frag copy fallback, plus
   shared-memory notification statistics.

   Alongside that, a series of zcrx reliability and cleanup fixes: more
   reliable scrubbing, poisoning pointers on unregistration, dropping an
   extra ifq close, adding a ctx back-pointer, reordering fd allocation
   in the export path, and killing a dead 'sock' member.

 - Allow using io_uring registered buffers for plain SEND and RECV, not
   just for the zero-copy send path.

   This enables targets like ublk's NBD backend to push/pull IO data
   directly to/from a registered buffer over a plain send/recv on a TCP
   socket.

 - Registered buffer improvements: account huge pages correctly, bump
   the io_mapped_ubuf length field to size_t, and raise the previous 1GB
   registered buffer size limit.

 - Restrict the ctx access exposed to io_uring BPF struct_ops programs
   by handing them an opaque type rather than the full io_ring_ctx, and
   add a separate MAINTAINERS entry for the bpf-ops code.

 - Allow opcode filtering on IORING_OP_CONNECT.

 - Validate ring-provided buffer addresses with access_ok(), and align
   the legacy buffer add limit with MAX_BIDS_PER_BGID.

 - Various other cleanups and minor fixes, including avoiding msghdr
   async data on connect/bind, dropping async_size for OP_LISTEN, making
   the POLL_FIRST receive side checks consistent, re-checking
   IO_WQ_BIT_EXIT for each linked work item, and using
   trace_call__##name() at guarded tracepoint call sites.

* tag 'for-7.2/io_uring-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (31 commits)
  io_uring/bpf-ops: add a separate maintainer entry
  io_uring/net: make POLL_FIRST receive side checks consistent
  io_uring: remove the per-ctx fallback task_work machinery
  io_uring: run the tctx task_work fallback directly
  io_uring: switch normal task_work to a mpscq
  io_uring: switch local task_work to a mpscq
  io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
  io_uring: grab RCU read lock marking task run
  io_uring/zcrx: kill dead 'sock' member in struct io_zcrx_args
  io_uring/kbuf: validate ring provided buffer addresses with access_ok()
  io_uring/net: support registered buffer for plain send and recv
  io_uring/nop: Drop a wrong comment in struct io_nop
  io_uring/net: Remove async_size for OP_LISTEN
  io_uring/net: Avoid msghdr on op_connect/op_bind async data
  io_uring/bpf-ops: restrict ctx access to BPF
  io_uring/io-wq: re-check IO_WQ_BIT_EXIT for each linked work item
  io_uring/kbuf: align legacy buffer add limit with MAX_BIDS_PER_BGID
  io_uring/zcrx: add shared-memory notification statistics
  io_uring/zcrx: notify user on frag copy fallback
  io_uring/zcrx: notify user when out of buffers
  ...
</content>
</entry>
<entry>
<title>Merge tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab</title>
<updated>2026-06-16T03:14:43+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-06-16T03:14:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f8115f0e8a0585ef1c03d07a68b989023097d16c'/>
<id>urn:sha1:f8115f0e8a0585ef1c03d07a68b989023097d16c</id>
<content type='text'>
Pull slab updates from Vlastimil Babka:

 - Support for "allocation tokens" (currently available in Clang 22+)
   for smarter partitioning of kmalloc caches based on the allocated
   object type, which can be enabled instead of the "random"
   per-caller-address-hash partitioning.

   It should be able to deterministically separate types containing a
   pointer from those that do not (Marco Elver)

 - Improvements and simplification of the kmem_cache_alloc_bulk() and
   mempool_alloc_bulk() API. This includes adaptation of callers
   (Christoph Hellwig)

 - Performance improvements and cleanups related mostly to sheaves
   refill (Hao Li, Shengming Hu, Vlastimil Babka)

 - Several fixups for the slabinfo tool (Xuewen Wang)

* tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab: do not limit zeroing to orig_size when only red zoning is enabled
  mm/slub: preserve original size in _kmalloc_nolock_noprof retry path
  mm: simplify the mempool_alloc_bulk API
  mm/slab: improve kmem_cache_alloc_bulk
  mm/slub: detach and reattach partial slabs in batch
  mm/slub: introduce helpers for node partial slab state
  mm/slub: use empty sheaf helpers for oversized sheaves
  tools/mm/slabinfo: remove redundant slab-&gt;partial assignment
  tools/mm/slabinfo: remove dead assignment in get_obj_and_str()
  tools/mm/slabinfo: Fix trace disable logic inversion
  MAINTAINERS: add slab-related scripts and tools to SLAB ALLOCATOR
  mm/slub: fix typo in sheaves comment
  mm, slab: simplify returning slab in __refill_objects_node()
  mm, slab: add an optimistic __slab_try_return_freelist()
  slab: fix kernel-docs for mm-api
  slab: improve KMALLOC_PARTITION_RANDOM randomness
  slab: support for compiler-assisted type-based slab cache partitioning
  mm/slub: defer freelist construction until after bulk allocation from a new slab
</content>
</entry>
<entry>
<title>io_uring: remove the per-ctx fallback task_work machinery</title>
<updated>2026-06-13T12:27:20+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2026-06-11T17:44:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=576cce91480a949f5b83578300f37023b933e0a2'/>
<id>urn:sha1:576cce91480a949f5b83578300f37023b933e0a2</id>
<content type='text'>
With the tctx fallback running its entries directly, the per-ctx
fallback work has a single user left: moving local (DEFER_TASKRUN)
task_work entries out of a ring that is going away. Both of its call
sites are process context and don't hold -&gt;uring_lock, the same
conditions the deferred fallback work itself ran under - so run the
entries in cancel mode right there instead, and rename the helper to
io_cancel_local_task_work() to match what it now does.

With that, -&gt;fallback_llist, -&gt;fallback_work, io_fallback_req_func()
and __io_fallback_tw() can all go away, along with the fallback work
flushing in the ring exit and cancel paths. Requests that get
orphaned by an exiting task now run via the tctx fallback work, which
the ring exit side implicitly waits on through the ctx refs those
requests hold.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: switch local task_work to a mpscq</title>
<updated>2026-06-13T12:27:06+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2026-06-10T21:19:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d46ab2c98ababa19b41a5709b6921d7b1add7f74'/>
<id>urn:sha1:d46ab2c98ababa19b41a5709b6921d7b1add7f74</id>
<content type='text'>
The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
ordered, and hence __io_run_local_work() has to restore the right
running order with an O(n) llist_reverse_order() pass first. On top of
that, a batch that gets capped by max_events needs the leftover entries
parked on a separate -&gt;retry_llist, as they can't be pushed back to the
shared list.

Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
retry loop, entries are popped in queue order with no reversal pass,
capping a run simply leaves the remainder on the queue, and
-&gt;retry_llist goes away entirely. The consumer cursor, -&gt;work_head,
lives with the rest of the -&gt;uring_lock protected state rather than
next to the queue, so that popping entries doesn't dirty the producer
side cacheline.

For low amounts of task_work, this ends up being a bit more efficient
than the existing scheme. As an example of that, doing multishot
receives for 8 clients has the following task_work overhead:

     1.02%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     0.88%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
     0.60%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
     0.14%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     2.64% at ~46Gb/sec

and after this change:

     1.08%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     1.03%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     2.11% at ~53Gb/sec

which has less overhead even though that test run was faster. For a case
of having 1024 clients on a single ring:

     2.22%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
     0.84%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
     0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     0.02%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     3.50% at ~24Gb/sec

we start to see the llist reversing taking a considerable amount of
time, and the total add+run task_work overhead is around 3.5%. After
the change:

     0.90%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     1.32% at ~26Gb/sec

most of that overhead is gone, and performance is better as well.

Caleb Sander Mateos &lt;csander@purestorage.com&gt; reports that it improves
the performance of a ublk 4kb workload by 4% [1], while testing v1 of
this patchset.

[1] https://lore.kernel.org/io-uring/CADUfDZr-MMYBaP-e+y9+xuRhuiunO2sBTUCmwZyd7AgT8sVtiQ@mail.gmail.com/

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>mm/slab: improve kmem_cache_alloc_bulk</title>
<updated>2026-06-03T16:20:43+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2026-05-28T09:34:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=6bb0009862c5f0e89a6e4afc09b499a02576c7da'/>
<id>urn:sha1:6bb0009862c5f0e89a6e4afc09b499a02576c7da</id>
<content type='text'>
The kmem_cache_alloc_bulk return value is weird.  It returns the number
of allocated objects, but that must always be 0 or the requested number
based on the implementations and the handling in the callers, but that
assumption is not actually documented anywhere, which confuses automated
review tools.

Fix this by returning a bool if the allocation succeeded and adding a
kerneldoc comment explaining the API.

[rob.clark@oss.qualcomm.com: fixups in
 msm_iommu_pagetable_prealloc_allocate() ]

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Alexander Lobakin &lt;aleksander.lobakin@intel.com&gt; # skbuff
Link: https://patch.msgid.link/20260528093437.2519248-2-hch@lst.de
Signed-off-by: Vlastimil Babka (SUSE) &lt;vbabka@kernel.org&gt;
</content>
</entry>
<entry>
<title>io_uring/zcrx: notify user when out of buffers</title>
<updated>2026-05-26T16:42:01+00:00</updated>
<author>
<name>Pavel Begunkov</name>
<email>asml.silence@gmail.com</email>
</author>
<published>2026-05-19T11:44:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0719e10d826aa0ba4840917d0261986eaead9a51'/>
<id>urn:sha1:0719e10d826aa0ba4840917d0261986eaead9a51</id>
<content type='text'>
There are currently no easy ways for the user to know if zcrx is out of
buffers and page pool fails to allocate. Add uapi for zcrx to communicate
it back.

It's implemented as a separate CQE, which for now is posted to the creator
ctx. To use it, on registration the user space needs to pass an instance
of struct zcrx_notification_desc, which tells the kernel the user_data
for resulting CQEs and which event types are expected / allowed.

When an allowed event happens, zcrx will post a CQE containing the
specified user_data, and lower bits of cqe-&gt;res will be set to the event
mask. Before the kernel could post another notification of the given
type, the user needs to acknowledge that it processed the previous one
by issuing IORING_REGISTER_ZCRX_CTRL with ZCRX_CTRL_ARM_NOTIFICATION.

The only notification type the patch implements is
ZCRX_NOTIF_NO_BUFFERS, but we'll need more of them in the future.

Co-developed-by: Vishwanath Seshagiri &lt;vishs@meta.com&gt;
Signed-off-by: Pavel Begunkov &lt;asml.silence@gmail.com&gt;
Signed-off-by: Vishwanath Seshagiri &lt;vishs@meta.com&gt;
Link: https://patch.msgid.link/35cd307a03a43583838a2e151fc641c69abd786f.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: propagate array_index_nospec opcode into req-&gt;opcode</title>
<updated>2026-05-18T14:59:12+00:00</updated>
<author>
<name>Michael Bommarito</name>
<email>michael.bommarito@gmail.com</email>
</author>
<published>2026-05-17T21:30:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=cf18e36455603d65d4745de83e2d1743c54ada47'/>
<id>urn:sha1:cf18e36455603d65d4745de83e2d1743c54ada47</id>
<content type='text'>
Commit 1e988c3fe126 ("io_uring: prevent opcode speculation") added
array_index_nospec() to io_init_req(), but applied it only to a local
opcode variable. req-&gt;opcode is initialized from sqe-&gt;opcode before the
bounds check and remains the raw value.

Keep req-&gt;opcode as the canonical opcode in io_init_req(): reject
out-of-range values architecturally, then write the array_index_nospec()
result back to req-&gt;opcode before any table lookup. This keeps downstream
users of req-&gt;opcode from observing the raw user byte on a mispredicted
path.

No functional change: array_index_nospec() is a no-op for opcodes in
[0, IORING_OP_LAST), and out-of-range opcodes are still rejected at the
bounds check above the assignment.

Fixes: 1e988c3fe126 ("io_uring: prevent opcode speculation")
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito &lt;michael.bommarito@gmail.com&gt;
Link: https://patch.msgid.link/20260517213010.696135-1-michael.bommarito@gmail.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring/rsrc: add huge page accounting for registered buffers</title>
<updated>2026-05-14T14:11:53+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2026-01-24T17:02:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=df0a52537c0f95d5441eb0ba1bdbd2d864e6def9'/>
<id>urn:sha1:df0a52537c0f95d5441eb0ba1bdbd2d864e6def9</id>
<content type='text'>
Track huge page references in a per-ring xarray to prevent double
accounting when the same huge page is used by multiple registered
buffers, either within the same ring or across cloned rings.

When registering buffers backed by huge pages, we need to account for
RLIMIT_MEMLOCK. But if multiple buffers share the same huge page (common
with cloned buffers), we must not account for the same page multiple
times. Similarly, we must only unaccount when the last reference to a
huge page is released.

Maintain a per-ring xarray (hpage_acct) that tracks reference counts for
each huge page. When registering a buffer, for each unique huge page,
increment its accounting reference count, and only account pages that
are newly added.

When unregistering a buffer, for each unique huge page, decrement its
refcount. Once the refcount hits zero, the page is unaccounted.

Note: any account is done against the ctx-&gt;user that was assigned when
the ring was setup. As before, if root is running the operation, no
accounting is done.

With these changes, any use of imu-&gt;acct_pages is also dead, hence kill
it from struct io_mapped_ubuf. This shrinks it from 56b to 48b on a
64-bit arch. Additionally, hpage_already_acct() is gone, which was an
O(M*M) scan over current + previous registrations.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: validate user-controlled cq.head in io_cqe_cache_refill()</title>
<updated>2026-05-14T03:44:57+00:00</updated>
<author>
<name>Zizhi Wo</name>
<email>wozizhi@huawei.com</email>
</author>
<published>2026-05-14T02:18:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f44d38a31f1802b7222adaea9ee69f9d280f698a'/>
<id>urn:sha1:f44d38a31f1802b7222adaea9ee69f9d280f698a</id>
<content type='text'>
A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU:

[root@fedora io_uring_stress]# ps -ef | grep io_uring
root  1240  1  99 13:36 ?  00:01:35 [io_uring_stress] &lt;defunct&gt;

The task loops inside io_cqring_wait() and never returns to userspace,
and SIGKILL has no effect.

This is caused by the CQ ring exposing rings-&gt;cq.head to userspace as
writable, while the authoritative tail lives in kernel-private
ctx-&gt;cached_cq_tail. io_cqe_cache_refill() computes free space as an
unsigned subtraction:

    free = ctx-&gt;cq_entries - min(tail - head, ctx-&gt;cq_entries);

If userspace keeps head within [0, tail], the subtraction is well
defined and min() just acts as a defensive clamp. But if userspace
advances head past tail, (tail - head) wraps to a huge value, free
becomes 0, and io_cqe_cache_refill() fails. The CQE is pushed onto the
overflow list and IO_CHECK_CQ_OVERFLOW_BIT is set.

The wait loop in io_cqring_wait() relies on an invariant: refill() only
fails when the CQ is *physically* full, in which case rings-&gt;cq.tail has
been advanced to iowq-&gt;cq_tail and io_should_wake() returns true. The
tampered head breaks this: refill() fails while the ring is not full, no
OCQE is copied in, rings-&gt;cq.tail never catches up, io_should_wake()
stays false, and io_cqring_wait_schedule() keeps returning early because
IO_CHECK_CQ_OVERFLOW_BIT is still set. The result is a tight retry loop
that never returns to userspace.

Introduce io_cqring_queued() as the single point that converts the
(tail, head) pair into a trustworthy queued count. Since the real
head/tail distance is bounded by cq_entries (far below 2^31), a signed
comparison reliably detects userspace moving head past tail; in that
case treat the queue as empty so callers see the full cache as free and
forward progress is preserved.

Suggested-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Zizhi Wo &lt;wozizhi@huawei.com&gt;
Link: https://patch.msgid.link/20260514021847.4062782-1-wozizhi@huaweicloud.com
[axboe: fixup commit message, kill 'queued' var, and keep it all in
io_uring.c]
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: hold uring_lock when walking link chain in io_wq_free_work()</title>
<updated>2026-05-11T17:14:29+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2026-05-11T16:58:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=20c39819a27646573dfa0ac0d01c38895298a6f6'/>
<id>urn:sha1:20c39819a27646573dfa0ac0d01c38895298a6f6</id>
<content type='text'>
io_wq_free_work() calls io_req_find_next() from io-wq worker context,
which reads and clears req-&gt;link without holding any lock. This can
potentially race with other paths that mutate the same chain under
ctx-&gt;uring_lock.

Take ctx-&gt;uring_lock around the io_req_find_next() call. Only requests
with IO_REQ_LINK_FLAGS reach this path, which is not the hot path.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
</feed>
