kernel/linux.git/io_uring/kbuf.c, branch v7.2-rc1

Merge tag 'for-7.2/io_uring-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

2026-06-16T07:23:59+00:00

Pull io_uring updates from Jens Axboe: - Rework the task_work infrastructure. Both the local (DEFER_TASKRUN) and the normal (tctx) task_work lists were llist based, which is LIFO ordered, and hence each run had to do an O(n) list reversal pass first to restore queue order. Additionally, to cap the amount of task_work run, each method needed a retry list as well. Add a lockless MPCS FIFO queue (based on Dmitry Vyukov's intrusive MPSC algorithm) and switch both task_work lists to it. It performs better than llists and we can then also ditch the retry lists as well as entries are popped one-at-the-time. On top of those changes, run the tctx fallback task_work directly and remove the now-unused per-ctx fallback machinery entirely. - zcrx user notifications. Add a mechanism for zcrx to communicate conditions back to userspace via a dedicated CQE, with the initial users being notification on running out of buffers and on a frag copy fallback, plus shared-memory notification statistics. Alongside that, a series of zcrx reliability and cleanup fixes: more reliable scrubbing, poisoning pointers on unregistration, dropping an extra ifq close, adding a ctx back-pointer, reordering fd allocation in the export path, and killing a dead 'sock' member. - Allow using io_uring registered buffers for plain SEND and RECV, not just for the zero-copy send path. This enables targets like ublk's NBD backend to push/pull IO data directly to/from a registered buffer over a plain send/recv on a TCP socket. - Registered buffer improvements: account huge pages correctly, bump the io_mapped_ubuf length field to size_t, and raise the previous 1GB registered buffer size limit. - Restrict the ctx access exposed to io_uring BPF struct_ops programs by handing them an opaque type rather than the full io_ring_ctx, and add a separate MAINTAINERS entry for the bpf-ops code. - Allow opcode filtering on IORING_OP_CONNECT. - Validate ring-provided buffer addresses with access_ok(), and align the legacy buffer add limit with MAX_BIDS_PER_BGID. - Various other cleanups and minor fixes, including avoiding msghdr async data on connect/bind, dropping async_size for OP_LISTEN, making the POLL_FIRST receive side checks consistent, re-checking IO_WQ_BIT_EXIT for each linked work item, and using trace_call__##name() at guarded tracepoint call sites. * tag 'for-7.2/io_uring-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (31 commits) io_uring/bpf-ops: add a separate maintainer entry io_uring/net: make POLL_FIRST receive side checks consistent io_uring: remove the per-ctx fallback task_work machinery io_uring: run the tctx task_work fallback directly io_uring: switch normal task_work to a mpscq io_uring: switch local task_work to a mpscq io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue io_uring: grab RCU read lock marking task run io_uring/zcrx: kill dead 'sock' member in struct io_zcrx_args io_uring/kbuf: validate ring provided buffer addresses with access_ok() io_uring/net: support registered buffer for plain send and recv io_uring/nop: Drop a wrong comment in struct io_nop io_uring/net: Remove async_size for OP_LISTEN io_uring/net: Avoid msghdr on op_connect/op_bind async data io_uring/bpf-ops: restrict ctx access to BPF io_uring/io-wq: re-check IO_WQ_BIT_EXIT for each linked work item io_uring/kbuf: align legacy buffer add limit with MAX_BIDS_PER_BGID io_uring/zcrx: add shared-memory notification statistics io_uring/zcrx: notify user on frag copy fallback io_uring/zcrx: notify user when out of buffers ...

io_uring/kbuf: validate ring provided buffer addresses with access_ok()

2026-06-08T17:03:16+00:00

Commit: 809b997a5ce9 ("x86-64/arm64/powerpc: clean up and rename __copy_from_user_flushcache") sanitized that any provided copy helper should separately validate destination and source addresses, but we should also ensure that anything that is retrieved from a buffer is validated upfront. For ring provided buffers, always include an access_ok() when grabbing a new buffer. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Jens Axboe

io_uring/kbuf: don't truncate end buffer for bundles

2026-06-07T22:11:47+00:00

If buffers have been peeked for a bundle receive, the kernel will truncate the end buffer, if the available length is shorter than the buffer itself. This is unnecessary, as applications iterating bundle receives must always use the minimum size of the buffer length and the remaining number of bytes in the bundle. The examples in liburing do that as well, eg examples/proxy.c. If the kernel does truncate this buffer AND the current transfer fails, then the buffer will be left with a smaller size than what is otherwise available. Just remove the buffer truncation, as it's not necessary in the first place. Link: https://lore.kernel.org/io-uring/CAAEr8jbY60noGj1fw_k91UJRBkyiRVoS6=nLhZ7Svwidjn4CAA@mail.gmail.com/ Reported-by: Federico Brasili Cc: stable@vger.kernel.org Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Signed-off-by: Jens Axboe

io_uring/kbuf: align legacy buffer add limit with MAX_BIDS_PER_BGID

2026-05-28T14:02:57+00:00

io_provide_buffers_prep() accepts nbufs up to MAX_BIDS_PER_BGID, but io_add_buffers() stops when bl->nbufs reaches USHRT_MAX. This makes the effective add limit one lower than the validated limit. Use MAX_BIDS_PER_BGID in the add-side boundary check so validation and execution use the same limit, and update the comment to refer to the actual limit constant. Signed-off-by: liyouhong Link: https://patch.msgid.link/20260528024936.3672659-1-dayou5941@163.com Signed-off-by: Jens Axboe

io_uring: parenthesize io_ring_head_to_buf() expansion

2026-05-14T13:21:32+00:00

Wrap the io_ring_head_to_buf() macro value in an extra pair of parentheses so it is safe when composed into larger expressions, and to satisfy scripts/checkpatch.pl. Signed-off-by: Yi Xie Link: https://patch.msgid.link/20260514083443.203387-1-xieyi@kylinos.cn Signed-off-by: Jens Axboe

io_uring/kbuf: support min length left for incremental buffers

2026-04-28T22:08:56+00:00

Incrementally consumed buffer rings are generally fully consumed, but it's quite possible that the application has a minimum size it needs to meet to avoid truncation. Currently that minimum limit is 1 byte, but this should be a setting that is the hands of the application. For recvmsg multishot, a prime use case for incrementally consumed buffers, the application may get spurious -EFAULT returned at the end of an incrementally consumed buffer, as less space is available than the headers need. Grab a u32 field in struct io_uring_buf_reg, which the application can use to inform the kernel of the minimum size that should be available in an incrementally consumed buffer. If less than that is available, the current buffer is fully processed and the next one will be picked. Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1433 Signed-off-by: Martin Michaelis [axboe: write commit message, change io_buffer_list member name] Reviewed-by: Gabriel Krisman Bertazi Signed-off-by: Jens Axboe

io_uring/kbuf: kill dead struct io_buffer_list 'nr_entries' member

2026-04-28T22:08:44+00:00

This is only ever assigned, never used. The only used part is the calculated mask, which is used for indexing. Kill 'nr_entries'. Reviewed-by: Gabriel Krisman Bertazi Signed-off-by: Jens Axboe

Merge tag 'for-7.1/io_uring-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

2026-04-13T23:22:30+00:00

Pull io_uring updates from Jens Axboe: - Add a callback driven main loop for io_uring, and BPF struct_ops on top to allow implementing custom event loop logic - Decouple IOPOLL from being a ring-wide all-or-nothing setting, allowing IOPOLL use cases to also issue certain white listed non-polled opcodes - Timeout improvements. Migrate internal timeout storage from timespec64 to ktime_t for simpler arithmetic and avoid copying of timespec data - Zero-copy receive (zcrx) updates: - Add a device-less mode (ZCRX_REG_NODEV) for testing and experimentation where data flows through the copy fallback path - Fix two-step unregistration regression, DMA length calculations, xarray mark usage, and a potential 32-bit overflow in id shifting - Refactoring toward multi-area support: dedicated refill queue struct, consolidated DMA syncing, netmem array refilling format, and guard-based locking - Zero-copy transmit (zctx) cleanup: - Unify io_send_zc() and io_sendmsg_zc() into a single function - Add vectorized registered buffer send for IORING_OP_SEND_ZC - Add separate notification user_data via sqe->addr3 so notification and completion CQEs can be distinguished without extra reference counting - Switch struct io_ring_ctx internal bitfields to explicit flag bits with atomic-safe accessors, and annotate the known harmless races on those flags - Various optimizations caching ctx and other request fields in local variables to avoid repeated loads, and cleanups for tctx setup, ring fd registration, and read path early returns * tag 'for-7.1/io_uring-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (58 commits) io_uring: unify getting ctx from passed in file descriptor io_uring/register: don't get a reference to the registered ring fd io_uring/tctx: clean up __io_uring_add_tctx_node() error handling io_uring/tctx: have io_uring_alloc_task_context() return tctx io_uring/timeout: use 'ctx' consistently io_uring/rw: clean up __io_read() obsolete comment and early returns io_uring/zcrx: use correct mmap off constants io_uring/zcrx: use dma_len for chunk size calculation io_uring/zcrx: don't clear not allocated niovs io_uring/zcrx: don't use mark0 for allocating xarray io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring() io_uring/zcrx: reject REG_NODEV with large rx_buf_size io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OP io_uring/rsrc: use io_cache_free() to free node io_uring/zcrx: rename zcrx [un]register functions io_uring/zcrx: check ctrl op payload struct sizes io_uring/zcrx: cache fallback availability in zcrx ctx io_uring/zcrx: warn on a repeated area append io_uring/zcrx: consolidate dma syncing io_uring/zcrx: netmem array as refiling format ...

Merge tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

2026-03-20T16:58:56+00:00

Pull io_uring fixes from Jens Axboe: - A bit of a work-around for AF_UNIX recv multishot, as the in-kernel implementation doesn't properly signal EOF. We'll likely rework this one going forward, but the fix is sufficient for now - Two fixes for incrementally consumed buffers, for non-pollable files and for 0 byte reads * tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/kbuf: propagate BUF_MORE through early buffer commit path io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF io_uring/poll: fix multishot recv missing EOF on wakeup race

io_uring/kbuf: propagate BUF_MORE through early buffer commit path

2026-03-19T21:09:48+00:00

When io_should_commit() returns true (eg for non-pollable files), buffer commit happens at buffer selection time and sel->buf_list is set to NULL. When __io_put_kbufs() generates CQE flags at completion time, it calls __io_put_kbuf_ring() which finds a NULL buffer_list and hence cannot determine whether the buffer was consumed or not. This means that IORING_CQE_F_BUF_MORE is never set for non-pollable input with incrementally consumed buffers. Likewise for io_buffers_select(), which always commits upfront and discards the return value of io_kbuf_commit(). Add REQ_F_BUF_MORE to store the result of io_kbuf_commit() during early commit. Then __io_put_kbuf_ring() can check this flag and set IORING_F_BUF_MORE accordingy. Reported-by: Martin Michaelis Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1553 Signed-off-by: Jens Axboe