kernel/linux.git/io_uring/rw.c, branch v7.0-rc7

io_uring: add IORING_OP_URING_CMD128 to opcode checks

2026-02-19T14:25:39+00:00

io_should_commit(), io_uring_classic_poll(), and io_do_iopoll() compare struct io_kiocb's opcode against IORING_OP_URING_CMD to implement special treatment for uring_cmds. The recently added opcode IORING_OP_URING_CMD128 is meant to be equivalent to IORING_OP_URING_CMD, so treat it the same way in these functions. Fixes: 1cba30bf9fdd ("io_uring: add support for IORING_SETUP_SQE_MIXED") Signed-off-by: Caleb Sander Mateos Reviewed-by: Anuj Gupta Reviewed-by: Kanchan Joshi Signed-off-by: Jens Axboe

io_uring/rsrc: replace reg buffer bit field with flags

2026-02-10T12:26:15+00:00

I'll need a flag in the registered buffer struct for dmabuf work, and it'll be more convenient to have a flags field rather than bit fields, especially for io_mapped_ubuf initialisation. We might want to add more flags in the future as well. For example, it might be useful for debugging and potentially optimisations to split out a flag indicating the shape of the buffer to gate iov_iter_advance() walks vs bit/mask arithmetics. It can also be combined with the direction mask field. Signed-off-by: Pavel Begunkov Signed-off-by: Jens Axboe

Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

2026-02-10T01:57:21+00:00

Pull block updates from Jens Axboe: - Support for batch request processing for ublk, improving the efficiency of the kernel/ublk server communication. This can yield nice 7-12% performance improvements - Support for integrity data for ublk - Various other ublk improvements and additions, including a ton of selftests additions and updated - Move the handling of blk-crypto software fallback from below the block layer to above it. This reduces the complexity of dealing with bio splitting - Series fixing a number of potential deadlocks in blk-mq related to the queue usage counter and writeback throttling and rq-qos debugfs handling - Add an async_depth queue attribute, to resolve a performance regression that's been around for a qhilw related to the scheduler depth handling - Only use task_work for IOPOLL completions on NVMe, if it is necessary to do so. An earlier fix for an issue resulted in all these completions being punted to task_work, to guarantee that completions were only run for a given io_uring ring when it was local to that ring. With the new changes, we can detect if it's necessary to use task_work or not, and avoid it if possible. - rnbd fixes: - Fix refcount underflow in device unmap path - Handle PREFLUSH and NOUNMAP flags properly in protocol - Fix server-side bi_size for special IOs - Zero response buffer before use - Fix trace format for flags - Add .release to rnbd_dev_ktype - MD pull requests via Yu Kuai - Fix raid5_run() to return error when log_init() fails - Fix IO hang with degraded array with llbitmap - Fix percpu_ref not resurrected on suspend timeout in llbitmap - Fix GPF in write_page caused by resize race - Fix NULL pointer dereference in process_metadata_update - Fix hang when stopping arrays with metadata through dm-raid - Fix any_working flag handling in raid10_sync_request - Refactor sync/recovery code path, improve error handling for badblocks, and remove unused recovery_disabled field - Consolidate mddev boolean fields into mddev_flags - Use mempool to allocate stripe_request_ctx and make sure max_sectors is not less than io_opt in raid5 - Fix return value of mddev_trylock - Fix memory leak in raid1_run() - Add Li Nan as mdraid reviewer - Move phys_vec definitions to the kernel types, mostly in preparation for some VFIO and RDMA changes - Improve the speed for secure erase for some devices - Various little rust updates - Various other minor fixes, improvements, and cleanups * tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) blk-mq: ABI/sysfs-block: fix docs build warnings selftests: ublk: organize test directories by test ID block: decouple secure erase size limit from discard size limit block: remove redundant kill_bdev() call in set_blocksize() blk-mq: add documentation for new queue attribute async_dpeth block, bfq: convert to use request_queue->async_depth mq-deadline: covert to use request_queue->async_depth kyber: covert to use request_queue->async_depth blk-mq: add a new queue sysfs attribute async_depth blk-mq: factor out a helper blk_mq_limit_depth() blk-mq-sched: unify elevators checking for async requests block: convert nr_requests to unsigned int block: don't use strcpy to copy blockdev name blk-mq-debugfs: warn about possible deadlock blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static blk-rq-qos: fix possible debugfs_mutex deadlock blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter ...

Merge tag 'for-7.0/io_uring-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

2026-02-10T01:22:00+00:00

Pull io_uring updates from Jens Axboe: - Clean up the IORING_SETUP_R_DISABLED and submitter task checking, mostly just in preparation for relaxing the locking for SINGLE_ISSUER in the future. - Improve IOPOLL by using a doubly linked list to manage completions. Previously it was singly listed, which meant that to complete request N in the chain 0..N-1 had to have completed first. With a doubly linked list we can complete whatever request completes in that order, rather than need to wait for a consecutive range to be available. This reduces latencies. - Improve the restriction setup and checking. Mostly in preparation for adding further features on top of that. Coming in a separate pull request. - Split out task_work and wait handling into separate files. These are mostly nicely abstracted already, but still remained in the io_uring.c file which is on the larger side. - Use GFP_KERNEL_ACCOUNT in a few more spots, where appropriate. - Ensure even the idle io-wq worker exits if a task no longer has any rings open. - Add support for a non-circular submission queue. By default, the SQ ring keeps moving around, even if only a few entries are used for each submission. This can be wasteful in terms of cachelines. If IORING_SETUP_SQ_REWIND is set for the ring when created, each submission will start at offset 0 instead of where we last left off doing submissions. - Various little cleanups * tag 'for-7.0/io_uring-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (30 commits) io_uring/kbuf: fix memory leak if io_buffer_add_list fails io_uring: Add SPDX id lines to remaining source files io_uring: allow io-wq workers to exit when unused io_uring/io-wq: add exit-on-idle state io_uring/net: don't continue send bundle if poll was required for retry io_uring/rsrc: use GFP_KERNEL_ACCOUNT consistently io_uring/futex: use GFP_KERNEL_ACCOUNT for futex data allocation io_uring/io-wq: handle !sysctl_hung_task_timeout_secs io_uring: fix bad indentation for setup flags if statement io_uring/rsrc: take unsigned index in io_rsrc_node_lookup() io_uring: introduce non-circular SQ io_uring: split out CQ waiting code into wait.c io_uring: split out task work code into tw.c io_uring/io-wq: don't trigger hung task for syzbot craziness io_uring: add IO_URING_EXIT_WAIT_MAX definition io_uring/sync: validate passed in offset io_uring/eventfd: remove unused ctx->evfd_last_cq_tail member io_uring/timeout: annotate data race in io_flush_timeouts() io_uring/uring_cmd: explicitly disallow cancelations for IOPOLL io_uring: fix IOPOLL with passthrough I/O ...

Merge tag 'io_uring-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

2026-01-23T20:51:00+00:00

Pull io_uring fixes from Jens Axboe: - Fix for a potential leak of an iovec, if a specific cleanup path is used and the rw_cache is full at the time of the call - Fix for a regression added in this cycle, where waitid should be using prober release/acquire semantics for updating the wait queue head - Check for the cancelation bit being set for every work item processed by io-wq, not just at the start of the loop. Has no real practical implications other than to shut up syzbot doing crazy things that grossly overload a system, hence slowing down ring exit - A few selftest additions, updating the mini_liburing that selftests use * tag 'io_uring-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: selftests/io_uring: support NO_SQARRAY in miniliburing selftests/io_uring: add io_uring_queue_init_params io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop io_uring/waitid: fix KCSAN warning on io_waitid->head io_uring/rw: free potentially allocated iovec on cache put failure

nvme/io_uring: optimize IOPOLL completions for local ring context

2026-01-20T17:18:01+00:00

When multiple io_uring rings poll on the same NVMe queue, one ring can find completions belonging to another ring. The current code always uses task_work to handle this, but this adds overhead for the common single-ring case. This patch passes the polling io_ring_ctx through io_comp_batch's new poll_ctx field. In io_do_iopoll(), the polling ring's context is stored in iob.poll_ctx before calling the iopoll callbacks. In nvme_uring_cmd_end_io(), we now compare iob->poll_ctx with the request's owning io_ring_ctx (via io_uring_cmd_ctx_handle()). If they match (local context), we complete inline with io_uring_cmd_done32(). If they differ (remote context) or iob is NULL (non-iopoll path), we use task_work as before. This optimization eliminates task_work scheduling overhead for the common case where a ring polls and finds its own completions. ~10% IOPS improvement is observed in the following benchmark: fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -O0 -P1 -u1 -n1 /dev/ng0n1 Signed-off-by: Ming Lei Reviewed-by: Kanchan Joshi Signed-off-by: Jens Axboe

io_uring/rw: free potentially allocated iovec on cache put failure

2026-01-19T13:59:06+00:00

If a read/write request goes through io_req_rw_cleanup() and has an allocated iovec attached and fails to put to the rw_cache, then it may end up with an unaccounted iovec pointer. Have io_rw_recycle() return whether it recycled the request or not, and use that to gauge whether to free a potential iovec or not. Reviewed-by: Nitesh Shetty Signed-off-by: Jens Axboe

io_uring: fix IOPOLL with passthrough I/O

2026-01-15T05:03:49+00:00

A previous commit improving IOPOLL made an incorrect assumption that task_work isn't used with IOPOLL. This can cause crashes when doing passthrough I/O on nvme, where queueing the completion task_work will trample on the same memory that holds the completed list of requests. Fix it up by shuffling the members around, so we're not sharing any parts that end up getting used in this path. Fixes: 3c7d76d6128a ("io_uring: IOPOLL polling improvements") Reported-by: Yi Zhang Link: https://lore.kernel.org/linux-block/CAHj4cs_SLPj9v9w5MgfzHKy+983enPx3ZQY2kMuMJ1202DBefw@mail.gmail.com/ Tested-by: Yi Zhang Cc: Ming Lei Reviewed-by: Ming Lei Signed-off-by: Jens Axboe

io_uring: IOPOLL polling improvements

2025-12-28T22:54:45+00:00

io_uring manages issued and pending IOPOLL read/write requests in a singly linked list. One downside of that is that individual items cannot easily be removed from that list, and as a result, io_uring will only complete a completed request N in that list if 0..N-1 are also complete. For homogenous IO this isn't necessarily an issue, but if different devices are involved in polling in the same ring, or if disparate IO from the same device is being polled for, this can defer completion of some requests unnecessarily. Move to a doubly linked list for iopoll completions instead, making it possible to easily complete whatever requests that were polled done successfully. Co-developed-by: Fengnan Chang Link: https://lore.kernel.org/io-uring/20251210085501.84261-1-changfengnan@bytedance.com/ Signed-off-by: Jens Axboe

block: enable per-cpu bio cache by default

2025-12-04T14:19:24+00:00

Since after commit 12e4e8c7ab59 ("io_uring/rw: enable bio caches for IRQ rw"), bio_put is safe for task and irq context, bio_alloc_bioset is safe for task context and no one calls in irq context, so we can enable per cpu bio cache by default. Benchmarked with t/io_uring and ext4+nvme: taskset -c 6 /root/fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 /mnt/testfile base IOPS is 562K, patch IOPS is 574K. The CPU usage of bio_alloc_bioset decrease from 1.42% to 1.22%. The worst case is allocate bio in CPU A but free in CPU B, still use t/io_uring and ext4+nvme: base IOPS is 648K, patch IOPS is 647K. Also use fio test ext4/xfs with libaio/sync/io_uring on null_blk and nvme, no obvious performance regression. Signed-off-by: Fengnan Chang Signed-off-by: Jens Axboe