summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)AuthorFilesLines
3 daysMerge tag 'io_uring-7.1-20260611' of ↵Linus Torvalds2-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Tweak for an off-by-one in the CQ ring accounting for the min wait support. - Don't truncate end buffer length for a bundle, as the transfer might not happen. It's not required in the first place, as the completion side handles this condition already. * tag 'io_uring-7.1-20260611' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/wait: fix min_timeout behavior io_uring/kbuf: don't truncate end buffer for bundles
8 daysio_uring/wait: fix min_timeout behaviorChristian A. Ehrhardt1-1/+1
The wakeup condition if a min timeout is present and has expired is that at least _one_ CQE was posted. Thus set the cq_tail target to ->cq_min_tail + 1. Without this commit a spurious wakeup can result in a premature wakeup because io_should_wake() will return true even if _no_ CQE was posted at all. Cc: Tip ten Brink <tip@tenbrinkmeijs.com> Fixes: e15cb2200b93 ("io_uring: fix min_wait wakeups for SQPOLL") Cc: stable@vger.kernel.org Signed-off-by: Christian A. Ehrhardt <lk@c--e.de> Link: https://patch.msgid.link/20260606201120.1441447-1-lk@c--e.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 daysio_uring/kbuf: don't truncate end buffer for bundlesJens Axboe1-1/+0
If buffers have been peeked for a bundle receive, the kernel will truncate the end buffer, if the available length is shorter than the buffer itself. This is unnecessary, as applications iterating bundle receives must always use the minimum size of the buffer length and the remaining number of bytes in the bundle. The examples in liburing do that as well, eg examples/proxy.c. If the kernel does truncate this buffer AND the current transfer fails, then the buffer will be left with a smaller size than what is otherwise available. Just remove the buffer truncation, as it's not necessary in the first place. Link: https://lore.kernel.org/io-uring/CAAEr8jbY60noGj1fw_k91UJRBkyiRVoS6=nLhZ7Svwidjn4CAA@mail.gmail.com/ Reported-by: Federico Brasili <federico.brasili@gmail.com> Cc: stable@vger.kernel.org Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 daysMerge tag 'io_uring-7.1-20260605' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fix from Jens Axboe: "A single fix for a missing flag mask when multishot is used with an incrementally consumed buffer ring, potentially leading to application confusion because of lack of IORING_CQE_F_BUF_MORE consistency" * tag 'io_uring-7.1-20260605' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retries
10 daysio_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retriesClément Léger1-1/+2
When a bundle recv retries inside io_recv_finish(), the merge logic OR the saved cflags from the previous iteration with the cflags returned by the new iteration: cflags = req->cqe.flags | (cflags & CQE_F_MASK); Bits listed in CQE_F_MASK are inherited from the new iteration, and all other bits (notably IORING_CQE_F_BUFFER and the buffer ID) come from the saved cflags. Before this change CQE_F_MASK covered only IORING_CQE_F_SOCK_NONEMPTY and IORING_CQE_F_MORE. When using provided buffer rings (IOU_PBUF_RING_INC) with incremental mode, and bundle recv, io_kbuf_inc_commit() can leave the head ring entry partially consumed, __io_put_kbufs() then sets IORING_CQE_F_BUF_MORE on the returned cflags so userspace knows the buffer ID will be reused for subsequent completions. Because IORING_CQE_F_BUF_MORE was not in CQE_F_MASK, the merge above silently dropped it whenever the final retry iteration partially consumed the buffer, and the subsequent req->cqe.flags = cflags & ~CQE_F_MASK save would have left a stale IORING_CQE_F_BUF_MORE in the carried-over cflags had one been present. Userspace would then wrongfully advance it ring head past an entry the kernel still uses. Add IORING_CQE_F_BUF_MORE to CQE_F_MASK so it is both inherited from the new iteration into the user-visible CQE and stripped from the saved cflags between iterations. Cc: stable@vger.kernel.org Signed-off-by: Clément Léger <cleger@meta.com> Assisted-by: Claude:claude-opus-4.6 Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://patch.msgid.link/20260604160715.2482972-1-cleger@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-29Merge tag 'io_uring-7.1-20260529' of ↵Linus Torvalds1-4/+8
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fix from Jens Axboe: "Just a single fix for a regression introduced in this cycle, where we should ensure the node is visible before the entry is added to the tctx list" * tag 'io_uring-7.1-20260529' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/tctx: set ->io_uring before publishing the tctx node
2026-05-24io_uring/tctx: set ->io_uring before publishing the tctx nodeLim HyeonJun1-4/+8
io_register_iowq_max_workers() walks ctx->tctx_list under ctx->tctx_lock and dereferences each node's task->io_uring without a NULL check: list_for_each_entry(node, &ctx->tctx_list, ctx_node) { tctx = node->task->io_uring; if (WARN_ON_ONCE(!tctx->io_wq)) continue; ... } __io_uring_add_tctx_node() installs the node into ctx->tctx_list (via io_tctx_install_node(), which does the list_add() under tctx_lock) and only assigns current->io_uring = tctx afterwards. A task doing its first io_uring operation on a shared ring therefore has a window in which its node is already visible on ctx->tctx_list while node->task->io_uring is still NULL. A concurrent IORING_REGISTER_IOWQ_MAX_WORKERS on the same ring reads that NULL and dereferences tctx->io_wq: KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] RIP: io_register_iowq_max_workers io_uring/register.c:423 Publish current->io_uring = tctx before installing the node, so any node visible on ctx->tctx_list always has a valid task->io_uring. Fixes: 7880174e1e5e ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling") Signed-off-by: Lim HyeonJun <shja0831@gmail.com> Link: https://patch.msgid.link/20260524110853.115634-1-shja0831@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-22Merge tag 'io_uring-7.1-20260522' of ↵Linus Torvalds5-9/+35
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Fix for an issue with IORING_OP_NOP and using injection results - Fix for an issue in IORING_OP_WAITID, where the info state was assumed cleared by the lower level syscall handler, but for some cases it is not. Just clear the data upfront, so that non-initialized data isn't copied back to userspace - Fix for a lockdep reported issue, where IORING_OP_BIND enters file create and hence hits mnt_want_write(), which creates a three part lockdep cycle between the super lock, io_uring's uring_lock, and the cred mutex - Fix a regression introduced in this cycle with how linked timeouts are deleted - Ensure that the ->opcode nospec indexing on the opcode issue side covers all the cases * tag 'io_uring-7.1-20260522' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/nop: pass all errors to userspace io_uring/timeout: splice timed out link in timeout handler io_uring: propagate array_index_nospec opcode into req->opcode io_uring/waitid: clear waitid info before copying it to userspace io_uring/net: punt IORING_OP_BIND async if it needs file create
2026-05-21io_uring/nop: pass all errors to userspaceAlexander A. Klimov1-2/+2
This fixes an inconsistency where io_nop() called req_set_fail() based on ret, but passed just nop->result to userspace. Originally, ret is a even copy of nop->result, but is set to an error when such happens subsequently. Now that's also passed to userspace. Fixes: a85f31052bce ("io_uring/nop: add support for testing registered files and buffers") Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Link: https://patch.msgid.link/20260520180045.538533-1-grandmaster@al2klimov.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-20io_uring/timeout: splice timed out link in timeout handlerJens Axboe1-1/+3
A previous commit deferred this to the task_work part of it, so it could be protected by ->uring_lock. But that's actually not necessary here, and in fact the head clearing is not enough to make that safe. For those two reasons, just re-instate the local splicing. Fixes: 49ae66eb8c27 ("io_uring: defer linked-timeout chain splice out of hrtimer context") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-18io_uring: propagate array_index_nospec opcode into req->opcodeMichael Bommarito1-5/+4
Commit 1e988c3fe126 ("io_uring: prevent opcode speculation") added array_index_nospec() to io_init_req(), but applied it only to a local opcode variable. req->opcode is initialized from sqe->opcode before the bounds check and remains the raw value. Keep req->opcode as the canonical opcode in io_init_req(): reject out-of-range values architecturally, then write the array_index_nospec() result back to req->opcode before any table lookup. This keeps downstream users of req->opcode from observing the raw user byte on a mispredicted path. No functional change: array_index_nospec() is a no-op for opcodes in [0, IORING_OP_LAST), and out-of-range opcodes are still rejected at the bounds check above the assignment. Fixes: 1e988c3fe126 ("io_uring: prevent opcode speculation") Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Link: https://patch.msgid.link/20260517213010.696135-1-michael.bommarito@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-16io_uring/waitid: clear waitid info before copying it to userspaceHeechan Kang1-0/+1
IORING_OP_WAITID stores its result fields in struct io_waitid::info and later copies them to userspace siginfo. The prep path initializes the request arguments, but it does not initialize info itself. If the wait operation completes without reporting a child event, the common wait code can return without writing wo_info. In that case io_waitid_finish() still copies iw->info to userspace, exposing stale bytes from the reused io_kiocb command storage. Clear the result storage during prep so the io_uring path matches the regular waitid syscall, which uses a zero-initialized struct waitid_info. Fixes: f31ecf671ddc ("io_uring: add IORING_OP_WAITID support") Cc: stable@vger.kernel.org # 6.7+ Signed-off-by: Heechan Kang <gganji11@naver.com> Link: https://patch.msgid.link/20260516184709.852814-1-gganji11@naver.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-15Merge tag 'io_uring-7.1-20260515' of ↵Linus Torvalds6-13/+44
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Small series sanitizing the locking done for either modifying or reading a chain of requests - If the application has a pid namespace, ensure that the sqthread pid is correctly printed in fdinfo - Fix for a hashing issue in the io-wq thread pool, which could lead to a use-after-free - Kill dead argument from io_prep_rw_pi() - Fix for a missed validation of the CQ ring head, affecting CQE refill * tag 'io_uring-7.1-20260515' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring: validate user-controlled cq.head in io_cqe_cache_refill() io-wq: check that the predecessor is hashed in io_wq_remove_pending() io_uring/rw: drop unused attr_type_mask from io_prep_rw_pi() io_uring: hold uring_lock across io_kill_timeouts() in cancel path io_uring: defer linked-timeout chain splice out of hrtimer context io_uring: hold uring_lock when walking link chain in io_wq_free_work() io_uring/fdinfo: translate SqThread PID through caller's pid_ns
2026-05-15io_uring/net: punt IORING_OP_BIND async if it needs file createJens Axboe1-1/+25
For two reasons: 1) An opcode cannot block inside io_uring_enter() doing submissions, as it'll stall the submission side pipeline. 2) Ending up in sb_start_write() -> __sb_start_write() -> percpu_down_read_freezable() introduces a new lockdep edge, which it correctly complains about. Check if the socket type is AF_UNIX and has a non-empty pathname. If it does, mark it REQ_F_FORCE_ASYNC to punt the submission to io-wq rather than attempt to do it inline. Fixes: 7481fd93fa0a ("io_uring: Introduce IORING_OP_BIND") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-14io_uring: validate user-controlled cq.head in io_cqe_cache_refill()Zizhi Wo1-5/+17
A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU: [root@fedora io_uring_stress]# ps -ef | grep io_uring root 1240 1 99 13:36 ? 00:01:35 [io_uring_stress] <defunct> The task loops inside io_cqring_wait() and never returns to userspace, and SIGKILL has no effect. This is caused by the CQ ring exposing rings->cq.head to userspace as writable, while the authoritative tail lives in kernel-private ctx->cached_cq_tail. io_cqe_cache_refill() computes free space as an unsigned subtraction: free = ctx->cq_entries - min(tail - head, ctx->cq_entries); If userspace keeps head within [0, tail], the subtraction is well defined and min() just acts as a defensive clamp. But if userspace advances head past tail, (tail - head) wraps to a huge value, free becomes 0, and io_cqe_cache_refill() fails. The CQE is pushed onto the overflow list and IO_CHECK_CQ_OVERFLOW_BIT is set. The wait loop in io_cqring_wait() relies on an invariant: refill() only fails when the CQ is *physically* full, in which case rings->cq.tail has been advanced to iowq->cq_tail and io_should_wake() returns true. The tampered head breaks this: refill() fails while the ring is not full, no OCQE is copied in, rings->cq.tail never catches up, io_should_wake() stays false, and io_cqring_wait_schedule() keeps returning early because IO_CHECK_CQ_OVERFLOW_BIT is still set. The result is a tight retry loop that never returns to userspace. Introduce io_cqring_queued() as the single point that converts the (tail, head) pair into a trustworthy queued count. Since the real head/tail distance is bounded by cq_entries (far below 2^31), a signed comparison reliably detects userspace moving head past tail; in that case treat the queue as empty so callers see the full cache as free and forward progress is preserved. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Link: https://patch.msgid.link/20260514021847.4062782-1-wozizhi@huaweicloud.com [axboe: fixup commit message, kill 'queued' var, and keep it all in io_uring.c] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-13io-wq: check that the predecessor is hashed in io_wq_remove_pending()Nicholas Carlini1-1/+2
io_wq_remove_pending() needs to fix up wq->hash_tail[] if the cancelled work was the tail of its hash bucket. When doing this, it checks whether the preceding entry in acct->work_list has the same hash value, but never checks that the predecessor is hashed at all. io_get_work_hash() is simply atomic_read(&work->flags) >> IO_WQ_HASH_SHIFT, and the hash bits are never set for non-hashed work, so it returns 0. Thus, when a hashed bucket-0 work is cancelled while a non-hashed work is its list predecessor, the check spuriously passes and a pointer to the non-hashed io_kiocb is stored in wq->hash_tail[0]. Because non-hashed work is dequeued via the fast path in io_get_next_work(), which never touches hash_tail[], the stale pointer is never cleared. Therefore, after the non-hashed io_kiocb completes and is freed back to req_cachep, wq->hash_tail[0] is a dangling pointer. The io_wq is per-task (tctx->io_wq) and survives ring open/close, so the dangling pointer persists for the lifetime of the task; the next hashed bucket-0 enqueue dereferences it in io_wq_insert_work() and wq_list_add_after() writes through freed memory. Add the missing io_wq_is_hashed() check so a non-hashed predecessor never inherits a hash_tail[] slot. Cc: stable@vger.kernel.org Fixes: 204361a77f40 ("io-wq: fix hang after cancelling pending hashed work") Signed-off-by: Nicholas Carlini <nicholas@carlini.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-13io_uring/rw: drop unused attr_type_mask from io_prep_rw_pi()Yang Xiuwei1-2/+2
io_prep_rw_pi() never used the attr_type_mask argument. Callers already validate sqe->attr_type_mask before invoking the helper (only IORING_RW_ATTR_FLAG_PI is supported today). Remove the dead parameter to avoid implying further interpretation happens here. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Link: https://patch.msgid.link/20260513094303.866533-1-yangxiuwei@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-11io_uring: hold uring_lock across io_kill_timeouts() in cancel pathJens Axboe1-1/+1
io_uring_try_cancel_requests() dropped ctx->uring_lock before calling io_kill_timeouts(), which walks each timeout's link chain via io_match_task() to test REQ_F_INFLIGHT. With chain mutation now serialized by ctx->uring_lock, that walk needs the lock too. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-11io_uring: defer linked-timeout chain splice out of hrtimer contextJens Axboe1-2/+14
io_link_timeout_fn() is the hrtimer callback that fires when a linked timeout expires. It currently calls io_remove_next_linked(prev) under ctx->timeout_lock to splice the timeout request out of the link chain. This is the only chain-mutation site that runs without ctx->uring_lock, because hrtimer callbacks cannot take a mutex. Defer the splicing until the task_work callback. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-11io_uring: hold uring_lock when walking link chain in io_wq_free_work()Jens Axboe1-1/+6
io_wq_free_work() calls io_req_find_next() from io-wq worker context, which reads and clears req->link without holding any lock. This can potentially race with other paths that mutate the same chain under ctx->uring_lock. Take ctx->uring_lock around the io_req_find_next() call. Only requests with IO_REQ_LINK_FLAGS reach this path, which is not the hot path. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-11io_uring/fdinfo: translate SqThread PID through caller's pid_nsMaoyi Xie1-1/+2
SQPOLL stores current->pid (init_pid_ns view) in sqd->task_pid at thread creation. fdinfo prints it raw via seq_printf("SqThread:\t%d\n", sq_pid). A reader inside a non-initial pid_ns sees the host PID, not the kthread's PID in the reader's own pid_ns. The SQPOLL kthread is created with CLONE_THREAD and no CLONE_NEW*, so it lives in the submitter's pid_ns. An unprivileged user_ns + pid_ns submitter can read fdinfo and learn the host PID of a kthread whose in-namespace PID is different. Reproducer (mainline 7.0, KASAN): unshare CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNS, mount a private /proc, then have a grandchild that is pid 1 in the new pid_ns open an io_uring ring with IORING_SETUP_SQPOLL. /proc/self/task lists {1, 2}; the SQPOLL kthread is pid 2. Before: fdinfo prints SqThread = <host pid>. After: SqThread = 2. Use task_pid_nr_ns() against the proc inode's pid_ns to compute sq_pid, instead of reading the stored sq->task_pid (which holds the init_pid_ns view). pidfd_show_fdinfo() in kernel/pid.c follows the same pattern. Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260510084119.457578-1-maoyi.xie@ntu.edu.sg Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-08Merge tag 'io_uring-7.1-20260508' of ↵Linus Torvalds5-24/+53
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Ensure that the absolute timeouts for both the command side and the waiting side honor the callers time namespace - Ensure tracked NAPI entries are cleared at unregistration time, as the NAPI polling loop checks the list state rather than the general NAPI state. This can lead to NAPI polling even after unregistration has been done. If unregistered, all NAPI polling should be disabled - Fix for eventfd recursive invocation handling * tag 'io_uring-7.1-20260508' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER io_uring/timeout: honour caller's time namespace for IORING_TIMEOUT_ABS io_uring/eventfd: reset deferred signal state io_uring/napi: clear tracked NAPI entries on unregister
2026-05-06io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMERMaoyi Xie1-1/+5
io_uring_enter() with IORING_ENTER_ABS_TIMER takes an absolute timespec from the caller via ext_arg->ts. It arms an ABS mode hrtimer in __io_cqring_wait_schedule(). The conversion path in io_uring/wait.c parses ext_arg->ts inline rather than going through io_parse_user_time(). It therefore does not pick up the time namespace conversion added by the previous patch. Apply timens_ktime_to_host() to the parsed time on the IORING_ENTER_ABS_TIMER branch. This mirrors the IORING_TIMEOUT_ABS fix in io_parse_user_time(). Use ctx->clockid as the clock id. ctx->clockid is set either at ring creation or via IORING_REGISTER_CLOCK. timens_ktime_to_host() is a no-op for clocks not affected by time namespaces. It is also a no-op for callers in the initial time namespace. The fast path is unchanged. Reproducer: in unshare --user --time, with a -10s monotonic offset, call io_uring_enter with min_complete=1, IORING_ENTER_ABS_TIMER, and ts = now + 1s. The call returns -ETIME after <1ms instead of after the expected ~1s. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260504153755.1293932-3-maoyi.xie@ntu.edu.sg Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-06io_uring/timeout: honour caller's time namespace for IORING_TIMEOUT_ABSMaoyi Xie1-13/+22
io_uring's IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT accept a timespec from the caller via io_parse_user_time(). With IORING_TIMEOUT_ABS, the timestamp is an absolute deadline on the selected clock. The clock is CLOCK_MONOTONIC by default. CLOCK_BOOTTIME and CLOCK_REALTIME are also selectable. A submitter inside a CLONE_NEWTIME time namespace observes CLOCK_MONOTONIC and CLOCK_BOOTTIME shifted by the namespace's offsets relative to the host. Every other ABS timer interface in the kernel converts the caller's absolute time to host view via timens_ktime_to_host() before arming an hrtimer: kernel/time/posix-timers.c -- timer_settime(TIMER_ABSTIME) kernel/time/posix-stubs.c -- clock_nanosleep(TIMER_ABSTIME) kernel/time/alarmtimer.c -- alarm_timer_nsleep(TIMER_ABSTIME) fs/timerfd.c -- timerfd_settime(TFD_TIMER_ABSTIME) io_parse_user_time() does not. As a result, an absolute timeout submitted from within a time namespace is interpreted in host view. That is generally a different point in time. It may already be in the past, causing the timer to fire immediately, or far in the future, causing the timer not to fire when expected. Reproducer: in unshare --user --time, with a -10s monotonic offset, submit IORING_OP_TIMEOUT with IORING_TIMEOUT_ABS and deadline = now + 1s. The CQE is delivered after <1ms instead of the expected ~1s. Apply timens_ktime_to_host() to the parsed time when IORING_TIMEOUT_ABS is set. Split the existing clock id resolver in io_timeout_get_clock() into a flags only helper io_flags_to_clock(), so io_parse_user_time() can resolve the clock without a struct io_timeout_data. timens_ktime_to_host() is a no-op for clocks not affected by time namespaces, e.g. CLOCK_REALTIME. It is also a no-op for callers in the initial time namespace. The fast path is unchanged. SQPOLL is also covered. The SQPOLL kernel thread is created via create_io_thread() with CLONE_THREAD and no CLONE_NEW* flag. copy_namespaces() therefore shares the submitter's nsproxy by reference. Inside the SQPOLL kthread, current->nsproxy->time_ns is the submitter's time_ns. timens_ktime_to_host() resolves correctly. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260504153755.1293932-2-maoyi.xie@ntu.edu.sg Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-04io_uring/eventfd: reset deferred signal stateYufan Chen1-0/+1
Recursive eventfd wakeups must defer io_uring eventfd signaling because eventfd_signal_mask() rejects reentry from eventfd wakeup handlers. The io_ev_fd ops bit tracks an outstanding deferred signal so that the same rcu_head is not queued twice. That bit is only set today. Once the first deferred callback runs, later recursive notifications still see the bit set and skip queueing another deferred signal. This can leave new completions without a matching eventfd wake after the first recursive deferral. Clear the pending bit before issuing the deferred signal. If the wakeup path recurses while the callback runs, a new signal can be queued for the next RCU grace period while the current callback keeps its reference until it returns. Signed-off-by: Yufan Chen <ericterminal@gmail.com> Fixes: 60b6c075e8eb ("io_uring/eventfd: move to more idiomatic RCU free usage") Link: https://patch.msgid.link/20260503175710.37209-1-yufan.chen@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-04io_uring/napi: clear tracked NAPI entries on unregisterYufan Chen2-10/+25
IORING_UNREGISTER_NAPI disables NAPI busy polling, but it currently leaves any previously tracked NAPI IDs on the ring context. The normal wait path only checks whether the list is empty before entering the busy poll helper, so an unregistered ring can still observe stale entries and run an unexpected busy poll pass. Make unregister switch the context to inactive and free the tracked entries. Do the same inactive transition while changing the tracking strategy, and recheck the expected tracking mode under napi_lock before inserting a newly learned NAPI ID. This prevents a racing poll path from repopulating the list after unregister or reconfiguration. Also make the busy poll dispatcher ignore inactive mode explicitly. Signed-off-by: Yufan Chen <ericterminal@gmail.com> Fixes: 6bf90bd8c58a ("io_uring/napi: add static napi tracking strategy") Link: https://patch.msgid.link/20260503175610.35521-1-yufan.chen@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-01Merge tag 'io_uring-7.1-20260430' of ↵Linus Torvalds4-4/+27
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Remove dead struct io_buffer_list member - Fix for incrementally consumed buffers with recvmsg multishot, which requires a minimum value left in a buffer for any receive for the headers. If there's still a bit of buffer left but it's smaller than that value, then userspace will see a spurious -EFAULT returned in the CQE - Locking fix for the DEFER_TASKRUN retry list, which otherwise could race with fallback cancelations. If the task is exiting with task_work left in both the normal and retry list AND the exit cleanup races with the task running task work, then entries could either be doubly completed or lost - Cap NAPI busy poll timeout to something sane, to avoid syzbot running into excessive polling and triggering warnings around that * tag 'io_uring-7.1-20260430' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/tw: serialize ctx->retry_llist with ->uring_lock io_uring/napi: cap busy_poll_to 10 msec io_uring/kbuf: support min length left for incremental buffers io_uring/kbuf: kill dead struct io_buffer_list 'nr_entries' member
2026-04-30io_uring/tw: serialize ctx->retry_llist with ->uring_lockJens Axboe1-1/+11
The DEFER_TASKRUN local task work paths all run under ctx->uring_lock, which serializes them with each other and with the rest of the ring's hot paths. io_move_task_work_from_local() is the exception - it's called from io_ring_exit_work() on a kworker without holding the lock and from the iopoll cancelation side right after dropping it. ->work_llist is fine with this, as it's only ever updated via the expected paths. But the ->retry_llist is updated while runing, and hence it could potentially race between normal task_work running and the task-has-exited shutdown path. Simply grab ->uring_lock while moving the local work to the fallback list for exit purposes, which nicely serializes it across both the normal additions and the exit prune path. Cc: stable@vger.kernel.org Fixes: f46b9cdb22f7 ("io_uring: limit local tw done") Reported-by: Robert Femmer <robert.femmer@x41-dsec.de> Reported-by: Christian Reitter <invd@inhq.net> Reported-by: Michael Rodler <michael.rodler@x41-dsec.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-30net: add net_iov_init() and use it to initialize ->page_typeJakub Kicinski1-2/+1
Commit db359fccf212 ("mm: introduce a new page type for page pool in page type") added a page_type field to struct net_iov at the same offset as struct page::page_type, so that page_pool_set_pp_info() can call __SetPageNetpp() uniformly on both pages and net_iovs. The page-type API requires the field to hold the UINT_MAX "no type" sentinel before a type can be set; for real struct page that invariant is established by the page allocator on free. struct net_iov is not allocated through the page allocator, so the field is left as zero (io_uring zcrx, which uses __GFP_ZERO) or as slab garbage (devmem, which uses kvmalloc_objs() without zeroing). When the page pool then calls page_pool_set_pp_info() on a freshly-bound niov, __SetPageNetpp()'s VM_BUG_ON_PAGE(page->page_type != UINT_MAX) fires and the kernel BUGs. Triggered in selftests by io_uring zcrx setup through the fbnic queue restart path: kernel BUG at ./include/linux/page-flags.h:1062! RIP: 0010:page_pool_set_pp_info (./include/linux/page-flags.h:1062 net/core/page_pool.c:716) Call Trace: <TASK> net_mp_niov_set_page_pool (net/core/page_pool.c:1360) io_pp_zc_alloc_netmems (io_uring/zcrx.c:1089 io_uring/zcrx.c:1110) fbnic_fill_bdq (./include/net/page_pool/helpers.h:160 drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:906) __fbnic_nv_restart (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2470 drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2874) fbnic_queue_start (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2903) netdev_rx_queue_reconfig (net/core/netdev_rx_queue.c:137) __netif_mp_open_rxq (net/core/netdev_rx_queue.c:234) io_register_zcrx (io_uring/zcrx.c:818 io_uring/zcrx.c:903) __io_uring_register (io_uring/register.c:931) __do_sys_io_uring_register (io_uring/register.c:1029) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) </TASK> The same path is reachable through devmem dmabuf binding via netdev_nl_bind_rx_doit() -> net_devmem_bind_dmabuf_to_queue(). Add a net_iov_init() helper that stamps ->owner, ->type and the ->page_type sentinel, and use it from both the devmem and io_uring zcrx niov init loops. Fixes: db359fccf212 ("mm: introduce a new page type for page pool in page type") Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/20260428025320.853452-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-29io_uring/napi: cap busy_poll_to 10 msecJens Axboe1-0/+2
Currently there's no cap on the maximum amount of time that napi is allowed to poll if no events are found, which can lead to kernel complaints on a task being stuck as there's no conditional rescheduling done within that loop. Just cap it to 10 msec in total, that's already way above any kind of sane value that will reap any benefits, yet low enough that it's nowhere near being able to trigger preemption complaints. Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-29io_uring/kbuf: support min length left for incremental buffersMartin Michaelis2-1/+14
Incrementally consumed buffer rings are generally fully consumed, but it's quite possible that the application has a minimum size it needs to meet to avoid truncation. Currently that minimum limit is 1 byte, but this should be a setting that is the hands of the application. For recvmsg multishot, a prime use case for incrementally consumed buffers, the application may get spurious -EFAULT returned at the end of an incrementally consumed buffer, as less space is available than the headers need. Grab a u32 field in struct io_uring_buf_reg, which the application can use to inform the kernel of the minimum size that should be available in an incrementally consumed buffer. If less than that is available, the current buffer is fully processed and the next one will be picked. Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1433 Signed-off-by: Martin Michaelis <code@mgjm.de> [axboe: write commit message, change io_buffer_list member name] Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-29io_uring/kbuf: kill dead struct io_buffer_list 'nr_entries' memberJens Axboe2-2/+0
This is only ever assigned, never used. The only used part is the calculated mask, which is used for indexing. Kill 'nr_entries'. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-22io_uring: take page references for NOMMU pbuf_ring mmapsGreg Kroah-Hartman1-1/+45
Under !CONFIG_MMU, io_uring_get_unmapped_area() returns the kernel virtual address of the io_mapped_region's backing pages directly; the user's VMA aliases the kernel allocation. io_uring_mmap() then just returns 0 -- it takes no page references. The CONFIG_MMU path uses vm_insert_pages(), which takes a reference on each inserted page. Those references are released when the VMA is torn down (zap_pte_range -> put_page). io_free_region() -> release_pages() drops the io_uring-side references, but the pages survive until munmap drops the VMA-side references. Under NOMMU there are no VMA-side references. io_unregister_pbuf_ring -> io_put_bl -> io_free_region -> release_pages drops the only references and the pages return to the buddy allocator while the user's VMA still has vm_start pointing into them. The user can then write into whatever the allocator hands out next. Mirror the MMU lifetime: take get_page references in io_uring_mmap() and release them via vm_ops->close. NOMMU's delete_vma() calls vma_close() which runs ->close on munmap. This also incidentally addresses the duplicate-vm_start case: two mmaps of SQ_RING and CQ_RING resolve to the same ctx->ring_region pointer. With page refs taken per mmap, the second mmap takes its own refs and the pages survive until both mmaps are closed. The nommu rb-tree BUG_ON on duplicate vm_start is a separate mm/nommu.c concern (it should share the existing region rather than BUG), but the page lifetime is now correct. Cc: Jens Axboe <axboe@kernel.dk> Reported-by: Anthropic Assisted-by: gkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/2026042115-body-attention-d15b@gregkh [axboe: get rid of region lookup, just iterate pages in vma] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-22io_uring/poll: ensure EPOLL_ONESHOT is propagated for EPOLL_URING_WAKEJens Axboe1-1/+3
Commit: aacf2f9f382c ("io_uring: fix req->apoll_events") fixed an issue where poll->events and req->apoll_events weren't synchronized, but then when the commit referenced in Fixes got added, it didn't ensure the same thing. If we mask in EPOLLONESHOT in the regular EPOLL_URING_WAKE path, then ensure it's done for both. Including a link to the original report below, even though it's mostly nonsense. But it includes a reproducer that does show that IORING_CQE_F_MORE is set in the previous CQE, while no more CQEs will be generated for this request. Just ignore anything that pretends this is security related in any way, it's just the typical AI nonsense. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/CAM0zi7yQzF3eKncgHo4iVM5yFLAjsiob_ucqyWKs=hyd_GqiMg@mail.gmail.com/ Reported-by: Azizcan Daştan <azizcan.d@mileniumsec.com> Fixes: 4464853277d0 ("io_uring: pass in EPOLL_URING_WAKE for eventfd signaling and wakeups") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21io_uring/zcrx: warn on freelist violationsPavel Begunkov1-0/+2
The freelist is appropriately sized to always be able to take a free niov, but let's be more defensive and check the invariant with a warning. That should help to catch any double-free issues. Suggested-by: Kai Aizen <kai@snailsploit.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/2f3cea363b04649755e3b6bb9ab66485a95936d5.1776760901.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21io_uring/zcrx: clear RQ headers on initPavel Begunkov1-0/+1
It might be unexpected to users if the RQ head/tail after a ring creation are not zeroed, fix that. Cc: stable@vger.kernel.org Fixes: 6f377873cb239 ("io_uring/zcrx: add interface queue and refill queue") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/331f94663c3e8f021ffa3cb770ca2844a07d4855.1776760911.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21io_uring/zcrx: fix user_struct uafPavel Begunkov1-1/+1
io_free_rbuf_ring() usees a struct user_struct, which io_zcrx_ifq_free() puts it down before destroying the ring. Cc: stable@vger.kernel.org Fixes: 5c686456a4e83 ("io_uring/zcrx: add user_struct and mm_struct to io_zcrx_ifq") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/e560ae00960d27a810522a7efc0e201c82dff351.1776760917.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21io_uring/register: fix ring resizing with mixed/large SQEs/CQEsJens Axboe1-8/+28
The ring resizing only properly handles "normal" sized SQEs or CQEs, if there are pending entries around a resize. This normally should not be the case, but the code is supposed to handle this regardless. For the mixed SQE/CQE cases, the current copying works fine as they are indexed in the same way. Each half is just copied separately. But for fixed large SQEs and CQEs, the iteration and copy need to take that into account. Cc: stable@kernel.org Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21io_uring/futex: ensure partial wakes are appropriately dequeuedJens Axboe1-1/+3
If a FUTEX_WAITV vectored operation is only partially woken, we should call __futex_wake_mark() on the queue to account for that. If not, then a later wakeup will wake the same entry, rather than the next one in line. Fixes: 8f350194d5cfd ("io_uring: add support for vectored futex waits") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21io_uring/rw: add defensive hardening for negative kbuf lengthsJens Axboe1-2/+2
No real bug here, just being a bit defensive in ensuring that whatever gets passed into io_put_kbuf() is always >= 0 and not some random error value. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21io_uring/rsrc: use kvfree() for the imu cacheJens Axboe2-2/+2
Currently anything that requires kvmalloc_flex() for allocations will not get re-cached, and hence the cache freeing path is correct in that it always uses kfree() to free the allocated memory. But this seems a bit fragile as it's something that could get mix should that situation change, so switch io_free_imu() and io_alloc_cache_free() to use kvfree as the desctructor. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21io_uring/rsrc: unify nospec indexing for direct descriptorsJens Axboe2-2/+10
For file updates, the node reset isn't capping the value via array_index_nospec() like the other paths do. Ensure it's all sane and have the update path do the proper capping as well. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-21io_uring: fix spurious fput in registered ring pathJens Axboe1-1/+2
Fix an issue with io_uring_ctx_get_file() not gating fput() on whether or not the file descriptor is a registered/direct one or not. Fixes: c5e9f6a96bf7 ("io_uring: unify getting ctx from passed in file descriptor") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-20io_uring: fix iowq_limits data race in tctx node additionJens Axboe1-3/+7
__io_uring_add_tctx_node() reads ctx->int_flags and ctx->iowq_limits[0..1] without holding ctx->uring_lock, while io_register_iowq_max_workers() writes these same fields under the lock. Mostly an application problem if you try and make these race, but let's silence KCSAN by just grabbing the ->uring_lock around the operation. This is a slow path operation anyway, and ->uring_lock will be grabbed by submission right after anyway. Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-20io_uring/tctx: mark io_wq as exiting before error path teardownJens Axboe1-1/+3
syzbot reports that it's hitting the below condition for exiting an io_wq context: WARN_ON_ONCE(!test_bit(IO_WQ_BIT_EXIT, &wq->state)) in io_wq_put_and_exit(), which can be triggered with memory allocation fault injection. Ensure that the io_wq is marked as exiting to silence this warning trigger. Reported-by: syzbot+79a4cc863a8db58cd92b@syzkaller.appspotmail.com Fixes: 7880174e1e5e ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling") Reviewed-by: Clément Léger <cleger@meta.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-20io_uring/tctx: check for setup tctx->io_wq before teardownJens Axboe1-1/+2
As with the idling code before it, the error exit path should check for a NULL tctx->io_wq before calling io_wq_put_and_exit(). Fixes: 7880174e1e5e ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling") Reported-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Clément Léger <cleger@meta.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15io_uring/poll: fix signed comparison in io_poll_get_ownership()Longxuan Yu1-1/+1
io_poll_get_ownership() uses a signed comparison to check whether poll_refs has reached the threshold for the slowpath: if (unlikely(atomic_read(&req->poll_refs) >= IO_POLL_REF_BIAS)) atomic_read() returns int (signed). When IO_POLL_CANCEL_FLAG (BIT(31)) is set in poll_refs, the value becomes negative in signed arithmetic, so the >= 128 comparison always evaluates to false and the slowpath is never taken. Fix this by casting the atomic_read() result to unsigned int before the comparison, so that the cancel flag is treated as a large positive value and correctly triggers the slowpath. Fixes: a26a35e9019f ("io_uring: make poll refs more robust") Cc: stable@vger.kernel.org Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Longxuan Yu <ylong030@ucr.edu> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/3a3508b08bcd7f1bc3beff848ae6e1d73d355043.1775965597.git.ylong030@ucr.edu Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-15Merge tag 'net-next-7.1' of ↵Linus Torvalds1-4/+7
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Support HW queue leasing, allowing containers to be granted access to HW queues for zero-copy operations and AF_XDP - Number of code moves to help the compiler with inlining. Avoid output arguments for returning drop reason where possible - Rework drop handling within qdiscs to include more metadata about the reason and dropping qdisc in the tracepoints - Remove the rtnl_lock use from IP Multicast Routing - Pack size information into the Rx Flow Steering table pointer itself. This allows making the table itself a flat array of u32s, thus making the table allocation size a power of two - Report TCP delayed ack timer information via socket diag - Add ip_local_port_step_width sysctl to allow distributing the randomly selected ports more evenly throughout the allowed space - Add support for per-route tunsrc in IPv6 segment routing - Start work of switching sockopt handling to iov_iter - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid buffer size drifting up - Support MSG_EOR in MPTCP - Add stp_mode attribute to the bridge driver for STP mode selection. This addresses concerns about call_usermodehelper() usage - Remove UDP-Lite support (as announced in 2023) - Remove support for building IPv6 as a module. Remove the now unnecessary function calling indirection Cross-tree stuff: - Move Michael MIC code from generic crypto into wireless, it's considered insecure but some WiFi networks still need it Netfilter: - Switch nft_fib_ipv6 module to no longer need temporary dst_entry object allocations by using fib6_lookup() + RCU. Florian W reports this gets us ~13% higher packet rate - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and switch the service tables to be per-net. Convert some code that walks the service lists to use RCU instead of the service_mutex - Add more opinionated input validation to lower security exposure - Make IPVS hash tables to be per-netns and resizable Wireless: - Finished assoc frame encryption/EPPKE/802.1X-over-auth - Radar detection improvements - Add 6 GHz incumbent signal detection APIs - Multi-link support for FILS, probe response templates and client probing - New APIs and mac80211 support for NAN (Neighbor Aware Networking, aka Wi-Fi Aware) so less work must be in firmware Driver API: - Add numerical ID for devlink instances (to avoid having to create fake bus/device pairs just to have an ID). Support shared devlink instances which span multiple PFs - Add standard counters for reporting pause storm events (implement in mlx5 and fbnic) - Add configuration API for completion writeback buffering (implement in mana) - Support driver-initiated change of RSS context sizes - Support DPLL monitoring input frequency (implement in zl3073x) - Support per-port resources in devlink (implement in mlx5) Misc: - Expand the YAML spec for Netfilter Drivers - Software: - macvlan: support multicast rx for bridge ports with shared source MAC address - team: decouple receive and transmit enablement for IEEE 802.3ad LACP "independent control" - Ethernet high-speed NICs: - nVidia/Mellanox: - support high order pages in zero-copy mode (for payload coalescing) - support multiple packets in a page (for systems with 64kB pages) - Broadcom 25-400GE (bnxt): - implement XDP RSS hash metadata extraction - add software fallback for UDP GSO, lowering the IOMMU cost - Broadcom 800GE (bnge): - add link status and configuration handling - add various HW and SW statistics - Marvell/Cavium: - NPC HW block support for cn20k - Huawei (hinic3): - add mailbox / control queue - add rx VLAN offload - add driver info and link management - Ethernet NICs: - Marvell/Aquantia: - support reading SFP module info on some AQC100 cards - Realtek PCI (r8169): - add support for RTL8125cp - Realtek USB (r8152): - support for the RTL8157 5Gbit chip - add 2500baseT EEE status/configuration support - Ethernet NICs embedded and off-the-shelf IP: - Synopsys (stmmac): - cleanup and reorganize SerDes handling and PCS support - cleanup descriptor handling and per-platform data - cleanup and consolidate MDIO defines and handling - shrink driver memory use for internal structures - improve Tx IRQ coalescing - improve TCP segmentation handling - add support for Spacemit K3 - Cadence (macb): - support PHYs that have inband autoneg disabled with GEM - support IEEE 802.3az EEE - rework usrio capabilities and handling - AMD (xgbe): - improve power management for S0i3 - improve TX resilience for link-down handling - Virtual: - Google cloud vNIC: - support larger ring sizes in DQO-QPL mode - improve HW-GRO handling - support UDP GSO for DQO format - PCIe NTB: - support queue count configuration - Ethernet PHYs: - automatically disable PHY autonomous EEE if MAC is in charge - Broadcom: - add BCM84891/BCM84892 support - Micrel: - support for LAN9645X internal PHY - Realtek: - add RTL8224 pair order support - support PHY LEDs on RTL8211F-VD - support spread spectrum clocking (SSC) - Maxlinear: - add PHY-level statistics via ethtool - Ethernet switches: - Maxlinear (mxl862xx): - support for bridge offloading - support for VLANs - support driver statistics - Bluetooth: - large number of fixes and new device IDs - Mediatek: - support MT6639 (MT7927) - support MT7902 SDIO - WiFi: - Intel (iwlwifi): - UNII-9 and continuing UHR work - MediaTek (mt76): - mt7996/mt7925 MLO fixes/improvements - mt7996 NPU support (HW eth/wifi traffic offload) - Qualcomm (ath12k): - monitor mode support on IPQ5332 - basic hwmon temperature reporting - support IPQ5424 - Realtek: - add USB RX aggregation to improve performance - add USB TX flow control by tracking in-flight URBs - Cellular: - IPA v5.2 support" * tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits) net: pse-pd: fix kernel-doc function name for pse_control_find_by_id() wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit wireguard: allowedips: remove redundant space tools: ynl: add sample for wireguard wireguard: allowedips: Use kfree_rcu() instead of call_rcu() MAINTAINERS: Add netkit selftest files selftests/net: Add additional test coverage in nk_qlease selftests/net: Split netdevsim tests from HW tests in nk_qlease tools/ynl: Make YnlFamily closeable as a context manager net: airoha: Add missing PPE configurations in airoha_ppe_hw_init() net: airoha: Fix VIP configuration for AN7583 SoC net: caif: clear client service pointer on teardown net: strparser: fix skb_head leak in strp_abort_strp() net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete() selftests/bpf: add test for xdp_master_redirect with bond not up net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration sctp: disable BH before calling udp_tunnel_xmit_skb() sctp: fix missing encap_port propagation for GSO fragments net: airoha: Rely on net_device pointer in ETS callbacks ...
2026-04-14Merge tag 'for-7.1/io_uring-20260411' of ↵Linus Torvalds33-506/+1030
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Add a callback driven main loop for io_uring, and BPF struct_ops on top to allow implementing custom event loop logic - Decouple IOPOLL from being a ring-wide all-or-nothing setting, allowing IOPOLL use cases to also issue certain white listed non-polled opcodes - Timeout improvements. Migrate internal timeout storage from timespec64 to ktime_t for simpler arithmetic and avoid copying of timespec data - Zero-copy receive (zcrx) updates: - Add a device-less mode (ZCRX_REG_NODEV) for testing and experimentation where data flows through the copy fallback path - Fix two-step unregistration regression, DMA length calculations, xarray mark usage, and a potential 32-bit overflow in id shifting - Refactoring toward multi-area support: dedicated refill queue struct, consolidated DMA syncing, netmem array refilling format, and guard-based locking - Zero-copy transmit (zctx) cleanup: - Unify io_send_zc() and io_sendmsg_zc() into a single function - Add vectorized registered buffer send for IORING_OP_SEND_ZC - Add separate notification user_data via sqe->addr3 so notification and completion CQEs can be distinguished without extra reference counting - Switch struct io_ring_ctx internal bitfields to explicit flag bits with atomic-safe accessors, and annotate the known harmless races on those flags - Various optimizations caching ctx and other request fields in local variables to avoid repeated loads, and cleanups for tctx setup, ring fd registration, and read path early returns * tag 'for-7.1/io_uring-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (58 commits) io_uring: unify getting ctx from passed in file descriptor io_uring/register: don't get a reference to the registered ring fd io_uring/tctx: clean up __io_uring_add_tctx_node() error handling io_uring/tctx: have io_uring_alloc_task_context() return tctx io_uring/timeout: use 'ctx' consistently io_uring/rw: clean up __io_read() obsolete comment and early returns io_uring/zcrx: use correct mmap off constants io_uring/zcrx: use dma_len for chunk size calculation io_uring/zcrx: don't clear not allocated niovs io_uring/zcrx: don't use mark0 for allocating xarray io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring() io_uring/zcrx: reject REG_NODEV with large rx_buf_size io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OP io_uring/rsrc: use io_cache_free() to free node io_uring/zcrx: rename zcrx [un]register functions io_uring/zcrx: check ctrl op payload struct sizes io_uring/zcrx: cache fallback availability in zcrx ctx io_uring/zcrx: warn on a repeated area append io_uring/zcrx: consolidate dma syncing io_uring/zcrx: netmem array as refiling format ...
2026-04-14Merge tag 'vfs-7.1-rc1.misc' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - coredump: add tracepoint for coredump events - fs: hide file and bfile caches behind runtime const machinery Fixes: - fix architecture-specific compat_ftruncate64 implementations - dcache: Limit the minimal number of bucket to two - fs/omfs: reject s_sys_blocksize smaller than OMFS_DIR_START - fs/mbcache: cancel shrink work before destroying the cache - dcache: permit dynamic_dname()s up to NAME_MAX Cleanups: - remove or unexport unused fs_context infrastructure - trivial ->setattr cleanups - selftests/filesystems: Assume that TIOCGPTPEER is defined - writeback: fix kernel-doc function name mismatch for wb_put_many() - autofs: replace manual symlink buffer allocation in autofs_dir_symlink - init/initramfs.c: trivial fix: FSM -> Finite-state machine - fs: remove stale and duplicate forward declarations - readdir: Introduce dirent_size() - fs: Replace user_access_{begin/end} by scoped user access - kernel: acct: fix duplicate word in comment - fs: write a better comment in step_into() concerning .mnt assignment - fs: attr: fix comment formatting and spelling issues" * tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) dcache: permit dynamic_dname()s up to NAME_MAX fs: attr: fix comment formatting and spelling issues fs: hide file and bfile caches behind runtime const machinery fs: write a better comment in step_into() concerning .mnt assignment proc: rename proc_notify_change to proc_setattr proc: rename proc_setattr to proc_nochmod_setattr affs: rename affs_notify_change to affs_setattr adfs: rename adfs_notify_change to adfs_setattr hfs: update comments on hfs_inode_setattr kernel: acct: fix duplicate word in comment fs: Replace user_access_{begin/end} by scoped user access readdir: Introduce dirent_size() coredump: add tracepoint for coredump events fs: remove do_sys_truncate fs: pass on FTRUNCATE_* flags to do_truncate fs: fix archiecture-specific compat_ftruncate64 fs: remove stale and duplicate forward declarations init/initramfs.c: trivial fix: FSM -> Finite-state machine autofs: replace manual symlink buffer allocation in autofs_dir_symlink fs/mbcache: cancel shrink work before destroying the cache ...