kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
5 days	io_uring: take page references for NOMMU pbuf_ring mmaps	Greg Kroah-Hartman	1	-1/+45
	Under !CONFIG_MMU, io_uring_get_unmapped_area() returns the kernel virtual address of the io_mapped_region's backing pages directly; the user's VMA aliases the kernel allocation. io_uring_mmap() then just returns 0 -- it takes no page references. The CONFIG_MMU path uses vm_insert_pages(), which takes a reference on each inserted page. Those references are released when the VMA is torn down (zap_pte_range -> put_page). io_free_region() -> release_pages() drops the io_uring-side references, but the pages survive until munmap drops the VMA-side references. Under NOMMU there are no VMA-side references. io_unregister_pbuf_ring -> io_put_bl -> io_free_region -> release_pages drops the only references and the pages return to the buddy allocator while the user's VMA still has vm_start pointing into them. The user can then write into whatever the allocator hands out next. Mirror the MMU lifetime: take get_page references in io_uring_mmap() and release them via vm_ops->close. NOMMU's delete_vma() calls vma_close() which runs ->close on munmap. This also incidentally addresses the duplicate-vm_start case: two mmaps of SQ_RING and CQ_RING resolve to the same ctx->ring_region pointer. With page refs taken per mmap, the second mmap takes its own refs and the pages survive until both mmaps are closed. The nommu rb-tree BUG_ON on duplicate vm_start is a separate mm/nommu.c concern (it should share the existing region rather than BUG), but the page lifetime is now correct. Cc: Jens Axboe <axboe@kernel.dk> Reported-by: Anthropic Assisted-by: gkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/2026042115-body-attention-d15b@gregkh [axboe: get rid of region lookup, just iterate pages in vma] Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days	io_uring/poll: ensure EPOLL_ONESHOT is propagated for EPOLL_URING_WAKE	Jens Axboe	1	-1/+3
	Commit: aacf2f9f382c ("io_uring: fix req->apoll_events") fixed an issue where poll->events and req->apoll_events weren't synchronized, but then when the commit referenced in Fixes got added, it didn't ensure the same thing. If we mask in EPOLLONESHOT in the regular EPOLL_URING_WAKE path, then ensure it's done for both. Including a link to the original report below, even though it's mostly nonsense. But it includes a reproducer that does show that IORING_CQE_F_MORE is set in the previous CQE, while no more CQEs will be generated for this request. Just ignore anything that pretends this is security related in any way, it's just the typical AI nonsense. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/CAM0zi7yQzF3eKncgHo4iVM5yFLAjsiob_ucqyWKs=hyd_GqiMg@mail.gmail.com/ Reported-by: Azizcan Daştan <azizcan.d@mileniumsec.com> Fixes: 4464853277d0 ("io_uring: pass in EPOLL_URING_WAKE for eventfd signaling and wakeups") Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days	io_uring/zcrx: warn on freelist violations	Pavel Begunkov	1	-0/+2
	The freelist is appropriately sized to always be able to take a free niov, but let's be more defensive and check the invariant with a warning. That should help to catch any double-free issues. Suggested-by: Kai Aizen <kai@snailsploit.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/2f3cea363b04649755e3b6bb9ab66485a95936d5.1776760901.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days	io_uring/zcrx: clear RQ headers on init	Pavel Begunkov	1	-0/+1
	It might be unexpected to users if the RQ head/tail after a ring creation are not zeroed, fix that. Cc: stable@vger.kernel.org Fixes: 6f377873cb239 ("io_uring/zcrx: add interface queue and refill queue") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/331f94663c3e8f021ffa3cb770ca2844a07d4855.1776760911.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days	io_uring/zcrx: fix user_struct uaf	Pavel Begunkov	1	-1/+1
	io_free_rbuf_ring() usees a struct user_struct, which io_zcrx_ifq_free() puts it down before destroying the ring. Cc: stable@vger.kernel.org Fixes: 5c686456a4e83 ("io_uring/zcrx: add user_struct and mm_struct to io_zcrx_ifq") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/e560ae00960d27a810522a7efc0e201c82dff351.1776760917.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days	io_uring/register: fix ring resizing with mixed/large SQEs/CQEs	Jens Axboe	1	-8/+28
	The ring resizing only properly handles "normal" sized SQEs or CQEs, if there are pending entries around a resize. This normally should not be the case, but the code is supposed to handle this regardless. For the mixed SQE/CQE cases, the current copying works fine as they are indexed in the same way. Each half is just copied separately. But for fixed large SQEs and CQEs, the iteration and copy need to take that into account. Cc: stable@kernel.org Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days	io_uring/futex: ensure partial wakes are appropriately dequeued	Jens Axboe	1	-1/+3
	If a FUTEX_WAITV vectored operation is only partially woken, we should call __futex_wake_mark() on the queue to account for that. If not, then a later wakeup will wake the same entry, rather than the next one in line. Fixes: 8f350194d5cfd ("io_uring: add support for vectored futex waits") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days	io_uring/rw: add defensive hardening for negative kbuf lengths	Jens Axboe	1	-2/+2
	No real bug here, just being a bit defensive in ensuring that whatever gets passed into io_put_kbuf() is always >= 0 and not some random error value. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days	io_uring/rsrc: use kvfree() for the imu cache	Jens Axboe	2	-2/+2
	Currently anything that requires kvmalloc_flex() for allocations will not get re-cached, and hence the cache freeing path is correct in that it always uses kfree() to free the allocated memory. But this seems a bit fragile as it's something that could get mix should that situation change, so switch io_free_imu() and io_alloc_cache_free() to use kvfree as the desctructor. Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days	io_uring/rsrc: unify nospec indexing for direct descriptors	Jens Axboe	2	-2/+10
	For file updates, the node reset isn't capping the value via array_index_nospec() like the other paths do. Ensure it's all sane and have the update path do the proper capping as well. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days	io_uring: fix spurious fput in registered ring path	Jens Axboe	1	-1/+2
	Fix an issue with io_uring_ctx_get_file() not gating fput() on whether or not the file descriptor is a registered/direct one or not. Fixes: c5e9f6a96bf7 ("io_uring: unify getting ctx from passed in file descriptor") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 days	io_uring: fix iowq_limits data race in tctx node addition	Jens Axboe	1	-3/+7
	__io_uring_add_tctx_node() reads ctx->int_flags and ctx->iowq_limits[0..1] without holding ctx->uring_lock, while io_register_iowq_max_workers() writes these same fields under the lock. Mostly an application problem if you try and make these race, but let's silence KCSAN by just grabbing the ->uring_lock around the operation. This is a slow path operation anyway, and ->uring_lock will be grabbed by submission right after anyway. Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 days	io_uring/tctx: mark io_wq as exiting before error path teardown	Jens Axboe	1	-1/+3
	syzbot reports that it's hitting the below condition for exiting an io_wq context: WARN_ON_ONCE(!test_bit(IO_WQ_BIT_EXIT, &wq->state)) in io_wq_put_and_exit(), which can be triggered with memory allocation fault injection. Ensure that the io_wq is marked as exiting to silence this warning trigger. Reported-by: syzbot+79a4cc863a8db58cd92b@syzkaller.appspotmail.com Fixes: 7880174e1e5e ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling") Reviewed-by: Clément Léger <cleger@meta.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 days	io_uring/tctx: check for setup tctx->io_wq before teardown	Jens Axboe	1	-1/+2
	As with the idling code before it, the error exit path should check for a NULL tctx->io_wq before calling io_wq_put_and_exit(). Fixes: 7880174e1e5e ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling") Reported-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Clément Léger <cleger@meta.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 days	io_uring/poll: fix signed comparison in io_poll_get_ownership()	Longxuan Yu	1	-1/+1
	io_poll_get_ownership() uses a signed comparison to check whether poll_refs has reached the threshold for the slowpath: if (unlikely(atomic_read(&req->poll_refs) >= IO_POLL_REF_BIAS)) atomic_read() returns int (signed). When IO_POLL_CANCEL_FLAG (BIT(31)) is set in poll_refs, the value becomes negative in signed arithmetic, so the >= 128 comparison always evaluates to false and the slowpath is never taken. Fix this by casting the atomic_read() result to unsigned int before the comparison, so that the cancel flag is treated as a large positive value and correctly triggers the slowpath. Fixes: a26a35e9019f ("io_uring: make poll refs more robust") Cc: stable@vger.kernel.org Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Longxuan Yu <ylong030@ucr.edu> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/3a3508b08bcd7f1bc3beff848ae6e1d73d355043.1775965597.git.ylong030@ucr.edu Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days	Merge tag 'net-next-7.1' of ↵	Linus Torvalds	1	-4/+7
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Support HW queue leasing, allowing containers to be granted access to HW queues for zero-copy operations and AF_XDP - Number of code moves to help the compiler with inlining. Avoid output arguments for returning drop reason where possible - Rework drop handling within qdiscs to include more metadata about the reason and dropping qdisc in the tracepoints - Remove the rtnl_lock use from IP Multicast Routing - Pack size information into the Rx Flow Steering table pointer itself. This allows making the table itself a flat array of u32s, thus making the table allocation size a power of two - Report TCP delayed ack timer information via socket diag - Add ip_local_port_step_width sysctl to allow distributing the randomly selected ports more evenly throughout the allowed space - Add support for per-route tunsrc in IPv6 segment routing - Start work of switching sockopt handling to iov_iter - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid buffer size drifting up - Support MSG_EOR in MPTCP - Add stp_mode attribute to the bridge driver for STP mode selection. This addresses concerns about call_usermodehelper() usage - Remove UDP-Lite support (as announced in 2023) - Remove support for building IPv6 as a module. Remove the now unnecessary function calling indirection Cross-tree stuff: - Move Michael MIC code from generic crypto into wireless, it's considered insecure but some WiFi networks still need it Netfilter: - Switch nft_fib_ipv6 module to no longer need temporary dst_entry object allocations by using fib6_lookup() + RCU. Florian W reports this gets us ~13% higher packet rate - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and switch the service tables to be per-net. Convert some code that walks the service lists to use RCU instead of the service_mutex - Add more opinionated input validation to lower security exposure - Make IPVS hash tables to be per-netns and resizable Wireless: - Finished assoc frame encryption/EPPKE/802.1X-over-auth - Radar detection improvements - Add 6 GHz incumbent signal detection APIs - Multi-link support for FILS, probe response templates and client probing - New APIs and mac80211 support for NAN (Neighbor Aware Networking, aka Wi-Fi Aware) so less work must be in firmware Driver API: - Add numerical ID for devlink instances (to avoid having to create fake bus/device pairs just to have an ID). Support shared devlink instances which span multiple PFs - Add standard counters for reporting pause storm events (implement in mlx5 and fbnic) - Add configuration API for completion writeback buffering (implement in mana) - Support driver-initiated change of RSS context sizes - Support DPLL monitoring input frequency (implement in zl3073x) - Support per-port resources in devlink (implement in mlx5) Misc: - Expand the YAML spec for Netfilter Drivers - Software: - macvlan: support multicast rx for bridge ports with shared source MAC address - team: decouple receive and transmit enablement for IEEE 802.3ad LACP "independent control" - Ethernet high-speed NICs: - nVidia/Mellanox: - support high order pages in zero-copy mode (for payload coalescing) - support multiple packets in a page (for systems with 64kB pages) - Broadcom 25-400GE (bnxt): - implement XDP RSS hash metadata extraction - add software fallback for UDP GSO, lowering the IOMMU cost - Broadcom 800GE (bnge): - add link status and configuration handling - add various HW and SW statistics - Marvell/Cavium: - NPC HW block support for cn20k - Huawei (hinic3): - add mailbox / control queue - add rx VLAN offload - add driver info and link management - Ethernet NICs: - Marvell/Aquantia: - support reading SFP module info on some AQC100 cards - Realtek PCI (r8169): - add support for RTL8125cp - Realtek USB (r8152): - support for the RTL8157 5Gbit chip - add 2500baseT EEE status/configuration support - Ethernet NICs embedded and off-the-shelf IP: - Synopsys (stmmac): - cleanup and reorganize SerDes handling and PCS support - cleanup descriptor handling and per-platform data - cleanup and consolidate MDIO defines and handling - shrink driver memory use for internal structures - improve Tx IRQ coalescing - improve TCP segmentation handling - add support for Spacemit K3 - Cadence (macb): - support PHYs that have inband autoneg disabled with GEM - support IEEE 802.3az EEE - rework usrio capabilities and handling - AMD (xgbe): - improve power management for S0i3 - improve TX resilience for link-down handling - Virtual: - Google cloud vNIC: - support larger ring sizes in DQO-QPL mode - improve HW-GRO handling - support UDP GSO for DQO format - PCIe NTB: - support queue count configuration - Ethernet PHYs: - automatically disable PHY autonomous EEE if MAC is in charge - Broadcom: - add BCM84891/BCM84892 support - Micrel: - support for LAN9645X internal PHY - Realtek: - add RTL8224 pair order support - support PHY LEDs on RTL8211F-VD - support spread spectrum clocking (SSC) - Maxlinear: - add PHY-level statistics via ethtool - Ethernet switches: - Maxlinear (mxl862xx): - support for bridge offloading - support for VLANs - support driver statistics - Bluetooth: - large number of fixes and new device IDs - Mediatek: - support MT6639 (MT7927) - support MT7902 SDIO - WiFi: - Intel (iwlwifi): - UNII-9 and continuing UHR work - MediaTek (mt76): - mt7996/mt7925 MLO fixes/improvements - mt7996 NPU support (HW eth/wifi traffic offload) - Qualcomm (ath12k): - monitor mode support on IPQ5332 - basic hwmon temperature reporting - support IPQ5424 - Realtek: - add USB RX aggregation to improve performance - add USB TX flow control by tracking in-flight URBs - Cellular: - IPA v5.2 support" * tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits) net: pse-pd: fix kernel-doc function name for pse_control_find_by_id() wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit wireguard: allowedips: remove redundant space tools: ynl: add sample for wireguard wireguard: allowedips: Use kfree_rcu() instead of call_rcu() MAINTAINERS: Add netkit selftest files selftests/net: Add additional test coverage in nk_qlease selftests/net: Split netdevsim tests from HW tests in nk_qlease tools/ynl: Make YnlFamily closeable as a context manager net: airoha: Add missing PPE configurations in airoha_ppe_hw_init() net: airoha: Fix VIP configuration for AN7583 SoC net: caif: clear client service pointer on teardown net: strparser: fix skb_head leak in strp_abort_strp() net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete() selftests/bpf: add test for xdp_master_redirect with bond not up net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration sctp: disable BH before calling udp_tunnel_xmit_skb() sctp: fix missing encap_port propagation for GSO fragments net: airoha: Rely on net_device pointer in ETS callbacks ...
13 days	Merge tag 'for-7.1/io_uring-20260411' of ↵	Linus Torvalds	33	-506/+1030
	git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Add a callback driven main loop for io_uring, and BPF struct_ops on top to allow implementing custom event loop logic - Decouple IOPOLL from being a ring-wide all-or-nothing setting, allowing IOPOLL use cases to also issue certain white listed non-polled opcodes - Timeout improvements. Migrate internal timeout storage from timespec64 to ktime_t for simpler arithmetic and avoid copying of timespec data - Zero-copy receive (zcrx) updates: - Add a device-less mode (ZCRX_REG_NODEV) for testing and experimentation where data flows through the copy fallback path - Fix two-step unregistration regression, DMA length calculations, xarray mark usage, and a potential 32-bit overflow in id shifting - Refactoring toward multi-area support: dedicated refill queue struct, consolidated DMA syncing, netmem array refilling format, and guard-based locking - Zero-copy transmit (zctx) cleanup: - Unify io_send_zc() and io_sendmsg_zc() into a single function - Add vectorized registered buffer send for IORING_OP_SEND_ZC - Add separate notification user_data via sqe->addr3 so notification and completion CQEs can be distinguished without extra reference counting - Switch struct io_ring_ctx internal bitfields to explicit flag bits with atomic-safe accessors, and annotate the known harmless races on those flags - Various optimizations caching ctx and other request fields in local variables to avoid repeated loads, and cleanups for tctx setup, ring fd registration, and read path early returns * tag 'for-7.1/io_uring-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (58 commits) io_uring: unify getting ctx from passed in file descriptor io_uring/register: don't get a reference to the registered ring fd io_uring/tctx: clean up __io_uring_add_tctx_node() error handling io_uring/tctx: have io_uring_alloc_task_context() return tctx io_uring/timeout: use 'ctx' consistently io_uring/rw: clean up __io_read() obsolete comment and early returns io_uring/zcrx: use correct mmap off constants io_uring/zcrx: use dma_len for chunk size calculation io_uring/zcrx: don't clear not allocated niovs io_uring/zcrx: don't use mark0 for allocating xarray io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring() io_uring/zcrx: reject REG_NODEV with large rx_buf_size io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OP io_uring/rsrc: use io_cache_free() to free node io_uring/zcrx: rename zcrx [un]register functions io_uring/zcrx: check ctrl op payload struct sizes io_uring/zcrx: cache fallback availability in zcrx ctx io_uring/zcrx: warn on a repeated area append io_uring/zcrx: consolidate dma syncing io_uring/zcrx: netmem array as refiling format ...
13 days	Merge tag 'vfs-7.1-rc1.misc' of ↵	Linus Torvalds	1	-1/+1
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - coredump: add tracepoint for coredump events - fs: hide file and bfile caches behind runtime const machinery Fixes: - fix architecture-specific compat_ftruncate64 implementations - dcache: Limit the minimal number of bucket to two - fs/omfs: reject s_sys_blocksize smaller than OMFS_DIR_START - fs/mbcache: cancel shrink work before destroying the cache - dcache: permit dynamic_dname()s up to NAME_MAX Cleanups: - remove or unexport unused fs_context infrastructure - trivial ->setattr cleanups - selftests/filesystems: Assume that TIOCGPTPEER is defined - writeback: fix kernel-doc function name mismatch for wb_put_many() - autofs: replace manual symlink buffer allocation in autofs_dir_symlink - init/initramfs.c: trivial fix: FSM -> Finite-state machine - fs: remove stale and duplicate forward declarations - readdir: Introduce dirent_size() - fs: Replace user_access_{begin/end} by scoped user access - kernel: acct: fix duplicate word in comment - fs: write a better comment in step_into() concerning .mnt assignment - fs: attr: fix comment formatting and spelling issues" * tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) dcache: permit dynamic_dname()s up to NAME_MAX fs: attr: fix comment formatting and spelling issues fs: hide file and bfile caches behind runtime const machinery fs: write a better comment in step_into() concerning .mnt assignment proc: rename proc_notify_change to proc_setattr proc: rename proc_setattr to proc_nochmod_setattr affs: rename affs_notify_change to affs_setattr adfs: rename adfs_notify_change to adfs_setattr hfs: update comments on hfs_inode_setattr kernel: acct: fix duplicate word in comment fs: Replace user_access_{begin/end} by scoped user access readdir: Introduce dirent_size() coredump: add tracepoint for coredump events fs: remove do_sys_truncate fs: pass on FTRUNCATE_* flags to do_truncate fs: fix archiecture-specific compat_ftruncate64 fs: remove stale and duplicate forward declarations init/initramfs.c: trivial fix: FSM -> Finite-state machine autofs: replace manual symlink buffer allocation in autofs_dir_symlink fs/mbcache: cancel shrink work before destroying the cache ...
2026-04-10	Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'	Jakub Kicinski	1	-4/+8
	Daniel Borkmann says: ==================== netkit: Support for io_uring zero-copy and AF_XDP Containers use virtual netdevs to route traffic from a physical netdev in the host namespace. They do not have access to the physical netdev in the host and thus can't use memory providers or AF_XDP that require reconfiguring/restarting queues in the physical netdev. This patchset adds the concept of queue leasing to virtual netdevs that allow containers to use memory providers and AF_XDP at native speed. Leased queues are bound to a real queue in a physical netdev and act as a proxy. Memory providers and AF_XDP operations take an ifindex and queue id, so containers would pass in an ifindex for a virtual netdev and a queue id of a leased queue, which then gets proxied to the underlying real queue. We have implemented support for this concept in netkit and tested the latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504 (bnxt_en) 100G NICs. For more details see the individual patches. ==================== Link: https://patch.msgid.link/20260402231031.447597-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-10	net: Proxy netdev_queue_get_dma_dev for leased queues	David Wei	1	-1/+2
	Extend netdev_queue_get_dma_dev to return the physical device of the real rxq for DMA in case the queue was leased. This allows memory providers like io_uring zero-copy or devmem to bind to the physically leased rxq via virtual devices such as netkit. Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-8-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-10	net: Slightly simplify net_mp_{open,close}_rxq	Daniel Borkmann	1	-3/+6
	net_mp_open_rxq is currently not used in the tree as all callers are using __net_mp_open_rxq directly, and net_mp_close_rxq is only used once while all other locations use __net_mp_close_rxq. Consolidate into a single API, netif_mp_{open,close}_rxq, using the netif_ prefix to indicate that the caller is responsible for locking. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-6-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-08	io_uring: unify getting ctx from passed in file descriptor	Jens Axboe	6	-58/+40
	io_uring_enter() and io_uring_register() end up having duplicated code for getting a ctx from a passed in file descriptor, for either a registered ring descriptor or a normal file descriptor. Move the io_uring_register_get_file() into io_uring.c and name it a bit more generically, and use it from both callsites rather than have that logic and handling duplicated. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-08	io_uring/register: don't get a reference to the registered ring fd	Jens Axboe	2	-5/+6
	This isn't necessary and was only done because the register path isn't a hot path and hence the extra ref/put doesn't matter, and to have the exit path be able to unconditionally put whatever file was gotten regardless of the type. In preparation for sharing this code with the main io_uring_enter(2) syscall, drop the reference and have the caller conditionally put the file if it was a normal file descriptor. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-08	io_uring/tctx: clean up __io_uring_add_tctx_node() error handling	Jens Axboe	1	-20/+40
	Refactor __io_uring_add_tctx_node() so that on error it never leaves current->io_uring pointing at a half-setup tctx. This moves the assignment of current->io_uring to the end of the function post any failure points. Separate out the node installation into io_tctx_install_node() to further clean this up. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-08	io_uring/tctx: have io_uring_alloc_task_context() return tctx	Jens Axboe	3	-14/+19
	Instead of having io_uring_alloc_task_context() return an int and assign tsk->io_uring, just have it return the task context directly. This enables cleaner error handling in callers, which may have failure points post calling io_uring_alloc_task_context(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-03	Merge tag 'io_uring-7.0-20260403' of ↵	Linus Torvalds	7	-29/+87
	git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - A previous fix in this release covered the case of the rings being RCU protected during resize, but it missed a few spots. This covers the rest - Fix the cBPF filters when COW'ed, introduced in this merge window - Fix for an attempt to import a zero sized buffer - Fix for a missing clamp in importing bundle buffers * tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/bpf_filters: retain COW'ed settings on parse failures io_uring: protect remaining lockless ctx->rings accesses with RCU io_uring/rsrc: reject zero-length fixed buffer import io_uring/net: fix slab-out-of-bounds read in io_bundle_nbufs()
2026-04-02	io_uring/timeout: use 'ctx' consistently	Yang Xiuwei	1	-2/+2
	There's already a local ctx variable, yet cq_timeouts accounting uses req->ctx. Use ctx consistently. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Link: https://patch.msgid.link/20260402014952.260414-1-yangxiuwei@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02	io_uring/rw: clean up __io_read() obsolete comment and early returns	Joanne Koong	1	-6/+5
	After commit a9165b83c193 ("io_uring/rw: always setup io_async_rw for read/write requests") which moved the iovec allocation into the prep path and stores it in req->async_data where it now gets freed as part of the request lifecycle, this comment is now outdated. Remove it and clean up the goto as well. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20260401173511.4052303-1-joannelkoong@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02	io_uring/zcrx: use correct mmap off constants	Pavel Begunkov	1	-1/+1
	zcrx was using IORING_OFF_PBUF_SHIFT during first iterations, but there is now a separate constant it should use. Both are 16 so it doesn't change anything, but improve it for the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/fe16ebe9ba4048a7e12f9b3b50880bd175b1ce03.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02	io_uring/zcrx: use dma_len for chunk size calculation	Pavel Begunkov	1	-1/+1
	Buffers are now dma-mapped earlier and we can sg_dma_len(), otherwise, since it's walking with for_each_sgtable_dma_sg(), it might wrongfully reject some configurations. As a bonus, it'd now be able to use larger chunks if dma addresses are coalesced e.g by iommu. Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/03b219af3f6cfdd1cf64679b8bab7461e47cc123.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02	io_uring/zcrx: don't clear not allocated niovs	Pavel Begunkov	1	-2/+4
	Now that area->is_mapped is set earlier before niovs array is allocated, io_zcrx_free_area -> io_zcrx_unmap_area in an error path can try to clear dma addresses for unallocated niovs, fix it. Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/cbcb7749b5a001ecd4d1c303515ce9403215640c.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: don't use mark0 for allocating xarray	Pavel Begunkov	1	-2/+2
	XA_MARK_0 is not compatible with xarray allocating entries, use XA_MARK_1. Fixes: fda90d43f4fac ("io_uring/zcrx: return back two step unregistration") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/f232cfd3c466047d333b474dd2bddd246b6ebb82.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring()	Anas Iqbal	1	-1/+1
	Smatch warns: io_uring/zcrx.c:393 io_allocate_rbuf_ring() warn: should 'id << 16' be a 64 bit type? The expression 'id << IORING_OFF_PBUF_SHIFT' is evaluated using 32-bit arithmetic because id is a u32. This may overflow before being promoted to the 64-bit mmap_offset. Cast id to u64 before shifting to ensure the shift is performed in 64-bit arithmetic. Signed-off-by: Anas Iqbal <mohd.abd.6602@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/52400e1b343691416bef3ed3ae287fb1a88d407f.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: reject REG_NODEV with large rx_buf_size	Pavel Begunkov	1	-1/+3
	The copy fallback path doesn't care about the actual niov size and only uses first PAGE_SIZE bytes, and any additional space will be wasted. Since ZCRX_REG_NODEV solely relies on the copy path, it doesn't make sense to support non-standard rx_buf_len. Reject it for now, and re-enable once improved. Fixes: c11728021d5cd ("io_uring/zcrx: implement device-less mode for zcrx") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/3e7652d9c27f8ac5d2b141e3af47971f2771fb05.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OP	Amir Mohammad Jahangirzad	1	-1/+8
	io_async_cancel_prep() reads the opcode selector from sqe->len and stores it in cancel->opcode, which is an 8-bit field. Since sqe->len is a 32-bit value, values larger than U8_MAX are implicitly truncated. This can cause unintended opcode matches when the truncated value corresponds to a valid io_uring opcode. For example, submitting a value such as 0x10b will be truncated to 0x0b (IORING_OP_TIMEOUT), allowing a cancel request to match operations it did not intend to target. Validate the opcode value before assigning it to the 8-bit field and reject values outside the valid io_uring opcode range. Signed-off-by: Amir Mohammad Jahangirzad <a.jahangirzad@gmail.com> Link: https://patch.msgid.link/20260331232113.615972-1-a.jahangirzad@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/rsrc: use io_cache_free() to free node	Jackie Liu	1	-1/+1
	Replace kfree(node) with io_cache_free() in io_buffer_register_bvec() to match all other error paths that free nodes allocated via io_rsrc_node_alloc(). The node is allocated through io_cache_alloc() internally, so it should be returned to the cache via io_cache_free() for proper object reuse. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Link: https://patch.msgid.link/20260331104509.7055-1-liu.yun@linux.dev [axboe: remove fixes tag, it's not a fix, it's a cleanup] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: rename zcrx [un]register functions	Pavel Begunkov	4	-10/+10
	Drop "ifqs" from function names, as it refers to an interface queue and there might be none once a device-less mode is introduced. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/657874acd117ec30fa6f45d9d844471c753b5a0f.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: check ctrl op payload struct sizes	Pavel Begunkov	1	-0/+2
	Add a build check that ctrl payloads are of the same size and don't grow struct zcrx_ctrl. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/af66caf9776d18e9ff880ab828eb159a6a03caf5.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: cache fallback availability in zcrx ctx	Pavel Begunkov	2	-1/+9
	Store a flag in struct io_zcrx_ifq telling if the backing memory is normal page or dmabuf based. It was looking it up from the area, however it logically allocates from the zcrx ctx and not a particular area, and once we add more than one area it'll become a mess. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/65e75408a7758fe7e60fae89b7a8d5ae4857f515.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: warn on a repeated area append	Pavel Begunkov	1	-1/+1
	We only support a single area, no path should be able to call io_zcrx_append_area() twice. Warn if that happens instead of just returning an error. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/28eb67fb8c48445584d7c247a36e1ad8800f0c8b.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: consolidate dma syncing	Pavel Begunkov	1	-11/+12
	Split refilling into two steps, first allocate niovs, and then do DMA sync for them. This way dma synchronisation code can be better optimised. E.g. we don't need to call dma_dev_need_sync() for each every niov, and maybe we can coalesce sync for adjacent netmems in the future as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/19f2d50baa62ff2e0c6cd56dd7c394cab728c567.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: netmem array as refiling format	Pavel Begunkov	1	-15/+25
	Instead of peeking into page pool allocation cache directly or via net_mp_netmem_place_in_cache(), pass a netmem array around. It's a better intermediate format, e.g. you can have it on stack and reuse the refilling code and decouples it from page pools a bit more. It still points into the page pool directly, there will be no additional copies. As the next step, we can change the callback prototype to take the netmem array from page pool. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/9d8549adb7ef6672daf2d8a52858ce5926279a82.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: warn on alloc with non-empty pp cache	Pavel Begunkov	1	-2/+2
	Page pool ensures the cache is empty before asking to refill it. Warn if the assumption is violated. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/9c9792d6e65f3780d57ff83b6334d341ed9a5f29.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: move count check into zcrx_get_free_niov	Pavel Begunkov	1	-17/+21
	Instead of relying on the caller of __io_zcrx_get_free_niov() to check that there are free niovs available (i.e. free_count > 0), move the check into the function and return NULL if can't allocate. It consolidates the free count checks, and it'll be easier to extend the niov free list allocator in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/6df04a6b3a6170f86d4345da9864f238311163f9.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: use guards for locking	Pavel Begunkov	1	-8/+7
	Convert last several places using manual locking to guards to simplify the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/eb4667cfaf88c559700f6399da9e434889f5b04a.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: add a struct for refill queue	Pavel Begunkov	2	-31/+37
	Add a new structure that keeps the refill queue state. It's cleaner and will be useful once we introduce multiple refill queues. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/4ce200da1ff0309c377293b949200f95f80be9ae.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: use better name for RQ region	Pavel Begunkov	2	-5/+5
	Rename "region" to "rq_region" to highlight that it's a refill queue region. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/ac815790d2477a15826aecaa3d94f2a94ef507e6.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: implement device-less mode for zcrx	Pavel Begunkov	2	-15/+28
	Allow creating a zcrx instance without attaching it to a net device. All data will be copied through the fallback path. The user is also expected to use ZCRX_CTRL_FLUSH_RQ to handle overflows as it normally should even with a netdev, but it becomes even more relevant as there will likely be no one to automatically pick up buffers. Apart from that, it follows the zcrx uapi for the I/O path, and is useful for testing, experimentation, and potentially for the copy receive path in the future if improved. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/674f8ad679c5a0bc79d538352b3042cf0999596e.1774261953.git.asml.silence@gmail.com [axboe: fix spelling error in uapi header and commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: extract netdev+area init into a helper	Pavel Begunkov	1	-29/+43
	In preparation to following patches, add a function that is responsibly for looking up a netdev, creating an area, DMA mapping it and opening a queue. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/88cb6f746ecb496a9030756125419df273d0b003.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01	io_uring/zcrx: always dma map in advance	Pavel Begunkov	1	-29/+15
	zcrx was originally establisihing dma mappings at a late stage when it was being bound to a page pool. Dma-buf couldn't work this way, so it's initialised during area creation. It's messy having them do it at different spots, just move everything to the area creation time. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/334092a2cbdd4aabd7c025050aa99f05ace89bb5.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>