kernel/linux.git/net/xdp, branch v6.18.21

xsk: Fix zero-copy AF_XDP fragment drop

2026-03-12T11:09:51+00:00

[ Upstream commit f7387d6579d65efd490a864254101cb665f2e7a7 ] AF_XDP should ensure that only a complete packet is sent to application. In the zero-copy case, if the Rx queue gets full as fragments are being enqueued, the remaining fragments are dropped. For the multi-buffer case, add a check to ensure that the Rx queue has enough space for all fragments of a packet before starting to enqueue them. Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Signed-off-by: Nikhil P. Rao Link: https://patch.msgid.link/20260225000456.107806-3-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski Signed-off-by: Sasha Levin

xsk: Fix fragment node deletion to prevent buffer leak

2026-03-12T11:09:50+00:00

[ Upstream commit 60abb0ac11dccd6b98fd9182bc5f85b621688861 ] After commit b692bf9a7543 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node"), the list_node field is reused for both the xskb pool list and the buffer free list, this causes a buffer leak as described below. xp_free() checks if a buffer is already on the free list using list_empty(&xskb->list_node). When list_del() is used to remove a node from the xskb pool list, it doesn't reinitialize the node pointers. This means list_empty() will return false even after the node has been removed, causing xp_free() to incorrectly skip adding the buffer to the free list. Fix this by using list_del_init() instead of list_del() in all fragment handling paths, this ensures the list node is reinitialized after removal, allowing the list_empty() to work correctly. Fixes: b692bf9a7543 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node") Acked-by: Maciej Fijalkowski Signed-off-by: Nikhil P. Rao Link: https://patch.msgid.link/20260225000456.107806-2-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski Signed-off-by: Sasha Levin

xsk: avoid data corruption on cq descriptor number

2025-11-26T03:51:50+00:00

Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor production"), the descriptor number is stored in skb control block and xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto pool's completion queue. skb control block shouldn't be used for this purpose as after transmit xsk doesn't have control over it and other subsystems could use it. This leads to the following kernel panic due to a NULL pointer dereference. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:xsk_destruct_skb+0xd0/0x180 [...] Call Trace: ? napi_complete_done+0x7a/0x1a0 ip_rcv_core+0x1bb/0x340 ip_rcv+0x30/0x1f0 __netif_receive_skb_one_core+0x85/0xa0 process_backlog+0x87/0x130 __napi_poll+0x28/0x180 net_rx_action+0x339/0x420 handle_softirqs+0xdc/0x320 ? handle_edge_irq+0x90/0x1e0 do_softirq.part.0+0x3b/0x60 __local_bh_enable_ip+0x60/0x70 __dev_direct_xmit+0x14e/0x1f0 __xsk_generic_xmit+0x482/0xb70 ? __remove_hrtimer+0x41/0xa0 ? __xsk_generic_xmit+0x51/0xb70 ? _raw_spin_unlock_irqrestore+0xe/0x40 xsk_sendmsg+0xda/0x1c0 __sys_sendto+0x1ee/0x200 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x84/0x2f0 ? __pfx_pollwake+0x10/0x10 ? __rseq_handle_notify_resume+0xad/0x4c0 ? restore_fpregs_from_fpstate+0x3c/0x90 ? switch_fpu_return+0x5b/0xe0 ? do_syscall_64+0x204/0x2f0 ? do_syscall_64+0x204/0x2f0 ? do_syscall_64+0x204/0x2f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e [...] Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Instead use the skb destructor_arg pointer along with pointer tagging. As pointers are always aligned to 8B, use the bottom bit to indicate whether this a single address or an allocated struct containing several addresses. Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production") Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/ Suggested-by: Jakub Kicinski Signed-off-by: Fernando Fernandez Mancera Reviewed-by: Maciej Fijalkowski Reviewed-by: Jason Xing Link: https://patch.msgid.link/20251124171409.3845-1-fmancera@suse.de Signed-off-by: Jakub Kicinski

xsk: Harden userspace-supplied xdp_desc validation

2025-10-10T17:07:48+00:00

Turned out certain clearly invalid values passed in xdp_desc from userspace can pass xp_{,un}aligned_validate_desc() and then lead to UBs or just invalid frames to be queued for xmit. desc->len close to ``U32_MAX`` with a non-zero pool->tx_metadata_len can cause positive integer overflow and wraparound, the same way low enough desc->addr with a non-zero pool->tx_metadata_len can cause negative integer overflow. Both scenarios can then pass the validation successfully. This doesn't happen with valid XSk applications, but can be used to perform attacks. Always promote desc->len to ``u64`` first to exclude positive overflows of it. Use explicit check_{add,sub}_overflow() when validating desc->addr (which is ``u64`` already). bloat-o-meter reports a little growth of the code size: add/remove: 0/0 grow/shrink: 2/1 up/down: 60/-16 (44) Function old new delta xskq_cons_peek_desc 299 330 +31 xsk_tx_peek_release_desc_batch 973 1002 +29 xsk_generic_xmit 3148 3132 -16 but hopefully this doesn't hurt the performance much. Fixes: 341ac980eab9 ("xsk: Support tx_metadata_len") Cc: stable@vger.kernel.org # 6.8+ Signed-off-by: Alexander Lobakin Reviewed-by: Jason Xing Reviewed-by: Maciej Fijalkowski Link: https://lore.kernel.org/r/20251008165659.4141318-1-aleksander.lobakin@intel.com Signed-off-by: Alexei Starovoitov

xsk: wrap generic metadata handling onto separate function

2025-09-26T20:51:45+00:00

xsk_build_skb() has gone wild with its size and one of the things we can do about it is to pull out a branch that takes care of metadata handling and make it a separate function. While at it, let us add metadata SW support for devices supporting IFF_TX_SKB_NO_LINEAR flag, that happen to have separate logic for building skb in xsk's generic xmit path. Acked-by: Stanislav Fomichev Reviewed-by: Jason Xing Signed-off-by: Maciej Fijalkowski Acked-by: Martin KaFai Lau Link: https://patch.msgid.link/20250925160009.2474816-4-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski

xsk: remove @first_frag from xsk_build_skb()

2025-09-26T20:51:45+00:00

Instead of using auxiliary boolean that tracks if we are at first frag when gathering all elements of skb, same functionality can be achieved with checking if skb_shared_info::nr_frags is 0. Remove @first_frag but be careful around xsk_build_skb_zerocopy() and NULL the skb pointer when it failed so that common error path does not incorrectly interpret it during decision whether to call kfree_skb(). Signed-off-by: Maciej Fijalkowski Acked-by: Stanislav Fomichev Reviewed-by: Jason Xing Acked-by: Martin KaFai Lau Link: https://patch.msgid.link/20250925160009.2474816-3-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski

xsk: avoid overwriting skb fields for multi-buffer traffic

2025-09-26T20:51:45+00:00

We are unnecessarily setting a bunch of skb fields per each processed descriptor, which is redundant for fragmented frames. Let us set these respective members for first fragment only. To address both paths that we have within xsk_build_skb(), move assignments onto xsk_set_destructor_arg() and rename it to xsk_skb_init_misc(). Signed-off-by: Maciej Fijalkowski Acked-by: Stanislav Fomichev Reviewed-by: Jason Xing Acked-by: Martin KaFai Lau Link: https://patch.msgid.link/20250925160009.2474816-2-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski

xsk: Fix immature cq descriptor production

2025-09-09T22:09:55+00:00

Eryk reported an issue that I have put under Closes: tag, related to umem addrs being prematurely produced onto pool's completion queue. Let us make the skb's destructor responsible for producing all addrs that given skb used. Commit from fixes tag introduced the buggy behavior, it was not broken from day 1, but rather when xsk multi-buffer got introduced. In order to mitigate performance impact as much as possible, mimic the linear and frag parts within skb by storing the first address from XSK descriptor at sk_buff::destructor_arg. For fragments, store them at ::cb via list. The nodes that will go onto list will be allocated via kmem_cache. xsk_destruct_skb() will consume address stored at ::destructor_arg and optionally go through list from ::cb, if count of descriptors associated with this particular skb is bigger than 1. Previous approach where whole array for storing UMEM addresses from XSK descriptors was pre-allocated during first fragment processing yielded too big performance regression for 64b traffic. In current approach impact is much reduced on my tests and for jumbo frames I observed traffic being slower by at most 9%. Magnus suggested to have this way of processing special cased for XDP_SHARED_UMEM, so we would identify this during bind and set different hooks for 'backpressure mechanism' on CQ and for skb destructor, but given that results looked promising on my side I decided to have a single data path for XSK generic Tx. I suppose other auxiliary stuff would have to land as well in order to make it work. Fixes: b7f72a30e9ac ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path") Reported-by: Eryk Kubanski Closes: https://lore.kernel.org/netdev/20250530103456.53564-1-e.kubanski@partner.samsung.com/ Acked-by: Stanislav Fomichev Signed-off-by: Maciej Fijalkowski Tested-by: Jason Xing Reviewed-by: Jason Xing Link: https://lore.kernel.org/r/20250904194907.2342177-1-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov

net: xsk: introduce XDP_MAX_TX_SKB_BUDGET setsockopt

2025-07-10T12:48:29+00:00

This patch provides a setsockopt method to let applications leverage to adjust how many descs to be handled at most in one send syscall. It mitigates the situation where the default value (32) that is too small leads to higher frequency of triggering send syscall. Considering the prosperity/complexity the applications have, there is no absolutely ideal suggestion fitting all cases. So keep 32 as its default value like before. The patch does the following things: - Add XDP_MAX_TX_SKB_BUDGET socket option. - Set max_tx_budget to 32 by default in the initialization phase as a per-socket granular control. - Set the range of max_tx_budget as [32, xs->tx->nentries]. The idea behind this comes out of real workloads in production. We use a user-level stack with xsk support to accelerate sending packets and minimize triggering syscalls. When the packets are aggregated, it's not hard to hit the upper bound (namely, 32). The moment user-space stack fetches the -EAGAIN error number passed from sendto(), it will loop to try again until all the expected descs from tx ring are sent out to the driver. Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of sendto() and higher throughput/PPS. Here is what I did in production, along with some numbers as follows: For one application I saw lately, I suggested using 128 as max_tx_budget because I saw two limitations without changing any default configuration: 1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind this was I counted how many descs are transmitted to the driver at one time of sendto() based on [1] patch and then I calculated the possibility of hitting the upper bound. Finally I chose 128 as a suitable value because 1) it covers most of the cases, 2) a higher number would not bring evident results. After twisting the parameters, a stable improvement of around 4% for both PPS and throughput and less resources consumption were found to be observed by strace -c -p xxx: 1) %time was decreased by 7.8% 2) error counter was decreased from 18367 to 572 [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/ Signed-off-by: Jason Xing Acked-by: Maciej Fijalkowski Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni

net: xsk: update tx queue consumer immediately after transmission

2025-07-09T01:28:07+00:00

For afxdp, the return value of sendto() syscall doesn't reflect how many descs handled in the kernel. One of use cases is that when user-space application tries to know the number of transmitted skbs and then decides if it continues to send, say, is it stopped due to max tx budget? The following formular can be used after sending to learn how many skbs/descs the kernel takes care of: tx_queue.consumers_before - tx_queue.consumers_after Prior to the current patch, in non-zc mode, the consumer of tx queue is not immediately updated at the end of each sendto syscall when error occurs, which leads to the consumer value out-of-dated from the perspective of user space. So this patch requires store operation to pass the cached value to the shared value to handle the problem. More than those explicit errors appearing in the while() loop in __xsk_generic_xmit(), there are a few possible error cases that might be neglected in the following call trace: __xsk_generic_xmit() xskq_cons_peek_desc() xskq_cons_read_desc() xskq_cons_is_valid_desc() It will also cause the premature exit in the while() loop even if not all the descs are consumed. Based on the above analysis, using @sent_frame could cover all the possible cases where it might lead to out-of-dated global state of consumer after finishing __xsk_generic_xmit(). The patch also adds a common helper __xsk_tx_release() to keep align with the zc mode usage in xsk_tx_release(). Signed-off-by: Jason Xing Acked-by: Maciej Fijalkowski Acked-by: Stanislav Fomichev Link: https://patch.msgid.link/20250703141712.33190-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski