kernel/linux.git/drivers/net/veth.c, branch v6.12.80

veth: fix data race in veth_get_ethtool_stats

2026-01-30T09:28:39+00:00

[ Upstream commit b47adaab8b3d443868096bac08fdbb3d403194ba ] In veth_get_ethtool_stats(), some statistics protected by u64_stats_sync, are read and accumulated in ignorance of possible u64_stats_fetch_retry() events. These statistics, peer_tq_xdp_xmit and peer_tq_xdp_xmit_err, are already accumulated by veth_xdp_xmit(). Fix this by reading them into a temporary buffer first. Fixes: 5fe6e56776ba ("veth: rely on peer veth_rq for ndo_xdp_xmit accounting") Signed-off-by: David Yang Link: https://patch.msgid.link/20260114122450.227982-1-mmyangfl@gmail.com Signed-off-by: Jakub Kicinski Signed-off-by: Sasha Levin

veth: reduce XDP no_direct return section to fix race

2025-12-06T21:24:54+00:00

[ Upstream commit a14602fcae17a3f1cb8a8521bedf31728f9e7e39 ] As explain in commit fa349e396e48 ("veth: Fix race with AF_XDP exposing old or uninitialized descriptors") for veth there is a chance after napi_complete_done() that another CPU can manage start another NAPI instance running veth_pool(). For NAPI this is correctly handled as the napi_schedule_prep() check will prevent multiple instances from getting scheduled, but for the remaining code in veth_pool() this can run concurrent with the newly started NAPI instance. The problem/race is that xdp_clear_return_frame_no_direct() isn't designed to be nested. Prior to commit 401cb7dae813 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.") the temporary BPF net context bpf_redirect_info was stored per CPU, where this wasn't an issue. Since this commit the BPF context is stored in 'current' task_struct. When running veth in threaded-NAPI mode, then the kthread becomes the storage area. Now a race exists between two concurrent veth_pool() function calls one exiting NAPI and one running new NAPI, both using the same BPF net context. Race is when another CPU gets within the xdp_set_return_frame_no_direct() section before exiting veth_pool() calls the clear-function xdp_clear_return_frame_no_direct(). Fixes: 401cb7dae8130 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.") Signed-off-by: Jesper Dangaard Brouer Link: https://patch.msgid.link/176356963888.337072.4805242001928705046.stgit@firesoul Signed-off-by: Jakub Kicinski Signed-off-by: Sasha Levin

veth: more robust handing of race to avoid txq getting stuck

2025-12-06T21:24:54+00:00

[ Upstream commit 5442a9da69789741bfda39f34ee7f69552bf0c56 ] Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops") introduced a race condition that can lead to a permanently stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra Max). The race occurs in veth_xmit(). The producer observes a full ptr_ring and stops the queue (netif_tx_stop_queue()). The subsequent conditional logic, intended to re-wake the queue if the consumer had just emptied it (if (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and traffic halts. This failure is caused by an incorrect use of the __ptr_ring_empty() API from the producer side. As noted in kernel comments, this check is not guaranteed to be correct if a consumer is operating on another CPU. The empty test is based on ptr_ring->consumer_head, making it reliable only for the consumer. Using this check from the producer side is fundamentally racy. This patch fixes the race by adopting the more robust logic from an earlier version V4 of the patchset, which always flushed the peer: (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier are removed. Instead, after stopping the queue, we unconditionally call __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled, making it solely responsible for re-waking the TXQ. This handles the race where veth_poll() consumes all packets and completes NAPI *before* veth_xmit() on the producer side has called netif_tx_stop_queue. The __veth_xdp_flush(rq) will observe rx_notify_masked is false and schedule NAPI. (2) On the consumer side, the logic for waking the peer TXQ is moved out of veth_xdp_rcv() and placed at the end of the veth_poll() function. This placement is part of fixing the race, as the netif_tx_queue_stopped() check must occur after rx_notify_masked is potentially set to false during NAPI completion. This handles the race where veth_poll() consumes all packets, but haven't finished (rx_notify_masked is still true). The producer veth_xmit() stops the TXQ and __veth_xdp_flush(rq) will observe rx_notify_masked is true, meaning not starting NAPI. Then veth_poll() change rx_notify_masked to false and stops NAPI. Before exiting veth_poll() will observe TXQ is stopped and wake it up. Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops") Reviewed-by: Toshiaki Makita Signed-off-by: Jesper Dangaard Brouer Link: https://patch.msgid.link/176295323282.307447.14790015927673763094.stgit@firesoul Signed-off-by: Jakub Kicinski Stable-dep-of: a14602fcae17 ("veth: reduce XDP no_direct return section to fix race") Signed-off-by: Sasha Levin

veth: prevent NULL pointer dereference in veth_xdp_rcv

2025-12-06T21:24:54+00:00

[ Upstream commit 9337c54401a5bb6ac3c9f6c71dd2a9130cfba82e ] The veth peer device is RCU protected, but when the peer device gets deleted (veth_dellink) then the pointer is assigned NULL (via RCU_INIT_POINTER). This patch adds a necessary NULL check in veth_xdp_rcv when accessing the veth peer net_device. This fixes a bug introduced in commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops"). The bug is a race and only triggers when having inflight packets on a veth that is being deleted. Reported-by: Ihor Solodrai Closes: https://lore.kernel.org/all/fecfcad0-7a16-42b8-bff2-66ee83a6e5c4@linux.dev/ Reported-by: syzbot+c4c7bf27f6b0c4bd97fe@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/683da55e.a00a0220.d8eae.0052.GAE@google.com/ Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops") Signed-off-by: Jesper Dangaard Brouer Acked-by: Ihor Solodrai Link: https://patch.msgid.link/174964557873.519608.10855046105237280978.stgit@firesoul Signed-off-by: Jakub Kicinski Stable-dep-of: a14602fcae17 ("veth: reduce XDP no_direct return section to fix race") Signed-off-by: Sasha Levin

veth: apply qdisc backpressure on full ptr_ring to reduce TX drops

2025-12-06T21:24:54+00:00

[ Upstream commit dc82a33297fc2c58cb0b2b008d728668d45c0f6a ] In production, we're seeing TX drops on veth devices when the ptr_ring fills up. This can occur when NAPI mode is enabled, though it's relatively rare. However, with threaded NAPI - which we use in production - the drops become significantly more frequent. The underlying issue is that with threaded NAPI, the consumer often runs on a different CPU than the producer. This increases the likelihood of the ring filling up before the consumer gets scheduled, especially under load, leading to drops in veth_xmit() (ndo_start_xmit()). This patch introduces backpressure by returning NETDEV_TX_BUSY when the ring is full, signaling the qdisc layer to requeue the packet. The txq (netdev queue) is stopped in this condition and restarted once veth_poll() drains entries from the ring, ensuring coordination between NAPI and qdisc. Backpressure is only enabled when a qdisc is attached. Without a qdisc, the driver retains its original behavior - dropping packets immediately when the ring is full. This avoids unexpected behavior changes in setups without a configured qdisc. With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management (AQM) to fairly schedule packets across flows and reduce collateral damage from elephant flows. A known limitation of this approach is that the full ring sits in front of the qdisc layer, effectively forming a FIFO buffer that introduces base latency. While AQM still improves fairness and mitigates flow dominance, the latency impact is measurable. In hardware drivers, this issue is typically addressed using BQL (Byte Queue Limits), which tracks in-flight bytes needed based on physical link rate. However, for virtual drivers like veth, there is no fixed bandwidth constraint - the bottleneck is CPU availability and the scheduler's ability to run the NAPI thread. It is unclear how effective BQL would be in this context. This patch serves as a first step toward addressing TX drops. Future work may explore adapting a BQL-like mechanism to better suit virtual devices like veth. Reported-by: Yan Zhai Signed-off-by: Jesper Dangaard Brouer Reviewed-by: Toshiaki Makita Link: https://patch.msgid.link/174559294022.827981.1282809941662942189.stgit@firesoul Signed-off-by: Jakub Kicinski Stable-dep-of: a14602fcae17 ("veth: reduce XDP no_direct return section to fix race") Signed-off-by: Sasha Levin

netdev_features: convert NETIF_F_LLTX to dev->lltx

2024-09-03T09:36:43+00:00

NETIF_F_LLTX can't be changed via Ethtool and is not a feature, rather an attribute, very similar to IFF_NO_QUEUE (and hot). Free one netdev_features_t bit and make it a "hot" private flag. Signed-off-by: Alexander Lobakin Signed-off-by: Paolo Abeni

net: veth: Disable netpoll support

2024-08-06T19:18:30+00:00

The current implementation of netpoll in veth devices leads to suboptimal behavior, as it triggers warnings due to the invocation of __netif_rx() within a softirq context. This is not compliant with expected practices, as __netif_rx() has the following statement: lockdep_assert_once(hardirq_count() | softirq_count()); Given that veth devices typically do not benefit from the functionalities provided by netpoll, Disable netpoll for veth interfaces. Signed-off-by: Breno Leitao Link: https://patch.msgid.link/20240805094012.1843247-1-leitao@debian.org Signed-off-by: Jakub Kicinski

Revert "net: mirror skb frag ref/unref helpers"

2024-05-03T23:05:53+00:00

This reverts commit a580ea994fd37f4105028f5a85c38ff6508a2b25. This revert is to resolve Dragos's report of page_pool leak here: https://lore.kernel.org/lkml/20240424165646.1625690-2-dtatulea@nvidia.com/ The reverted patch interacts very badly with commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets"). The reverted commit hopes that the pp_recycle + is_pp_page variables do not change between the skb_frag_ref and skb_frag_unref operation. If such a change occurs, the skb_frag_ref/unref will not operate on the same reference type. In the case of Dragos's report, the grabbed ref was a pp ref, but the unref was a page ref, because the pp_recycle setting on the skb was changed. Attempting to fix this issue on the fly is risky. Lets revert and I hope to reland this with better understanding and testing to ensure we don't regress some edge case while streamlining skb reffing. Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers") Reported-by: Dragos Tatulea Signed-off-by: Mina Almasry Link: https://lore.kernel.org/r/20240502175423.2456544-1-almasrymina@google.com Signed-off-by: Jakub Kicinski

net: mirror skb frag ref/unref helpers

2024-04-12T02:29:23+00:00

Refactor some of the skb frag ref/unref helpers for improved clarity. Implement napi_pp_get_page() to be the mirror counterpart of napi_pp_put_page(). Implement skb_page_ref() to be the mirror of skb_page_unref(). Improve __skb_frag_ref() to become a mirror counterpart of __skb_frag_unref(). Previously unref could handle pp & non-pp pages, while the ref could only handle non-pp pages. Now both the ref & unref helpers can correctly handle both pp & non-pp pages. Now that __skb_frag_ref() can handle both pp & non-pp pages, remove skb_pp_frag_ref(), and use __skb_frag_ref() instead. This lets us remove pp specific handling from skb_try_coalesce. Additionally, since __skb_frag_ref() can now handle both pp & non-pp pages, a latent issue in skb_shift() should now be fixed. Previously this function would do a non-pp ref & pp unref on potential pp frags (fragfrom). After this patch, skb_shift() should correctly do a pp ref/unref on pp frags. Signed-off-by: Mina Almasry Reviewed-by: Dragos Tatulea Reviewed-by: Jacob Keller Link: https://lore.kernel.org/r/20240410190505.1225848-3-almasrymina@google.com Signed-off-by: Jakub Kicinski

net: move skb ref helpers to new header

2024-04-12T02:29:22+00:00

Add a new header, linux/skbuff_ref.h, which contains all the skb_*_ref() helpers. Many of the consumers of skbuff.h do not actually use any of the skb ref helpers, and we can speed up compilation a bit by minimizing this header file. Additionally in the later patch in the series we add page_pool support to skb_frag_ref(), which requires some page_pool dependencies. We can now add these dependencies to skbuff_ref.h instead of a very ubiquitous skbuff.h Signed-off-by: Mina Almasry Link: https://lore.kernel.org/r/20240410190505.1225848-2-almasrymina@google.com Signed-off-by: Jakub Kicinski