kernel/linux.git/drivers/net/wireguard/device.c, branch v6.6.133

Revert "wireguard: device: enable threaded NAPI"

2026-02-19T15:28:26+00:00

This reverts commit 8c9e9cd398777fd60ba202211da1110614cb5bc5 which is commit db9ae3b6b43c79b1ba87eea849fd65efa05b4b2e upstream. We have had three independent production user reports in combination with Cilium utilizing WireGuard as encryption underneath that k8s Pod E/W traffic to certain peer nodes fully stalled. The situation appears as follows: - Occurs very rarely but at random times under heavy networking load. - Once the issue triggers the decryption side stops working completely for that WireGuard peer, other peers keep working fine. The stall happens also for newly initiated connections towards that particular WireGuard peer. - Only the decryption side is affected, never the encryption side. - Once it triggers, it never recovers and remains in this state, the CPU/mem on that node looks normal, no leak, busy loop or crash. - bpftrace on the affected system shows that wg_prev_queue_enqueue fails, thus the MAX_QUEUED_PACKETS (1024 skbs!) for the peer's rx_queue is reached. - Also, bpftrace shows that wg_packet_rx_poll for that peer is never called again after reaching this state for that peer. For other peers wg_packet_rx_poll does get called normally. - Commit db9ae3b ("wireguard: device: enable threaded NAPI") switched WireGuard to threaded NAPI by default. The default has not been changed for triggering the issue, neither did CPU hotplugging occur (i.e. 5bd8de2 ("wireguard: queueing: always return valid online CPU in wg_cpumask_choose_online()")). - The issue has been observed with stable kernels of v5.15 as well as v6.1. It was reported to us that v5.10 stable is working fine, and no report on v6.6 stable either (somewhat related discussion in [0] though). - In the WireGuard driver the only material difference between v5.10 stable and v5.15 stable is the switch to threaded NAPI by default. [0] https://lore.kernel.org/netdev/CA+wXwBTT74RErDGAnj98PqS=wvdh8eM1pi4q6tTdExtjnokKqA@mail.gmail.com/ Breakdown of the problem: 1) skbs arriving for decryption are enqueued to the peer->rx_queue in wg_packet_consume_data via wg_queue_enqueue_per_device_and_peer. 2) The latter only moves the skb into the MPSC peer queue if it does not surpass MAX_QUEUED_PACKETS (1024) which is kept track in an atomic counter via wg_prev_queue_enqueue. 3) In case enqueueing was successful, the skb is also queued up in the device queue, round-robin picks a next online CPU, and schedules the decryption worker. 4) The wg_packet_decrypt_worker, once scheduled, picks these up from the queue, decrypts the packets and once done calls into wg_queue_enqueue_per_peer_rx. 5) The latter updates the state to PACKET_STATE_CRYPTED on success and calls napi_schedule on the per peer->napi instance. 6) NAPI then polls via wg_packet_rx_poll. wg_prev_queue_peek checks on the peer->rx_queue. It will wg_prev_queue_dequeue if the queue->peeked skb was not cached yet, or just return the latter otherwise. (wg_prev_queue_drop_peeked later clears the cache.) 7) From an ordering perspective, the peer->rx_queue has skbs in order while the device queue with the per-CPU worker threads from a global ordering PoV can finish the decryption and signal the skb PACKET_STATE_CRYPTED out of order. 8) A situation can be observed that the first packet coming in will be stuck waiting for the decryption worker to be scheduled for a longer time when the system is under pressure. 9) While this is the case, the other CPUs in the meantime finish decryption and call into napi_schedule. 10) Now in wg_packet_rx_poll it picks up the first in-order skb from the peer->rx_queue and sees that its state is still PACKET_STATE_UNCRYPTED. The NAPI poll routine then exits early with work_done = 0 and calls napi_complete_done, signalling it "finished" processing. 11) The assumption in wg_packet_decrypt_worker is that when the decryption finished the subsequent napi_schedule will always lead to a later invocation of wg_packet_rx_poll to pick up the finished packet. 12) However, it appears that a later napi_schedule does /not/ schedule a later poll and thus no wg_packet_rx_poll. 13) If this situation happens exactly for the corner case where the decryption worker of the first packet is stuck and waiting to be scheduled, and the network load for WireGuard is very high then the queue can build up to MAX_QUEUED_PACKETS. 14) If this situation occurs, then no new decryption worker will be scheduled and also no new napi_schedule to make forward progress. 15) This means the peer->rx_queue stops processing packets completely and they are indefinitely stuck waiting for a new NAPI poll on that peer which never happens. New packets for that peer are then dropped due to full queue, as it has been observed on the production machines. Technically, the backport of commit db9ae3b6b43c ("wireguard: device: enable threaded NAPI") to stable should not have happened since it is more of an optimization rather than a pure fix and addresses a NAPI situation with utilizing many WireGuard tunnel devices in parallel. Revert it from stable given the backport triggers a regression for mentioned kernels. Signed-off-by: Daniel Borkmann Acked-by: Jason A. Donenfeld Signed-off-by: Greg Kroah-Hartman

wireguard: device: enable threaded NAPI

2025-06-19T13:28:35+00:00

[ Upstream commit db9ae3b6b43c79b1ba87eea849fd65efa05b4b2e ] Enable threaded NAPI by default for WireGuard devices in response to low performance behavior that we observed when multiple tunnels (and thus multiple wg devices) are deployed on a single host. This affects any kind of multi-tunnel deployment, regardless of whether the tunnels share the same endpoints or not (i.e., a VPN concentrator type of gateway would also be affected). The problem is caused by the fact that, in case of a traffic surge that involves multiple tunnels at the same time, the polling of the NAPI instance of all these wg devices tends to converge onto the same core, causing underutilization of the CPU and bottlenecking performance. This happens because NAPI polling is hosted by default in softirq context, but the WireGuard driver only raises this softirq after the rx peer queue has been drained, which doesn't happen during high traffic. In this case, the softirq already active on a core is reused instead of raising a new one. As a result, once two or more tunnel softirqs have been scheduled on the same core, they remain pinned there until the surge ends. In our experiments, this almost always leads to all tunnel NAPIs being handled on a single core shortly after a surge begins, limiting scalability to less than 3× the performance of a single tunnel, despite plenty of unused CPU cores being available. The proposed mitigation is to enable threaded NAPI for all WireGuard devices. This moves the NAPI polling context to a dedicated per-device kernel thread, allowing the scheduler to balance the load across all available cores. On our 32-core gateways, enabling threaded NAPI yields a ~4× performance improvement with 16 tunnels, increasing throughput from ~13 Gbps to ~48 Gbps. Meanwhile, CPU usage on the receiver (which is the bottleneck) jumps from 20% to 100%. We have found no performance regressions in any scenario we tested. Single-tunnel throughput remains unchanged. More details are available in our Netdev paper. Link: https://netdevconf.info/0x18/docs/netdev-0x18-paper23-talk-paper.pdf Signed-off-by: Mirco Barone Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld Link: https://patch.msgid.link/20250605120616.2808744-1-Jason@zx2c4.com Signed-off-by: Jakub Kicinski Signed-off-by: Sasha Levin

wireguard: use DEV_STATS_INC()

2023-12-03T06:33:03+00:00

[ Upstream commit 93da8d75a66568ba4bb5b14ad2833acd7304cd02 ] wg_xmit() can be called concurrently, KCSAN reported [1] some device stats updates can be lost. Use DEV_STATS_INC() for this unlikely case. [1] BUG: KCSAN: data-race in wg_xmit / wg_xmit read-write to 0xffff888104239160 of 8 bytes by task 1375 on cpu 0: wg_xmit+0x60f/0x680 drivers/net/wireguard/device.c:231 __netdev_start_xmit include/linux/netdevice.h:4918 [inline] netdev_start_xmit include/linux/netdevice.h:4932 [inline] xmit_one net/core/dev.c:3543 [inline] dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3559 ... read-write to 0xffff888104239160 of 8 bytes by task 1378 on cpu 1: wg_xmit+0x60f/0x680 drivers/net/wireguard/device.c:231 __netdev_start_xmit include/linux/netdevice.h:4918 [inline] netdev_start_xmit include/linux/netdevice.h:4932 [inline] xmit_one net/core/dev.c:3543 [inline] dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3559 ... v2: also change wg_packet_consume_data_done() (Hangbin Liu) and wg_packet_purge_staged_packets() Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Reported-by: syzbot Signed-off-by: Eric Dumazet Cc: Jason A. Donenfeld Cc: Hangbin Liu Signed-off-by: Jason A. Donenfeld Reviewed-by: Hangbin Liu Signed-off-by: David S. Miller Signed-off-by: Sasha Levin

net: move gso declarations and functions to their own files

2023-06-10T07:11:41+00:00

Move declarations into include/net/gso.h and code into net/core/gso.c Signed-off-by: Eric Dumazet Cc: Stanislav Fomichev Reviewed-by: Simon Horman Reviewed-by: David Ahern Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com Signed-off-by: Jakub Kicinski

pm/sleep: Add PM_USERSPACE_AUTOSLEEP Kconfig

2022-07-01T08:39:20+00:00

Systems that initiate frequent suspend/resume from userspace can make the kernel aware by enabling PM_USERSPACE_AUTOSLEEP config. This allows for certain sleep-sensitive code (wireguard/rng) to decide on what preparatory work should be performed (or not) in their pm_notification callbacks. This patch was prompted by the discussion at [1] which attempts to remove CONFIG_ANDROID that currently guards these code paths. [1] https://lore.kernel.org/r/20220629150102.1582425-1-hch@lst.de/ Suggested-by: Jason A. Donenfeld Acked-by: Jason A. Donenfeld Signed-off-by: Kalesh Singh Link: https://lore.kernel.org/r/20220630191230.235306-1-kaleshsingh@google.com Signed-off-by: Greg Kroah-Hartman

wireguard: device: check for metadata_dst with skb_valid_dst()

2022-04-22T22:59:05+00:00

When we try to transmit an skb with md_dst attached through wireguard we hit a null pointer dereference in wg_xmit() due to the use of dst_mtu() which calls into dst_blackhole_mtu() which in turn tries to dereference dst->dev. Since wireguard doesn't use md_dsts we should use skb_valid_dst(), which checks for DST_METADATA flag, and if it's set, then falls back to wireguard's device mtu. That gives us the best chance of transmitting the packet; otherwise if the blackhole netdev is used we'd get ETH_MIN_MTU. [ 263.693506] BUG: kernel NULL pointer dereference, address: 00000000000000e0 [ 263.693908] #PF: supervisor read access in kernel mode [ 263.694174] #PF: error_code(0x0000) - not-present page [ 263.694424] PGD 0 P4D 0 [ 263.694653] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 263.694876] CPU: 5 PID: 951 Comm: mausezahn Kdump: loaded Not tainted 5.18.0-rc1+ #522 [ 263.695190] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014 [ 263.695529] RIP: 0010:dst_blackhole_mtu+0x17/0x20 [ 263.695770] Code: 00 00 00 0f 1f 44 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 47 10 48 83 e0 fc 8b 40 04 85 c0 75 09 48 8b 07 <8b> 80 e0 00 00 00 c3 66 90 0f 1f 44 00 00 48 89 d7 be 01 00 00 00 [ 263.696339] RSP: 0018:ffffa4a4422fbb28 EFLAGS: 00010246 [ 263.696600] RAX: 0000000000000000 RBX: ffff8ac9c3553000 RCX: 0000000000000000 [ 263.696891] RDX: 0000000000000401 RSI: 00000000fffffe01 RDI: ffffc4a43fb48900 [ 263.697178] RBP: ffffa4a4422fbb90 R08: ffffffff9622635e R09: 0000000000000002 [ 263.697469] R10: ffffffff9b69a6c0 R11: ffffa4a4422fbd0c R12: ffff8ac9d18b1a00 [ 263.697766] R13: ffff8ac9d0ce1840 R14: ffff8ac9d18b1a00 R15: ffff8ac9c3553000 [ 263.698054] FS: 00007f3704c337c0(0000) GS:ffff8acaebf40000(0000) knlGS:0000000000000000 [ 263.698470] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 263.698826] CR2: 00000000000000e0 CR3: 0000000117a5c000 CR4: 00000000000006e0 [ 263.699214] Call Trace: [ 263.699505] [ 263.699759] wg_xmit+0x411/0x450 [ 263.700059] ? bpf_skb_set_tunnel_key+0x46/0x2d0 [ 263.700382] ? dev_queue_xmit_nit+0x31/0x2b0 [ 263.700719] dev_hard_start_xmit+0xd9/0x220 [ 263.701047] __dev_queue_xmit+0x8b9/0xd30 [ 263.701344] __bpf_redirect+0x1a4/0x380 [ 263.701664] __dev_queue_xmit+0x83b/0xd30 [ 263.701961] ? packet_parse_headers+0xb4/0xf0 [ 263.702275] packet_sendmsg+0x9a8/0x16a0 [ 263.702596] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 263.702933] sock_sendmsg+0x5e/0x60 [ 263.703239] __sys_sendto+0xf0/0x160 [ 263.703549] __x64_sys_sendto+0x20/0x30 [ 263.703853] do_syscall_64+0x3b/0x90 [ 263.704162] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 263.704494] RIP: 0033:0x7f3704d50506 [ 263.704789] Code: 48 c7 c0 ff ff ff ff eb b7 66 2e 0f 1f 84 00 00 00 00 00 90 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 c3 90 55 48 83 ec 30 44 89 4c 24 2c 4c 89 [ 263.705652] RSP: 002b:00007ffe954b0b88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 263.706141] RAX: ffffffffffffffda RBX: 0000558bb259b490 RCX: 00007f3704d50506 [ 263.706544] RDX: 000000000000004a RSI: 0000558bb259b7b2 RDI: 0000000000000003 [ 263.706952] RBP: 0000000000000000 R08: 00007ffe954b0b90 R09: 0000000000000014 [ 263.707339] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe954b0b90 [ 263.707735] R13: 000000000000004a R14: 0000558bb259b7b2 R15: 0000000000000001 [ 263.708132] [ 263.708398] Modules linked in: bridge netconsole bonding [last unloaded: bridge] [ 263.708942] CR2: 00000000000000e0 Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Link: https://github.com/cilium/cilium/issues/19428 Reported-by: Martynas Pumputis Signed-off-by: Nikolay Aleksandrov Acked-by: Daniel Borkmann Signed-off-by: Jason A. Donenfeld Signed-off-by: Jakub Kicinski

wireguard: device: clear keys on VM fork

2022-03-13T01:00:56+00:00

When a virtual machine forks, it's important that WireGuard clear existing sessions so that different plaintexts are not transmitted using the same key+nonce, which can result in catastrophic cryptographic failure. To accomplish this, we simply hook into the newly added vmfork notifier. As a bonus, it turns out that, like the vmfork registration function, the PM registration function is stubbed out when CONFIG_PM_SLEEP is not set, so we can actually just remove the maze of ifdefs, which makes it really quite clean to support both notifiers at once. Cc: Dominik Brodowski Cc: Greg Kroah-Hartman Cc: Theodore Ts'o Acked-by: Jakub Kicinski Signed-off-by: Jason A. Donenfeld

wireguard: receive: use ring buffer for incoming handshakes

2021-11-30T03:50:50+00:00

Apparently the spinlock on incoming_handshake's skb_queue is highly contended, and a torrent of handshake or cookie packets can bring the data plane to its knees, simply by virtue of enqueueing the handshake packets to be processed asynchronously. So, we try switching this to a ring buffer to hopefully have less lock contention. This alleviates the problem somewhat, though it still isn't perfect, so future patches will have to improve this further. However, it at least doesn't completely diminish the data plane. Reported-by: Streun Fabio Reported-by: Joel Wanner Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld Signed-off-by: Jakub Kicinski

wireguard: device: reset peer src endpoint when netns exits

2021-11-30T03:50:45+00:00

Each peer's endpoint contains a dst_cache entry that takes a reference to another netdev. When the containing namespace exits, we take down the socket and prevent future sockets from being created (by setting creating_net to NULL), which removes that potential reference on the netns. However, it doesn't release references to the netns that a netdev cached in dst_cache might be taking, so the netns still might fail to exit. Since the socket is gimped anyway, we can simply clear all the dst_caches (by way of clearing the endpoint src), which will release all references. However, the current dst_cache_reset function only releases those references lazily. But it turns out that all of our usages of wg_socket_clear_peer_endpoint_src are called from contexts that are not exactly high-speed or bottle-necked. For example, when there's connection difficulty, or when userspace is reconfiguring the interface. And in particular for this patch, when the netns is exiting. So for those cases, it makes more sense to call dst_release immediately. For that, we add a small helper function to dst_cache. This patch also adds a test to netns.sh from Hangbin Liu to ensure this doesn't regress. Tested-by: Hangbin Liu Reported-by: Xiumei Mu Cc: Toke Høiland-Jørgensen Cc: Paolo Abeni Fixes: 900575aa33a3 ("wireguard: device: avoid circular netns references") Signed-off-by: Jason A. Donenfeld Signed-off-by: Jakub Kicinski

wireguard: queueing: get rid of per-peer ring buffers

2021-02-23T23:59:34+00:00

Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we *already* have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue Reviewed-by: Dmitry Vyukov Reviewed-by: Toke Høiland-Jørgensen Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld Signed-off-by: Jakub Kicinski