<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/net/wireguard/device.c, branch v6.6.133</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.133</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.133'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-02-19T15:28:26+00:00</updated>
<entry>
<title>Revert "wireguard: device: enable threaded NAPI"</title>
<updated>2026-02-19T15:28:26+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2026-02-16T21:31:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=168ee1549fa24fff2ab924a2ca7b1d85f30e062d'/>
<id>urn:sha1:168ee1549fa24fff2ab924a2ca7b1d85f30e062d</id>
<content type='text'>
This reverts commit 8c9e9cd398777fd60ba202211da1110614cb5bc5 which is
commit db9ae3b6b43c79b1ba87eea849fd65efa05b4b2e upstream.

We have had three independent production user reports in combination
with Cilium utilizing WireGuard as encryption underneath that k8s Pod
E/W traffic to certain peer nodes fully stalled. The situation appears
as follows:

  - Occurs very rarely but at random times under heavy networking load.
  - Once the issue triggers the decryption side stops working completely
    for that WireGuard peer, other peers keep working fine. The stall
    happens also for newly initiated connections towards that particular
    WireGuard peer.
  - Only the decryption side is affected, never the encryption side.
  - Once it triggers, it never recovers and remains in this state,
    the CPU/mem on that node looks normal, no leak, busy loop or crash.
  - bpftrace on the affected system shows that wg_prev_queue_enqueue
    fails, thus the MAX_QUEUED_PACKETS (1024 skbs!) for the peer's
    rx_queue is reached.
  - Also, bpftrace shows that wg_packet_rx_poll for that peer is never
    called again after reaching this state for that peer. For other
    peers wg_packet_rx_poll does get called normally.
  - Commit db9ae3b ("wireguard: device: enable threaded NAPI")
    switched WireGuard to threaded NAPI by default. The default has
    not been changed for triggering the issue, neither did CPU
    hotplugging occur (i.e. 5bd8de2 ("wireguard: queueing: always
    return valid online CPU in wg_cpumask_choose_online()")).
  - The issue has been observed with stable kernels of v5.15 as well as
    v6.1. It was reported to us that v5.10 stable is working fine, and
    no report on v6.6 stable either (somewhat related discussion in [0]
    though).
  - In the WireGuard driver the only material difference between v5.10
    stable and v5.15 stable is the switch to threaded NAPI by default.

    [0] https://lore.kernel.org/netdev/CA+wXwBTT74RErDGAnj98PqS=wvdh8eM1pi4q6tTdExtjnokKqA@mail.gmail.com/

Breakdown of the problem:

  1) skbs arriving for decryption are enqueued to the peer-&gt;rx_queue in
     wg_packet_consume_data via wg_queue_enqueue_per_device_and_peer.
  2) The latter only moves the skb into the MPSC peer queue if it does
     not surpass MAX_QUEUED_PACKETS (1024) which is kept track in an
     atomic counter via wg_prev_queue_enqueue.
  3) In case enqueueing was successful, the skb is also queued up
     in the device queue, round-robin picks a next online CPU, and
     schedules the decryption worker.
  4) The wg_packet_decrypt_worker, once scheduled, picks these up
     from the queue, decrypts the packets and once done calls into
     wg_queue_enqueue_per_peer_rx.
  5) The latter updates the state to PACKET_STATE_CRYPTED on success
     and calls napi_schedule on the per peer-&gt;napi instance.
  6) NAPI then polls via wg_packet_rx_poll. wg_prev_queue_peek checks
     on the peer-&gt;rx_queue. It will wg_prev_queue_dequeue if the
     queue-&gt;peeked skb was not cached yet, or just return the latter
     otherwise. (wg_prev_queue_drop_peeked later clears the cache.)
  7) From an ordering perspective, the peer-&gt;rx_queue has skbs in order
     while the device queue with the per-CPU worker threads from a
     global ordering PoV can finish the decryption and signal the skb
     PACKET_STATE_CRYPTED out of order.
  8) A situation can be observed that the first packet coming in will
     be stuck waiting for the decryption worker to be scheduled for
     a longer time when the system is under pressure.
  9) While this is the case, the other CPUs in the meantime finish
     decryption and call into napi_schedule.
 10) Now in wg_packet_rx_poll it picks up the first in-order skb
     from the peer-&gt;rx_queue and sees that its state is still
     PACKET_STATE_UNCRYPTED. The NAPI poll routine then exits early
     with work_done = 0 and calls napi_complete_done, signalling
     it "finished" processing.
 11) The assumption in wg_packet_decrypt_worker is that when the
     decryption finished the subsequent napi_schedule will always
     lead to a later invocation of wg_packet_rx_poll to pick up
     the finished packet.
 12) However, it appears that a later napi_schedule does /not/
     schedule a later poll and thus no wg_packet_rx_poll.
 13) If this situation happens exactly for the corner case where
     the decryption worker of the first packet is stuck and waiting
     to be scheduled, and the network load for WireGuard is very
     high then the queue can build up to MAX_QUEUED_PACKETS.
 14) If this situation occurs, then no new decryption worker will
     be scheduled and also no new napi_schedule to make forward
     progress.
 15) This means the peer-&gt;rx_queue stops processing packets completely
     and they are indefinitely stuck waiting for a new NAPI poll on
     that peer which never happens. New packets for that peer are
     then dropped due to full queue, as it has been observed on the
     production machines.

Technically, the backport of commit db9ae3b6b43c ("wireguard: device:
enable threaded NAPI") to stable should not have happened since it is
more of an optimization rather than a pure fix and addresses a NAPI
situation with utilizing many WireGuard tunnel devices in parallel.
Revert it from stable given the backport triggers a regression for
mentioned kernels.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>wireguard: device: enable threaded NAPI</title>
<updated>2025-06-19T13:28:35+00:00</updated>
<author>
<name>Mirco Barone</name>
<email>mirco.barone@polito.it</email>
</author>
<published>2025-06-05T12:06:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8c9e9cd398777fd60ba202211da1110614cb5bc5'/>
<id>urn:sha1:8c9e9cd398777fd60ba202211da1110614cb5bc5</id>
<content type='text'>
[ Upstream commit db9ae3b6b43c79b1ba87eea849fd65efa05b4b2e ]

Enable threaded NAPI by default for WireGuard devices in response to low
performance behavior that we observed when multiple tunnels (and thus
multiple wg devices) are deployed on a single host.  This affects any
kind of multi-tunnel deployment, regardless of whether the tunnels share
the same endpoints or not (i.e., a VPN concentrator type of gateway
would also be affected).

The problem is caused by the fact that, in case of a traffic surge that
involves multiple tunnels at the same time, the polling of the NAPI
instance of all these wg devices tends to converge onto the same core,
causing underutilization of the CPU and bottlenecking performance.

This happens because NAPI polling is hosted by default in softirq
context, but the WireGuard driver only raises this softirq after the rx
peer queue has been drained, which doesn't happen during high traffic.
In this case, the softirq already active on a core is reused instead of
raising a new one.

As a result, once two or more tunnel softirqs have been scheduled on
the same core, they remain pinned there until the surge ends.

In our experiments, this almost always leads to all tunnel NAPIs being
handled on a single core shortly after a surge begins, limiting
scalability to less than 3× the performance of a single tunnel, despite
plenty of unused CPU cores being available.

The proposed mitigation is to enable threaded NAPI for all WireGuard
devices. This moves the NAPI polling context to a dedicated per-device
kernel thread, allowing the scheduler to balance the load across all
available cores.

On our 32-core gateways, enabling threaded NAPI yields a ~4× performance
improvement with 16 tunnels, increasing throughput from ~13 Gbps to
~48 Gbps. Meanwhile, CPU usage on the receiver (which is the bottleneck)
jumps from 20% to 100%.

We have found no performance regressions in any scenario we tested.
Single-tunnel throughput remains unchanged.

More details are available in our Netdev paper.

Link: https://netdevconf.info/0x18/docs/netdev-0x18-paper23-talk-paper.pdf
Signed-off-by: Mirco Barone &lt;mirco.barone@polito.it&gt;
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
Link: https://patch.msgid.link/20250605120616.2808744-1-Jason@zx2c4.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>wireguard: use DEV_STATS_INC()</title>
<updated>2023-12-03T06:33:03+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-11-17T14:17:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b605d23cb162bf454e621e234c127ee98bb60c8d'/>
<id>urn:sha1:b605d23cb162bf454e621e234c127ee98bb60c8d</id>
<content type='text'>
[ Upstream commit 93da8d75a66568ba4bb5b14ad2833acd7304cd02 ]

wg_xmit() can be called concurrently, KCSAN reported [1]
some device stats updates can be lost.

Use DEV_STATS_INC() for this unlikely case.

[1]
BUG: KCSAN: data-race in wg_xmit / wg_xmit

read-write to 0xffff888104239160 of 8 bytes by task 1375 on cpu 0:
wg_xmit+0x60f/0x680 drivers/net/wireguard/device.c:231
__netdev_start_xmit include/linux/netdevice.h:4918 [inline]
netdev_start_xmit include/linux/netdevice.h:4932 [inline]
xmit_one net/core/dev.c:3543 [inline]
dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3559
...

read-write to 0xffff888104239160 of 8 bytes by task 1378 on cpu 1:
wg_xmit+0x60f/0x680 drivers/net/wireguard/device.c:231
__netdev_start_xmit include/linux/netdevice.h:4918 [inline]
netdev_start_xmit include/linux/netdevice.h:4932 [inline]
xmit_one net/core/dev.c:3543 [inline]
dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3559
...

v2: also change wg_packet_consume_data_done() (Hangbin Liu)
    and wg_packet_purge_staged_packets()

Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Reported-by: syzbot &lt;syzkaller@googlegroups.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
Cc: Hangbin Liu &lt;liuhangbin@gmail.com&gt;
Signed-off-by: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
Reviewed-by: Hangbin Liu &lt;liuhangbin@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>net: move gso declarations and functions to their own files</title>
<updated>2023-06-10T07:11:41+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-06-08T19:17:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d457a0e329b0bfd3a1450e0b1a18cd2b47a25a08'/>
<id>urn:sha1:d457a0e329b0bfd3a1450e0b1a18cd2b47a25a08</id>
<content type='text'>
Move declarations into include/net/gso.h and code into net/core/gso.c

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Stanislav Fomichev &lt;sdf@google.com&gt;
Reviewed-by: Simon Horman &lt;simon.horman@corigine.com&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>pm/sleep: Add PM_USERSPACE_AUTOSLEEP Kconfig</title>
<updated>2022-07-01T08:39:20+00:00</updated>
<author>
<name>Kalesh Singh</name>
<email>kaleshsingh@google.com</email>
</author>
<published>2022-06-30T19:12:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=261e224d6a5c43e2bb8a07b7662f9b4ec425cfec'/>
<id>urn:sha1:261e224d6a5c43e2bb8a07b7662f9b4ec425cfec</id>
<content type='text'>
Systems that initiate frequent suspend/resume from userspace
can make the kernel aware by enabling PM_USERSPACE_AUTOSLEEP
config.

This allows for certain sleep-sensitive code (wireguard/rng) to
decide on what preparatory work should be performed (or not) in
their pm_notification callbacks.

This patch was prompted by the discussion at [1] which attempts
to remove CONFIG_ANDROID that currently guards these code paths.

[1] https://lore.kernel.org/r/20220629150102.1582425-1-hch@lst.de/

Suggested-by: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
Acked-by: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
Signed-off-by: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Link: https://lore.kernel.org/r/20220630191230.235306-1-kaleshsingh@google.com
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>wireguard: device: check for metadata_dst with skb_valid_dst()</title>
<updated>2022-04-22T22:59:05+00:00</updated>
<author>
<name>Nikolay Aleksandrov</name>
<email>razor@blackwall.org</email>
</author>
<published>2022-04-21T13:48:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=45ac774c33d834fe9d4de06ab5f1022fe8cd2071'/>
<id>urn:sha1:45ac774c33d834fe9d4de06ab5f1022fe8cd2071</id>
<content type='text'>
When we try to transmit an skb with md_dst attached through wireguard
we hit a null pointer dereference in wg_xmit() due to the use of
dst_mtu() which calls into dst_blackhole_mtu() which in turn tries to
dereference dst-&gt;dev.

Since wireguard doesn't use md_dsts we should use skb_valid_dst(), which
checks for DST_METADATA flag, and if it's set, then falls back to
wireguard's device mtu. That gives us the best chance of transmitting
the packet; otherwise if the blackhole netdev is used we'd get
ETH_MIN_MTU.

 [  263.693506] BUG: kernel NULL pointer dereference, address: 00000000000000e0
 [  263.693908] #PF: supervisor read access in kernel mode
 [  263.694174] #PF: error_code(0x0000) - not-present page
 [  263.694424] PGD 0 P4D 0
 [  263.694653] Oops: 0000 [#1] PREEMPT SMP NOPTI
 [  263.694876] CPU: 5 PID: 951 Comm: mausezahn Kdump: loaded Not tainted 5.18.0-rc1+ #522
 [  263.695190] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
 [  263.695529] RIP: 0010:dst_blackhole_mtu+0x17/0x20
 [  263.695770] Code: 00 00 00 0f 1f 44 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 47 10 48 83 e0 fc 8b 40 04 85 c0 75 09 48 8b 07 &lt;8b&gt; 80 e0 00 00 00 c3 66 90 0f 1f 44 00 00 48 89 d7 be 01 00 00 00
 [  263.696339] RSP: 0018:ffffa4a4422fbb28 EFLAGS: 00010246
 [  263.696600] RAX: 0000000000000000 RBX: ffff8ac9c3553000 RCX: 0000000000000000
 [  263.696891] RDX: 0000000000000401 RSI: 00000000fffffe01 RDI: ffffc4a43fb48900
 [  263.697178] RBP: ffffa4a4422fbb90 R08: ffffffff9622635e R09: 0000000000000002
 [  263.697469] R10: ffffffff9b69a6c0 R11: ffffa4a4422fbd0c R12: ffff8ac9d18b1a00
 [  263.697766] R13: ffff8ac9d0ce1840 R14: ffff8ac9d18b1a00 R15: ffff8ac9c3553000
 [  263.698054] FS:  00007f3704c337c0(0000) GS:ffff8acaebf40000(0000) knlGS:0000000000000000
 [  263.698470] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [  263.698826] CR2: 00000000000000e0 CR3: 0000000117a5c000 CR4: 00000000000006e0
 [  263.699214] Call Trace:
 [  263.699505]  &lt;TASK&gt;
 [  263.699759]  wg_xmit+0x411/0x450
 [  263.700059]  ? bpf_skb_set_tunnel_key+0x46/0x2d0
 [   263.700382]  ? dev_queue_xmit_nit+0x31/0x2b0
 [  263.700719]  dev_hard_start_xmit+0xd9/0x220
 [  263.701047]  __dev_queue_xmit+0x8b9/0xd30
 [  263.701344]  __bpf_redirect+0x1a4/0x380
 [  263.701664]  __dev_queue_xmit+0x83b/0xd30
 [  263.701961]  ? packet_parse_headers+0xb4/0xf0
 [  263.702275]  packet_sendmsg+0x9a8/0x16a0
 [  263.702596]  ? _raw_spin_unlock_irqrestore+0x23/0x40
 [  263.702933]  sock_sendmsg+0x5e/0x60
 [  263.703239]  __sys_sendto+0xf0/0x160
 [  263.703549]  __x64_sys_sendto+0x20/0x30
 [  263.703853]  do_syscall_64+0x3b/0x90
 [  263.704162]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 [  263.704494] RIP: 0033:0x7f3704d50506
 [  263.704789] Code: 48 c7 c0 ff ff ff ff eb b7 66 2e 0f 1f 84 00 00 00 00 00 90 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 &lt;48&gt; 3d 00 f0 ff ff 77 72 c3 90 55 48 83 ec 30 44 89 4c 24 2c 4c 89
 [  263.705652] RSP: 002b:00007ffe954b0b88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
 [  263.706141] RAX: ffffffffffffffda RBX: 0000558bb259b490 RCX: 00007f3704d50506
 [  263.706544] RDX: 000000000000004a RSI: 0000558bb259b7b2 RDI: 0000000000000003
 [  263.706952] RBP: 0000000000000000 R08: 00007ffe954b0b90 R09: 0000000000000014
 [  263.707339] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe954b0b90
 [  263.707735] R13: 000000000000004a R14: 0000558bb259b7b2 R15: 0000000000000001
 [  263.708132]  &lt;/TASK&gt;
 [  263.708398] Modules linked in: bridge netconsole bonding [last unloaded: bridge]
 [  263.708942] CR2: 00000000000000e0

Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Link: https://github.com/cilium/cilium/issues/19428
Reported-by: Martynas Pumputis &lt;m@lambda.lt&gt;
Signed-off-by: Nikolay Aleksandrov &lt;razor@blackwall.org&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>wireguard: device: clear keys on VM fork</title>
<updated>2022-03-13T01:00:56+00:00</updated>
<author>
<name>Jason A. Donenfeld</name>
<email>Jason@zx2c4.com</email>
</author>
<published>2022-03-01T22:26:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2d6919c3205b141ba85fb733b2a67937ff85dc7f'/>
<id>urn:sha1:2d6919c3205b141ba85fb733b2a67937ff85dc7f</id>
<content type='text'>
When a virtual machine forks, it's important that WireGuard clear
existing sessions so that different plaintexts are not transmitted using
the same key+nonce, which can result in catastrophic cryptographic
failure. To accomplish this, we simply hook into the newly added vmfork
notifier.

As a bonus, it turns out that, like the vmfork registration function,
the PM registration function is stubbed out when CONFIG_PM_SLEEP is not
set, so we can actually just remove the maze of ifdefs, which makes it
really quite clean to support both notifiers at once.

Cc: Dominik Brodowski &lt;linux@dominikbrodowski.net&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Theodore Ts'o &lt;tytso@mit.edu&gt;
Acked-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
</content>
</entry>
<entry>
<title>wireguard: receive: use ring buffer for incoming handshakes</title>
<updated>2021-11-30T03:50:50+00:00</updated>
<author>
<name>Jason A. Donenfeld</name>
<email>Jason@zx2c4.com</email>
</author>
<published>2021-11-29T15:39:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=886fcee939adb5e2af92741b90643a59f2b54f97'/>
<id>urn:sha1:886fcee939adb5e2af92741b90643a59f2b54f97</id>
<content type='text'>
Apparently the spinlock on incoming_handshake's skb_queue is highly
contended, and a torrent of handshake or cookie packets can bring the
data plane to its knees, simply by virtue of enqueueing the handshake
packets to be processed asynchronously. So, we try switching this to a
ring buffer to hopefully have less lock contention. This alleviates the
problem somewhat, though it still isn't perfect, so future patches will
have to improve this further. However, it at least doesn't completely
diminish the data plane.

Reported-by: Streun Fabio &lt;fstreun@student.ethz.ch&gt;
Reported-by: Joel Wanner &lt;joel.wanner@inf.ethz.ch&gt;
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>wireguard: device: reset peer src endpoint when netns exits</title>
<updated>2021-11-30T03:50:45+00:00</updated>
<author>
<name>Jason A. Donenfeld</name>
<email>Jason@zx2c4.com</email>
</author>
<published>2021-11-29T15:39:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=20ae1d6aa159eb91a9bf09ff92ccaa94dbea92c2'/>
<id>urn:sha1:20ae1d6aa159eb91a9bf09ff92ccaa94dbea92c2</id>
<content type='text'>
Each peer's endpoint contains a dst_cache entry that takes a reference
to another netdev. When the containing namespace exits, we take down the
socket and prevent future sockets from being created (by setting
creating_net to NULL), which removes that potential reference on the
netns. However, it doesn't release references to the netns that a netdev
cached in dst_cache might be taking, so the netns still might fail to
exit. Since the socket is gimped anyway, we can simply clear all the
dst_caches (by way of clearing the endpoint src), which will release all
references.

However, the current dst_cache_reset function only releases those
references lazily. But it turns out that all of our usages of
wg_socket_clear_peer_endpoint_src are called from contexts that are not
exactly high-speed or bottle-necked. For example, when there's
connection difficulty, or when userspace is reconfiguring the interface.
And in particular for this patch, when the netns is exiting. So for
those cases, it makes more sense to call dst_release immediately. For
that, we add a small helper function to dst_cache.

This patch also adds a test to netns.sh from Hangbin Liu to ensure this
doesn't regress.

Tested-by: Hangbin Liu &lt;liuhangbin@gmail.com&gt;
Reported-by: Xiumei Mu &lt;xmu@redhat.com&gt;
Cc: Toke Høiland-Jørgensen &lt;toke@redhat.com&gt;
Cc: Paolo Abeni &lt;pabeni@redhat.com&gt;
Fixes: 900575aa33a3 ("wireguard: device: avoid circular netns references")
Signed-off-by: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>wireguard: queueing: get rid of per-peer ring buffers</title>
<updated>2021-02-23T23:59:34+00:00</updated>
<author>
<name>Jason A. Donenfeld</name>
<email>Jason@zx2c4.com</email>
</author>
<published>2021-02-22T16:25:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8b5553ace83cced775eefd0f3f18b5c6214ccf7a'/>
<id>urn:sha1:8b5553ace83cced775eefd0f3f18b5c6214ccf7a</id>
<content type='text'>
Having two ring buffers per-peer means that every peer results in two
massive ring allocations. On an 8-core x86_64 machine, this commit
reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which
is an 90% reduction. Ninety percent! With some single-machine
deployments approaching 500,000 peers, we're talking about a reduction
from 7 gigs of memory down to 700 megs of memory.

In order to get rid of these per-peer allocations, this commit switches
to using a list-based queueing approach. Currently GSO fragments are
chained together using the skb-&gt;next pointer (the skb_list_* singly
linked list approach), so we form the per-peer queue around the unused
skb-&gt;prev pointer (which sort of makes sense because the links are
pointing backwards). Use of skb_queue_* is not possible here, because
that is based on doubly linked lists and spinlocks. Multiple cores can
write into the queue at any given time, because its writes occur in the
start_xmit path or in the udp_recv path. But reads happen in a single
workqueue item per-peer, amounting to a multi-producer, single-consumer
paradigm.

The MPSC queue is implemented locklessly and never blocks. However, it
is not linearizable (though it is serializable), with a very tight and
unlikely race on writes, which, when hit (some tiny fraction of the
0.15% of partial adds on a fully loaded 16-core x86_64 system), causes
the queue reader to terminate early. However, because every packet sent
queues up the same workqueue item after it is fully added, the worker
resumes again, and stopping early isn't actually a problem, since at
that point the packet wouldn't have yet been added to the encryption
queue. These properties allow us to avoid disabling interrupts or
spinning. The design is based on Dmitry Vyukov's algorithm [1].

Performance-wise, ordinarily list-based queues aren't preferable to
ringbuffers, because of cache misses when following pointers around.
However, we *already* have to follow the adjacent pointers when working
through fragments, so there shouldn't actually be any change there. A
potential downside is that dequeueing is a bit more complicated, but the
ptr_ring structure used prior had a spinlock when dequeueing, so all and
all the difference appears to be a wash.

Actually, from profiling, the biggest performance hit, by far, of this
commit winds up being atomic_add_unless(count, 1, max) and atomic_
dec(count), which account for the majority of CPU time, according to
perf. In that sense, the previous ring buffer was superior in that it
could check if it was full by head==tail, which the list-based approach
cannot do.

But all and all, this enables us to get massive memory savings, allowing
WireGuard to scale for real world deployments, without taking much of a
performance hit.

[1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue

Reviewed-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Reviewed-by: Toke Høiland-Jørgensen &lt;toke@redhat.com&gt;
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
</feed>
