<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/net/xsk_buff_pool.h, branch v6.1.174</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.1.174</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.1.174'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-02-19T15:25:19+00:00</updated>
<entry>
<title>xsk: Fix race condition in AF_XDP generic RX path</title>
<updated>2026-02-19T15:25:19+00:00</updated>
<author>
<name>e.kubanski</name>
<email>e.kubanski@partner.samsung.com</email>
</author>
<published>2026-02-10T09:12:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=975b372313dc018b9bd6cc0d85d188787054b19e'/>
<id>urn:sha1:975b372313dc018b9bd6cc0d85d188787054b19e</id>
<content type='text'>
[ Upstream commit a1356ac7749cafc4e27aa62c0c4604b5dca4983e ]

Move rx_lock from xsk_socket to xsk_buff_pool.
Fix synchronization for shared umem mode in
generic RX path where multiple sockets share
single xsk_buff_pool.

RX queue is exclusive to xsk_socket, while FILL
queue can be shared between multiple sockets.
This could result in race condition where two
CPU cores access RX path of two different sockets
sharing the same umem.

Protect both queues by acquiring spinlock in shared
xsk_buff_pool.

Lock contention may be minimized in the future by some
per-thread FQ buffering.

It's safe and necessary to move spin_lock_bh(rx_lock)
after xsk_rcv_check():
* xs-&gt;pool and spinlock_init is synchronized by
  xsk_bind() -&gt; xsk_is_bound() memory barriers.
* xsk_rcv_check() may return true at the moment
  of xsk_release() or xsk_unbind_dev(),
  however this will not cause any data races or
  race conditions. xsk_unbind_dev() removes xdp
  socket from all maps and waits for completion
  of all outstanding rx operations. Packets in
  RX path will either complete safely or drop.

Signed-off-by: Eryk Kubanski &lt;e.kubanski@partner.samsung.com&gt;
Fixes: bf0bdd1343efb ("xdp: fix race on generic receive path")
Acked-by: Magnus Karlsson &lt;magnus.karlsson@intel.com&gt;
Link: https://patch.msgid.link/20250416101908.10919-1-e.kubanski@partner.samsung.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
[ Conflict is resolved when backporting this fix. ]
Signed-off-by: Jianqiang kang &lt;jianqkang@sina.cn&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>xsk: Fix unaligned descriptor validation</title>
<updated>2023-05-11T14:03:21+00:00</updated>
<author>
<name>Kal Conley</name>
<email>kal.conley@dectris.com</email>
</author>
<published>2023-04-05T23:59:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3ee343914c1f9300b18685e4ee697458e948ac1d'/>
<id>urn:sha1:3ee343914c1f9300b18685e4ee697458e948ac1d</id>
<content type='text'>
[ Upstream commit d769ccaf957fe7391f357c0a923de71f594b8a2b ]

Make sure unaligned descriptors that straddle the end of the UMEM are
considered invalid. Currently, descriptor validation is broken for
zero-copy mode which only checks descriptors at page granularity.
For example, descriptors in zero-copy mode that overrun the end of the
UMEM but not a page boundary are (incorrectly) considered valid. The
UMEM boundary check needs to happen before the page boundary and
contiguity checks in xp_desc_crosses_non_contig_pg(). Do this check in
xp_unaligned_validate_desc() instead like xp_check_unaligned() already
does.

Fixes: 2b43470add8c ("xsk: Introduce AF_XDP buffer allocation API")
Signed-off-by: Kal Conley &lt;kal.conley@dectris.com&gt;
Acked-by: Magnus Karlsson &lt;magnus.karlsson@intel.com&gt;
Link: https://lore.kernel.org/r/20230405235920.7305-2-kal.conley@dectris.com
Signed-off-by: Martin KaFai Lau &lt;martin.lau@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>xsk: Inherit need_wakeup flag for shared sockets</title>
<updated>2022-09-22T15:16:22+00:00</updated>
<author>
<name>Jalal Mostafa</name>
<email>jalal.a.mostapha@gmail.com</email>
</author>
<published>2022-09-21T13:57:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=60240bc26114543fcbfcd8a28466e67e77b20388'/>
<id>urn:sha1:60240bc26114543fcbfcd8a28466e67e77b20388</id>
<content type='text'>
The flag for need_wakeup is not set for xsks with `XDP_SHARED_UMEM`
flag and of different queue ids and/or devices. They should inherit
the flag from the first socket buffer pool since no flags can be
specified once `XDP_SHARED_UMEM` is specified.

Fixes: b5aea28dca134 ("xsk: Add shared umem support between queue ids")
Signed-off-by: Jalal Mostafa &lt;jalal.a.mostapha@gmail.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Magnus Karlsson &lt;magnus.karlsson@intel.com&gt;
Link: https://lore.kernel.org/bpf/20220921135701.10199-1-jalal.a.mostapha@gmail.com
</content>
</entry>
<entry>
<title>xsk: Fix possible crash when multiple sockets are created</title>
<updated>2022-04-26T14:19:54+00:00</updated>
<author>
<name>Maciej Fijalkowski</name>
<email>maciej.fijalkowski@intel.com</email>
</author>
<published>2022-04-25T15:37:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ba3beec2ec1d3b4fd8672ca6e781dac4b3267f6e'/>
<id>urn:sha1:ba3beec2ec1d3b4fd8672ca6e781dac4b3267f6e</id>
<content type='text'>
Fix a crash that happens if an Rx only socket is created first, then a
second socket is created that is Tx only and bound to the same umem as
the first socket and also the same netdev and queue_id together with the
XDP_SHARED_UMEM flag. In this specific case, the tx_descs array page
pool was not created by the first socket as it was an Rx only socket.
When the second socket is bound it needs this tx_descs array of this
shared page pool as it has a Tx component, but unfortunately it was
never allocated, leading to a crash. Note that this array is only used
for zero-copy drivers using the batched Tx APIs, currently only ice and
i40e.

[ 5511.150360] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 5511.158419] #PF: supervisor write access in kernel mode
[ 5511.164472] #PF: error_code(0x0002) - not-present page
[ 5511.170416] PGD 0 P4D 0
[ 5511.173347] Oops: 0002 [#1] PREEMPT SMP PTI
[ 5511.178186] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G            E     5.18.0-rc1+ #97
[ 5511.187245] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
[ 5511.198418] RIP: 0010:xsk_tx_peek_release_desc_batch+0x198/0x310
[ 5511.205375] Code: c0 83 c6 01 84 c2 74 6d 8d 46 ff 23 07 44 89 e1 48 83 c0 14 48 c1 e1 04 48 c1 e0 04 48 03 47 10 4c 01 c1 48 8b 50 08 48 8b 00 &lt;48&gt; 89 51 08 48 89 01 41 80 bd d7 00 00 00 00 75 82 48 8b 19 49 8b
[ 5511.227091] RSP: 0018:ffffc90000003dd0 EFLAGS: 00010246
[ 5511.233135] RAX: 0000000000000000 RBX: ffff88810c8da600 RCX: 0000000000000000
[ 5511.241384] RDX: 000000000000003c RSI: 0000000000000001 RDI: ffff888115f555c0
[ 5511.249634] RBP: ffffc90000003e08 R08: 0000000000000000 R09: ffff889092296b48
[ 5511.257886] R10: 0000ffffffffffff R11: ffff889092296800 R12: 0000000000000000
[ 5511.266138] R13: ffff88810c8db500 R14: 0000000000000040 R15: 0000000000000100
[ 5511.274387] FS:  0000000000000000(0000) GS:ffff88903f800000(0000) knlGS:0000000000000000
[ 5511.283746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5511.290389] CR2: 0000000000000008 CR3: 00000001046e2001 CR4: 00000000003706f0
[ 5511.298640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5511.306892] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5511.315142] Call Trace:
[ 5511.317972]  &lt;IRQ&gt;
[ 5511.320301]  ice_xmit_zc+0x68/0x2f0 [ice]
[ 5511.324977]  ? ktime_get+0x38/0xa0
[ 5511.328913]  ice_napi_poll+0x7a/0x6a0 [ice]
[ 5511.333784]  __napi_poll+0x2c/0x160
[ 5511.337821]  net_rx_action+0xdd/0x200
[ 5511.342058]  __do_softirq+0xe6/0x2dd
[ 5511.346198]  irq_exit_rcu+0xb5/0x100
[ 5511.350339]  common_interrupt+0xa4/0xc0
[ 5511.354777]  &lt;/IRQ&gt;
[ 5511.357201]  &lt;TASK&gt;
[ 5511.359625]  asm_common_interrupt+0x1e/0x40
[ 5511.364466] RIP: 0010:cpuidle_enter_state+0xd2/0x360
[ 5511.370211] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 e9 00 7b ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 72 02 00 00 31 ff e8 02 0c 80 ff fb 45 85 f6 &lt;0f&gt; 88 11 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49
[ 5511.391921] RSP: 0018:ffffffff82a03e60 EFLAGS: 00000202
[ 5511.397962] RAX: ffff88903f800000 RBX: 0000000000000001 RCX: 000000000000001f
[ 5511.406214] RDX: 0000000000000000 RSI: ffffffff823400b9 RDI: ffffffff8234c046
[ 5511.424646] RBP: ffff88810a384800 R08: 000005032a28c046 R09: 0000000000000008
[ 5511.443233] R10: 000000000000000b R11: 0000000000000006 R12: ffffffff82bcf700
[ 5511.461922] R13: 000005032a28c046 R14: 0000000000000001 R15: 0000000000000000
[ 5511.480300]  cpuidle_enter+0x29/0x40
[ 5511.494329]  do_idle+0x1c7/0x250
[ 5511.507610]  cpu_startup_entry+0x19/0x20
[ 5511.521394]  start_kernel+0x649/0x66e
[ 5511.534626]  secondary_startup_64_no_verify+0xc3/0xcb
[ 5511.549230]  &lt;/TASK&gt;

Detect such case during bind() and allocate this memory region via newly
introduced xp_alloc_tx_descs(). Also, use kvcalloc instead of kcalloc as
for other buffer pool allocations, so that it matches the kvfree() from
xp_destroy().

Fixes: d1bc532e99be ("i40e: xsk: Move tmp desc array from driver to pool")
Signed-off-by: Maciej Fijalkowski &lt;maciej.fijalkowski@intel.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Magnus Karlsson &lt;magnus.karlsson@intel.com&gt;
Link: https://lore.kernel.org/bpf/20220425153745.481322-1-maciej.fijalkowski@intel.com
</content>
</entry>
<entry>
<title>i40e: xsk: Move tmp desc array from driver to pool</title>
<updated>2022-01-27T16:25:32+00:00</updated>
<author>
<name>Magnus Karlsson</name>
<email>magnus.karlsson@intel.com</email>
</author>
<published>2022-01-25T16:04:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d1bc532e99becf104635ed4da6fefa306f452321'/>
<id>urn:sha1:d1bc532e99becf104635ed4da6fefa306f452321</id>
<content type='text'>
Move desc_array from the driver to the pool. The reason behind this is
that we can then reuse this array as a temporary storage for descriptors
in all zero-copy drivers that use the batched interface. This will make
it easier to add batching to more drivers.

i40e is the only driver that has a batched Tx zero-copy
implementation, so no need to touch any other driver.

Signed-off-by: Magnus Karlsson &lt;magnus.karlsson@intel.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Reviewed-by: Alexander Lobakin &lt;alexandr.lobakin@intel.com&gt;
Link: https://lore.kernel.org/bpf/20220125160446.78976-6-maciej.fijalkowski@intel.com
</content>
</entry>
<entry>
<title>xsk: Optimize for aligned case</title>
<updated>2021-09-27T22:18:35+00:00</updated>
<author>
<name>Magnus Karlsson</name>
<email>magnus.karlsson@intel.com</email>
</author>
<published>2021-09-22T07:56:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=94033cd8e73b8632bab7c8b7bb54caa4f5616db7'/>
<id>urn:sha1:94033cd8e73b8632bab7c8b7bb54caa4f5616db7</id>
<content type='text'>
Optimize for the aligned case by precomputing the parameter values of
the xdp_buff_xsk and xdp_buff structures in the heads array. We can do
this as the heads array size is equal to the number of chunks in the
umem for the aligned case. Then every entry in this array will reflect
a certain chunk/frame and can therefore be prepopulated with the
correct values and we can drop the use of the free_heads stack. Note
that it is not possible to allocate more buffers than what has been
allocated in the aligned case since each chunk can only contain a
single buffer.

We can unfortunately not do this in the unaligned case as one chunk
might contain multiple buffers. In this case, we keep the old scheme
of populating a heads entry every time it is used and using
the free_heads stack.

Also move xp_release() and xp_get_handle() to xsk_buff_pool.h. They
were for some reason in xsk.c even though they are buffer pool
operations.

Signed-off-by: Magnus Karlsson &lt;magnus.karlsson@intel.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Link: https://lore.kernel.org/bpf/20210922075613.12186-7-magnus.karlsson@gmail.com
</content>
</entry>
<entry>
<title>xsk: Batched buffer allocation for the pool</title>
<updated>2021-09-27T22:18:34+00:00</updated>
<author>
<name>Magnus Karlsson</name>
<email>magnus.karlsson@intel.com</email>
</author>
<published>2021-09-22T07:56:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=47e4075df300050a920b99299c4db3dad9adaba9'/>
<id>urn:sha1:47e4075df300050a920b99299c4db3dad9adaba9</id>
<content type='text'>
Add a new driver interface xsk_buff_alloc_batch() offering batched
buffer allocations to improve performance. The new interface takes
three arguments: the buffer pool to allocated from, a pointer to an
array of struct xdp_buff pointers which will contain pointers to the
allocated xdp_buffs, and an unsigned integer specifying the max number
of buffers to allocate. The return value is the actual number of
buffers that the allocator managed to allocate and it will be in the
range 0 &lt;= N &lt;= max, where max is the third parameter to the function.

u32 xsk_buff_alloc_batch(struct xsk_buff_pool *pool, struct xdp_buff **xdp,
                         u32 max);

A second driver interface is also introduced that need to be used in
conjunction with xsk_buff_alloc_batch(). It is a helper that sets the
size of struct xdp_buff and is used by the NIC Rx irq routine when
receiving a packet. This helper sets the three struct members data,
data_meta, and data_end. The two first ones is in the xsk_buff_alloc()
case set in the allocation routine and data_end is set when a packet
is received in the receive irq function. This unfortunately leads to
worse performance since the xdp_buff is touched twice with a long time
period in between leading to an extra cache miss. Instead, we fill out
the xdp_buff with all 3 fields at one single point in time in the
driver, when the size of the packet is known. Hence this helper. Note
that the driver has to use this helper (or set all three fields
itself) when using xsk_buff_alloc_batch(). xsk_buff_alloc() works as
before and does not require this.

void xsk_buff_set_size(struct xdp_buff *xdp, u32 size);

Signed-off-by: Magnus Karlsson &lt;magnus.karlsson@intel.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Link: https://lore.kernel.org/bpf/20210922075613.12186-3-magnus.karlsson@gmail.com
</content>
</entry>
<entry>
<title>xsk: Get rid of unused entry in struct xdp_buff_xsk</title>
<updated>2021-09-27T22:18:34+00:00</updated>
<author>
<name>Magnus Karlsson</name>
<email>magnus.karlsson@intel.com</email>
</author>
<published>2021-09-22T07:56:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=10a5e009b93a812956e232cee8804ed99f5b93bb'/>
<id>urn:sha1:10a5e009b93a812956e232cee8804ed99f5b93bb</id>
<content type='text'>
Get rid of the unused entry "unaligned" in struct xdp_buff_xsk.

Signed-off-by: Magnus Karlsson &lt;magnus.karlsson@intel.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Link: https://lore.kernel.org/bpf/20210922075613.12186-2-magnus.karlsson@gmail.com
</content>
</entry>
<entry>
<title>xsk: Fix missing validation for skb and unaligned mode</title>
<updated>2021-06-18T14:57:19+00:00</updated>
<author>
<name>Magnus Karlsson</name>
<email>magnus.karlsson@intel.com</email>
</author>
<published>2021-06-17T09:22:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2f99619820c2269534eb2c0cde44870313c6d353'/>
<id>urn:sha1:2f99619820c2269534eb2c0cde44870313c6d353</id>
<content type='text'>
Fix a missing validation of a Tx descriptor when executing in skb mode
and the umem is in unaligned mode. A descriptor could point to a
buffer straddling the end of the umem, thus effectively tricking the
kernel to read outside the allowed umem region. This could lead to a
kernel crash if that part of memory is not mapped.

In zero-copy mode, the descriptor validation code rejects such
descriptors by checking a bit in the DMA address that tells us if the
next page is physically contiguous or not. For the last page in the
umem, this bit is not set, therefore any descriptor pointing to a
packet straddling this last page boundary will be rejected. However,
the skb path does not use this bit since it copies out data and can do
so to two different pages. (It also does not have the array of DMA
address, so it cannot even store this bit.) The code just returned
that the packet is always physically contiguous. But this is
unfortunately also returned for the last page in the umem, which means
that packets that cross the end of the umem are being allowed, which
they should not be.

Fix this by introducing a check for this in the SKB path only, not
penalizing the zero-copy path.

Fixes: 2b43470add8c ("xsk: Introduce AF_XDP buffer allocation API")
Signed-off-by: Magnus Karlsson &lt;magnus.karlsson@intel.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Björn Töpel &lt;bjorn@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20210617092255.3487-1-magnus.karlsson@gmail.com
</content>
</entry>
<entry>
<title>xsk: Fix race in SKB mode transmit with shared cq</title>
<updated>2020-12-18T15:10:21+00:00</updated>
<author>
<name>Magnus Karlsson</name>
<email>magnus.karlsson@intel.com</email>
</author>
<published>2020-12-18T13:45:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f09ced4053bc0a2094a12b60b646114c966ef4c6'/>
<id>urn:sha1:f09ced4053bc0a2094a12b60b646114c966ef4c6</id>
<content type='text'>
Fix a race when multiple sockets are simultaneously calling sendto()
when the completion ring is shared in the SKB case. This is the case
when you share the same netdev and queue id through the
XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
be in xsk_generic_xmit() and call the backpressure mechanism in
xskq_prod_reserve(xs-&gt;pool-&gt;cq). As this is a shared resource in this
specific scenario, a race might occur since the rings are
single-producer single-consumer.

Fix this by moving the tx_completion_lock from the socket to the pool
as the pool is shared between the sockets that share the completion
ring. (The pool is not shared when this is not the case.) And then
protect the accesses to xskq_prod_reserve() with this lock. The
tx_completion_lock is renamed cq_lock to better reflect that it
protects accesses to the potentially shared completion ring.

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Reported-by: Xuan Zhuo &lt;xuanzhuo@linux.alibaba.com&gt;
Signed-off-by: Magnus Karlsson &lt;magnus.karlsson@intel.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Björn Töpel &lt;bjorn.topel@intel.com&gt;
Link: https://lore.kernel.org/bpf/20201218134525.13119-2-magnus.karlsson@gmail.com
</content>
</entry>
</feed>
