<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/net/xdp_sock.h, branch linux-7.0.y</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=linux-7.0.y</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=linux-7.0.y'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-04-07T01:43:51+00:00</updated>
<entry>
<title>xsk: fix XDP_UMEM_SG_FLAG issues</title>
<updated>2026-04-07T01:43:51+00:00</updated>
<author>
<name>Maciej Fijalkowski</name>
<email>maciej.fijalkowski@intel.com</email>
</author>
<published>2026-04-02T15:49:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=93e84fe45b752d17a5a46b306ed78f0133bbc719'/>
<id>urn:sha1:93e84fe45b752d17a5a46b306ed78f0133bbc719</id>
<content type='text'>
Currently xp_assign_dev_shared() is missing XDP_USE_SG being propagated
to flags so set it in order to preserve mtu check that is supposed to be
done only when no multi-buffer setup is in picture.

Also, this flag has the same value as XDP_UMEM_TX_SW_CSUM so we could
get unexpected SG setups for software Tx checksums. Since csum flag is
UAPI, modify value of XDP_UMEM_SG_FLAG.

Fixes: d609f3d228a8 ("xsk: add multi-buffer support for sockets sharing umem")
Reviewed-by: Björn Töpel &lt;bjorn@kernel.org&gt;
Signed-off-by: Maciej Fijalkowski &lt;maciej.fijalkowski@intel.com&gt;
Link: https://patch.msgid.link/20260402154958.562179-4-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>xsk: add indirect call for xsk_destruct_skb</title>
<updated>2025-11-11T09:21:08+00:00</updated>
<author>
<name>Jason Xing</name>
<email>kernelxing@tencent.com</email>
</author>
<published>2025-10-31T10:33:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8da7bea7db692e786165b71729fb68b7ff65ee56'/>
<id>urn:sha1:8da7bea7db692e786165b71729fb68b7ff65ee56</id>
<content type='text'>
Since Eric proposed an idea about adding indirect call wrappers for
UDP and managed to see a huge improvement[1], the same situation can
also be applied in xsk scenario.

This patch adds an indirect call for xsk and helps current copy mode
improve the performance by around 1% stably which was observed with
IXGBE at 10Gb/sec loaded. If the throughput grows, the positive effect
will be magnified. I applied this patch on top of batch xmit series[2],
and was able to see &lt;5% improvement from our internal application
which is a little bit unstable though.

Use INDIRECT wrappers to keep xsk_destruct_skb static as it used to
be when the mitigation config is off.

Be aware of the freeing path that can be very hot since the frequency
can reach around 2,000,000 times per second with the xdpsock test.

[1]: https://lore.kernel.org/netdev/20251006193103.2684156-2-edumazet@google.com/
[2]: https://lore.kernel.org/all/20251021131209.41491-1-kerneljasonxing@gmail.com/

Suggested-by: Alexander Lobakin &lt;aleksander.lobakin@intel.com&gt;
Signed-off-by: Jason Xing &lt;kernelxing@tencent.com&gt;
Reviewed-by: Alexander Lobakin &lt;aleksander.lobakin@intel.com&gt;
Link: https://patch.msgid.link/20251031103328.95468-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;

</content>
</entry>
<entry>
<title>net: xsk: introduce XDP_MAX_TX_SKB_BUDGET setsockopt</title>
<updated>2025-07-10T12:48:29+00:00</updated>
<author>
<name>Jason Xing</name>
<email>kernelxing@tencent.com</email>
</author>
<published>2025-07-04T16:01:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=45e359be1ce88fb22e61fa3aa23b2e450a6cae03'/>
<id>urn:sha1:45e359be1ce88fb22e61fa3aa23b2e450a6cae03</id>
<content type='text'>
This patch provides a setsockopt method to let applications leverage to
adjust how many descs to be handled at most in one send syscall. It
mitigates the situation where the default value (32) that is too small
leads to higher frequency of triggering send syscall.

Considering the prosperity/complexity the applications have, there is no
absolutely ideal suggestion fitting all cases. So keep 32 as its default
value like before.

The patch does the following things:
- Add XDP_MAX_TX_SKB_BUDGET socket option.
- Set max_tx_budget to 32 by default in the initialization phase as a
  per-socket granular control.
- Set the range of max_tx_budget as [32, xs-&gt;tx-&gt;nentries].

The idea behind this comes out of real workloads in production. We use a
user-level stack with xsk support to accelerate sending packets and
minimize triggering syscalls. When the packets are aggregated, it's not
hard to hit the upper bound (namely, 32). The moment user-space stack
fetches the -EAGAIN error number passed from sendto(), it will loop to try
again until all the expected descs from tx ring are sent out to the driver.
Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of
sendto() and higher throughput/PPS.

Here is what I did in production, along with some numbers as follows:
For one application I saw lately, I suggested using 128 as max_tx_budget
because I saw two limitations without changing any default configuration:
1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by
net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind
this was I counted how many descs are transmitted to the driver at one
time of sendto() based on [1] patch and then I calculated the
possibility of hitting the upper bound. Finally I chose 128 as a
suitable value because 1) it covers most of the cases, 2) a higher
number would not bring evident results. After twisting the parameters,
a stable improvement of around 4% for both PPS and throughput and less
resources consumption were found to be observed by strace -c -p xxx:
1) %time was decreased by 7.8%
2) error counter was decreased from 18367 to 572

[1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/

Signed-off-by: Jason Xing &lt;kernelxing@tencent.com&gt;
Acked-by: Maciej Fijalkowski &lt;maciej.fijalkowski@intel.com&gt;
Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;

</content>
</entry>
<entry>
<title>xsk: Fix race condition in AF_XDP generic RX path</title>
<updated>2025-04-25T00:11:33+00:00</updated>
<author>
<name>e.kubanski</name>
<email>e.kubanski@partner.samsung.com</email>
</author>
<published>2025-04-16T10:19:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a1356ac7749cafc4e27aa62c0c4604b5dca4983e'/>
<id>urn:sha1:a1356ac7749cafc4e27aa62c0c4604b5dca4983e</id>
<content type='text'>
Move rx_lock from xsk_socket to xsk_buff_pool.
Fix synchronization for shared umem mode in
generic RX path where multiple sockets share
single xsk_buff_pool.

RX queue is exclusive to xsk_socket, while FILL
queue can be shared between multiple sockets.
This could result in race condition where two
CPU cores access RX path of two different sockets
sharing the same umem.

Protect both queues by acquiring spinlock in shared
xsk_buff_pool.

Lock contention may be minimized in the future by some
per-thread FQ buffering.

It's safe and necessary to move spin_lock_bh(rx_lock)
after xsk_rcv_check():
* xs-&gt;pool and spinlock_init is synchronized by
  xsk_bind() -&gt; xsk_is_bound() memory barriers.
* xsk_rcv_check() may return true at the moment
  of xsk_release() or xsk_unbind_dev(),
  however this will not cause any data races or
  race conditions. xsk_unbind_dev() removes xdp
  socket from all maps and waits for completion
  of all outstanding rx operations. Packets in
  RX path will either complete safely or drop.

Signed-off-by: Eryk Kubanski &lt;e.kubanski@partner.samsung.com&gt;
Fixes: bf0bdd1343efb ("xdp: fix race on generic receive path")
Acked-by: Magnus Karlsson &lt;magnus.karlsson@intel.com&gt;
Link: https://patch.msgid.link/20250416101908.10919-1-e.kubanski@partner.samsung.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>xsk: Add launch time hardware offload support to XDP Tx metadata</title>
<updated>2025-02-20T23:13:45+00:00</updated>
<author>
<name>Song Yoong Siang</name>
<email>yoong.siang.song@intel.com</email>
</author>
<published>2025-02-16T09:34:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ca4419f15abd19ba8be1e109661b60f9f5b6c9f0'/>
<id>urn:sha1:ca4419f15abd19ba8be1e109661b60f9f5b6c9f0</id>
<content type='text'>
Extend the XDP Tx metadata framework so that user can requests launch time
hardware offload, where the Ethernet device will schedule the packet for
transmission at a pre-determined time called launch time. The value of
launch time is communicated from user space to Ethernet driver via
launch_time field of struct xsk_tx_metadata.

Suggested-by: Stanislav Fomichev &lt;sdf@fomichev.me&gt;
Signed-off-by: Song Yoong Siang &lt;yoong.siang.song@intel.com&gt;
Signed-off-by: Martin KaFai Lau &lt;martin.lau@kernel.org&gt;
Acked-by: Stanislav Fomichev &lt;sdf@fomichev.me&gt;
Acked-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Link: https://patch.msgid.link/20250216093430.957880-2-yoong.siang.song@intel.com
</content>
</entry>
<entry>
<title>net: Optimize xdp_do_flush() with bpf_net_context infos.</title>
<updated>2024-07-02T13:26:57+00:00</updated>
<author>
<name>Sebastian Andrzej Siewior</name>
<email>bigeasy@linutronix.de</email>
</author>
<published>2024-06-28T10:18:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d839a73179ae91c07f5f2f97ccb9c69b2b7c3306'/>
<id>urn:sha1:d839a73179ae91c07f5f2f97ccb9c69b2b7c3306</id>
<content type='text'>
Every NIC driver utilizing XDP should invoke xdp_do_flush() after
processing all packages. With the introduction of the bpf_net_context
logic the flush lists (for dev, CPU-map and xsk) are lazy initialized
only if used. However xdp_do_flush() tries to flush all three of them so
all three lists are always initialized and the likely empty lists are
"iterated".
Without the usage of XDP but with CONFIG_DEBUG_NET the lists are also
initialized due to xdp_do_check_flushed().

Jakub suggest to utilize the hints in bpf_net_context and avoid invoking
the flush function. This will also avoiding initializing the lists which
are otherwise unused.

Introduce bpf_net_ctx_get_all_used_flush_lists() to return the
individual list if not-empty. Use the logic in xdp_do_flush() and
xdp_do_check_flushed(). Remove the not needed .*_check_flush().

Suggested-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Reviewed-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
</content>
</entry>
<entry>
<title>xsk: Don't assume metadata is always requested in TX completion</title>
<updated>2024-03-19T12:47:29+00:00</updated>
<author>
<name>Stanislav Fomichev</name>
<email>sdf@google.com</email>
</author>
<published>2024-03-18T16:54:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f6e922365faf4cd576bd1cf3e64b58c8a32e1856'/>
<id>urn:sha1:f6e922365faf4cd576bd1cf3e64b58c8a32e1856</id>
<content type='text'>
`compl-&gt;tx_timestam != NULL` means that the user has explicitly
requested the metadata via XDP_TX_METADATA+XDP_TX_METADATA_TIMESTAMP.

Fixes: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support")
Reported-by: Daniele Salvatore Albano &lt;d.albano@gmail.com&gt;
Signed-off-by: Stanislav Fomichev &lt;sdf@google.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Tested-by: Daniele Salvatore Albano &lt;d.albano@gmail.com&gt;
Link: https://lore.kernel.org/bpf/20240318165427.1403313-1-sdf@google.com
</content>
</entry>
<entry>
<title>xsk: Add TX timestamp and TX checksum offload support</title>
<updated>2023-11-29T22:59:40+00:00</updated>
<author>
<name>Stanislav Fomichev</name>
<email>sdf@google.com</email>
</author>
<published>2023-11-27T19:03:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=48eb03dd26304c24f03bdbb9382e89c8564e71df'/>
<id>urn:sha1:48eb03dd26304c24f03bdbb9382e89c8564e71df</id>
<content type='text'>
This change actually defines the (initial) metadata layout
that should be used by AF_XDP userspace (xsk_tx_metadata).
The first field is flags which requests appropriate offloads,
followed by the offload-specific fields. The supported per-device
offloads are exported via netlink (new xsk-flags).

The offloads themselves are still implemented in a bit of a
framework-y fashion that's left from my initial kfunc attempt.
I'm introducing new xsk_tx_metadata_ops which drivers are
supposed to implement. The drivers are also supposed
to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
the right places. Since xsk_tx_metadata_{request,_complete}
are static inline, we don't incur any extra overhead doing
indirect calls.

The benefit of this scheme is as follows:
- keeps all metadata layout parsing away from driver code
- makes it easy to grep and see which drivers implement what
- don't need any extra flags to maintain to keep track of what
  offloads are implemented; if the callback is implemented - the offload
  is supported (used by netlink reporting code)

Two offloads are defined right now:
1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset
2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata
   area upon completion (tx_timestamp field)

XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes
SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass
metadata pointer).

The struct is forward-compatible and can be extended in the future
by appending more fields.

Reviewed-by: Song Yoong Siang &lt;yoong.siang.song@intel.com&gt;
Signed-off-by: Stanislav Fomichev &lt;sdf@google.com&gt;
Acked-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Link: https://lore.kernel.org/r/20231127190319.1190813-3-sdf@google.com
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>xsk: Support tx_metadata_len</title>
<updated>2023-11-29T22:59:40+00:00</updated>
<author>
<name>Stanislav Fomichev</name>
<email>sdf@google.com</email>
</author>
<published>2023-11-27T19:03:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=341ac980eab90ac1f6c22ee9f9da83ed9604d899'/>
<id>urn:sha1:341ac980eab90ac1f6c22ee9f9da83ed9604d899</id>
<content type='text'>
For zerocopy mode, tx_desc-&gt;addr can point to an arbitrary offset
and carry some TX metadata in the headroom. For copy mode, there
is no way currently to populate skb metadata.

Introduce new tx_metadata_len umem config option that indicates how many
bytes to treat as metadata. Metadata bytes come prior to tx_desc address
(same as in RX case).

The size of the metadata has mostly the same constraints as XDP:
- less than 256 bytes
- 8-byte aligned (compared to 4-byte alignment on xdp, due to 8-byte
  timestamp in the completion)
- non-zero

This data is not interpreted in any way right now.

Reviewed-by: Song Yoong Siang &lt;yoong.siang.song@intel.com&gt;
Signed-off-by: Stanislav Fomichev &lt;sdf@google.com&gt;
Reviewed-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Link: https://lore.kernel.org/r/20231127190319.1190813-2-sdf@google.com
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>xsk: Avoid starving the xsk further down the list</title>
<updated>2023-10-24T09:55:36+00:00</updated>
<author>
<name>Albert Huang</name>
<email>huangjie.albert@bytedance.com</email>
</author>
<published>2023-10-23T12:57:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=99b29a499b5fdfb7ab274835b8e4d4c11df2f6d7'/>
<id>urn:sha1:99b29a499b5fdfb7ab274835b8e4d4c11df2f6d7</id>
<content type='text'>
In the previous implementation, when multiple xsk sockets were
associated with a single xsk_buff_pool, a situation could arise
where the xsk_tx_list maintained data at the front for one xsk
socket while starving the xsk sockets at the back of the list.
This could result in issues such as the inability to transmit packets,
increased latency, and jitter. To address this problem, we introduce
a new variable called tx_budget_spent, which limits each xsk to transmit
a maximum of MAX_PER_SOCKET_BUDGET tx descriptors. This allocation ensures
equitable opportunities for subsequent xsk sockets to send tx descriptors.
The value of MAX_PER_SOCKET_BUDGET is set to 32.

Signed-off-by: Albert Huang &lt;huangjie.albert@bytedance.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Magnus Karlsson &lt;magnus.karlsson@intel.com&gt;
Link: https://lore.kernel.org/bpf/20231023125732.82261-1-huangjie.albert@bytedance.com
</content>
</entry>
</feed>
