<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/net/netns/ipv4.h, branch v6.12.80</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.12.80</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.12.80'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-03-04T12:20:39+00:00</updated>
<entry>
<title>inet: move icmp_global_{credit,stamp} to a separate cache line</title>
<updated>2026-03-04T12:20:39+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2026-02-16T14:28:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c480680be3776263198f3ae117e6a1c68a7d60de'/>
<id>urn:sha1:c480680be3776263198f3ae117e6a1c68a7d60de</id>
<content type='text'>
[ Upstream commit 87b08913a9ae82082e276d237ece08fc8ee24380 ]

icmp_global_credit was meant to be changed ~1000 times per second,
but if an admin sets net.ipv4.icmp_msgs_per_sec to a very high value,
icmp_global_credit changes can inflict false sharing to surrounding
fields that are read mostly.

Move icmp_global_credit and icmp_global_stamp to a separate
cacheline aligned group.

Fixes: b056b4cd9178 ("icmp: move icmp_global.credit and icmp_global.stamp to per netns storage")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: Kuniyuki Iwashima &lt;kuniyu@google.com&gt;
Link: https://patch.msgid.link/20260216142832.3834174-3-edumazet@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>icmp: icmp_msgs_per_sec and icmp_msgs_burst sysctls become per netns</title>
<updated>2024-08-30T18:14:06+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2024-08-29T14:46:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f17bf505ff89595df5147755e51441632a5dc563'/>
<id>urn:sha1:f17bf505ff89595df5147755e51441632a5dc563</id>
<content type='text'>
Previous patch made ICMP rate limits per netns, it makes sense
to allow each netns to change the associated sysctl.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Link: https://patch.msgid.link/20240829144641.3880376-4-edumazet@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>icmp: move icmp_global.credit and icmp_global.stamp to per netns storage</title>
<updated>2024-08-30T18:14:06+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2024-08-29T14:46:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b056b4cd9178f7a1d5d57f7b48b073c29729ddaa'/>
<id>urn:sha1:b056b4cd9178f7a1d5d57f7b48b073c29729ddaa</id>
<content type='text'>
Host wide ICMP ratelimiter should be per netns, to provide better isolation.

Following patch in this series makes the sysctl per netns.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Link: https://patch.msgid.link/20240829144641.3880376-3-edumazet@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net: ipv4: Add a sysctl to set multipath hash seed</title>
<updated>2024-06-12T23:42:11+00:00</updated>
<author>
<name>Petr Machata</name>
<email>petrm@nvidia.com</email>
</author>
<published>2024-06-07T15:13:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4ee2a8cace3fb9a34aea6a56426f89d26dd514f3'/>
<id>urn:sha1:4ee2a8cace3fb9a34aea6a56426f89d26dd514f3</id>
<content type='text'>
When calculating hashes for the purpose of multipath forwarding, both IPv4
and IPv6 code currently fall back on flow_hash_from_keys(). That uses a
randomly-generated seed. That's a fine choice by default, but unfortunately
some deployments may need a tighter control over the seed used.

In this patch, make the seed configurable by adding a new sysctl key,
net.ipv4.fib_multipath_hash_seed to control the seed. This seed is used
specifically for multipath forwarding and not for the other concerns that
flow_hash_from_keys() is used for, such as queue selection. Expose the knob
as sysctl because other such settings, such as headers to hash, are also
handled that way. Like those, the multipath hash seed is a per-netns
variable.

Despite being placed in the net.ipv4 namespace, the multipath seed sysctl
is used for both IPv4 and IPv6, similarly to e.g. a number of TCP
variables.

The seed used by flow_hash_from_keys() is a 128-bit quantity. However it
seems that usually the seed is a much more modest value. 32 bits seem
typical (Cisco, Cumulus), some systems go even lower. For that reason, and
to decouple the user interface from implementation details, go with a
32-bit quantity, which is then quadruplicated to form the siphash key.

Signed-off-by: Petr Machata &lt;petrm@nvidia.com&gt;
Reviewed-by: Ido Schimmel &lt;idosch@nvidia.com&gt;
Reviewed-by: Nikolay Aleksandrov &lt;razor@blackwall.org&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Link: https://lore.kernel.org/r/20240607151357.421181-3-petrm@nvidia.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: add sysctl_tcp_rto_min_us</title>
<updated>2024-06-05T12:42:54+00:00</updated>
<author>
<name>Kevin Yang</name>
<email>yyd@google.com</email>
</author>
<published>2024-06-03T21:30:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f086edef71be7174a16c1ed67ac65a085cda28b1'/>
<id>urn:sha1:f086edef71be7174a16c1ed67ac65a085cda28b1</id>
<content type='text'>
Adding a sysctl knob to allow user to specify a default
rto_min at socket init time, other than using the hard
coded 200ms default rto_min.

Note that the rto_min route option has the highest precedence
for configuring this setting, followed by the TCP_BPF_RTO_MIN
socket option, followed by the tcp_rto_min_us sysctl.

Signed-off-by: Kevin Yang &lt;yyd@google.com&gt;
Reviewed-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Reviewed-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: Tony Lu &lt;tonylu@linux.alibaba.com&gt;
Reviewed-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Use READ/WRITE_ONCE() for IP local_port_range.</title>
<updated>2023-12-08T18:44:42+00:00</updated>
<author>
<name>David Laight</name>
<email>David.Laight@ACULAB.COM</email>
</author>
<published>2023-12-06T13:44:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d9f28735af8781d9c8c6c406c2a102090644133d'/>
<id>urn:sha1:d9f28735af8781d9c8c6c406c2a102090644133d</id>
<content type='text'>
Commit 227b60f5102cd added a seqlock to ensure that the low and high
port numbers were always updated together.
This is overkill because the two 16bit port numbers can be held in
a u32 and read/written in a single instruction.

More recently 91d0b78c5177f added support for finer per-socket limits.
The user-supplied value is 'high &lt;&lt; 16 | low' but they are held
separately and the socket options protected by the socket lock.

Use a u32 containing 'high &lt;&lt; 16 | low' for both the 'net' and 'sk'
fields and use READ_ONCE()/WRITE_ONCE() to ensure both values are
always updated together.

Change (the now trival) inet_get_local_port_range() to a static inline
to optimise the calling code.
(In particular avoiding returning integers by reference.)

Signed-off-by: David Laight &lt;david.laight@aculab.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Acked-by: Mat Martineau &lt;martineau@kernel.org&gt;
Reviewed-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Link: https://lore.kernel.org/r/4e505d4198e946a8be03fb1b4c3072b0@AcuMS.aculab.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>netns-ipv4: reorganize netns_ipv4 fast path variables</title>
<updated>2023-12-02T22:24:36+00:00</updated>
<author>
<name>Coco Li</name>
<email>lixiaoyan@google.com</email>
</author>
<published>2023-11-29T07:27:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=18fd64d2542292713b0322e6815be059bdee440c'/>
<id>urn:sha1:18fd64d2542292713b0322e6815be059bdee440c</id>
<content type='text'>
Reorganize fast path variables on tx-txrx-rx order.
Fastpath cacheline ends after sysctl_tcp_rmem.
There are only read-only variables here. (write is on the control path
and not considered in this case)

Below data generated with pahole on x86 architecture.
Fast path variables span cache lines before change: 4
Fast path variables span cache lines after change: 2

Suggested-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: Wei Wang &lt;weiwan@google.com&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Signed-off-by: Coco Li &lt;lixiaoyan@google.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: Set pingpong threshold via sysctl</title>
<updated>2023-10-16T21:55:32+00:00</updated>
<author>
<name>Haiyang Zhang</name>
<email>haiyangz@microsoft.com</email>
</author>
<published>2023-10-11T20:30:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=562b1fdf061bff9394ccd884456ed1173c224fdc'/>
<id>urn:sha1:562b1fdf061bff9394ccd884456ed1173c224fdc</id>
<content type='text'>
TCP pingpong threshold is 1 by default. But some applications, like SQL DB
may prefer a higher pingpong threshold to activate delayed acks in quick
ack mode for better performance.

The pingpong threshold and related code were changed to 3 in the year
2019 in:
  commit 4a41f453bedf ("tcp: change pingpong threshold to 3")
And reverted to 1 in the year 2022 in:
  commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"")

There is no single value that fits all applications.
Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for
optimal performance based on the application needs.

Signed-off-by: Haiyang Zhang &lt;haiyangz@microsoft.com&gt;
Reviewed-by: Simon Horman &lt;horms@kernel.org&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Reviewed-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Link: https://lore.kernel.org/r/1697056244-21888-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: defer regular ACK while processing socket backlog</title>
<updated>2023-09-12T17:10:01+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-09-11T17:05:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=133c4c0d37175f510a10fa9bed51e223936073fc'/>
<id>urn:sha1:133c4c0d37175f510a10fa9bed51e223936073fc</id>
<content type='text'>
This idea came after a particular workload requested
the quickack attribute set on routes, and a performance
drop was noticed for large bulk transfers.

For high throughput flows, it is best to use one cpu
running the user thread issuing socket system calls,
and a separate cpu to process incoming packets from BH context.
(With TSO/GRO, bottleneck is usually the 'user' cpu)

Problem is the user thread can spend a lot of time while holding
the socket lock, forcing BH handler to queue most of incoming
packets in the socket backlog.

Whenever the user thread releases the socket lock, it must first
process all accumulated packets in the backlog, potentially
adding latency spikes. Due to flood mitigation, having too many
packets in the backlog increases chance of unexpected drops.

Backlog processing unfortunately shifts a fair amount of cpu cycles
from the BH cpu to the 'user' cpu, thus reducing max throughput.

This patch takes advantage of the backlog processing,
and the fact that ACK are mostly cumulative.

The idea is to detect we are in the backlog processing
and defer all eligible ACK into a single one,
sent from tcp_release_cb().

This saves cpu cycles on both sides, and network resources.

Performance of a single TCP flow on a 200Gbit NIC:

- Throughput is increased by 20% (100Gbit -&gt; 120Gbit).
- Number of generated ACK per second shrinks from 240,000 to 40,000.
- Number of backlog drops per second shrinks from 230 to 0.

Benchmark context:
 - Regular netperf TCP_STREAM (no zerocopy)
 - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids)
 - MAX_SKB_FRAGS = 17 (~60KB per GRO packet)

This feature is guarded by a new sysctl, and enabled by default:
 /proc/sys/net/ipv4/tcp_backlog_ack_defer

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Acked-by: Dave Taht &lt;dave.taht@gmail.com&gt;
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
</content>
</entry>
<entry>
<title>tcp: get rid of sysctl_tcp_adv_win_scale</title>
<updated>2023-07-19T01:41:18+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-07-17T15:29:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=dfa2f0483360d4d6f2324405464c9f281156bd87'/>
<id>urn:sha1:dfa2f0483360d4d6f2324405464c9f281156bd87</id>
<content type='text'>
With modern NIC drivers shifting to full page allocations per
received frame, we face the following issue:

TCP has one per-netns sysctl used to tweak how to translate
a memory use into an expected payload (RWIN), in RX path.

tcp_win_from_space() implementation is limited to few cases.

For hosts dealing with various MSS, we either under estimate
or over estimate the RWIN we send to the remote peers.

For instance with the default sysctl_tcp_adv_win_scale value,
we expect to store 50% of payload per allocated chunk of memory.

For the typical use of MTU=1500 traffic, and order-0 pages allocations
by NIC drivers, we are sending too big RWIN, leading to potential
tcp collapse operations, which are extremely expensive and source
of latency spikes.

This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
uses a per socket scaling factor, so that we can precisely
adjust the RWIN based on effective skb-&gt;len/skb-&gt;truesize ratio.

This patch alone can double TCP receive performance when receivers
are too slow to drain their receive queue, or by allowing
a bigger RWIN when MSS is close to PAGE_SIZE.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
</feed>
