<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/linux/tcp.h, branch v6.6.132</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.132</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.132'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-03-04T12:20:22+00:00</updated>
<entry>
<title>tcp: defer regular ACK while processing socket backlog</title>
<updated>2026-03-04T12:20:22+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-09-11T17:05:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b4d5e97679bc786687879710dc2de280c70ca889'/>
<id>urn:sha1:b4d5e97679bc786687879710dc2de280c70ca889</id>
<content type='text'>
[ Upstream commit 133c4c0d37175f510a10fa9bed51e223936073fc ]

This idea came after a particular workload requested
the quickack attribute set on routes, and a performance
drop was noticed for large bulk transfers.

For high throughput flows, it is best to use one cpu
running the user thread issuing socket system calls,
and a separate cpu to process incoming packets from BH context.
(With TSO/GRO, bottleneck is usually the 'user' cpu)

Problem is the user thread can spend a lot of time while holding
the socket lock, forcing BH handler to queue most of incoming
packets in the socket backlog.

Whenever the user thread releases the socket lock, it must first
process all accumulated packets in the backlog, potentially
adding latency spikes. Due to flood mitigation, having too many
packets in the backlog increases chance of unexpected drops.

Backlog processing unfortunately shifts a fair amount of cpu cycles
from the BH cpu to the 'user' cpu, thus reducing max throughput.

This patch takes advantage of the backlog processing,
and the fact that ACK are mostly cumulative.

The idea is to detect we are in the backlog processing
and defer all eligible ACK into a single one,
sent from tcp_release_cb().

This saves cpu cycles on both sides, and network resources.

Performance of a single TCP flow on a 200Gbit NIC:

- Throughput is increased by 20% (100Gbit -&gt; 120Gbit).
- Number of generated ACK per second shrinks from 240,000 to 40,000.
- Number of backlog drops per second shrinks from 230 to 0.

Benchmark context:
 - Regular netperf TCP_STREAM (no zerocopy)
 - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids)
 - MAX_SKB_FRAGS = 17 (~60KB per GRO packet)

This feature is guarded by a new sysctl, and enabled by default:
 /proc/sys/net/ipv4/tcp_backlog_ack_defer

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Acked-by: Dave Taht &lt;dave.taht@gmail.com&gt;
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
Stable-dep-of: 87b08913a9ae ("inet: move icmp_global_{credit,stamp} to a separate cache line")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: new TCP_INFO stats for RTO events</title>
<updated>2024-10-17T13:24:23+00:00</updated>
<author>
<name>Aananth V</name>
<email>aananthv@google.com</email>
</author>
<published>2023-09-14T14:36:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=718c49f840ef4e451bf44a8a62aae89ebdd5a687'/>
<id>urn:sha1:718c49f840ef4e451bf44a8a62aae89ebdd5a687</id>
<content type='text'>
[ Upstream commit 3868ab0f192581eff978501a05f3dc2e01541d77 ]

The 2023 SIGCOMM paper "Improving Network Availability with Protective
ReRoute" has indicated Linux TCP's RTO-triggered txhash rehashing can
effectively reduce application disruption during outages. To better
measure the efficacy of this feature, this patch adds three more
detailed stats during RTO recovery and exports via TCP_INFO.
Applications and monitoring systems can leverage this data to measure
the network path diversity and end-to-end repair latency during network
outages to improve their network infrastructure.

The following counters are added to tcp_sock in order to track RTO
events over the lifetime of a TCP socket.

1. u16 total_rto - Counts the total number of RTO timeouts.
2. u16 total_rto_recoveries - Counts the total number of RTO recoveries.
3. u32 total_rto_time - Counts the total time spent (ms) in RTO
                        recoveries. (time spent in CA_Loss and
                        CA_Recovery states)

To compute total_rto_time, we add a new u32 rto_stamp field to
tcp_sock. rto_stamp records the start timestamp (ms) of the last RTO
recovery (CA_Loss).

Corresponding fields are also added to the tcp_info struct.

Signed-off-by: Aananth V &lt;aananthv@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Stable-dep-of: 27c80efcc204 ("tcp: fix TFO SYN_RECV to not zero retrans_stamp with retransmits out")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: set TCP_USER_TIMEOUT locklessly</title>
<updated>2023-08-06T07:24:55+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-08-04T14:46:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d58f2e15aa0c07f6f03ec71f64d7697ca43d04a1'/>
<id>urn:sha1:d58f2e15aa0c07f6f03ec71f64d7697ca43d04a1</id>
<content type='text'>
icsk-&gt;icsk_user_timeout can be set locklessly,
if all read sides use READ_ONCE().

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net</title>
<updated>2023-07-20T22:52:55+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2023-07-20T22:05:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=59be3baa8dff271d48500e009622318badfc7140'/>
<id>urn:sha1:59be3baa8dff271d48500e009622318badfc7140</id>
<content type='text'>
Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: annotate data-races around fastopenq.max_qlen</title>
<updated>2023-07-20T19:34:18+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-07-19T21:28:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=70f360dd7042cb843635ece9d28335a4addff9eb'/>
<id>urn:sha1:70f360dd7042cb843635ece9d28335a4addff9eb</id>
<content type='text'>
This field can be read locklessly.

Fixes: 1536e2857bd3 ("tcp: Add a TCP_FASTOPEN socket option to get a max backlog on its listner")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Link: https://lore.kernel.org/r/20230719212857.3943972-12-edumazet@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: get rid of sysctl_tcp_adv_win_scale</title>
<updated>2023-07-19T01:41:18+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-07-17T15:29:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=dfa2f0483360d4d6f2324405464c9f281156bd87'/>
<id>urn:sha1:dfa2f0483360d4d6f2324405464c9f281156bd87</id>
<content type='text'>
With modern NIC drivers shifting to full page allocations per
received frame, we face the following issue:

TCP has one per-netns sysctl used to tweak how to translate
a memory use into an expected payload (RWIN), in RX path.

tcp_win_from_space() implementation is limited to few cases.

For hosts dealing with various MSS, we either under estimate
or over estimate the RWIN we send to the remote peers.

For instance with the default sysctl_tcp_adv_win_scale value,
we expect to store 50% of payload per allocated chunk of memory.

For the typical use of MTU=1500 traffic, and order-0 pages allocations
by NIC drivers, we are sending too big RWIN, leading to potential
tcp collapse operations, which are extremely expensive and source
of latency spikes.

This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
uses a per socket scaling factor, so that we can precisely
adjust the RWIN based on effective skb-&gt;len/skb-&gt;truesize ratio.

This patch alone can double TCP receive performance when receivers
are too slow to drain their receive queue, or by allowing
a bigger RWIN when MSS is close to PAGE_SIZE.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: preserve const qualifier in tcp_sk()</title>
<updated>2023-03-18T12:23:34+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-03-17T15:55:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e9d9da91548b21e189fcd0259a0f2d26d1afc509'/>
<id>urn:sha1:e9d9da91548b21e189fcd0259a0f2d26d1afc509</id>
<content type='text'>
We can change tcp_sk() to propagate its argument const qualifier,
thanks to container_of_const().

We have two places where a const sock pointer has to be upgraded
to a write one. We have been using const qualifier for lockless
listeners to clearly identify points where writes could happen.

Add tcp_sk_rw() helper to better document these.

tcp_inbound_md5_hash(), __tcp_grow_window(), tcp_reset_check()
and tcp_rack_reo_wnd() get an additional const qualififer
for their @tp local variables.

smc_check_reset_syn_req() also needs a similar change.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: Simon Horman &lt;simon.horman@corigine.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: add u32 counter in tcp_sock and an SNMP counter for PLB</title>
<updated>2022-10-28T09:47:42+00:00</updated>
<author>
<name>Mubashir Adnan Qureshi</name>
<email>mubashirq@google.com</email>
</author>
<published>2022-10-26T13:51:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=29c1c44646aec5d5134f2365259a84becc1ee7d3'/>
<id>urn:sha1:29c1c44646aec5d5134f2365259a84becc1ee7d3</id>
<content type='text'>
A u32 counter is added to tcp_sock for counting the number of PLB
triggered rehashes for a TCP connection. An SNMP counter is also
added to count overall PLB triggered rehash events for a host. These
counters are hooked up to PLB implementation for DCTCP.

TCP_NLA_REHASH is added to SCM_TIMESTAMPING_OPT_STATS that reports
the rehash attempts triggered due to PLB or timeouts. This gives
a historical view of sustained congestion or timeouts experienced
by the TCP connection.

Signed-off-by: Mubashir Adnan Qureshi &lt;mubashirq@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net</title>
<updated>2022-10-04T00:44:18+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2022-10-04T00:44:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e52f7c1ddf3e47243c330923ea764e7ccfbe99f7'/>
<id>urn:sha1:e52f7c1ddf3e47243c330923ea764e7ccfbe99f7</id>
<content type='text'>
Merge in the left-over fixes before the net-next pull-request.

Conflicts:

drivers/net/ethernet/mediatek/mtk_ppe.c
  ae3ed15da588 ("net: ethernet: mtk_eth_soc: fix state in __mtk_foe_entry_clear")
  9d8cb4c096ab ("net: ethernet: mtk_eth_soc: add foe_entry_size to mtk_eth_soc")
https://lore.kernel.org/all/6cb6893b-4921-a068-4c30-1109795110bb@tessares.net/

kernel/bpf/helpers.c
  8addbfc7b308 ("bpf: Gate dynptr API behind CAP_BPF")
  5679ff2f138f ("bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF")
  8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types")
https://lore.kernel.org/all/20221003201957.13149-1-daniel@iogearbox.net/

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited</title>
<updated>2022-09-30T11:37:45+00:00</updated>
<author>
<name>Neal Cardwell</name>
<email>ncardwell@google.com</email>
</author>
<published>2022-09-28T20:03:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f4ce91ce12a7c6ead19b128ffa8cff6e3ded2a14'/>
<id>urn:sha1:f4ce91ce12a7c6ead19b128ffa8cff6e3ded2a14</id>
<content type='text'>
This commit fixes a bug in the tracking of max_packets_out and
is_cwnd_limited. This bug can cause the connection to fail to remember
that is_cwnd_limited is true, causing the connection to fail to grow
cwnd when it should, causing throughput to be lower than it should be.

The following event sequence is an example that triggers the bug:

 (a) The connection is cwnd_limited, but packets_out is not at its
     peak due to TSO deferral deciding not to send another skb yet.
     In such cases the connection can advance max_packets_seq and set
     tp-&gt;is_cwnd_limited to true and max_packets_out to a small
     number.

(b) Then later in the round trip the connection is pacing-limited (not
     cwnd-limited), and packets_out is larger. In such cases the
     connection would raise max_packets_out to a bigger number but
     (unexpectedly) flip tp-&gt;is_cwnd_limited from true to false.

This commit fixes that bug.

One straightforward fix would be to separately track (a) the next
window after max_packets_out reaches a maximum, and (b) the next
window after tp-&gt;is_cwnd_limited is set to true. But this would
require consuming an extra u32 sequence number.

Instead, to save space we track only the most important
information. Specifically, we track the strongest available signal of
the degree to which the cwnd is fully utilized:

(1) If the connection is cwnd-limited then we remember that fact for
the current window.

(2) If the connection not cwnd-limited then we track the maximum
number of outstanding packets in the current window.

In particular, note that the new logic cannot trigger the buggy
(a)/(b) sequence above because with the new logic a condition where
tp-&gt;packets_out &gt; tp-&gt;max_packets_out can only trigger an update of
tp-&gt;is_cwnd_limited if tp-&gt;is_cwnd_limited is false.

This first showed up in a testing of a BBRv2 dev branch, but this
buggy behavior highlighted a general issue with the
tcp_cwnd_validate() logic that can cause cwnd to fail to increase at
the proper rate for any TCP congestion control, including Reno or
CUBIC.

Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler")
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Kevin(Yudong) Yang &lt;yyd@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
