<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/net/tcp.h, branch v4.15.6</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v4.15.6</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v4.15.6'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2017-12-08T15:07:02+00:00</updated>
<entry>
<title>tcp: invalidate rate samples during SACK reneging</title>
<updated>2017-12-08T15:07:02+00:00</updated>
<author>
<name>Yousuk Seung</name>
<email>ysseung@google.com</email>
</author>
<published>2017-12-07T21:41:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d4761754b4fb2ef8d9a1e9d121c4bec84e1fe292'/>
<id>urn:sha1:d4761754b4fb2ef8d9a1e9d121c4bec84e1fe292</id>
<content type='text'>
Mark tcp_sock during a SACK reneging event and invalidate rate samples
while marked. Such rate samples may overestimate bw by including packets
that were SACKed before reneging.

&lt; ack 6001 win 10000 sack 7001:38001
&lt; ack 7001 win 0 sack 8001:38001 // Reneg detected
&gt; seq 7001:8001 // RTO, SACK cleared.
&lt; ack 38001 win 10000

In above example the rate sample taken after the last ack will count
7001-38001 as delivered while the actual delivery rate likely could
be much lower i.e. 7001-8001.

This patch adds a new field tcp_sock.sack_reneg and marks it when we
declare SACK reneging and entering TCP_CA_Loss, and unmarks it after
the last rate sample was taken before moving back to TCP_CA_Open. This
patch also invalidates rate samples taken while tcp_sock.is_sack_reneg
is set.

Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection")
Signed-off-by: Yousuk Seung &lt;ysseung@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Priyaranjan Jha &lt;priyarjha@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match()</title>
<updated>2017-12-03T17:39:15+00:00</updated>
<author>
<name>David Ahern</name>
<email>dsahern@gmail.com</email>
</author>
<published>2017-12-03T17:33:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b4d1605a8ea608fd7dc45b926a05d75d340bde4b'/>
<id>urn:sha1:b4d1605a8ea608fd7dc45b926a05d75d340bde4b</id>
<content type='text'>
After this fix : ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()"),
socket lookups happen while skb-&gt;cb[] has not been mangled yet by TCP.

Fixes: a04a480d4392 ("net: Require exact match for TCP socket lookups if dif is l3mdev")
Signed-off-by: David Ahern &lt;dsahern@gmail.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: when scheduling TLP, time of RTO should account for current ACK</title>
<updated>2017-11-19T03:25:26+00:00</updated>
<author>
<name>Neal Cardwell</name>
<email>ncardwell@google.com</email>
</author>
<published>2017-11-18T02:06:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ed66dfaf236c04d414de1d218441296e57fb2bd2'/>
<id>urn:sha1:ed66dfaf236c04d414de1d218441296e57fb2bd2</id>
<content type='text'>
Fix the TLP scheduling logic so that when scheduling a TLP probe, we
ensure that the estimated time at which an RTO would fire accounts for
the fact that ACKs indicating forward progress should push back RTO
times.

After the following fix:

df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")

we had an unintentional behavior change in the following kind of
scenario: suppose the RTT variance has been very low recently. Then
suppose we send out a flight of N packets and our RTT is 100ms:

t=0: send a flight of N packets
t=100ms: receive an ACK for N-1 packets

The response before df92c8394e6e that was:
  -&gt; schedule a TLP for now + RTO_interval

The response after df92c8394e6e is:
  -&gt; schedule a TLP for t=0 + RTO_interval

Since RTO_interval = srtt + RTT_variance, this means that we have
scheduled a TLP timer at a point in the future that only accounts for
RTT_variance. If the RTT_variance term is small, this means that the
timer fires soon.

Before df92c8394e6e this would not happen, because in that code, when
we receive an ACK for a prefix of flight, we did:

    1) Near the top of tcp_ack(), switch from TLP timer to RTO
       at write_queue_head-&gt;paket_tx_time + RTO_interval:
            if (icsk-&gt;icsk_pending == ICSK_TIME_LOSS_PROBE)
                   tcp_rearm_rto(sk);

    2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
            if (flag &amp; FLAG_ACKED) {
                   tcp_rearm_rto(sk);

    3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
       to TLP at now + RTO_interval:
            if (icsk-&gt;icsk_pending == ICSK_TIME_RETRANS)
                   tcp_schedule_loss_probe(sk);

In df92c8394e6e we removed that 3-phase dance, and instead directly
set the TLP timer once: we set the TLP timer in cases like this to
write_queue_head-&gt;packet_tx_time + RTO_interval. So if the RTT
variance is small, then this means that this is setting the TLP timer
to fire quite soon. This means if the ACK for the tail of the flight
takes longer than an RTT to arrive (often due to delayed ACKs), then
the TLP timer fires too quickly.

Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: highest_sack fix</title>
<updated>2017-11-15T10:48:42+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2017-11-15T05:02:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=50895b9de1d3e0258e015e8e55128d835d9a9f19'/>
<id>urn:sha1:50895b9de1d3e0258e015e8e55128d835d9a9f19</id>
<content type='text'>
syzbot easily found a regression added in our latest patches [1]

No longer set tp-&gt;highest_sack to the head of the send queue since
this is not logical and error prone.

Only sack processing should maintain the pointer to an skb from rtx queue.

We might in the future only remember the sequence instead of a pointer to skb,
since rb-tree should allow a fast lookup.

[1]
BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1706 [inline]
BUG: KASAN: use-after-free in tcp_ack+0x42bb/0x4fd0 net/ipv4/tcp_input.c:3537
Read of size 4 at addr ffff8801c154faa8 by task syz-executor4/12860

CPU: 0 PID: 12860 Comm: syz-executor4 Not tainted 4.14.0-next-20171113+ #41
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x25b/0x340 mm/kasan/report.c:409
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
 tcp_highest_sack_seq include/net/tcp.h:1706 [inline]
 tcp_ack+0x42bb/0x4fd0 net/ipv4/tcp_input.c:3537
 tcp_rcv_established+0x672/0x18a0 net/ipv4/tcp_input.c:5439
 tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1468
 sk_backlog_rcv include/net/sock.h:909 [inline]
 __release_sock+0x124/0x360 net/core/sock.c:2264
 release_sock+0xa4/0x2a0 net/core/sock.c:2778
 tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462
 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
 sock_sendmsg_nosec net/socket.c:632 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:642
 ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2048
 __sys_sendmsg+0xe5/0x210 net/socket.c:2082
 SYSC_sendmsg net/socket.c:2093 [inline]
 SyS_sendmsg+0x2d/0x50 net/socket.c:2089
 entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x452879
RSP: 002b:00007fc9761bfbe8 EFLAGS: 00000212 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000758020 RCX: 0000000000452879
RDX: 0000000000000000 RSI: 0000000020917fc8 RDI: 0000000000000015
RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000212 R12: 00000000006ee3a0
R13: 00000000ffffffff R14: 00007fc9761c06d4 R15: 0000000000000000

Allocated by task 12860:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
 kmem_cache_alloc_node+0x144/0x760 mm/slab.c:3638
 __alloc_skb+0xf1/0x780 net/core/skbuff.c:193
 alloc_skb_fclone include/linux/skbuff.h:1023 [inline]
 sk_stream_alloc_skb+0x11d/0x900 net/ipv4/tcp.c:870
 tcp_sendmsg_locked+0x1341/0x3b80 net/ipv4/tcp.c:1299
 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1461
 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
 sock_sendmsg_nosec net/socket.c:632 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:642
 SYSC_sendto+0x358/0x5a0 net/socket.c:1749
 SyS_sendto+0x40/0x50 net/socket.c:1717
 entry_SYSCALL_64_fastpath+0x1f/0x96

Freed by task 12860:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
 __cache_free mm/slab.c:3492 [inline]
 kmem_cache_free+0x77/0x280 mm/slab.c:3750
 kfree_skbmem+0xdd/0x1d0 net/core/skbuff.c:603
 __kfree_skb+0x1d/0x20 net/core/skbuff.c:642
 sk_wmem_free_skb include/net/sock.h:1419 [inline]
 tcp_rtx_queue_unlink_and_free include/net/tcp.h:1682 [inline]
 tcp_clean_rtx_queue net/ipv4/tcp_input.c:3111 [inline]
 tcp_ack+0x1b17/0x4fd0 net/ipv4/tcp_input.c:3593
 tcp_rcv_established+0x672/0x18a0 net/ipv4/tcp_input.c:5439
 tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1468
 sk_backlog_rcv include/net/sock.h:909 [inline]
 __release_sock+0x124/0x360 net/core/sock.c:2264
 release_sock+0xa4/0x2a0 net/core/sock.c:2778
 tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462
 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
 sock_sendmsg_nosec net/socket.c:632 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:642
 ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2048
 __sys_sendmsg+0xe5/0x210 net/socket.c:2082
 SYSC_sendmsg net/socket.c:2093 [inline]
 SyS_sendmsg+0x2d/0x50 net/socket.c:2089
 entry_SYSCALL_64_fastpath+0x1f/0x96

The buggy address belongs to the object at ffff8801c154fa80
 which belongs to the cache skbuff_fclone_cache of size 456
The buggy address is located 40 bytes inside of
 456-byte region [ffff8801c154fa80, ffff8801c154fc48)
The buggy address belongs to the page:
page:ffffea00070553c0 count:1 mapcount:0 mapping:ffff8801c154f080 index:0x0
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffff8801c154f080 0000000000000000 0000000100000006
raw: ffffea00070a5a20 ffffea0006a18360 ffff8801d9ca0500 0000000000000000
page dumped because: kasan: bad access detected

Fixes: 737ff314563c ("tcp: use sequence distance to detect reordering")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Yuchung Cheng &lt;ycheng@google.com&gt;
Reported-by: syzbot &lt;syzkaller@googlegroups.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: Namespace-ify sysctl_tcp_default_congestion_control</title>
<updated>2017-11-15T05:09:52+00:00</updated>
<author>
<name>Stephen Hemminger</name>
<email>stephen@networkplumber.org</email>
</author>
<published>2017-11-14T16:25:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=6670e152447732ba90626f36dfc015a13fbf150e'/>
<id>urn:sha1:6670e152447732ba90626f36dfc015a13fbf150e</id>
<content type='text'>
Make default TCP default congestion control to a per namespace
value. This changes default congestion control to a pointer to congestion ops
(rather than implicit as first element of available lsit).

The congestion control setting of new namespaces is inherited
from the current setting of the root namespace.

Signed-off-by: Stephen Hemminger &lt;sthemmin@microsoft.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: retire FACK loss detection</title>
<updated>2017-11-11T09:53:16+00:00</updated>
<author>
<name>Yuchung Cheng</name>
<email>ycheng@google.com</email>
</author>
<published>2017-11-08T21:01:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=713bafea92920103cd3d361657406cf04d0e22dd'/>
<id>urn:sha1:713bafea92920103cd3d361657406cf04d0e22dd</id>
<content type='text'>
FACK loss detection has been disabled by default and the
successor RACK subsumed FACK and can handle reordering better.
This patch removes FACK to simplify TCP loss recovery.

Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Reviewed-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Reviewed-by: Priyaranjan Jha &lt;priyarjha@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: Namespace-ify sysctl_tcp_rmem and sysctl_tcp_wmem</title>
<updated>2017-11-10T05:34:58+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2017-11-07T08:29:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=356d1833b638bd465672aefeb71def3ab93fc17d'/>
<id>urn:sha1:356d1833b638bd465672aefeb71def3ab93fc17d</id>
<content type='text'>
Note that when a new netns is created, it inherits its
sysctl_tcp_rmem and sysctl_tcp_wmem from initial netns.

This change is needed so that we can refine TCP rcvbuf autotuning,
to take RTT into consideration.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Wei Wang &lt;weiwan@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: higher throughput under reordering with adaptive RACK reordering wnd</title>
<updated>2017-11-05T14:15:42+00:00</updated>
<author>
<name>Priyaranjan Jha</name>
<email>priyarjha@google.com</email>
</author>
<published>2017-11-03T23:38:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1f2556916d974cfb62b6af51660186b5f58bd869'/>
<id>urn:sha1:1f2556916d974cfb62b6af51660186b5f58bd869</id>
<content type='text'>
Currently TCP RACK loss detection does not work well if packets are
being reordered beyond its static reordering window (min_rtt/4).Under
such reordering it may falsely trigger loss recoveries and reduce TCP
throughput significantly.

This patch improves that by increasing and reducing the reordering
window based on DSACK, which is now supported in major TCP implementations.
It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.

- If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
  by srtt), since there is possibility that spurious retransmission was
  due to reordering delay longer than reo_wnd.

- Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
  no. of successful recoveries (accounts for full DSACK-based loss
  recovery undo). After that, reset it to default (min_rtt/4).

- At max, reo_wnd is incremented only once per rtt. So that the new
  DSACK on which we are reacting, is due to the spurious retx (approx)
  after the reo_wnd has been updated last time.

- reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
  absolute value to account for change in rtt.

In our internal testing, we observed significant increase in throughput,
in scenarios where reordering exceeds min_rtt/4 (previous static value).

Signed-off-by: Priyaranjan Jha &lt;priyarjha@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2017-11-02T06:23:39+00:00</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2017-11-02T05:59:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ed29668d1aa2c6f01e61dd616df13b5241cee7e0'/>
<id>urn:sha1:ed29668d1aa2c6f01e61dd616df13b5241cee7e0</id>
<content type='text'>
Smooth Cong Wang's bug fix into 'net-next'.  Basically put
the bulk of the tcf_block_put() logic from 'net' into
tcf_block_put_ext(), but after the offload unbind.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: fix tcp_mtu_probe() vs highest_sack</title>
<updated>2017-11-01T12:18:34+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2017-10-31T06:08:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2b7cda9c35d3b940eb9ce74b30bbd5eb30db493d'/>
<id>urn:sha1:2b7cda9c35d3b940eb9ce74b30bbd5eb30db493d</id>
<content type='text'>
Based on SNMP values provided by Roman, Yuchung made the observation
that some crashes in tcp_sacktag_walk() might be caused by MTU probing.

Looking at tcp_mtu_probe(), I found that when a new skb was placed
in front of the write queue, we were not updating tcp highest sack.

If one skb is freed because all its content was copied to the new skb
(for MTU probing), then tp-&gt;highest_sack could point to a now freed skb.

Bad things would then happen, including infinite loops.

This patch renames tcp_highest_sack_combine() and uses it
from tcp_mtu_probe() to fix the bug.

Note that I also removed one test against tp-&gt;sacked_out,
since we want to replace tp-&gt;highest_sack regardless of whatever
condition, since keeping a stale pointer to freed skb is a recipe
for disaster.

Fixes: a47e5a988a57 ("[TCP]: Convert highest_sack to sk_buff to allow direct access")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Alexei Starovoitov &lt;alexei.starovoitov@gmail.com&gt;
Reported-by: Roman Gushchin &lt;guro@fb.com&gt;
Reported-by: Oleksandr Natalenko &lt;oleksandr@natalenko.name&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Acked-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
