<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/net/ipv4, branch linux-5.9.y</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=linux-5.9.y</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=linux-5.9.y'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2020-12-21T12:28:17+00:00</updated>
<entry>
<title>tcp: fix cwnd-limited bug for TSO deferral where we send nothing</title>
<updated>2020-12-21T12:28:17+00:00</updated>
<author>
<name>Neal Cardwell</name>
<email>ncardwell@google.com</email>
</author>
<published>2020-12-09T03:57:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f9c84d022929be7c15747ab1fc8696e5271ccc60'/>
<id>urn:sha1:f9c84d022929be7c15747ab1fc8696e5271ccc60</id>
<content type='text'>
[ Upstream commit 299bcb55ecd1412f6df606e9dc0912d55610029e ]

When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
into persistent scenarios where we have the following sequence:

(1) ACK for full-sized skb of N*MSS arrives
  -&gt; tcp_write_xmit() transmit full-sized skb with N*MSS
  -&gt; move pacing release time forward
  -&gt; exit tcp_write_xmit() because pacing time is in the future

(2) TSQ callback or TCP internal pacing timer fires
  -&gt; try to transmit next skb, but TSO deferral finds remainder of
     available cwnd is not big enough to trigger an immediate send
     now, so we defer sending until the next ACK.

(3) repeat...

So we can get into a case where we never mark ourselves as
cwnd-limited for many seconds at a time, even with
bulk/infinite-backlog senders, because:

o In case (1) above, every time in tcp_write_xmit() we have enough
cwnd to send a full-sized skb, we are not fully using the cwnd
(because cwnd is not a multiple of the TSO skb size). So every time we
send data, we are not cwnd limited, and so in the cwnd-limited
tracking code in tcp_cwnd_validate() we mark ourselves as not
cwnd-limited.

o In case (2) above, every time in tcp_write_xmit() that we try to
transmit the "remainder" of the cwnd but defer, we set the local
variable is_cwnd_limited to true, but we do not send any packets, so
sent_pkts is zero, so we don't call the cwnd-limited logic to update
tp-&gt;is_cwnd_limited.

Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler")
Reported-by: Ingemar Johansson &lt;ingemar.s.johansson@ericsson.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>tcp: select sane initial rcvq_space.space for big MSS</title>
<updated>2020-12-21T12:28:17+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2020-12-08T16:21:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=312238bb10bf0544fe92a945e7c5600c6ce3ba1b'/>
<id>urn:sha1:312238bb10bf0544fe92a945e7c5600c6ce3ba1b</id>
<content type='text'>
[ Upstream commit 72d05c00d7ecda85df29abd046da7e41cc071c17 ]

Before commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS.

This is no longer the case, and Hazem Mohamed Abuelfotoh reported
that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames.

Root cause is that tcp_init_buffer_space() uses tp-&gt;rcv_wnd for upper limit
of rcvq_space.space computation, while it can select later a smaller
value for tp-&gt;rcv_ssthresh and tp-&gt;window_clamp.

ss -temoi on receiver would show :

skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) rcv_space:62496 rcv_ssthresh:56596

This means that TCP can not increase its window in tcp_grow_window(),
and that DRS can never kick.

Fix this by making sure that rcvq_space.space is not bigger than number of bytes
that can be held in TCP receive queue.

People unable/unwilling to change their kernel can work around this issue by
selecting a bigger tcp_rmem[1] value as in :

echo "4096 196608 6291456" &gt;/proc/sys/net/ipv4/tcp_rmem

Based on an initial report and patch from Hazem Mohamed Abuelfotoh
 https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/

Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
Reported-by: Hazem Mohamed Abuelfotoh &lt;abuehaze@amazon.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>udp: fix the proto value passed to ip_protocol_deliver_rcu for the segments</title>
<updated>2020-12-21T12:28:15+00:00</updated>
<author>
<name>Xin Long</name>
<email>lucien.xin@gmail.com</email>
</author>
<published>2020-12-07T07:55:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=41097c9c282156ca6fe5d514d2b20913648c988e'/>
<id>urn:sha1:41097c9c282156ca6fe5d514d2b20913648c988e</id>
<content type='text'>
[ Upstream commit 10c678bd0a035ac2c64a9b26b222f20556227a53 ]

Guillaume noticed that: for segments udp_queue_rcv_one_skb() returns the
proto, and it should pass "ret" unmodified to ip_protocol_deliver_rcu().
Otherwize, with a negtive value passed, it will underflow inet_protos.

This can be reproduced with IPIP FOU:

  # ip fou add port 5555 ipproto 4
  # ethtool -K eth1 rx-gro-list on

Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
Reported-by: Guillaume Nault &lt;gnault@redhat.com&gt;
Signed-off-by: Xin Long &lt;lucien.xin@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>ipv4: fix error return code in rtm_to_fib_config()</title>
<updated>2020-12-21T12:28:15+00:00</updated>
<author>
<name>Zhang Changzhong</name>
<email>zhangchangzhong@huawei.com</email>
</author>
<published>2020-12-04T08:48:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fa5bef34ce8a49be43847a611bf241b03c371bc4'/>
<id>urn:sha1:fa5bef34ce8a49be43847a611bf241b03c371bc4</id>
<content type='text'>
[ Upstream commit b410f04eb5b482b5efc4eee90de81ad35d3d923b ]

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: d15662682db2 ("ipv4: Allow ipv6 gateway with ipv4 routes")
Reported-by: Hulk Robot &lt;hulkci@huawei.com&gt;
Signed-off-by: Zhang Changzhong &lt;zhangchangzhong@huawei.com&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Link: https://lore.kernel.org/r/1607071695-33740-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>ipv4: Fix tos mask in inet_rtm_getroute()</title>
<updated>2020-12-08T09:42:01+00:00</updated>
<author>
<name>Guillaume Nault</name>
<email>gnault@redhat.com</email>
</author>
<published>2020-11-26T18:09:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f9309d52c9e2d5d8c8bf218d846be7869abccf5d'/>
<id>urn:sha1:f9309d52c9e2d5d8c8bf218d846be7869abccf5d</id>
<content type='text'>
[ Upstream commit 1ebf179037cb46c19da3a9c1e2ca16e7a754b75e ]

When inet_rtm_getroute() was converted to use the RCU variants of
ip_route_input() and ip_route_output_key(), the TOS parameters
stopped being masked with IPTOS_RT_MASK before doing the route lookup.

As a result, "ip route get" can return a different route than what
would be used when sending real packets.

For example:

    $ ip route add 192.0.2.11/32 dev eth0
    $ ip route add unreachable 192.0.2.11/32 tos 2
    $ ip route get 192.0.2.11 tos 2
    RTNETLINK answers: No route to host

But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would
actually be routed using the first route:

    $ ping -c 1 -Q 2 192.0.2.11
    PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data.
    64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms

    --- 192.0.2.11 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms

This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to
return results consistent with real route lookups.

Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
Signed-off-by: Guillaume Nault &lt;gnault@redhat.com&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_control</title>
<updated>2020-12-08T09:42:00+00:00</updated>
<author>
<name>Alexander Duyck</name>
<email>alexanderduyck@fb.com</email>
</author>
<published>2020-11-19T21:23:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=bd2585b7c821bfc81f7badbf04634d6cb988c722'/>
<id>urn:sha1:bd2585b7c821bfc81f7badbf04634d6cb988c722</id>
<content type='text'>
[ Upstream commit 55472017a4219ca965a957584affdb17549ae4a4 ]

When setting congestion control via a BPF program it is seen that the
SYN/ACK for packets within a given flow will not include the ECT0 flag. A
bit of simple printk debugging shows that when this is configured without
BPF we will see the value INET_ECN_xmit value initialized in
tcp_assign_congestion_control however when we configure this via BPF the
socket is in the closed state and as such it isn't configured, and I do not
see it being initialized when we transition the socket into the listen
state. The result of this is that the ECT0 bit is configured based on
whatever the default state is for the socket.

Any easy way to reproduce this is to monitor the following with tcpdump:
tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca

Without this patch the SYN/ACK will follow whatever the default is. If dctcp
all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0
will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK
bit matches the value seen on the other packets in the given stream.

Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck &lt;alexanderduyck@fb.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>ipv4: use IS_ENABLED instead of ifdef</title>
<updated>2020-12-02T07:51:43+00:00</updated>
<author>
<name>Florian Klink</name>
<email>flokli@flokli.de</email>
</author>
<published>2020-11-15T22:45:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f36abf80e9a78f37480bdc4880307f2b76f64ebc'/>
<id>urn:sha1:f36abf80e9a78f37480bdc4880307f2b76f64ebc</id>
<content type='text'>
commit c09c8a27b9baa417864b9adc3228b10ae5eeec93 upstream.

Checking for ifdef CONFIG_x fails if CONFIG_x=m.

Use IS_ENABLED instead, which is true for both built-ins and modules.

Otherwise, a
&gt; ip -4 route add 1.2.3.4/32 via inet6 fe80::2 dev eth1
fails with the message "Error: IPv6 support not enabled in kernel." if
CONFIG_IPV6 is `m`.

In the spirit of b8127113d01e53adba15b41aefd37b90ed83d631.

Fixes: d15662682db2 ("ipv4: Allow ipv6 gateway with ipv4 routes")
Cc: Kim Phillips &lt;kim.phillips@arm.com&gt;
Signed-off-by: Florian Klink &lt;flokli@flokli.de&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Link: https://lore.kernel.org/r/20201115224509.2020651-1-flokli@flokli.de
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect</title>
<updated>2020-11-24T12:39:03+00:00</updated>
<author>
<name>John Fastabend</name>
<email>john.fastabend@gmail.com</email>
</author>
<published>2020-11-16T22:28:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b7372ebc805f4cbf83aac47f822353e8b662e11e'/>
<id>urn:sha1:b7372ebc805f4cbf83aac47f822353e8b662e11e</id>
<content type='text'>
[ Upstream commit 36cd0e696a832a00247fca522034703566ac8885 ]

Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This
allows users to tune SO_RCVBUF and sockmap will honor them.

We can refactor the if(charge) case out in later patches. But, keep this
fix to the point.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Suggested-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Signed-off-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Reviewed-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Link: https://lore.kernel.org/bpf/160556568657.73229.8404601585878439060.stgit@john-XPS-13-9370
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made</title>
<updated>2020-11-24T12:39:03+00:00</updated>
<author>
<name>John Fastabend</name>
<email>john.fastabend@gmail.com</email>
</author>
<published>2020-11-16T22:27:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d7403309e38c79dca6fa3df30a0a14d43fa2fa0b'/>
<id>urn:sha1:d7403309e38c79dca6fa3df30a0a14d43fa2fa0b</id>
<content type='text'>
[ Upstream commit c9c89dcd872ea33327673fcb97398993a1f22736 ]

If copy_page_to_iter() fails or even partially completes, but with fewer
bytes copied than expected we currently reset sg.start and return EFAULT.
This proves problematic if we already copied data into the user buffer
before we return an error. Because we leave the copied data in the user
buffer and fail to unwind the scatterlist so kernel side believes data
has been copied and user side believes data has _not_ been received.

Expected behavior should be to return number of bytes copied and then
on the next read we need to return the error assuming its still there. This
can happen if we have a copy length spanning multiple scatterlist elements
and one or more complete before the error is hit.

The error is rare enough though that my normal testing with server side
programs, such as nginx, httpd, envoy, etc., I have never seen this. The
only reliable way to reproduce that I've found is to stream movies over
my browser for a day or so and wait for it to hang. Not very scientific,
but with a few extra WARN_ON()s in the code the bug was obvious.

When we review the errors from copy_page_to_iter() it seems we are hitting
a page fault from copy_page_to_iter_iovec() where the code checks
fault_in_pages_writeable(buf, copy) where buf is the user buffer. It
also seems typical server applications don't hit this case.

The other way to try and reproduce this is run the sockmap selftest tool
test_sockmap with data verification enabled, but it doesn't reproduce the
fault. Perhaps we can trigger this case artificially somehow from the
test tools. I haven't sorted out a way to do that yet though.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Reviewed-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: only postpone PROBE_RTT if RTT is &lt; current min_rtt estimate</title>
<updated>2020-11-24T12:38:50+00:00</updated>
<author>
<name>Ryan Sharpelletti</name>
<email>sharpelletti@google.com</email>
</author>
<published>2020-11-16T17:44:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=354eaccbeb62336adde2a15fd177b02fed99d39f'/>
<id>urn:sha1:354eaccbeb62336adde2a15fd177b02fed99d39f</id>
<content type='text'>
[ Upstream commit 1b9e2a8c99a5c021041bfb2d512dc3ed92a94ffd ]

During loss recovery, retransmitted packets are forced to use TCP
timestamps to calculate the RTT samples, which have a millisecond
granularity. BBR is designed using a microsecond granularity. As a
result, multiple RTT samples could be truncated to the same RTT value
during loss recovery. This is problematic, as BBR will not enter
PROBE_RTT if the RTT sample is &lt;= the current min_rtt sample, meaning
that if there are persistent losses, PROBE_RTT will constantly be
pushed off and potentially never re-entered. This patch makes sure
that BBR enters PROBE_RTT by checking if RTT sample is &lt; the current
min_rtt sample, rather than &lt;=.

The Netflix transport/TCP team discovered this bug in the Linux TCP
BBR code during lab tests.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Ryan Sharpelletti &lt;sharpelletti@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Link: https://lore.kernel.org/r/20201116174412.1433277-1-sharpelletti.kdev@gmail.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
</feed>
