<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/uapi/linux/tcp.h, branch linux-5.11.y</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=linux-5.11.y</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=linux-5.11.y'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2020-12-04T21:40:53+00:00</updated>
<entry>
<title>net-zerocopy: Defer vm zap unless actually needed.</title>
<updated>2020-12-04T21:40:53+00:00</updated>
<author>
<name>Arjun Roy</name>
<email>arjunroy@google.com</email>
</author>
<published>2020-12-02T22:53:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=94ab9eb9b234ddf23af04a4bc7e8db68e67b8778'/>
<id>urn:sha1:94ab9eb9b234ddf23af04a4bc7e8db68e67b8778</id>
<content type='text'>
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.

zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().

This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.

An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.

When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.

Signed-off-by: Arjun Roy &lt;arjunroy@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.</title>
<updated>2020-12-04T21:40:52+00:00</updated>
<author>
<name>Arjun Roy</name>
<email>arjunroy@google.com</email>
</author>
<published>2020-12-02T22:53:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=18fb76ed53865c1b5d5f0157b1b825704590beb5'/>
<id>urn:sha1:18fb76ed53865c1b5d5f0157b1b825704590beb5</id>
<content type='text'>
When TCP receive zerocopy does not successfully map the entire
requested space, it outputs a 'hint' that the caller should recvmsg().

Augment zerocopy to accept a user buffer that it tries to copy this
hint into - if it is possible to copy the entire hint, it will do so.
This elides a recvmsg() call for received traffic that isn't exactly
page-aligned in size.

This was tested with RPC-style traffic of arbitrary sizes. Normally,
each received message required at least one getsockopt() call, and one
recvmsg() call for the remaining unaligned data.

With this change, almost all of the recvmsg() calls are eliminated,
leading to a savings of about 25%-50% in number of system calls
for RPC-style workloads.

Signed-off-by: Arjun Roy &lt;arjunroy@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: add earliest departure time to SCM_TIMESTAMPING_OPT_STATS</title>
<updated>2020-08-01T00:00:44+00:00</updated>
<author>
<name>Yousuk Seung</name>
<email>ysseung@google.com</email>
</author>
<published>2020-07-30T22:44:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=48040793fa6003d211f021c6ad273477bcd90d91'/>
<id>urn:sha1:48040793fa6003d211f021c6ad273477bcd90d91</id>
<content type='text'>
This change adds TCP_NLA_EDT to SCM_TIMESTAMPING_OPT_STATS that reports
the earliest departure time(EDT) of the timestamped skb. By tracking EDT
values of the skb from different timestamps, we can observe when and how
much the value changed. This allows to measure the precise delay
injected on the sender host e.g. by a bpf-base throttler.

Signed-off-by: Yousuk Seung &lt;ysseung@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Acked-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: add bytes not sent to SCM_TIMESTAMPING_OPT_STATS</title>
<updated>2020-03-10T00:56:33+00:00</updated>
<author>
<name>Yousuk Seung</name>
<email>ysseung@google.com</email>
</author>
<published>2020-03-09T20:16:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e08ab0b377a1489760533424437c5f4be7f484a4'/>
<id>urn:sha1:e08ab0b377a1489760533424437c5f4be7f484a4</id>
<content type='text'>
Add TCP_NLA_BYTES_NOTSENT to SCM_TIMESTAMPING_OPT_STATS that reports
bytes in the write queue but not sent. This is the same metric as
what is exported with tcp_info.tcpi_notsent_bytes.

Signed-off-by: Yousuk Seung &lt;ysseung@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Acked-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp-zerocopy: Return sk_err (if set) along with tcp receive zerocopy.</title>
<updated>2020-02-17T03:25:02+00:00</updated>
<author>
<name>Arjun Roy</name>
<email>arjunroy@google.com</email>
</author>
<published>2020-02-14T23:30:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=33946518d493cdf10aedb4a483f1aa41948a3dab'/>
<id>urn:sha1:33946518d493cdf10aedb4a483f1aa41948a3dab</id>
<content type='text'>
This patchset is intended to reduce the number of extra system calls
imposed by TCP receive zerocopy. For ping-pong RPC style workloads,
this patchset has demonstrated a system call reduction of about 30%
when coupled with userspace changes.

For applications using epoll, returning sk_err along with the result
of tcp receive zerocopy could remove the need to call
recvmsg()=-EAGAIN after a spurious wakeup.

Consider a multi-threaded application using epoll. A thread may awaken
with EPOLLIN but another thread may already be reading. The
spuriously-awoken thread does not necessarily know that another thread
'won'; rather, it may be possible that it was woken up due to the
presence of an error if there is no data. A zerocopy read receiving 0
bytes thus would need to be followed up by recvmsg to be sure.

Instead, we return sk_err directly with zerocopy, so the application
can avoid this extra system call.

Signed-off-by: Arjun Roy &lt;arjunroy@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp-zerocopy: Return inq along with tcp receive zerocopy.</title>
<updated>2020-02-17T03:25:02+00:00</updated>
<author>
<name>Arjun Roy</name>
<email>arjunroy@google.com</email>
</author>
<published>2020-02-14T23:30:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c8856c051454909e5059df4e81c77b9c366c5515'/>
<id>urn:sha1:c8856c051454909e5059df4e81c77b9c366c5515</id>
<content type='text'>
This patchset is intended to reduce the number of extra system calls
imposed by TCP receive zerocopy. For ping-pong RPC style workloads,
this patchset has demonstrated a system call reduction of about 30%
when coupled with userspace changes.

For applications using edge-triggered epoll, returning inq along with
the result of tcp receive zerocopy could remove the need to call
recvmsg()=-EAGAIN after a successful zerocopy. Generally speaking,
since normally we would need to perform a recvmsg() call for every
successful small RPC read via TCP receive zerocopy, returning inq can
reduce the number of system calls performed by approximately half.

Signed-off-by: Arjun Roy &lt;arjunroy@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: export count for rehash attempts</title>
<updated>2020-01-26T14:28:47+00:00</updated>
<author>
<name>Abdul Kabbani</name>
<email>akabbani@google.com</email>
</author>
<published>2020-01-24T21:34:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=32efcc06d2a15fa87585614d12d6c2308cc2d3f3'/>
<id>urn:sha1:32efcc06d2a15fa87585614d12d6c2308cc2d3f3</id>
<content type='text'>
Using IPv6 flow-label to swiftly route around avoid congested or
disconnected network path can greatly improve TCP reliability.

This patch adds SNMP counters and a OPT_STATS counter to track both
host-level and connection-level statistics. Network administrators
can use these counters to evaluate the impact of this new ability better.

Export count for rehash attempts to
1) two SNMP counters: TcpTimeoutRehash (rehash due to timeouts),
   and TcpDuplicateDataRehash (rehash due to receiving duplicate
   packets)
2) Timestamping API SOF_TIMESTAMPING_OPT_STATS.

Signed-off-by: Abdul Kabbani &lt;akabbani@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Kevin(Yudong) Yang &lt;yyd@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: Add device index to tcp_md5sig</title>
<updated>2020-01-02T23:51:22+00:00</updated>
<author>
<name>David Ahern</name>
<email>dsahern@gmail.com</email>
</author>
<published>2019-12-30T22:14:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=6b102db50cdde3ba2f78631ed21222edf3a5fb51'/>
<id>urn:sha1:6b102db50cdde3ba2f78631ed21222edf3a5fb51</id>
<content type='text'>
Add support for userspace to specify a device index to limit the scope
of an entry via the TCP_MD5SIG_EXT setsockopt. The existing __tcpm_pad
is renamed to tcpm_ifindex and the new field is only checked if the new
TCP_MD5SIG_FLAG_IFINDEX is set in tcpm_flags. For now, the device index
must point to an L3 master device (e.g., VRF). The API and error
handling are setup to allow the constraint to be relaxed in the future
to any device index.

Signed-off-by: David Ahern &lt;dsahern@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: add TCP_INFO status for failed client TFO</title>
<updated>2019-10-26T02:25:37+00:00</updated>
<author>
<name>Jason Baron</name>
<email>jbaron@akamai.com</email>
</author>
<published>2019-10-23T15:09:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=480274787d7e3458bc5a7cfbbbe07033984ad711'/>
<id>urn:sha1:480274787d7e3458bc5a7cfbbbe07033984ad711</id>
<content type='text'>
The TCPI_OPT_SYN_DATA bit as part of tcpi_options currently reports whether
or not data-in-SYN was ack'd on both the client and server side. We'd like
to gather more information on the client-side in the failure case in order
to indicate the reason for the failure. This can be useful for not only
debugging TFO, but also for creating TFO socket policies. For example, if
a middle box removes the TFO option or drops a data-in-SYN, we can
can detect this case, and turn off TFO for these connections saving the
extra retransmits.

The newly added tcpi_fastopen_client_fail status is 2 bits and has the
following 4 states:

1) TFO_STATUS_UNSPEC

Catch-all state which includes when TFO is disabled via black hole
detection, which is indicated via LINUX_MIB_TCPFASTOPENBLACKHOLE.

2) TFO_COOKIE_UNAVAILABLE

If TFO_CLIENT_NO_COOKIE mode is off, this state indicates that no cookie
is available in the cache.

3) TFO_DATA_NOT_ACKED

Data was sent with SYN, we received a SYN/ACK but it did not cover the data
portion. Cookie is not accepted by server because the cookie may be invalid
or the server may be overloaded.

4) TFO_SYN_RETRANSMITTED

Data was sent with SYN, we received a SYN/ACK which did not cover the data
after at least 1 additional SYN was sent (without data). It may be the case
that a middle-box is dropping data-in-SYN packets. Thus, it would be more
efficient to not use TFO on this connection to avoid extra retransmits
during connection establishment.

These new fields do not cover all the cases where TFO may fail, but other
failures, such as SYN/ACK + data being dropped, will result in the
connection not becoming established. And a connection blackhole after
session establishment shows up as a stalled connection.

Signed-off-by: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Neal Cardwell &lt;ncardwell@google.com&gt;
Cc: Christoph Paasch &lt;cpaasch@apple.com&gt;
Cc: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: Add snd_wnd to TCP_INFO</title>
<updated>2019-09-16T14:26:11+00:00</updated>
<author>
<name>Thomas Higdon</name>
<email>tph@fb.com</email>
</author>
<published>2019-09-13T23:23:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8f7baad7f03543451af27f5380fc816b008aa1f2'/>
<id>urn:sha1:8f7baad7f03543451af27f5380fc816b008aa1f2</id>
<content type='text'>
Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
performance problems --
&gt; (1) Usually when we're diagnosing TCP performance problems, we do so
&gt; from the sender, since the sender makes most of the
&gt; performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
&gt; From the sender-side the thing that would be most useful is to see
&gt; tp-&gt;snd_wnd, the receive window that the receiver has advertised to
&gt; the sender.

This serves the purpose of adding an additional __u32 to avoid the
would-be hole caused by the addition of the tcpi_rcvi_ooopack field.

Signed-off-by: Thomas Higdon &lt;tph@fb.com&gt;
Acked-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
