<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/net/inet_connection_sock.h, branch linux-6.5.y</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=linux-6.5.y</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=linux-6.5.y'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2022-08-25T02:30:07+00:00</updated>
<entry>
<title>net: Add a bhash2 table hashed by port and address</title>
<updated>2022-08-25T02:30:07+00:00</updated>
<author>
<name>Joanne Koong</name>
<email>joannelkoong@gmail.com</email>
</author>
<published>2022-08-22T18:10:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=28044fc1d4953b07acec0da4d2fc4784c57ea6fb'/>
<id>urn:sha1:28044fc1d4953b07acec0da4d2fc4784c57ea6fb</id>
<content type='text'>
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.

This patch adds a second bind table, bhash2, that hashes by
port and sk-&gt;sk_rcv_saddr (ipv4) and sk-&gt;sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.

Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:

  1) The case where there is a bind() call on INADDR_ANY (ipv4) or
  IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
  assign the socket an address when it handles the connect()

  2) In inet_sk_reselect_saddr(), which is called when rebuilding the
  sk header and a few pre-conditions are met (eg rerouting fails).

In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.

* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.

This brings up a few stipulations:
  1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
  will always be acquired after the bhash lock and released before the
  bhash lock is released.

  2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
  acquired+released before another bhash2 lock is acquired+released.

* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port-&gt;socket associations.

Signed-off-by: Joanne Koong &lt;joannelkoong@gmail.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>Revert "tcp: change pingpong threshold to 3"</title>
<updated>2022-07-22T22:09:10+00:00</updated>
<author>
<name>Wei Wang</name>
<email>weiwan@google.com</email>
</author>
<published>2022-07-21T20:44:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4d8f24eeedc58d5f87b650ddda73c16e8ba56559'/>
<id>urn:sha1:4d8f24eeedc58d5f87b650ddda73c16e8ba56559</id>
<content type='text'>
This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.

This to-be-reverted commit was meant to apply a stricter rule for the
stack to enter pingpong mode. However, the condition used to check for
interactive session "before(tp-&gt;lsndtime, icsk-&gt;icsk_ack.lrcvtime)" is
jiffy based and might be too coarse, which delays the stack entering
pingpong mode.
We revert this patch so that we no longer use the above condition to
determine interactive session, and also reduce pingpong threshold to 1.

Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
Reported-by: LemmyHuang &lt;hlm3280@163.com&gt;
Suggested-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Wei Wang &lt;weiwan@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Link: https://lore.kernel.org/r/20220721204404.388396-1-weiwan@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>Revert "net: Add a second bind table hashed by port and address"</title>
<updated>2022-06-16T18:07:59+00:00</updated>
<author>
<name>Joanne Koong</name>
<email>joannelkoong@gmail.com</email>
</author>
<published>2022-06-15T19:32:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=593d1ebe00a45af5cb7bda1235c0790987c2a2b2'/>
<id>urn:sha1:593d1ebe00a45af5cb7bda1235c0790987c2a2b2</id>
<content type='text'>
This reverts:

commit d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
commit 538aaf9b2383 ("selftests: Add test for timing a bind request to a port with a populated bhash entry")
Link: https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/

There are a few things that need to be fixed here:
* Updating bhash2 in cases where the socket's rcv saddr changes
* Adding bhash2 hashbucket locks

Links to syzbot reports:
https://lore.kernel.org/netdev/00000000000022208805e0df247a@google.com/
https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/

Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com
Reported-by: syzbot+98fd2d1422063b0f8c44@syzkaller.appspotmail.com
Reported-by: syzbot+0a847a982613c6438fba@syzkaller.appspotmail.com
Signed-off-by: Joanne Koong &lt;joannelkoong@gmail.com&gt;
Link: https://lore.kernel.org/r/20220615193213.2419568-1-joannelkoong@gmail.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net: Add a second bind table hashed by port and address</title>
<updated>2022-05-21T01:16:24+00:00</updated>
<author>
<name>Joanne Koong</name>
<email>joannelkoong@gmail.com</email>
</author>
<published>2022-05-20T00:18:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d5a42de8bdbe25081f07b801d8b35f4d75a791f4'/>
<id>urn:sha1:d5a42de8bdbe25081f07b801d8b35f4d75a791f4</id>
<content type='text'>
We currently have one tcp bind table (bhash) which hashes by port
number only. In the socket bind path, we check for bind conflicts by
traversing the specified port's inet_bind2_bucket while holding the
bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).

In instances where there are tons of sockets hashed to the same port
at different addresses, checking for a bind conflict is time-intensive
and can cause softirq cpu lockups, as well as stops new tcp connections
since __inet_inherit_port() also contests for the spinlock.

This patch proposes adding a second bind table, bhash2, that hashes by
port and ip address. Searching the bhash2 table leads to significantly
faster conflict resolution and less time holding the spinlock.

Signed-off-by: Joanne Koong &lt;joannelkoong@gmail.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Kuniyuki Iwashima &lt;kuniyu@amazon.co.jp&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net: inet: Retire port only listening_hash</title>
<updated>2022-05-12T23:52:18+00:00</updated>
<author>
<name>Martin KaFai Lau</name>
<email>kafai@fb.com</email>
</author>
<published>2022-05-12T00:06:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=cae3873c5b3a4fcd9706fb461ff4e91bdf1f0120'/>
<id>urn:sha1:cae3873c5b3a4fcd9706fb461ff4e91bdf1f0120</id>
<content type='text'>
The listen sk is currently stored in two hash tables,
listening_hash (hashed by port) and lhash2 (hashed by port and address).

After commit 0ee58dad5b06 ("net: tcp6: prefer listeners bound to an address")
and commit d9fbc7f6431f ("net: tcp: prefer listeners bound to an address"),
the TCP-SYN lookup fast path does not use listening_hash.

The commit 05c0b35709c5 ("tcp: seq_file: Replace listening_hash with lhash2")
also moved the seq_file (/proc/net/tcp) iteration usage from
listening_hash to lhash2.

There are still a few listening_hash usages left.
One of them is inet_reuseport_add_sock() which uses the listening_hash
to search a listen sk during the listen() system call.  This turns
out to be very slow on use cases that listen on many different
VIPs at a popular port (e.g. 443).  [ On top of the slowness in
adding to the tail in the IPv6 case ].  The latter patch has a
selftest to demonstrate this case.

This patch takes this chance to move all remaining listening_hash
usages to lhash2 and then retire listening_hash.

Since most changes need to be done together, it is hard to cut
the listening_hash to lhash2 switch into small patches.  The
changes in this patch is highlighted here for the review
purpose.

1. Because of the listening_hash removal, lhash2 can use the
   sk-&gt;sk_nulls_node instead of the icsk-&gt;icsk_listen_portaddr_node.
   This will also keep the sk_unhashed() check to work as is
   after stop adding sk to listening_hash.

   The union is removed from inet_listen_hashbucket because
   only nulls_head is needed.

2. icsk-&gt;icsk_listen_portaddr_node and its helpers are removed.

3. The current lhash2 users needs to iterate with sk_nulls_node
   instead of icsk_listen_portaddr_node.

   One case is in the inet[6]_lhash2_lookup().

   Another case is the seq_file iterator in tcp_ipv4.c.
   One thing to note is sk_nulls_next() is needed
   because the old inet_lhash2_for_each_icsk_continue()
   does a "next" first before iterating.

4. Move the remaining listening_hash usage to lhash2

   inet_reuseport_add_sock() which this series is
   trying to improve.

   inet_diag.c and mptcp_diag.c are the final two
   remaining use cases and is moved to lhash2 now also.

Signed-off-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: Use BPF timeout setting for SYN ACK RTO</title>
<updated>2022-02-02T14:45:18+00:00</updated>
<author>
<name>Akhmat Karakotov</name>
<email>hmukos@yandex-team.ru</email>
</author>
<published>2022-01-28T19:26:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5903123f662ed18483f05cac3f9e800a074c29ff'/>
<id>urn:sha1:5903123f662ed18483f05cac3f9e800a074c29ff</id>
<content type='text'>
When setting RTO through BPF program, some SYN ACK packets were unaffected
and continued to use TCP_TIMEOUT_INIT constant. This patch adds timeout
option to struct request_sock. Option is initialized with TCP_TIMEOUT_INIT
and is reassigned through BPF using tcp_timeout_init call. SYN ACK
retransmits now use newly added timeout option.

Signed-off-by: Akhmat Karakotov &lt;hmukos@yandex-team.ru&gt;
Acked-by: Martin KaFai Lau &lt;kafai@fb.com&gt;

v2:
	- Add timeout option to struct request_sock. Do not call
	  tcp_timeout_init on every syn ack retransmit.

v3:
	- Use unsigned long for min. Bound tcp_timeout_init to TCP_RTO_MAX.

v4:
	- Refactor duplicate code by adding reqsk_timeout function.
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>dccp/tcp: Remove an unused argument in inet_csk_listen_start().</title>
<updated>2021-11-24T04:16:18+00:00</updated>
<author>
<name>Kuniyuki Iwashima</name>
<email>kuniyu@amazon.co.jp</email>
</author>
<published>2021-11-22T10:16:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e7049395b1c3085d12b5ba16d058c65598368853'/>
<id>urn:sha1:e7049395b1c3085d12b5ba16d058c65598368853</id>
<content type='text'>
The commit 1295e2cf3065 ("inet: minor optimization for backlog setting in
listen(2)") added change so that sk_max_ack_backlog is initialised earlier
in inet_dccp_listen() and inet_listen().  Since then, we no longer use
backlog in inet_csk_listen_start(), so let's remove it.

Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.co.jp&gt;
Acked-by: Yafang Shao &lt;laoar.shao@gmail.com&gt;
Reviewed-by: Richard Sailer &lt;richard_siegfried@systemli.org&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: switch orphan_count to bare per-cpu counters</title>
<updated>2021-10-15T10:28:34+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2021-10-14T13:41:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=19757cebf0c5016a1f36f7fe9810a9f0b33c0832'/>
<id>urn:sha1:19757cebf0c5016a1f36f7fe9810a9f0b33c0832</id>
<content type='text'>
Use of percpu_counter structure to track count of orphaned
sockets is causing problems on modern hosts with 256 cpus
or more.

Stefan Bach reported a serious spinlock contention in real workloads,
that I was able to reproduce with a netfilter rule dropping
incoming FIN packets.

    53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
            |
            ---queued_spin_lock_slowpath
               |
                --53.51%--_raw_spin_lock_irqsave
                          |
                           --53.51%--__percpu_counter_sum
                                     tcp_check_oom
                                     |
                                     |--39.03%--__tcp_close
                                     |          tcp_close
                                     |          inet_release
                                     |          inet6_release
                                     |          sock_close
                                     |          __fput
                                     |          ____fput
                                     |          task_work_run
                                     |          exit_to_usermode_loop
                                     |          do_syscall_64
                                     |          entry_SYSCALL_64_after_hwframe
                                     |          __GI___libc_close
                                     |
                                      --14.48%--tcp_out_of_resources
                                                tcp_write_timeout
                                                tcp_retransmit_timer
                                                tcp_write_timer_handler
                                                tcp_write_timer
                                                call_timer_fn
                                                expire_timers
                                                __run_timers
                                                run_timer_softirq
                                                __softirqentry_text_start

As explained in commit cf86a086a180 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting"), default batch size is too big
for the default value of tcp_max_orphans (262144).

But even if we reduce batch sizes, there would still be cases
where the estimated count of orphans is beyond the limit,
and where tcp_too_many_orphans() has to call the expensive
percpu_counter_sum_positive().

One solution is to use plain per-cpu counters, and have
a timer to periodically refresh this cache.

Updating this cache every 100ms seems about right, tcp pressure
state is not radically changing over shorter periods.

percpu_counter was nice 15 years ago while hosts had less
than 16 cpus, not anymore by current standards.

v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
    reported by kernel test robot &lt;lkp@intel.com&gt;
    Remove unused socket argument from tcp_too_many_orphans()

Fixes: dd24c00191d5 ("net: Use a percpu_counter for orphan_count")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Stefan Bach &lt;sfb@google.com&gt;
Cc: Neal Cardwell &lt;ncardwell@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: change ICSK_CA_PRIV_SIZE definition</title>
<updated>2021-06-29T18:54:36+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2021-06-29T13:53:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3f8ad50a9e43b6a59070e6c9c5eec79626f81095'/>
<id>urn:sha1:3f8ad50a9e43b6a59070e6c9c5eec79626f81095</id>
<content type='text'>
Instead of a magic number (13 currently) and having
to change it every other year, use sizeof_field() macro.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: relookup sock for RST+ACK packets handled by obsolete req sock</title>
<updated>2021-03-15T21:34:29+00:00</updated>
<author>
<name>Alexander Ovechkin</name>
<email>ovov@yandex-team.ru</email>
</author>
<published>2021-03-15T11:05:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7233da86697efef41288f8b713c10c2499cffe85'/>
<id>urn:sha1:7233da86697efef41288f8b713c10c2499cffe85</id>
<content type='text'>
Currently tcp_check_req can be called with obsolete req socket for which big
socket have been already created (because of CPU race or early demux
assigning req socket to multiple packets in gro batch).

Commit e0f9759f530bf789e984 ("tcp: try to keep packet if SYN_RCV race
is lost") added retry in case when tcp_check_req is called for PSH|ACK packet.
But if client sends RST+ACK immediatly after connection being
established (it is performing healthcheck, for example) retry does not
occur. In that case tcp_check_req tries to close req socket,
leaving big socket active.

Fixes: e0f9759f530 ("tcp: try to keep packet if SYN_RCV race is lost")
Signed-off-by: Alexander Ovechkin &lt;ovov@yandex-team.ru&gt;
Reported-by: Oleg Senin &lt;olegsenin@yandex-team.ru&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
