kernel/linux.git/net/core/sock.c, branch linux-6.0.y

soreuseport: Fix socket selection for SO_INCOMING_CPU.

2022-12-31T12:25:51+00:00

[ Upstream commit b261eda84ec136240a9ca753389853a3a1bccca2 ] Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work with setsockopt(SO_REUSEPORT) since v4.6. With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could build a highly efficient server application. setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener or UDP socket, and then incoming packets processed on the CPU will likely be distributed to the socket. Technically, a socket could even receive packets handled on another CPU if no sockets in the reuseport group have the same CPU receiving the flow. The logic exists in compute_score() so that a socket will get a higher score if it has the same CPU with the flow. However, the score gets ignored after the blamed two commits, which introduced a faster socket selection algorithm for SO_REUSEPORT. This patch introduces a counter of sockets with SO_INCOMING_CPU in a reuseport group to check if we should iterate all sockets to find a proper one. We increment the counter when * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT * enabling SO_INCOMING_CPU if the socket is in a reuseport group Also, we decrement it when * detaching a socket out of the group to apply SO_INCOMING_CPU to migrated TCP requests * disabling SO_INCOMING_CPU if the socket is in a reuseport group When the counter reaches 0, we can get back to the O(1) selection algorithm. The overall changes are negligible for the non-SO_INCOMING_CPU case, and the only notable thing is that we have to update sk_incomnig_cpu under reuseport_lock. Otherwise, the race prevents transitioning to the O(n) algorithm and results in the wrong socket selection. cpu1 (setsockopt) cpu2 (listen) +-----------------+ +-------------+ lock_sock(sk1) lock_sock(sk2) reuseport_update_incoming_cpu(sk1, val) . | /* set CPU as 0 */ |- WRITE_ONCE(sk1->incoming_cpu, val) | | spin_lock_bh(&reuseport_lock) | reuseport_grow(sk2, reuse) | . | |- more_socks_size = reuse->max_socks * 2U; | |- if (more_socks_size > U16_MAX && | | reuse->num_closed_socks) | | . | | |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL); | | `- __reuseport_detach_closed_sock(sk1, reuse) | | . | | `- reuseport_put_incoming_cpu(sk1, reuse) | | . | | | /* Read shutdown()ed sk1's sk_incoming_cpu | | | * without lock_sock(). | | | */ | | `- if (sk1->sk_incoming_cpu >= 0) | | . | | | /* decrement not-yet-incremented | | | * count, which is never incremented. | | | */ | | `- __reuseport_put_incoming_cpu(reuse); | | | `- spin_lock_bh(&reuseport_lock) | |- spin_lock_bh(&reuseport_lock) | |- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...) |- if (!reuse) | . | | /* Cannot increment reuse->incoming_cpu. */ | `- goto out; | `- spin_unlock_bh(&reuseport_lock) Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection") Reported-by: Kazuho Oku Signed-off-by: Kuniyuki Iwashima Signed-off-by: Paolo Abeni Signed-off-by: Sasha Levin

net: Fix a data-race around sysctl_net_busy_read.

2022-08-24T12:46:58+00:00

While reading sysctl_net_busy_read, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: 2d48d67fa8cd ("net: poll/select low latency socket support") Signed-off-by: Kuniyuki Iwashima Signed-off-by: David S. Miller

net: Fix data-races around sysctl_optmem_max.

2022-08-24T12:46:57+00:00

While reading sysctl_optmem_max, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima Signed-off-by: David S. Miller

net: Fix data-races around sysctl_[rw]mem_(max|default).

2022-08-24T12:46:57+00:00

While reading sysctl_[rw]mem_(max|default), they can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima Signed-off-by: David S. Miller

tls: rx: periodically flush socket backlog

2022-07-06T11:56:35+00:00

We continuously hold the socket lock during large reads and writes. This may inflate RTT and negatively impact TCP performance. Flush the backlog periodically. I tried to pick a flush period (128kB) which gives significant benefit but the max Bps rate is not yet visibly impacted. Signed-off-by: Jakub Kicinski Signed-off-by: David S. Miller

tcp: sk_forced_mem_schedule() optimization

2022-06-13T12:35:25+00:00

sk_memory_allocated_add() has three callers, and returns to them @memory_allocated. sk_forced_mem_schedule() is one of them, and ignores the returned value. Change sk_memory_allocated_add() to return void. Change sock_reserve_memory() and __sk_mem_raise_allocated() to call sk_memory_allocated(). This removes one cache line miss [1] for RPC workloads, as first skbs in TCP write queue and receive queue go through sk_forced_mem_schedule(). [1] Cache line holding tcp_memory_allocated. Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Reviewed-by: Shakeel Butt Signed-off-by: David S. Miller

net: unexport __sk_mem_{raise|reduce}_allocated

2022-06-10T23:21:27+00:00

These two helpers are only used from core networking. Signed-off-by: Eric Dumazet Reviewed-by: Shakeel Butt Acked-by: Soheil Hassas Yeganeh Signed-off-by: Jakub Kicinski

net: add per_cpu_fw_alloc field to struct proto

2022-06-10T23:21:26+00:00

Each protocol having a ->memory_allocated pointer gets a corresponding per-cpu reserve, that following patches will use. Instead of having reserved bytes per socket, we want to have per-cpu reserves. Signed-off-by: Eric Dumazet Reviewed-by: Shakeel Butt Acked-by: Soheil Hassas Yeganeh Signed-off-by: Jakub Kicinski

net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFT

2022-06-10T23:21:26+00:00

Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE. This might change in the future, but it seems better to avoid the confusion. Signed-off-by: Eric Dumazet Reviewed-by: Shakeel Butt Acked-by: Soheil Hassas Yeganeh Signed-off-by: Jakub Kicinski

net: use DEBUG_NET_WARN_ON_ONCE() in __release_sock()

2022-06-10T04:53:54+00:00

Check against skb dst in socket backlog has never triggered in past years. Keep the check omly for CONFIG_DEBUG_NET=y builds. Signed-off-by: Eric Dumazet Signed-off-by: Jakub Kicinski