kernel/linux.git/net/core, branch linux-5.9.y

net: flow_offload: Fix memory leak for indirect flow block

2020-12-21T12:28:17+00:00

[ Upstream commit 5137d303659d8c324e67814b1cc2e1bc0c0d9836 ] The offending commit introduces a cleanup callback that is invoked when the driver module is removed to clean up the tunnel device flow block. But it returns on the first iteration of the for loop. The remaining indirect flow blocks will never be freed. Fixes: 1fac52da5942 ("net: flow_offload: consolidate indirect flow_block infrastructure") CC: Pablo Neira Ayuso Signed-off-by: Chris Mi Reviewed-by: Roi Dayan Signed-off-by: Greg Kroah-Hartman

net, xsk: Avoid taking multiple skbuff references

2020-12-16T09:58:25+00:00

[ Upstream commit 36ccdf85829a7dd6936dba5d02fa50138471f0d3 ] Commit 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY") addressed the problem that packets were discarded from the Tx AF_XDP ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was bumping the skbuff reference count, so that the buffer would not be freed by dev_direct_xmit(). A reference count larger than one means that the skbuff is "shared", which is not the case. If the "shared" skbuff is sent to the generic XDP receive path, netif_receive_generic_xdp(), and pskb_expand_head() is entered the BUG_ON(skb_shared(skb)) will trigger. This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(), where a user can select the skbuff free policy. This allows AF_XDP to avoid bumping the reference count, but still keep the NETDEV_TX_BUSY behavior. Fixes: 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY") Reported-by: Yonghong Song Signed-off-by: Björn Töpel Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20201123175600.146255-1-bjorn.topel@gmail.com Signed-off-by: Sasha Levin

net: skbuff: ensure LSE is pullable before decrementing the MPLS ttl

2020-12-08T09:42:03+00:00

[ Upstream commit 13de4ed9e3a9ccbe54d05f7d5c773f69ecaf6c64 ] skb_mpls_dec_ttl() reads the LSE without ensuring that it is contained in the skb "linear" area. Fix this calling pskb_may_pull() before reading the current ttl. Found by code inspection. Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC") Reported-by: Marcelo Ricardo Leitner Signed-off-by: Davide Caratti Link: https://lore.kernel.org/r/53659f28be8bc336c113b5254dc637cc76bbae91.1606987074.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski Signed-off-by: Greg Kroah-Hartman

sock: set sk_err to ee_errno on dequeue from errq

2020-12-08T09:42:00+00:00

[ Upstream commit 985f7337421a811cb354ca93882f943c8335a6f5 ] When setting sk_err, set it to ee_errno, not ee_origin. Commit f5f99309fa74 ("sock: do not set sk_err in sock_dequeue_err_skb") disabled updating sk_err on errq dequeue, which is correct for most error types (origins): - sk->sk_err = err; Commit 38b257938ac6 ("sock: reset sk_err when the error queue is empty") reenabled the behavior for IMCP origins, which do require it: + if (icmp_next) + sk->sk_err = SKB_EXT_ERR(skb_next)->ee.ee_origin; But read from ee_errno. Fixes: 38b257938ac6 ("sock: reset sk_err when the error queue is empty") Reported-by: Ayush Ranjan Signed-off-by: Willem de Bruijn Acked-by: Soheil Hassas Yeganeh Link: https://lore.kernel.org/r/20201126151220.2819322-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski Signed-off-by: Greg Kroah-Hartman

devlink: Make sure devlink instance and port are in same net namespace

2020-12-08T09:41:59+00:00

[ Upstream commit a7b43649507dae4e55ff0087cad4e4dd1c6d5b99 ] When devlink reload operation is not used, netdev of an Ethernet port may be present in different net namespace than the net namespace of the devlink instance. Ensure that both the devlink instance and devlink port netdev are located in same net namespace. Fixes: 070c63f20f6c ("net: devlink: allow to change namespaces during reload") Signed-off-by: Parav Pandit Signed-off-by: Jakub Kicinski Signed-off-by: Greg Kroah-Hartman

devlink: Hold rtnl lock while reading netdev attributes

2020-12-08T09:41:59+00:00

[ Upstream commit b187c9b4178b87954dbc94e78a7094715794714f ] A netdevice of a devlink port can be moved to different net namespace than its parent devlink instance. This scenario occurs when devlink reload is not used. When netdevice is undergoing migration to net namespace, its ifindex and name may change. In such use case, devlink port query may read stale netdev attributes. Fix it by reading them under rtnl lock. Fixes: bfcd3a466172 ("Introduce devlink infrastructure") Signed-off-by: Parav Pandit Signed-off-by: Jakub Kicinski Signed-off-by: Greg Kroah-Hartman

bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self

2020-11-24T12:39:06+00:00

[ Upstream commit 6fa9201a898983da731fca068bb4b5c941537588 ] If a socket redirects to itself and it is under memory pressure it is possible to get a socket stuck so that recv() returns EAGAIN and the socket can not advance for some time. This happens because when redirecting a skb to the same socket we received the skb on we first check if it is OK to enqueue the skb on the receiving socket by checking memory limits. But, if the skb is itself the object holding the memory needed to enqueue the skb we will keep retrying from kernel side and always fail with EAGAIN. Then userspace will get a recv() EAGAIN error if there are no skbs in the psock ingress queue. This will continue until either some skbs get kfree'd causing the memory pressure to reduce far enough that we can enqueue the pending packet or the socket is destroyed. In some cases its possible to get a socket stuck for a noticeable amount of time if the socket is only receiving skbs from sk_skb verdict programs. To reproduce I make the socket memory limits ridiculously low so sockets are always under memory pressure. More often though if under memory pressure it looks like a spurious EAGAIN error on user space side causing userspace to retry and typically enough has moved on the memory side that it works. To fix skip memory checks and skb_orphan if receiving on the same sock as already assigned. For SK_PASS cases this is easy, its always the same socket so we can just omit the orphan/set_owner pair. For backlog cases we need to check skb->sk and decide if the orphan and set_owner pair are needed. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend Signed-off-by: Daniel Borkmann Reviewed-by: Jakub Sitnicki Link: https://lore.kernel.org/bpf/160556572660.73229.12566203819812939627.stgit@john-XPS-13-9370 Signed-off-by: Sasha Levin

bpf, sockmap: Use truesize with sk_rmem_schedule()

2020-11-24T12:39:06+00:00

[ Upstream commit 70796fb751f1d34cc650e640572a174faf009cd4 ] We use skb->size with sk_rmem_scheduled() which is not correct. Instead use truesize to align with socket and tcp stack usage of sk_rmem_schedule. Suggested-by: Daniel Borkman Signed-off-by: John Fastabend Signed-off-by: Daniel Borkmann Reviewed-by: Jakub Sitnicki Link: https://lore.kernel.org/bpf/160556570616.73229.17003722112077507863.stgit@john-XPS-13-9370 Signed-off-by: Sasha Levin

bpf, sockmap: On receive programs try to fast track SK_PASS ingress

2020-11-24T12:39:06+00:00

[ Upstream commit 9ecbfb06a078c4911fb444203e8e41d93d22f886 ] When we receive an skb and the ingress skb verdict program returns SK_PASS we currently set the ingress flag and put it on the workqueue so it can be turned into a sk_msg and put on the sk_msg ingress queue. Then finally telling userspace with data_ready hook. Here we observe that if the workqueue is empty then we can try to convert into a sk_msg type and call data_ready directly without bouncing through a workqueue. Its a common pattern to have a recv verdict program for visibility that always returns SK_PASS. In this case unless there is an ENOMEM error or we overrun the socket we can avoid the workqueue completely only using it when we fall back to error cases caused by memory pressure. By doing this we eliminate another case where data may be dropped if errors occur on memory limits in workqueue. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/160226859704.5692.12929678876744977669.stgit@john-Precision-5820-Tower Signed-off-by: Sasha Levin

bpf, sockmap: Skb verdict SK_PASS to self already checked rmem limits

2020-11-24T12:39:06+00:00

[ Upstream commit cfea28f890cf292d5fe90680db64b68086ef25ba ] For sk_skb case where skb_verdict program returns SK_PASS to continue to pass packet up the stack, the memory limits were already checked before enqueuing in skb_queue_tail from TCP side. So, lets remove the extra checks here. The theory is if the TCP stack believes we have memory to receive the packet then lets trust the stack and not double check the limits. In fact the accounting here can cause a drop if sk_rmem_alloc has increased after the stack accepted this packet, but before the duplicate check here. And worse if this happens because TCP stack already believes the data has been received there is no retransmit. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/160226857664.5692.668205469388498375.stgit@john-Precision-5820-Tower Signed-off-by: Sasha Levin