summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2019-07-04Merge tag 'v5.1.16' into dev-5.1dev-5.1Joel Stanley2-2/+10
This is the 5.1.16 stable release Signed-off-by: Joel Stanley <joel@jms.id.au>
2019-07-03bpf: udp: Avoid calling reuseport's bpf_prog from udp_groMartin KaFai Lau1-1/+5
commit 257a525fe2e49584842c504a92c27097407f778f upstream. When the commit a6024562ffd7 ("udp: Add GRO functions to UDP socket") added udp[46]_lib_lookup_skb to the udp_gro code path, it broke the reuseport_select_sock() assumption that skb->data is pointing to the transport header. This patch follows an earlier __udp6_lib_err() fix by passing a NULL skb to avoid calling the reuseport's bpf_prog. Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03bpf: fix unconnected udp hooksDaniel Borkmann1-0/+4
commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream. Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently to applications as also stated in original motivation in 7828f20e3779 ("Merge branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter two hooks into Cilium to enable host based load-balancing with Kubernetes, I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes typically sets up DNS as a service and is thus subject to load-balancing. Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API is currently insufficient and thus not usable as-is for standard applications shipped with most distros. To break down the issue we ran into with a simple example: # cat /etc/resolv.conf nameserver 147.75.207.207 nameserver 147.75.207.208 For the purpose of a simple test, we set up above IPs as service IPs and transparently redirect traffic to a different DNS backend server for that node: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 The attached BPF program is basically selecting one of the backends if the service IP/port matches on the cgroup hook. DNS breaks here, because the hooks are not transparent enough to applications which have built-in msg_name address checks: # nslookup 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ;; connection timed out; no servers could be reached # dig 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached For comparison, if none of the service IPs is used, and we tell nslookup to use 8.8.8.8 directly it works just fine, of course: # nslookup 1.1.1.1 8.8.8.8 1.1.1.1.in-addr.arpa name = one.one.one.one. In order to fix this and thus act more transparent to the application, this needs reverse translation on recvmsg() side. A minimal fix for this API is to add similar recvmsg() hooks behind the BPF cgroups static key such that the program can track state and replace the current sockaddr_in{,6} with the original service IP. From BPF side, this basically tracks the service tuple plus socket cookie in an LRU map where the reverse NAT can then be retrieved via map value as one example. Side-note: the BPF cgroups static key should be converted to a per-hook static key in future. Same example after this fix: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 Lookups work fine now: # nslookup 1.1.1.1 1.1.1.1.in-addr.arpa name = one.one.one.one. Authoritative answers can be found from: # dig 1.1.1.1 ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;1.1.1.1. IN A ;; AUTHORITY SECTION: . 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 147.75.207.207#53(147.75.207.207) ;; WHEN: Tue May 21 12:59:38 UTC 2019 ;; MSG SIZE rcvd: 111 And from an actual packet level it shows that we're using the back end server when talking via 147.75.207.20{7,8} front end: # tcpdump -i any udp [...] 12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) [...] In order to be flexible and to have same semantics as in sendmsg BPF programs, we only allow return codes in [1,1] range. In the sendmsg case the program is called if msg->msg_name is present which can be the case in both, connected and unconnected UDP. The former only relies on the sockaddr_in{,6} passed via connect(2) if passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar way to call into the BPF program whenever a non-NULL msg->msg_name was passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note that for TCP case, the msg->msg_name is ignored in the regular recvmsg path and therefore not relevant. For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE, the hook is not called. This is intentional as it aligns with the same semantics as in case of TCP cgroup BPF hooks right now. This might be better addressed in future through a different bpf_attach_type such that this case can be distinguished from the regular recvmsg paths, for example. Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03ipv4: Use return value of inet_iif() for __raw_v4_lookup in the while loopStephen Suryaputra1-1/+1
[ Upstream commit 38c73529de13e1e10914de7030b659a2f8b01c3b ] In commit 19e4e768064a8 ("ipv4: Fix raw socket lookup for local traffic"), the dif argument to __raw_v4_lookup() is coming from the returned value of inet_iif() but the change was done only for the first lookup. Subsequent lookups in the while loop still use skb->dev->ifIndex. Fixes: 19e4e768064a8 ("ipv4: Fix raw socket lookup for local traffic") Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-02treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner39-214/+39
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 2874c5fd284268364ece81a7bd936f3c8168e567) Signed-off-by: Joel Stanley <joel@jms.id.au>
2019-06-22tcp: refine memory limit test in tcp_fragment()Eric Dumazet1-1/+2
commit b6653b3629e5b88202be3c9abc44713973f5c4b4 upstream. tcp_fragment() might be called for skbs in the write queue. Memory limits might have been exceeded because tcp_sendmsg() only checks limits at full skb (64KB) boundaries. Therefore, we need to make sure tcp_fragment() wont punish applications that might have setup very low SO_SNDBUF values. Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Christoph Paasch <cpaasch@apple.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-22netfilter: nft_fib: Fix existence check supportPhil Sutter1-20/+3
[ Upstream commit e633508a95289489d28faacb68b32c3e7e68ef6f ] NFTA_FIB_F_PRESENT flag was not always honored since eval functions did not call nft_fib_store_result in all cases. Given that in all callsites there is a struct net_device pointer available which holds the interface data to be stored in destination register, simplify nft_fib_store_result() to just accept that pointer instead of the nft_pktinfo pointer and interface index. This also allows to drop the index to interface lookup previously needed to get the name associated with given index. Fixes: 055c4b34b94f6 ("netfilter: nft_fib: Support existence check") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-22net: correct udp zerocopy refcnt also when zerocopy only on appendWillem de Bruijn1-1/+1
[ Upstream commit 522924b583082f51b8a2406624a2f27c22119b20 ] The below patch fixes an incorrect zerocopy refcnt increment when appending with MSG_MORE to an existing zerocopy udp skb. send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt 1 send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt still 1 (bar frags) But it missed that zerocopy need not be passed at the first send. The right test whether the uarg is newly allocated and thus has extra refcnt 1 is not !skb, but !skb_zcopy. send(.., MSG_MORE); // <no uarg> send(.., MSG_ZEROCOPY); // refcnt 1 Fixes: 100f6d8e09905 ("net: correct zerocopy refcnt with udp MSG_MORE") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19bpf, tcp: correctly handle DONT_WAIT flags and timeo == 0John Fastabend1-1/+4
[ Upstream commit 5fa2ca7c4a3fc176f31b495e1a704862d8188b53 ] The tcp_bpf_wait_data() routine needs to check timeo != 0 before calling sk_wait_event() otherwise we may see unexpected stalls on receiver. Arika did all the leg work here I just formatted, posted and ran a few tests. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Reported-by: Arika Chen <eaglesora@gmail.com> Suggested-by: Arika Chen <eaglesora@gmail.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-19bpf: sockmap remove duplicate queue freeJohn Fastabend1-2/+0
[ Upstream commit c42253cc88206fd0e9868c8b2fd7f9e79f9e0e03 ] In tcp bpf remove we free the cork list and purge the ingress msg list. However we do this before the ref count reaches zero so it could be possible some other access is in progress. In this case (tcp close and/or tcp_unhash) we happen to also hold the sock lock so no path exists but lets fix it otherwise it is extremely fragile and breaks the reference counting rules. Also we already check the cork list and ingress msg queue and free them once the ref count reaches zero so its wasteful to check twice. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-17tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()Eric Dumazet1-0/+1
commit 967c05aee439e6e5d7d805e195b3a20ef5c433d6 upstream. If mtu probing is enabled tcp_mtu_probing() could very well end up with a too small MSS. Use the new sysctl tcp_min_snd_mss to make sure MSS search is performed in an acceptable range. CVE-2019-11479 -- tcp mss hardcoded to 48 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17tcp: add tcp_min_snd_mss sysctlEric Dumazet3-2/+13
commit 5f3e2bf008c2221478101ee72f5cb4654b9fc363 upstream. Some TCP peers announce a very small MSS option in their SYN and/or SYN/ACK messages. This forces the stack to send packets with a very high network/cpu overhead. Linux has enforced a minimal value of 48. Since this value includes the size of TCP options, and that the options can consume up to 40 bytes, this means that each segment can include only 8 bytes of payload. In some cases, it can be useful to increase the minimal value to a saner value. We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility reasons. Note that TCP_MAXSEG socket option enforces a minimal value of (TCP_MIN_MSS). David Miller increased this minimal value in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.") from 64 to 88. We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS. CVE-2019-11479 -- tcp mss hardcoded to 48 Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17tcp: tcp_fragment() should apply sane memory limitsEric Dumazet2-0/+6
commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream. Jonathan Looney reported that a malicious peer can force a sender to fragment its retransmit queue into tiny skbs, inflating memory usage and/or overflow 32bit counters. TCP allows an application to queue up to sk_sndbuf bytes, so we need to give some allowance for non malicious splitting of retransmit queue. A new SNMP counter is added to monitor how many times TCP did not allow to split an skb if the allowance was exceeded. Note that this counter might increase in the case applications use SO_SNDBUF socket option to lower sk_sndbuf. CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the socket is already using more than half the allowed space Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17tcp: limit payload size of sacked skbsEric Dumazet3-9/+24
commit 3b4929f65b0d8249f19a50245cd88ed1a2f78cff upstream. Jonathan Looney reported that TCP can trigger the following crash in tcp_shifted_skb() : BUG_ON(tcp_skb_pcount(skb) < pcount); This can happen if the remote peer has advertized the smallest MSS that linux TCP accepts : 48 An skb can hold 17 fragments, and each fragment can hold 32KB on x86, or 64KB on PowerPC. This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs can overflow. Note that tcp_sendmsg() builds skbs with less than 64KB of payload, so this problem needs SACK to be enabled. SACK blocks allow TCP to coalesce multiple skbs in the retransmit queue, thus filling the 17 fragments to maximal capacity. CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11ipmr_base: Do not reset index in mr_table_dumpDavid Ahern1-2/+1
[ Upstream commit 7fcd1e033dacedd520abebc943c960dcf5add3ae ] e is the counter used to save the location of a dump when an skb is filled. Once the walk of the table is complete, mr_table_dump needs to return without resetting that index to 0. Dump of a specific table is looping because of the reset because there is no way to indicate the walk of the table is done. Move the reset to the caller so the dump of each table starts at 0, but the loop counter is maintained if a dump fills an skb. Fixes: e1cedae1ba6b0 ("ipmr: Refactor mr_rtm_dumproute") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11udp: only choose unbound UDP socket for multicast when not in a VRFTim Beale1-2/+1
[ Upstream commit 82ba25c6de200d7a9e9c970c998cdd6dfa8637ae ] By default, packets received in another VRF should not be passed to an unbound socket in the default VRF. This patch updates the IPv4 UDP multicast logic to match the unicast VRF logic (in compute_score()), as well as the IPv6 mcast logic (in __udp_v6_is_mcast_sock()). The particular case I noticed was DHCP discover packets going to the 255.255.255.255 address, which are handled by __udp4_lib_mcast_deliver(). The previous code meant that running multiple different DHCP server or relay agent instances across VRFs did not work correctly - any server/relay agent in the default VRF received DHCP discover packets for all other VRFs. Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF") Signed-off-by: Tim Beale <timbeale@catalyst.net.nz> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11ipv4: not do cache for local delivery if bc_forwarding is enabledXin Long1-11/+11
[ Upstream commit 0a90478b93a46bdcd56ba33c37566a993e455d54 ] With the topo: h1 ---| rp1 | | route rp3 |--- h3 (192.168.200.1) h2 ---| rp2 | If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on h2, and the packets can still be forwared. This issue was caused by the input route cache. It should only do the cache for either bc forwarding or local delivery. Otherwise, local delivery can use the route cache for bc forwarding of other interfaces. This patch is to fix it by not doing cache for local delivery if all.bc_forwarding is enabled. Note that we don't fix it by checking route cache local flag after rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the common route code shouldn't be touched for bc_forwarding. Fixes: 5cbf777cfdf6 ("route: add support for directed broadcast forwarding") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-04net: correct zerocopy refcnt with udp MSG_MOREWillem de Bruijn1-2/+2
[ Upstream commit 100f6d8e09905c59be45b6316f8f369c0be1b2d8 ] TCP zerocopy takes a uarg reference for every skb, plus one for the tcp_sendmsg_locked datapath temporarily, to avoid reaching refcnt zero as it builds, sends and frees skbs inside its inner loop. UDP and RAW zerocopy do not send inside the inner loop so do not need the extra sock_zerocopy_get + sock_zerocopy_put pair. Commit 52900d22288ed ("udp: elide zerocopy operation in hot path") introduced extra_uref to pass the initial reference taken in sock_zerocopy_alloc to the first generated skb. But, sock_zerocopy_realloc takes this extra reference at the start of every call. With MSG_MORE, no new skb may be generated to attach the extra_uref to, so refcnt is incorrectly 2 with only one skb. Do not take the extra ref if uarg && !tcp, which implies MSG_MORE. Update extra_uref accordingly. This conditional assignment triggers a false positive may be used uninitialized warning, so have to initialize extra_uref at define. Changes v1->v2: fix typo in Fixes SHA1 Fixes: 52900d22288e7 ("udp: elide zerocopy operation in hot path") Reported-by: syzbot <syzkaller@googlegroups.com> Diagnosed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-04ipv4/igmp: fix build error if !CONFIG_IP_MULTICASTEric Dumazet1-11/+11
[ Upstream commit 903869bd10e6719b9df6718e785be7ec725df59f ] ip_sf_list_clear_all() needs to be defined even if !CONFIG_IP_MULTICAST Fixes: 3580d04aa674 ("ipv4/igmp: fix another memory leak in igmpv3_del_delrec()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-04ipv4/igmp: fix another memory leak in igmpv3_del_delrec()Eric Dumazet1-17/+30
[ Upstream commit 3580d04aa674383c42de7b635d28e52a1e5bc72c ] syzbot reported memory leaks [1] that I have back tracked to a missing cleanup from igmpv3_del_delrec() when (im->sfmode != MCAST_INCLUDE) Add ip_sf_list_clear_all() and kfree_pmc() helpers to explicitely handle the cleanups before freeing. [1] BUG: memory leak unreferenced object 0xffff888123e32b00 (size 64): comm "softirq", pid 0, jiffies 4294942968 (age 8.010s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 e0 00 00 01 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000006105011b>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline] [<000000006105011b>] slab_post_alloc_hook mm/slab.h:439 [inline] [<000000006105011b>] slab_alloc mm/slab.c:3326 [inline] [<000000006105011b>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553 [<000000004bba8073>] kmalloc include/linux/slab.h:547 [inline] [<000000004bba8073>] kzalloc include/linux/slab.h:742 [inline] [<000000004bba8073>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline] [<000000004bba8073>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085 [<00000000a46a65a0>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475 [<000000005956ca89>] do_ip_setsockopt.isra.0+0x1795/0x1930 net/ipv4/ip_sockglue.c:957 [<00000000848e2d2f>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246 [<00000000b9db185c>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616 [<000000003028e438>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130 [<0000000015b65589>] __sys_setsockopt+0x98/0x120 net/socket.c:2078 [<00000000ac198ef0>] __do_sys_setsockopt net/socket.c:2089 [inline] [<00000000ac198ef0>] __se_sys_setsockopt net/socket.c:2086 [inline] [<00000000ac198ef0>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086 [<000000000a770437>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301 [<00000000d3adb93b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 9c8bb163ae78 ("igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hangbin Liu <liuhangbin@gmail.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-04inet: switch IP ID generator to siphashEric Dumazet1-5/+7
[ Upstream commit df453700e8d81b1bdafdf684365ee2b9431fb702 ] According to Amit Klein and Benny Pinkas, IP ID generation is too weak and might be used by attackers. Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix()) having 64bit key and Jenkins hash is risky. It is time to switch to siphash and its 128bit keys. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-16ipv4: Fix raw socket lookup for local trafficDavid Ahern1-2/+2
[ Upstream commit 19e4e768064a87b073a4b4c138b55db70e0cfb9f ] inet_iif should be used for the raw socket lookup. inet_iif considers rt_iif which handles the case of local traffic. As it stands, ping to a local address with the '-I <dev>' option fails ever since ping was changed to use SO_BINDTODEVICE instead of cmsg + IP_PKTINFO. IPv6 works fine. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-02udp: fix GRO packet of deathEric Dumazet1-3/+10
syzbot was able to crash host by sending UDP packets with a 0 payload. TCP does not have this issue since we do not aggregate packets without payload. Since dev_gro_receive() sets gso_size based on skb_gro_len(skb) it seems not worth trying to cope with padded packets. BUG: KASAN: slab-out-of-bounds in skb_gro_receive+0xf5f/0x10e0 net/core/skbuff.c:3826 Read of size 16 at addr ffff88808893fff0 by task syz-executor612/7889 CPU: 0 PID: 7889 Comm: syz-executor612 Not tainted 5.1.0-rc7+ #96 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 __asan_report_load16_noabort+0x14/0x20 mm/kasan/generic_report.c:133 skb_gro_receive+0xf5f/0x10e0 net/core/skbuff.c:3826 udp_gro_receive_segment net/ipv4/udp_offload.c:382 [inline] call_gro_receive include/linux/netdevice.h:2349 [inline] udp_gro_receive+0xb61/0xfd0 net/ipv4/udp_offload.c:414 udp4_gro_receive+0x763/0xeb0 net/ipv4/udp_offload.c:478 inet_gro_receive+0xe72/0x1110 net/ipv4/af_inet.c:1510 dev_gro_receive+0x1cd0/0x23c0 net/core/dev.c:5581 napi_gro_frags+0x36b/0xd10 net/core/dev.c:5843 tun_get_user+0x2f24/0x3fb0 drivers/net/tun.c:1981 tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2027 call_write_iter include/linux/fs.h:1866 [inline] do_iter_readv_writev+0x5e1/0x8e0 fs/read_write.c:681 do_iter_write fs/read_write.c:957 [inline] do_iter_write+0x184/0x610 fs/read_write.c:938 vfs_writev+0x1b3/0x2f0 fs/read_write.c:1002 do_writev+0x15e/0x370 fs/read_write.c:1037 __do_sys_writev fs/read_write.c:1110 [inline] __se_sys_writev fs/read_write.c:1107 [inline] __x64_sys_writev+0x75/0xb0 fs/read_write.c:1107 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x441cc0 Code: 05 48 3d 01 f0 ff ff 0f 83 9d 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 83 3d 51 93 29 00 00 75 14 b8 14 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 74 09 fc ff c3 48 83 ec 08 e8 ba 2b 00 00 RSP: 002b:00007ffe8c716118 EFLAGS: 00000246 ORIG_RAX: 0000000000000014 RAX: ffffffffffffffda RBX: 00007ffe8c716150 RCX: 0000000000441cc0 RDX: 0000000000000001 RSI: 00007ffe8c716170 RDI: 00000000000000f0 RBP: 0000000000000000 R08: 000000000000ffff R09: 0000000000a64668 R10: 0000000020000040 R11: 0000000000000246 R12: 000000000000c2d9 R13: 0000000000402b50 R14: 0000000000000000 R15: 0000000000000000 Allocated by task 5143: save_stack+0x45/0xd0 mm/kasan/common.c:75 set_track mm/kasan/common.c:87 [inline] __kasan_kmalloc mm/kasan/common.c:497 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:505 slab_post_alloc_hook mm/slab.h:437 [inline] slab_alloc mm/slab.c:3393 [inline] kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3555 mm_alloc+0x1d/0xd0 kernel/fork.c:1030 bprm_mm_init fs/exec.c:363 [inline] __do_execve_file.isra.0+0xaa3/0x23f0 fs/exec.c:1791 do_execveat_common fs/exec.c:1865 [inline] do_execve fs/exec.c:1882 [inline] __do_sys_execve fs/exec.c:1958 [inline] __se_sys_execve fs/exec.c:1953 [inline] __x64_sys_execve+0x8f/0xc0 fs/exec.c:1953 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 5351: save_stack+0x45/0xd0 mm/kasan/common.c:75 set_track mm/kasan/common.c:87 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:459 kasan_slab_free+0xe/0x10 mm/kasan/common.c:467 __cache_free mm/slab.c:3499 [inline] kmem_cache_free+0x86/0x260 mm/slab.c:3765 __mmdrop+0x238/0x320 kernel/fork.c:677 mmdrop include/linux/sched/mm.h:49 [inline] finish_task_switch+0x47b/0x780 kernel/sched/core.c:2746 context_switch kernel/sched/core.c:2880 [inline] __schedule+0x81b/0x1cc0 kernel/sched/core.c:3518 preempt_schedule_irq+0xb5/0x140 kernel/sched/core.c:3745 retint_kernel+0x1b/0x2d arch_local_irq_restore arch/x86/include/asm/paravirt.h:767 [inline] kmem_cache_free+0xab/0x260 mm/slab.c:3766 anon_vma_chain_free mm/rmap.c:134 [inline] unlink_anon_vmas+0x2ba/0x870 mm/rmap.c:401 free_pgtables+0x1af/0x2f0 mm/memory.c:394 exit_mmap+0x2d1/0x530 mm/mmap.c:3144 __mmput kernel/fork.c:1046 [inline] mmput+0x15f/0x4c0 kernel/fork.c:1067 exec_mmap fs/exec.c:1046 [inline] flush_old_exec+0x8d9/0x1c20 fs/exec.c:1279 load_elf_binary+0x9bc/0x53f0 fs/binfmt_elf.c:864 search_binary_handler fs/exec.c:1656 [inline] search_binary_handler+0x17f/0x570 fs/exec.c:1634 exec_binprm fs/exec.c:1698 [inline] __do_execve_file.isra.0+0x1394/0x23f0 fs/exec.c:1818 do_execveat_common fs/exec.c:1865 [inline] do_execve fs/exec.c:1882 [inline] __do_sys_execve fs/exec.c:1958 [inline] __se_sys_execve fs/exec.c:1953 [inline] __x64_sys_execve+0x8f/0xc0 fs/exec.c:1953 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff88808893f7c0 which belongs to the cache mm_struct of size 1496 The buggy address is located 600 bytes to the right of 1496-byte region [ffff88808893f7c0, ffff88808893fd98) The buggy address belongs to the page: page:ffffea0002224f80 count:1 mapcount:0 mapping:ffff88821bc40ac0 index:0xffff88808893f7c0 compound_mapcount: 0 flags: 0x1fffc0000010200(slab|head) raw: 01fffc0000010200 ffffea00025b4f08 ffffea00027b9d08 ffff88821bc40ac0 raw: ffff88808893f7c0 ffff88808893e440 0000000100000001 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88808893fe80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88808893ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88808893ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff888088940000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888088940080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01ipv4: ip_do_fragment: Preserve skb_iif during fragmentationShmulik Ladkani1-0/+1
Previously, during fragmentation after forwarding, skb->skb_iif isn't preserved, i.e. 'ip_copy_metadata' does not copy skb_iif from given 'from' skb. As a result, ip_do_fragment's creates fragments with zero skb_iif, leading to inconsistent behavior. Assume for example an eBPF program attached at tc egress (post forwarding) that examines __sk_buff->ingress_ifindex: - the correct iif is observed if forwarding path does not involve fragmentation/refragmentation - a bogus iif is observed if forwarding path involves fragmentation/refragmentatiom Fix, by preserving skb_iif during 'ip_copy_metadata'. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30Merge branch 'master' of ↵David S. Miller4-24/+37
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2019-04-30 1) Fix an out-of-bound array accesses in __xfrm_policy_unlink. From YueHaibing. 2) Reset the secpath on failure in the ESP GRO handlers to avoid dereferencing an invalid pointer on error. From Myungho Jung. 3) Add and revert a patch that tried to add rcu annotations to netns_xfrm. From Su Yanjun. 4) Wait for rcu callbacks before freeing xfrm6_tunnel_spi_kmem. From Su Yanjun. 5) Fix forgotten vti4 ipip tunnel deregistration. From Jeremy Sowden: 6) Remove some duplicated log messages in vti4. From Jeremy Sowden. 7) Don't use IPSEC_PROTO_ANY when flushing states because this will flush only IPsec portocol speciffic states. IPPROTO_ROUTING states may remain in the lists when doing net exit. Fix this by replacing IPSEC_PROTO_ANY with zero. From Cong Wang. 8) Add length check for UDP encapsulation to fix "Oversized IP packet" warnings on receive side. From Sabrina Dubroca. 9) Fix xfrm interface lookup when the interface is associated to a vrf layer 3 master device. From Martin Willi. 10) Reload header pointers after pskb_may_pull() in _decode_session4(), otherwise we may read from uninitialized memory. 11) Update the documentation about xfrm[46]_gc_thresh, it is not used anymore after the flowcache removal. From Nicolas Dichtel. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30tcp: add sanity tests in tcp_add_backlog()Eric Dumazet1-1/+12
Richard and Bruno both reported that my commit added a bug, and Bruno was able to determine the problem came when a segment wih a FIN packet was coalesced to a prior one in tcp backlog queue. It turns out the header prediction in tcp_rcv_established() looks back to TCP headers in the packet, not in the metadata (aka TCP_SKB_CB(skb)->tcp_flags) The fast path in tcp_rcv_established() is not supposed to handle a FIN flag (it does not call tcp_fin()) Therefore we need to make sure to propagate the FIN flag, so that the coalesced packet does not go through the fast path, the same than a GRO packet carrying a FIN flag. While we are at it, make sure we do not coalesce packets with RST or SYN, or if they do not have ACK set. Many thanks to Richard and Bruno for pinpointing the bad commit, and to Richard for providing a first version of the fix. Fixes: 4f693b55c3d2 ("tcp: implement coalescing on backlog queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org> Reported-by: Bruno Prémont <bonbons@sysophe.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28udp: fix GRO reception in case of length mismatchPaolo Abeni1-4/+5
Currently, the UDP GRO code path does bad things on some edge conditions - Aggregation can happen even on packet with different lengths. Fix the above by rewriting the 'complete' condition for GRO packets. While at it, note explicitly that we allow merging the first packet per burst below gso_size. Reported-by: Sean Tong <seantong114@gmail.com> Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-25ipv4: add sanity checks in ipv4_link_failure()Eric Dumazet1-9/+23
Before calling __ip_options_compile(), we need to ensure the network header is a an IPv4 one, and that it is already pulled in skb->head. RAW sockets going through a tunnel can end up calling ipv4_link_failure() with total garbage in the skb, or arbitrary lengthes. syzbot report : BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:355 [inline] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123 Write of size 69 at addr ffff888096abf068 by task syz-executor.4/9204 CPU: 0 PID: 9204 Comm: syz-executor.4 Not tainted 5.1.0-rc5+ #77 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 check_memory_region_inline mm/kasan/generic.c:185 [inline] check_memory_region+0x123/0x190 mm/kasan/generic.c:191 memcpy+0x38/0x50 mm/kasan/common.c:133 memcpy include/linux/string.h:355 [inline] __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123 __icmp_send+0x725/0x1400 net/ipv4/icmp.c:695 ipv4_link_failure+0x29f/0x550 net/ipv4/route.c:1204 dst_link_failure include/net/dst.h:427 [inline] vti6_xmit net/ipv6/ip6_vti.c:514 [inline] vti6_tnl_xmit+0x10d4/0x1c0c net/ipv6/ip6_vti.c:553 __netdev_start_xmit include/linux/netdevice.h:4414 [inline] netdev_start_xmit include/linux/netdevice.h:4423 [inline] xmit_one net/core/dev.c:3292 [inline] dev_hard_start_xmit+0x1b2/0x980 net/core/dev.c:3308 __dev_queue_xmit+0x271d/0x3060 net/core/dev.c:3878 dev_queue_xmit+0x18/0x20 net/core/dev.c:3911 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1527 neigh_output include/net/neighbour.h:508 [inline] ip_finish_output2+0x949/0x1740 net/ipv4/ip_output.c:229 ip_finish_output+0x73c/0xd50 net/ipv4/ip_output.c:317 NF_HOOK_COND include/linux/netfilter.h:278 [inline] ip_output+0x21f/0x670 net/ipv4/ip_output.c:405 dst_output include/net/dst.h:444 [inline] NF_HOOK include/linux/netfilter.h:289 [inline] raw_send_hdrinc net/ipv4/raw.c:432 [inline] raw_sendmsg+0x1d2b/0x2f20 net/ipv4/raw.c:663 inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg+0xdd/0x130 net/socket.c:661 sock_write_iter+0x27c/0x3e0 net/socket.c:988 call_write_iter include/linux/fs.h:1866 [inline] new_sync_write+0x4c7/0x760 fs/read_write.c:474 __vfs_write+0xe4/0x110 fs/read_write.c:487 vfs_write+0x20c/0x580 fs/read_write.c:549 ksys_write+0x14f/0x2d0 fs/read_write.c:599 __do_sys_write fs/read_write.c:611 [inline] __se_sys_write fs/read_write.c:608 [inline] __x64_sys_write+0x73/0xb0 fs/read_write.c:608 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x458c29 Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f293b44bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458c29 RDX: 0000000000000014 RSI: 00000000200002c0 RDI: 0000000000000003 RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f293b44c6d4 R13: 00000000004c8623 R14: 00000000004ded68 R15: 00000000ffffffff The buggy address belongs to the page: page:ffffea00025aafc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x1fffc0000000000() raw: 01fffc0000000000 0000000000000000 ffffffff025a0101 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888096abef80: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2 ffff888096abf000: f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 >ffff888096abf080: 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ^ ffff888096abf100: 00 00 00 00 f1 f1 f1 f1 00 00 f3 f3 00 00 00 00 ffff888096abf180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv4: set the tcp_min_rtt_wlen range from 0 to one dayZhangXiaoxu1-1/+4
There is a UBSAN report as below: UBSAN: Undefined behaviour in net/ipv4/tcp_input.c:2877:56 signed integer overflow: 2147483647 * 1000 cannot be represented in type 'int' CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.1.0-rc4-00058-g582549e #1 Call Trace: <IRQ> dump_stack+0x8c/0xba ubsan_epilogue+0x11/0x60 handle_overflow+0x12d/0x170 ? ttwu_do_wakeup+0x21/0x320 __ubsan_handle_mul_overflow+0x12/0x20 tcp_ack_update_rtt+0x76c/0x780 tcp_clean_rtx_queue+0x499/0x14d0 tcp_ack+0x69e/0x1240 ? __wake_up_sync_key+0x2c/0x50 ? update_group_capacity+0x50/0x680 tcp_rcv_established+0x4e2/0xe10 tcp_v4_do_rcv+0x22b/0x420 tcp_v4_rcv+0xfe8/0x1190 ip_protocol_deliver_rcu+0x36/0x180 ip_local_deliver+0x15b/0x1a0 ip_rcv+0xac/0xd0 __netif_receive_skb_one_core+0x7f/0xb0 __netif_receive_skb+0x33/0xc0 netif_receive_skb_internal+0x84/0x1c0 napi_gro_receive+0x2a0/0x300 receive_buf+0x3d4/0x2350 ? detach_buf_split+0x159/0x390 virtnet_poll+0x198/0x840 ? reweight_entity+0x243/0x4b0 net_rx_action+0x25c/0x770 __do_softirq+0x19b/0x66d irq_exit+0x1eb/0x230 do_IRQ+0x7a/0x150 common_interrupt+0xf/0xf </IRQ> It can be reproduced by: echo 2147483647 > /proc/sys/net/ipv4/tcp_min_rtt_wlen Fixes: f672258391b42 ("tcp: track min RTT using windowed min-filter") Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17tcp: tcp_grow_window() needs to respect tcp_space()Eric Dumazet1-5/+5
For some reason, tcp_grow_window() correctly tests if enough room is present before attempting to increase tp->rcv_ssthresh, but does not prevent it to grow past tcp_space() This is causing hard to debug issues, like failing the (__tcp_select_window(sk) >= tp->rcv_wnd) test in __tcp_ack_snd_check(), causing ACK delays and possibly slow flows. Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio, we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000" after about 60 round trips, when the active side no longer sends immediate acks. This bug predates git history. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-14ipv4: ensure rcu_read_lock() in ipv4_link_failure()Eric Dumazet1-2/+8
fib_compute_spec_dst() needs to be called under rcu protection. syzbot reported : WARNING: suspicious RCU usage 5.1.0-rc4+ #165 Not tainted include/linux/inetdevice.h:220 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by swapper/0/0: #0: 0000000051b67925 ((&n->timer)){+.-.}, at: lockdep_copy_map include/linux/lockdep.h:170 [inline] #0: 0000000051b67925 ((&n->timer)){+.-.}, at: call_timer_fn+0xda/0x720 kernel/time/timer.c:1315 stack backtrace: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.1.0-rc4+ #165 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5162 __in_dev_get_rcu include/linux/inetdevice.h:220 [inline] fib_compute_spec_dst+0xbbd/0x1030 net/ipv4/fib_frontend.c:294 spec_dst_fill net/ipv4/ip_options.c:245 [inline] __ip_options_compile+0x15a7/0x1a10 net/ipv4/ip_options.c:343 ipv4_link_failure+0x172/0x400 net/ipv4/route.c:1195 dst_link_failure include/net/dst.h:427 [inline] arp_error_report+0xd1/0x1c0 net/ipv4/arp.c:297 neigh_invalidate+0x24b/0x570 net/core/neighbour.c:995 neigh_timer_handler+0xc35/0xf30 net/core/neighbour.c:1081 call_timer_fn+0x190/0x720 kernel/time/timer.c:1325 expire_timers kernel/time/timer.c:1362 [inline] __run_timers kernel/time/timer.c:1681 [inline] __run_timers kernel/time/timer.c:1649 [inline] run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694 __do_softirq+0x266/0x95a kernel/softirq.c:293 invoke_softirq kernel/softirq.c:374 [inline] irq_exit+0x180/0x1d0 kernel/softirq.c:414 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807 Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-13ipv4: recompile ip options in ipv4_link_failureStephen Suryaputra1-1/+9
Recompile IP options since IPCB may not be valid anymore when ipv4_link_failure is called from arp_error_report. Refer to the commit 3da1ed7ac398 ("net: avoid use IPCB in cipso_v4_error") and the commit before that (9ef6b42ad6fd) for a similar issue. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12dctcp: more accurate tracking of packets deliveryEric Dumazet1-28/+17
After commit e21db6f69a95 ("tcp: track total bytes delivered with ECN CE marks") core TCP stack does a very good job tracking ECN signals. The "sender's best estimate of CE information" Yuchung mentioned in his patch is indeed the best we can do. DCTCP can use tp->delivered_ce and tp->delivered to not duplicate the logic, and use the existing best estimate. This solves some problems, since current DCTCP logic does not deal with losses and/or GRO or ack aggregation very well. This also removes a dubious use of inet_csk(sk)->icsk_ack.rcv_mss (this should have been tp->mss_cache), and a 64 bit divide. Finally, we can see that the DCTCP logic, calling dctcp_update_alpha() for every ACK could be done differently, calling it only once per RTT. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Lawrence Brakmo <brakmo@fb.com> Cc: Abdul Kabbani <akabbani@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11net: fou: do not use guehdr after iptunnel_pull_offloads in gue_udp_recvLorenzo Bianconi1-1/+3
gue tunnels run iptunnel_pull_offloads on received skbs. This can determine a possible use-after-free accessing guehdr pointer since the packet will be 'uncloned' running pskb_expand_head if it is a cloned gso skb (e.g if the packet has been sent though a veth device) Fixes: a09a4c8dd1ec ("tunnels: Remove encapsulation offloads on decap") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-09net: ip_gre: fix possible use-after-free in erspan_rcvLorenzo Bianconi1-5/+10
erspan tunnels run __iptunnel_pull_header on received skbs to remove gre and erspan headers. This can determine a possible use-after-free accessing pkt_md pointer in erspan_rcv since the packet will be 'uncloned' running pskb_expand_head if it is a cloned gso skb (e.g if the packet has been sent though a veth device). Fix it resetting pkt_md pointer after __iptunnel_pull_header Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04tcp: Ensure DCTCP reacts to lossesKoen De Schepper1-18/+18
RFC8257 §3.5 explicitly states that "A DCTCP sender MUST react to loss episodes in the same way as conventional TCP". Currently, Linux DCTCP performs no cwnd reduction when losses are encountered. Optionally, the dctcp_clamp_alpha_on_loss resets alpha to its maximal value if a RTO happens. This behavior is sub-optimal for at least two reasons: i) it ignores losses triggering fast retransmissions; and ii) it causes unnecessary large cwnd reduction in the future if the loss was isolated as it resets the historical term of DCTCP's alpha EWMA to its maximal value (i.e., denoting a total congestion). The second reason has an especially noticeable effect when using DCTCP in high BDP environments, where alpha normally stays at low values. This patch replace the clamping of alpha by setting ssthresh to half of cwnd for both fast retransmissions and RTOs, at most once per RTT. Consequently, the dctcp_clamp_alpha_on_loss module parameter has been removed. The table below shows experimental results where we measured the drop probability of a PIE AQM (not applying ECN marks) at a bottleneck in the presence of a single TCP flow with either the alpha-clamping option enabled or the cwnd halving proposed by this patch. Results using reno or cubic are given for comparison. | Link | RTT | Drop TCP CC | speed | base+AQM | probability ==================|=========|==========|============ CUBIC | 40Mbps | 7+20ms | 0.21% RENO | | | 0.19% DCTCP-CLAMP-ALPHA | | | 25.80% DCTCP-HALVE-CWND | | | 0.22% ------------------|---------|----------|------------ CUBIC | 100Mbps | 7+20ms | 0.03% RENO | | | 0.02% DCTCP-CLAMP-ALPHA | | | 23.30% DCTCP-HALVE-CWND | | | 0.04% ------------------|---------|----------|------------ CUBIC | 800Mbps | 1+1ms | 0.04% RENO | | | 0.05% DCTCP-CLAMP-ALPHA | | | 18.70% DCTCP-HALVE-CWND | | | 0.06% We see that, without halving its cwnd for all source of losses, DCTCP drives the AQM to large drop probabilities in order to keep the queue length under control (i.e., it repeatedly faces RTOs). Instead, if DCTCP reacts to all source of losses, it can then be controlled by the AQM using similar drop levels than cubic or reno. Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com> Cc: Bob Briscoe <research@bobbriscoe.net> Cc: Lawrence Brakmo <brakmo@fb.com> Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <borkmann@iogearbox.net> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-02xfrm4: Fix uninitialized memory read in _decode_session4Steffen Klassert1-11/+13
We currently don't reload pointers pointing into skb header after doing pskb_may_pull() in _decode_session4(). So in case pskb_may_pull() changed the pointers, we read from random memory. Fix this by putting all the needed infos on the stack, so that we don't need to access the header pointers after doing pskb_may_pull(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-01vrf: check accept_source_route on the original netdeviceStephen Suryaputra2-6/+5
Configuration check to accept source route IP options should be made on the incoming netdevice when the skb->dev is an l3mdev master. The route lookup for the source route next hop also needs the incoming netdev. v2->v3: - Simplify by passing the original netdevice down the stack (per David Ahern). Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01tcp: fix a potential NULL pointer dereference in tcp_sk_exitDust Li1-1/+2
When tcp_sk_init() failed in inet_ctl_sock_create(), 'net->ipv4.tcp_congestion_control' will be left uninitialized, but tcp_sk_exit() hasn't check for that. This patch add checking on 'net->ipv4.tcp_congestion_control' in tcp_sk_exit() to prevent NULL-ptr dereference. Fixes: 6670e1524477 ("tcp: Namespace-ify sysctl_tcp_default_congestion_control") Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-26esp4: add length check for UDP encapsulationSabrina Dubroca1-5/+15
esp_output_udp_encap can produce a length that doesn't fit in the 16 bits of a UDP header's length field. In that case, we'll send a fragmented packet whose length is larger than IP_MAX_MTU (resulting in "Oversized IP packet" warnings on receive) and with a bogus UDP length. To prevent this, add a length check to esp_output_udp_encap and return -EMSGSIZE on failure. This seems to be older than git history. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-03-21vti4: removed duplicate log message.Jeremy Sowden1-3/+1
Removed info log-message if ipip tunnel registration fails during module-initialization: it adds nothing to the error message that is written on all failures. Fixes: dd9ee3444014e ("vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel") Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-03-21vti4: ipip tunnel deregistration fixes.Jeremy Sowden1-2/+3
If tunnel registration failed during module initialization, the module would fail to deregister the IPPROTO_COMP protocol and would attempt to deregister the tunnel. The tunnel was not deregistered during module-exit. Fixes: dd9ee3444014e ("vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel") Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-03-12tcp: Don't access TCP_SKB_CB before initializing itChristoph Paasch1-8/+1
Since commit eeea10b83a13 ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()"), tcp_vX_fill_cb is only called after tcp_filter(). That means, TCP_SKB_CB(skb)->end_seq still points to the IP-part of the cb. We thus should not mock with it, as this can trigger bugs (thanks syzkaller): [ 12.349396] ================================================================== [ 12.350188] BUG: KASAN: slab-out-of-bounds in ip6_datagram_recv_specific_ctl+0x19b3/0x1a20 [ 12.351035] Read of size 1 at addr ffff88006adbc208 by task test_ip6_datagr/1799 Setting end_seq is actually no more necessary in tcp_filter as it gets initialized later on in tcp_vX_fill_cb. Cc: Eric Dumazet <edumazet@google.com> Fixes: eeea10b83a13 ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()") Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-09tcp: handle inet_csk_reqsk_queue_add() failuresGuillaume Nault2-2/+13
Commit 7716682cc58e ("tcp/dccp: fix another race at listener dismantle") let inet_csk_reqsk_queue_add() fail, and adjusted {tcp,dccp}_check_req() accordingly. However, TFO and syncookies weren't modified, thus leaking allocated resources on error. Contrary to tcp_check_req(), in both syncookies and TFO cases, we need to drop the request socket. Also, since the child socket is created with inet_csk_clone_lock(), we have to unlock it and drop an extra reference (->sk_refcount is initially set to 2 and inet_csk_reqsk_queue_add() drops only one ref). For TFO, we also need to revert the work done by tcp_try_fastopen() (with reqsk_fastopen_remove()). Fixes: 7716682cc58e ("tcp/dccp: fix another race at listener dismantle") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-09fou, fou6: avoid uninit-value in gue_err() and gue6_err()Eric Dumazet1-2/+2
My prior commit missed the fact that these functions were using udp_hdr() (aka skb_transport_header()) to get access to GUE header. Since pskb_transport_may_pull() does not exist yet, we have to add transport_offset to our pskb_may_pull() calls. BUG: KMSAN: uninit-value in gue_err+0x514/0xfa0 net/ipv4/fou.c:1032 CPU: 1 PID: 10648 Comm: syz-executor.1 Not tainted 5.0.0+ #11 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x173/0x1d0 lib/dump_stack.c:113 kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:600 __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:313 gue_err+0x514/0xfa0 net/ipv4/fou.c:1032 __udp4_lib_err_encap_no_sk net/ipv4/udp.c:571 [inline] __udp4_lib_err_encap net/ipv4/udp.c:626 [inline] __udp4_lib_err+0x12e6/0x1d40 net/ipv4/udp.c:665 udp_err+0x74/0x90 net/ipv4/udp.c:737 icmp_socket_deliver net/ipv4/icmp.c:767 [inline] icmp_unreach+0xb65/0x1070 net/ipv4/icmp.c:884 icmp_rcv+0x11a1/0x1950 net/ipv4/icmp.c:1066 ip_protocol_deliver_rcu+0x584/0xbb0 net/ipv4/ip_input.c:208 ip_local_deliver_finish net/ipv4/ip_input.c:234 [inline] NF_HOOK include/linux/netfilter.h:289 [inline] ip_local_deliver+0x624/0x7b0 net/ipv4/ip_input.c:255 dst_input include/net/dst.h:450 [inline] ip_rcv_finish net/ipv4/ip_input.c:414 [inline] NF_HOOK include/linux/netfilter.h:289 [inline] ip_rcv+0x6bd/0x740 net/ipv4/ip_input.c:524 __netif_receive_skb_one_core net/core/dev.c:4973 [inline] __netif_receive_skb net/core/dev.c:5083 [inline] process_backlog+0x756/0x10e0 net/core/dev.c:5923 napi_poll net/core/dev.c:6346 [inline] net_rx_action+0x78b/0x1a60 net/core/dev.c:6412 __do_softirq+0x53f/0x93a kernel/softirq.c:293 invoke_softirq kernel/softirq.c:375 [inline] irq_exit+0x214/0x250 kernel/softirq.c:416 exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:536 smp_apic_timer_interrupt+0x48/0x70 arch/x86/kernel/apic/apic.c:1064 apic_timer_interrupt+0x2e/0x40 arch/x86/entry/entry_64.S:814 </IRQ> RIP: 0010:finish_lock_switch+0x2b/0x40 kernel/sched/core.c:2597 Code: 48 89 e5 53 48 89 fb e8 63 e7 95 00 8b b8 88 0c 00 00 48 8b 00 48 85 c0 75 12 48 89 df e8 dd db 95 00 c6 00 00 c6 03 00 fb 5b <5d> c3 e8 4e e6 95 00 eb e7 66 90 66 2e 0f 1f 84 00 00 00 00 00 55 RSP: 0018:ffff888081a0fc80 EFLAGS: 00000296 ORIG_RAX: ffffffffffffff13 RAX: ffff88821fd6bd80 RBX: ffff888027898000 RCX: ccccccccccccd000 RDX: ffff88821fca8d80 RSI: ffff888000000000 RDI: 00000000000004a0 RBP: ffff888081a0fc80 R08: 0000000000000002 R09: ffff888081a0fb08 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 R13: ffff88811130e388 R14: ffff88811130da00 R15: ffff88812fdb7d80 finish_task_switch+0xfc/0x2d0 kernel/sched/core.c:2698 context_switch kernel/sched/core.c:2851 [inline] __schedule+0x6cc/0x800 kernel/sched/core.c:3491 schedule+0x15b/0x240 kernel/sched/core.c:3535 freezable_schedule include/linux/freezer.h:172 [inline] do_nanosleep+0x2ba/0x980 kernel/time/hrtimer.c:1679 hrtimer_nanosleep kernel/time/hrtimer.c:1733 [inline] __do_sys_nanosleep kernel/time/hrtimer.c:1767 [inline] __se_sys_nanosleep+0x746/0x960 kernel/time/hrtimer.c:1754 __x64_sys_nanosleep+0x3e/0x60 kernel/time/hrtimer.c:1754 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291 entry_SYSCALL_64_after_hwframe+0x63/0xe7 RIP: 0033:0x4855a0 Code: 00 00 48 c7 c0 d4 ff ff ff 64 c7 00 16 00 00 00 31 c0 eb be 66 0f 1f 44 00 00 83 3d b1 11 5d 00 00 75 14 b8 23 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 04 e2 f8 ff c3 48 83 ec 08 e8 3a 55 fd ff RSP: 002b:0000000000a4fd58 EFLAGS: 00000246 ORIG_RAX: 0000000000000023 RAX: ffffffffffffffda RBX: 0000000000085780 RCX: 00000000004855a0 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000a4fd60 RBP: 00000000000007ec R08: 0000000000000001 R09: 0000000000ceb940 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000008 R13: 0000000000a4fdb0 R14: 0000000000085711 R15: 0000000000a4fdc0 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:205 [inline] kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:159 kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176 kmsan_slab_alloc+0xe/0x10 mm/kmsan/kmsan_hooks.c:185 slab_post_alloc_hook mm/slab.h:445 [inline] slab_alloc_node mm/slub.c:2773 [inline] __kmalloc_node_track_caller+0xe9e/0xff0 mm/slub.c:4398 __kmalloc_reserve net/core/skbuff.c:140 [inline] __alloc_skb+0x309/0xa20 net/core/skbuff.c:208 alloc_skb include/linux/skbuff.h:1012 [inline] alloc_skb_with_frags+0x186/0xa60 net/core/skbuff.c:5287 sock_alloc_send_pskb+0xafd/0x10a0 net/core/sock.c:2091 sock_alloc_send_skb+0xca/0xe0 net/core/sock.c:2108 __ip_append_data+0x34cd/0x5000 net/ipv4/ip_output.c:998 ip_append_data+0x324/0x480 net/ipv4/ip_output.c:1220 icmp_push_reply+0x23d/0x7e0 net/ipv4/icmp.c:375 __icmp_send+0x2ea3/0x30f0 net/ipv4/icmp.c:737 icmp_send include/net/icmp.h:47 [inline] ipv4_link_failure+0x6d/0x230 net/ipv4/route.c:1190 dst_link_failure include/net/dst.h:427 [inline] arp_error_report+0x106/0x1a0 net/ipv4/arp.c:297 neigh_invalidate+0x359/0x8e0 net/core/neighbour.c:992 neigh_timer_handler+0xdf2/0x1280 net/core/neighbour.c:1078 call_timer_fn+0x285/0x600 kernel/time/timer.c:1325 expire_timers kernel/time/timer.c:1362 [inline] __run_timers+0xdb4/0x11d0 kernel/time/timer.c:1681 run_timer_softirq+0x2e/0x50 kernel/time/timer.c:1694 __do_softirq+0x53f/0x93a kernel/softirq.c:293 Fixes: 26fc181e6cac ("fou, fou6: do not assume linear skbs") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Stefano Brivio <sbrivio@redhat.com> Cc: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-08route: set the deleted fnhe fnhe_daddr to 0 in ip_del_fnhe to fix a raceXin Long1-0/+4
The race occurs in __mkroute_output() when 2 threads lookup a dst: CPU A CPU B find_exception() find_exception() [fnhe expires] ip_del_fnhe() [fnhe is deleted] rt_bind_exception() In rt_bind_exception() it will bind a deleted fnhe with the new dst, and this dst will get no chance to be freed. It causes a dev defcnt leak and consecutive dmesg warnings: unregister_netdevice: waiting for ethX to become free. Usage count = 1 Especially thanks Jon to identify the issue. This patch fixes it by setting fnhe_daddr to 0 in ip_del_fnhe() to stop binding the deleted fnhe with a new dst when checking fnhe's fnhe_daddr and daddr in rt_bind_exception(). It works as both ip_del_fnhe() and rt_bind_exception() are protected by fnhe_lock and the fhne is freed by kfree_rcu(). Fixes: deed49df7390 ("route: check and remove route cache when we get route") Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-08xfrm: Reset secpath in xfrm failureMyungho Jung1-3/+5
In esp4_gro_receive() and esp6_gro_receive(), secpath can be allocated without adding xfrm state to xvec. Then, sp->xvec[sp->len - 1] would fail and result in dereferencing invalid pointer in esp4_gso_segment() and esp6_gso_segment(). Reset secpath if xfrm function returns error. Fixes: 7785bba299a8 ("esp: Add a software GRO codepath") Reported-by: syzbot+b69368fd933c6c592f4c@syzkaller.appspotmail.com Signed-off-by: Myungho Jung <mhjungk@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-03-06tcp: do not report TCP_CM_INQ of 0 for closed connectionsSoheil Hassas Yeganeh1-0/+5
Returning 0 as inq to userspace indicates there is no more data to read, and the application needs to wait for EPOLLIN. For a connection that has received FIN from the remote peer, however, the application must continue reading until getting EOF (return value of 0 from tcp_recvmsg) or an error, if edge-triggered epoll (EPOLLET) is being used. Otherwise, the application will never receive a new EPOLLIN, since there is no epoll edge after the FIN. Return 1 when there is no data left on the queue but the connection has received FIN, so that the applications continue reading. Fixes: b75eba76d3d72 (tcp: send in-queue bytes in cmsg upon read) Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-06tcp: detecting the misuse of .sendpage for Slab objectsVasily Averin1-0/+4
sendpage was not designed for processing of the Slab pages, in some situations it can trigger BUG_ON on receiving side. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-06iptunnel: NULL pointer deref for ip_md_tunnel_xmitAlan Maguire1-3/+6
Naresh Kamboju noted the following oops during execution of selftest tools/testing/selftests/bpf/test_tunnel.sh on x86_64: [ 274.120445] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 274.128285] #PF error: [INSTR] [ 274.131351] PGD 8000000414a0e067 P4D 8000000414a0e067 PUD 3b6334067 PMD 0 [ 274.138241] Oops: 0010 [#1] SMP PTI [ 274.141734] CPU: 1 PID: 11464 Comm: ping Not tainted 5.0.0-rc4-next-20190129 #1 [ 274.149046] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.0b 07/27/2017 [ 274.156526] RIP: 0010: (null) [ 274.160280] Code: Bad RIP value. [ 274.163509] RSP: 0018:ffffbc9681f83540 EFLAGS: 00010286 [ 274.168726] RAX: 0000000000000000 RBX: ffffdc967fa80a18 RCX: 0000000000000000 [ 274.175851] RDX: ffff9db2ee08b540 RSI: 000000000000000e RDI: ffffdc967fa809a0 [ 274.182974] RBP: ffffbc9681f83580 R08: ffff9db2c4d62690 R09: 000000000000000c [ 274.190098] R10: 0000000000000000 R11: ffff9db2ee08b540 R12: ffff9db31ce7c000 [ 274.197222] R13: 0000000000000001 R14: 000000000000000c R15: ffff9db3179cf400 [ 274.204346] FS: 00007ff4ae7c5740(0000) GS:ffff9db31fa80000(0000) knlGS:0000000000000000 [ 274.212424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 274.218162] CR2: ffffffffffffffd6 CR3: 00000004574da004 CR4: 00000000003606e0 [ 274.225292] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 274.232416] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 274.239541] Call Trace: [ 274.241988] ? tnl_update_pmtu+0x296/0x3b0 [ 274.246085] ip_md_tunnel_xmit+0x1bc/0x520 [ 274.250176] gre_fb_xmit+0x330/0x390 [ 274.253754] gre_tap_xmit+0x128/0x180 [ 274.257414] dev_hard_start_xmit+0xb7/0x300 [ 274.261598] sch_direct_xmit+0xf6/0x290 [ 274.265430] __qdisc_run+0x15d/0x5e0 [ 274.269007] __dev_queue_xmit+0x2c5/0xc00 [ 274.273011] ? dev_queue_xmit+0x10/0x20 [ 274.276842] ? eth_header+0x2b/0xc0 [ 274.280326] dev_queue_xmit+0x10/0x20 [ 274.283984] ? dev_queue_xmit+0x10/0x20 [ 274.287813] arp_xmit+0x1a/0xf0 [ 274.290952] arp_send_dst.part.19+0x46/0x60 [ 274.295138] arp_solicit+0x177/0x6b0 [ 274.298708] ? mod_timer+0x18e/0x440 [ 274.302281] neigh_probe+0x57/0x70 [ 274.305684] __neigh_event_send+0x197/0x2d0 [ 274.309862] neigh_resolve_output+0x18c/0x210 [ 274.314212] ip_finish_output2+0x257/0x690 [ 274.318304] ip_finish_output+0x219/0x340 [ 274.322314] ? ip_finish_output+0x219/0x340 [ 274.326493] ip_output+0x76/0x240 [ 274.329805] ? ip_fragment.constprop.53+0x80/0x80 [ 274.334510] ip_local_out+0x3f/0x70 [ 274.337992] ip_send_skb+0x19/0x40 [ 274.341391] ip_push_pending_frames+0x33/0x40 [ 274.345740] raw_sendmsg+0xc15/0x11d0 [ 274.349403] ? __might_fault+0x85/0x90 [ 274.353151] ? _copy_from_user+0x6b/0xa0 [ 274.357070] ? rw_copy_check_uvector+0x54/0x130 [ 274.361604] inet_sendmsg+0x42/0x1c0 [ 274.365179] ? inet_sendmsg+0x42/0x1c0 [ 274.368937] sock_sendmsg+0x3e/0x50 [ 274.372460] ___sys_sendmsg+0x26f/0x2d0 [ 274.376293] ? lock_acquire+0x95/0x190 [ 274.380043] ? __handle_mm_fault+0x7ce/0xb70 [ 274.384307] ? lock_acquire+0x95/0x190 [ 274.388053] ? __audit_syscall_entry+0xdd/0x130 [ 274.392586] ? ktime_get_coarse_real_ts64+0x64/0xc0 [ 274.397461] ? __audit_syscall_entry+0xdd/0x130 [ 274.401989] ? trace_hardirqs_on+0x4c/0x100 [ 274.406173] __sys_sendmsg+0x63/0xa0 [ 274.409744] ? __sys_sendmsg+0x63/0xa0 [ 274.413488] __x64_sys_sendmsg+0x1f/0x30 [ 274.417405] do_syscall_64+0x55/0x190 [ 274.421064] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 274.426113] RIP: 0033:0x7ff4ae0e6e87 [ 274.429686] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00 00 00 00 8b 05 ca d9 2b 00 48 63 d2 48 63 ff 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 53 48 89 f3 48 83 ec 10 48 89 7c 24 08 [ 274.448422] RSP: 002b:00007ffcd9b76db8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 274.455978] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007ff4ae0e6e87 [ 274.463104] RDX: 0000000000000000 RSI: 00000000006092e0 RDI: 0000000000000003 [ 274.470228] RBP: 0000000000000000 R08: 00007ffcd9bc40a0 R09: 00007ffcd9bc4080 [ 274.477349] R10: 000000000000060a R11: 0000000000000246 R12: 0000000000000003 [ 274.484475] R13: 0000000000000016 R14: 00007ffcd9b77fa0 R15: 00007ffcd9b78da4 [ 274.491602] Modules linked in: cls_bpf sch_ingress iptable_filter ip_tables algif_hash af_alg x86_pkg_temp_thermal fuse [last unloaded: test_bpf] [ 274.504634] CR2: 0000000000000000 [ 274.507976] ---[ end trace 196d18386545eae1 ]--- [ 274.512588] RIP: 0010: (null) [ 274.516334] Code: Bad RIP value. [ 274.519557] RSP: 0018:ffffbc9681f83540 EFLAGS: 00010286 [ 274.524775] RAX: 0000000000000000 RBX: ffffdc967fa80a18 RCX: 0000000000000000 [ 274.531921] RDX: ffff9db2ee08b540 RSI: 000000000000000e RDI: ffffdc967fa809a0 [ 274.539082] RBP: ffffbc9681f83580 R08: ffff9db2c4d62690 R09: 000000000000000c [ 274.546205] R10: 0000000000000000 R11: ffff9db2ee08b540 R12: ffff9db31ce7c000 [ 274.553329] R13: 0000000000000001 R14: 000000000000000c R15: ffff9db3179cf400 [ 274.560456] FS: 00007ff4ae7c5740(0000) GS:ffff9db31fa80000(0000) knlGS:0000000000000000 [ 274.568541] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 274.574277] CR2: ffffffffffffffd6 CR3: 00000004574da004 CR4: 00000000003606e0 [ 274.581403] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 274.588535] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 274.595658] Kernel panic - not syncing: Fatal exception in interrupt [ 274.602046] Kernel Offset: 0x14400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 274.612827] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- [ 274.620387] ------------[ cut here ]------------ I'm also seeing the same failure on x86_64, and it reproduces consistently. >From poking around it looks like the skb's dst entry is being used to calculate the mtu in: mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu; ...but because that dst_entry has an "ops" value set to md_dst_ops, the various ops (including mtu) are not set: crash> struct sk_buff._skb_refdst ffff928f87447700 -x _skb_refdst = 0xffffcd6fbf5ea590 crash> struct dst_entry.ops 0xffffcd6fbf5ea590 ops = 0xffffffffa0193800 crash> struct dst_ops.mtu 0xffffffffa0193800 mtu = 0x0 crash> I confirmed that the dst entry also has dst->input set to dst_md_discard, so it looks like it's an entry that's been initialized via __metadata_dst_init alright. I think the fix here is to use skb_valid_dst(skb) - it checks for DST_METADATA also, and with that fix in place, the problem - which was previously 100% reproducible - disappears. The below patch resolves the panic and all bpf tunnel tests pass without incident. Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Anders Roxell <anders.roxell@linaro.org> Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>