summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2019-04-30netfilter: use macros to create module aliases.Flavio Leitner7-7/+7
Each NAT helper creates a module alias which follows a pattern. Use macros for consistency. Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30netfilter: conntrack: limit sysctl setting for boolean optionsTonghao Zhang1-15/+33
We use the zero and one to limit the boolean options setting. After this patch we only set 0 or 1 to boolean options for nf conntrack sysctl. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30netfilter: nf_flow_table: check ttl value in flow offload data pathTaehee Yoo1-0/+6
nf_flow_offload_ip_hook() and nf_flow_offload_ipv6_hook() do not check ttl value. So, ttl value overflow may occur. Fixes: 97add9f0d66d ("netfilter: flow table support for IPv4") Fixes: 0995210753a2 ("netfilter: flow table support for IPv6") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30netfilter: nf_flow_table: fix netdev refcnt leakTaehee Yoo1-0/+1
flow_offload_alloc() calls nf_route() to get a dst_entry. Internally, nf_route() calls ip_route_output_key() that allocates a dst_entry and holds it. So, a dst_entry should be released by dst_release() if nf_route() is successful. Otherwise, netns exit routine cannot be finished and the following message is printed: [ 257.490952] unregister_netdevice: waiting for lo to become free. Usage count = 1 Fixes: ac2a66665e23 ("netfilter: add generic flow table infrastructure") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30netfilter: nft_flow_offload: add entry to flowtable after confirmationPablo Neira Ayuso1-2/+1
This is fixing flow offload for UDP traffic where packets only follow one single direction. The flow_offload_fixup_tcp() mechanism works fine in case that the offloaded entry remains in SYN_RECV state, given sequence tracking is reset and that conntrack handles syn+ack packets as a retransmission, ie. sES + synack => sIG for reply traffic. Fixes: a3c90f7a2323 ("netfilter: nf_tables: flow offload expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30netfilter: nf_tables: delay chain policy update until transaction is completeFlorian Westphal1-11/+39
When we process a long ruleset of the form chain input { type filter hook input priority filter; policy drop; ... } Then the base chain gets registered early on, we then continue to process/validate the next messages coming in the same transaction. Problem is that if the base chain policy is 'drop', it will take effect immediately, which causes all traffic to get blocked until the transaction completes or is aborted. Fix this by deferring the policy until the transaction has been processed and all of the rules have been flagged as active. Reported-by: Jann Haber <jann.haber@selfnet.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30netfilter: nf_tables: fix implicit include of module.hPaul Gortmaker1-0/+1
This file clearly uses modular infrastructure but does not call out the inclusion of <linux/module.h> explicitly. We add that include explicitly here, so we can tidy up some header usage elsewhere w/o causing build breakage. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30netfilter: nf_tables: relocate header content to consumerPaul Gortmaker1-0/+17
The nf_tables.h header is used in a lot of files, but it turns out that there is only one actual user of nft_expr_clone(). Hence we relocate that function to be with the one consumer of it and avoid having to process it with CPP for all the other files. This will also enable a reduction in the other headers that the nf_tables.h itself has to include just to be stand-alone, hence a pending further significant reduction in the CPP content that needs to get processed for each netfilter file. Note that the explicit "inline" has been dropped as part of this relocation. In similar changes to this, I believe Dave has asked this be done, so we free up gcc to make the choice of whether to inline or not. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30ipv6/flowlabel: wait rcu grace period before put_pid()Eric Dumazet1-6/+12
syzbot was able to catch a use-after-free read in pid_nr_ns() [1] ip6fl_seq_show() seems to use RCU protection, dereferencing fl->owner.pid but fl_free() releases fl->owner.pid before rcu grace period is started. [1] BUG: KASAN: use-after-free in pid_nr_ns+0x128/0x140 kernel/pid.c:407 Read of size 4 at addr ffff888094012a04 by task syz-executor.0/18087 CPU: 0 PID: 18087 Comm: syz-executor.0 Not tainted 5.1.0-rc6+ #89 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:131 pid_nr_ns+0x128/0x140 kernel/pid.c:407 ip6fl_seq_show+0x2f8/0x4f0 net/ipv6/ip6_flowlabel.c:794 seq_read+0xad3/0x1130 fs/seq_file.c:268 proc_reg_read+0x1fe/0x2c0 fs/proc/inode.c:227 do_loop_readv_writev fs/read_write.c:701 [inline] do_loop_readv_writev fs/read_write.c:688 [inline] do_iter_read+0x4a9/0x660 fs/read_write.c:922 vfs_readv+0xf0/0x160 fs/read_write.c:984 kernel_readv fs/splice.c:358 [inline] default_file_splice_read+0x475/0x890 fs/splice.c:413 do_splice_to+0x12a/0x190 fs/splice.c:876 splice_direct_to_actor+0x2d2/0x970 fs/splice.c:953 do_splice_direct+0x1da/0x2a0 fs/splice.c:1062 do_sendfile+0x597/0xd00 fs/read_write.c:1443 __do_sys_sendfile64 fs/read_write.c:1498 [inline] __se_sys_sendfile64 fs/read_write.c:1490 [inline] __x64_sys_sendfile64+0x15a/0x220 fs/read_write.c:1490 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x458da9 Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f300d24bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 0000000000458da9 RDX: 00000000200000c0 RSI: 0000000000000008 RDI: 0000000000000007 RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 000000000000005a R11: 0000000000000246 R12: 00007f300d24c6d4 R13: 00000000004c5fa3 R14: 00000000004da748 R15: 00000000ffffffff Allocated by task 17543: save_stack+0x45/0xd0 mm/kasan/common.c:75 set_track mm/kasan/common.c:87 [inline] __kasan_kmalloc mm/kasan/common.c:497 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:505 slab_post_alloc_hook mm/slab.h:437 [inline] slab_alloc mm/slab.c:3393 [inline] kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3555 alloc_pid+0x55/0x8f0 kernel/pid.c:168 copy_process.part.0+0x3b08/0x7980 kernel/fork.c:1932 copy_process kernel/fork.c:1709 [inline] _do_fork+0x257/0xfd0 kernel/fork.c:2226 __do_sys_clone kernel/fork.c:2333 [inline] __se_sys_clone kernel/fork.c:2327 [inline] __x64_sys_clone+0xbf/0x150 kernel/fork.c:2327 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 7789: save_stack+0x45/0xd0 mm/kasan/common.c:75 set_track mm/kasan/common.c:87 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:459 kasan_slab_free+0xe/0x10 mm/kasan/common.c:467 __cache_free mm/slab.c:3499 [inline] kmem_cache_free+0x86/0x260 mm/slab.c:3765 put_pid.part.0+0x111/0x150 kernel/pid.c:111 put_pid+0x20/0x30 kernel/pid.c:105 fl_free+0xbe/0xe0 net/ipv6/ip6_flowlabel.c:102 ip6_fl_gc+0x295/0x3e0 net/ipv6/ip6_flowlabel.c:152 call_timer_fn+0x190/0x720 kernel/time/timer.c:1325 expire_timers kernel/time/timer.c:1362 [inline] __run_timers kernel/time/timer.c:1681 [inline] __run_timers kernel/time/timer.c:1649 [inline] run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694 __do_softirq+0x266/0x95a kernel/softirq.c:293 The buggy address belongs to the object at ffff888094012a00 which belongs to the cache pid_2 of size 88 The buggy address is located 4 bytes inside of 88-byte region [ffff888094012a00, ffff888094012a58) The buggy address belongs to the page: page:ffffea0002500480 count:1 mapcount:0 mapping:ffff88809a483080 index:0xffff888094012980 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea00018a3508 ffffea0002524a88 ffff88809a483080 raw: ffff888094012980 ffff888094012000 000000010000001b 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888094012900: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc ffff888094012980: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc >ffff888094012a00: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc ^ ffff888094012a80: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc ffff888094012b00: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc Fixes: 4f82f45730c6 ("net ip6 flowlabel: Make owner a union of struct pid * and kuid_t") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30vrf: Use orig netdev to count Ip6InNoRoutes and a fresh route lookup when ↵Stephen Suryaputra1-6/+17
sending dest unreach When there is no route to an IPv6 dest addr, skb_dst(skb) points to loopback dev in the case of that the IP6CB(skb)->iif is enslaved to a vrf. This causes Ip6InNoRoutes to be incremented on the loopback dev. This also causes the lookup to fail on icmpv6_send() and the dest unreachable to not sent and Ip6OutNoRoutes gets incremented on the loopback dev. To reproduce: * Gateway configuration: ip link add dev vrf_258 type vrf table 258 ip link set dev enp0s9 master vrf_258 ip addr add 66:1/64 dev enp0s9 ip -6 route add unreachable default metric 8192 table 258 sysctl -w net.ipv6.conf.all.forwarding=1 sysctl -w net.ipv6.conf.enp0s9.forwarding=1 * Sender configuration: ip addr add 66::2/64 dev enp0s9 ip -6 route add default via 66::1 and ping 67::1 for example from the sender. Fix this by counting on the original netdev and reset the skb dst to force a fresh lookup. v2: Fix typo of destination address in the repro steps. v3: Simplify the loopback check (per David Ahern) and use reverse Christmas tree format (per David Miller). Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Tested-by: David Ahern <dsahern@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30tcp: add sanity tests in tcp_add_backlog()Eric Dumazet1-1/+12
Richard and Bruno both reported that my commit added a bug, and Bruno was able to determine the problem came when a segment wih a FIN packet was coalesced to a prior one in tcp backlog queue. It turns out the header prediction in tcp_rcv_established() looks back to TCP headers in the packet, not in the metadata (aka TCP_SKB_CB(skb)->tcp_flags) The fast path in tcp_rcv_established() is not supposed to handle a FIN flag (it does not call tcp_fin()) Therefore we need to make sure to propagate the FIN flag, so that the coalesced packet does not go through the fast path, the same than a GRO packet carrying a FIN flag. While we are at it, make sure we do not coalesce packets with RST or SYN, or if they do not have ACK set. Many thanks to Richard and Bruno for pinpointing the bad commit, and to Richard for providing a first version of the fix. Fixes: 4f693b55c3d2 ("tcp: implement coalescing on backlog queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org> Reported-by: Bruno Prémont <bonbons@sysophe.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30ipv6: invert flowlabel sharing check in process and user modeWillem de Bruijn1-2/+2
A request for a flowlabel fails in process or user exclusive mode must fail if the caller pid or uid does not match. Invert the test. Previously, the test was unsafe wrt PID recycling, but indeed tested for inequality: fl1->owner != fl->owner Fixes: 4f82f45730c68 ("net ip6 flowlabel: Make owner a union of struct pid* and kuid_t") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29Merge tag 'mac80211-for-davem-2019-04-26' of ↵David S. Miller4-6/+9
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== * fix use-after-free in mac80211 TXQs * fix RX STBC byte order * fix debugfs rename crashing due to ERR_PTR() * fix missing regulatory notification ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Allow tag drivers to be built as modulesAndrew Lunn2-29/+73
Make the CONFIG symbols tristate and add help text. The broadcom and Microchip KSZ tag drivers support two different tagging protocols in one driver. Add a configuration option for the drivers, and then options to select the protocol. Create a submenu for the tagging drivers. Signed-off-by: Andrew Lunn <andrew@lunn.ch> v2: tab/space cleanup Help text wording NET_DSA_TAG_BRCM_COMMON and NET_DSA_TAG_KZS_COMMON hidden v3: More tabification Punctuation v4: trailler->trailer Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: tag_brcm: Avoid unused symbolsAndrew Lunn1-2/+6
It is possible that the driver is compiled with both CONFIG_NET_DSA_TAG_BRCM and CONFIG_NET_DSA_TAG_BRCM_PREPEND disabled. This results in warnings about unused symbols. Add some conditional compilation to avoid this. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> v2 Reorder patch to before tag drivers can be modules Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Cleanup unneeded table and make tag structures staticAndrew Lunn11-76/+11
Now that tag drivers dynamically register, we don't need the static table. Remove it. This also means the tag driver structures can be made static. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Make use of the list of tag driversAndrew Lunn1-5/+34
Implement the _get and _put functions to make use of the list of tag drivers. Also, trigger the loading of the module, based on the alias information. The _get function takes a reference on the tag driver, so it cannot be unloaded, and the _put function releases the reference. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> v2: Make tag_driver_register void Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Add stub tag driver put methodAndrew Lunn4-0/+9
When a DSA switch driver is unloaded, the lock on the tag driver should be released so the module can be unloaded. Add the needed calls, but leave the actual release code as a stub. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> v2 Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Rename dsa_resolve_tag_protocol() to _get ready for lockingAndrew Lunn4-4/+5
dsa_resolve_tag_protocol() is used to find the tagging driver needed by a switch driver. When the tagging drivers become modules, it will be necassary to take a reference on the module to prevent it being unloaded. So rename this function to _get() to indicate it has some locking properties. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Register the none tagger opsAndrew Lunn1-0/+7
The none tagger is special in that it does not live in a tag_*.c file, but is within the core. Register/unregister when DSA is loaded/unloaded. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Keep link list of tag driversAndrew Lunn1-0/+28
Let the tag drivers register themselves with the DSA core, keeping them in a linked list. Signed-off-by: Andrew Lunn <andrew@lunn.ch> v2 Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Add boilerplate helper to register DSA tag driver modulesAndrew Lunn10-2/+52
A DSA tag driver module will need to register the tag protocols it implements with the DSA core. Add macros containing this boiler plate. The registration/unregistration code is currently just a stub. A Later patch will add the real implementation. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> v2 Fix indent of #endif Rewrite to move list pointer into a new structure v3 Move kdoc next to macro Fix THIS_MODULE indentation Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Add TAG protocol to tag opsAndrew Lunn10-0/+12
In order that we can match the tagging protocol a switch driver request to the tagger, we need to know what protocol the tagger supports. Add this information to the ops structure. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> v2 More tag protocol to end of structure to keep hot members at the beginning. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Add MODULE_LICENSE to tag driversAndrew Lunn9-0/+9
All the tag drivers are some variant of GPL. Add a MODULE_LICENSE() indicating this, so the drivers can later be compiled as modules. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Add MODULE_ALIAS to taggers in preparation to become modulesAndrew Lunn9-0/+22
When the tag drivers become modules, we will need to dynamically load them based on what the switch drivers need. Add aliases to map between the TAG protocol and the driver. In order to do this, we need the tag protocol number as something which the C pre-processor can stringinfy. Only the compiler knows the value of an enum, CPP cannot use them. So add #defines. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Move tagger name into its ops structureAndrew Lunn10-43/+13
Rather than keep a list to map a tagger ops to a name, place the name into the ops structure. This removes the hard coded list, a step towards making the taggers more dynamic. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> v2: Move name to end of structure, keeping the hot entries at the beginning. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29dsa: Add SPDX header to tag drivers.Andrew Lunn8-52/+8
Add an SPDX header, and remove the license boilerplate text. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller7-105/+1021
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-04-28 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Introduce BPF socket local storage map so that BPF programs can store private data they associate with a socket (instead of e.g. separate hash table), from Martin. 2) Add support for bpftool to dump BTF types. This is done through a new `bpftool btf dump` sub-command, from Andrii. 3) Enable BPF-based flow dissector for skb-less eth_get_headlen() calls which was currently not supported since skb was used to lookup netns, from Stanislav. 4) Add an opt-in interface for tracepoints to expose a writable context for attached BPF programs, used here for NBD sockets, from Matt. 5) BPF xadd related arm64 JIT fixes and scalability improvements, from Daniel. 6) Change the skb->protocol for bpf_skb_adjust_room() helper in order to support tunnels such as sit. Add selftests as well, from Willem. 7) Various smaller misc fixes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28udp: fix GRO reception in case of length mismatchPaolo Abeni1-4/+5
Currently, the UDP GRO code path does bad things on some edge conditions - Aggregation can happen even on packet with different lengths. Fix the above by rewriting the 'complete' condition for GRO packets. While at it, note explicitly that we allow merging the first packet per burst below gso_size. Reported-by: Sean Tong <seantong114@gmail.com> Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28net/tls: fix copy to fragments in reencryptJakub Kicinski1-7/+22
Fragments may contain data from other records so we have to account for that when we calculate the destination and max length of copy we can perform. Note that 'offset' is the offset within the message, so it can't be passed as offset within the frag.. Here skb_store_bits() would have realised the call is wrong and simply not copy data. Fixes: 4799ac81e52a ("tls: Add rx inline crypto offload") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28net/tls: don't copy negative amounts of data in reencryptJakub Kicinski1-6/+8
There is no guarantee the record starts before the skb frags. If we don't check for this condition copy amount will get negative, leading to reads and writes to random memory locations. Familiar hilarity ensues. Fixes: 4799ac81e52a ("tls: Add rx inline crypto offload") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28genetlink: optionally validate strictly/dumpsJohannes Berg27-3/+358
Add options to strictly validate messages and dump messages, sometimes perhaps validating dump messages non-strictly may be required, so add an option for that as well. Since none of this can really be applied to existing commands, set the options everwhere using the following spatch: @@ identifier ops; expression X; @@ struct genl_ops ops[] = { ..., { .cmd = X, + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, ... }, ... }; For new commands one should just not copy the .validate 'opt-out' flags and thus get strict validation. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28netlink: make validation more configurable for future strictnessJohannes Berg122-709/+860
We currently have two levels of strict validation: 1) liberal (default) - undefined (type >= max) & NLA_UNSPEC attributes accepted - attribute length >= expected accepted - garbage at end of message accepted 2) strict (opt-in) - NLA_UNSPEC attributes accepted - attribute length >= expected accepted Split out parsing strictness into four different options: * TRAILING - check that there's no trailing data after parsing attributes (in message or nested) * MAXTYPE - reject attrs > max known type * UNSPEC - reject attributes with NLA_UNSPEC policy entries * STRICT_ATTRS - strictly validate attribute size The default for future things should be *everything*. The current *_strict() is a combination of TRAILING and MAXTYPE, and is renamed to _deprecated_strict(). The current regular parsing has none of this, and is renamed to *_parse_deprecated(). Additionally it allows us to selectively set one of the new flags even on old policies. Notably, the UNSPEC flag could be useful in this case, since it can be arranged (by filling in the policy) to not be an incompatible userspace ABI change, but would then going forward prevent forgetting attribute entries. Similar can apply to the POLICY flag. We end up with the following renames: * nla_parse -> nla_parse_deprecated * nla_parse_strict -> nla_parse_deprecated_strict * nlmsg_parse -> nlmsg_parse_deprecated * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict * nla_parse_nested -> nla_parse_nested_deprecated * nla_validate_nested -> nla_validate_nested_deprecated Using spatch, of course: @@ expression TB, MAX, HEAD, LEN, POL, EXT; @@ -nla_parse(TB, MAX, HEAD, LEN, POL, EXT) +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression TB, MAX, NLA, POL, EXT; @@ -nla_parse_nested(TB, MAX, NLA, POL, EXT) +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT) @@ expression START, MAX, POL, EXT; @@ -nla_validate_nested(START, MAX, POL, EXT) +nla_validate_nested_deprecated(START, MAX, POL, EXT) @@ expression NLH, HDRLEN, MAX, POL, EXT; @@ -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT) +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT) For this patch, don't actually add the strict, non-renamed versions yet so that it breaks compile if I get it wrong. Also, while at it, make nla_validate and nla_parse go down to a common __nla_validate_parse() function to avoid code duplication. Ultimately, this allows us to have very strict validation for every new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the next patch, while existing things will continue to work as is. In effect then, this adds fully strict validation for any new command. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28net: fix two coding style issuesMichal Kubecek2-3/+4
This is a simple cleanup addressing two coding style issues found by checkpatch.pl in an earlier patch. It's submitted as a separate patch to keep the original patch as it was generated by spatch. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28ipset: drop ipset_nest_start() and ipset_nest_end()Michal Kubecek3-21/+21
After the previous commit, both ipset_nest_start() and ipset_nest_end() are just aliases for nla_nest_start() and nla_nest_end() so that there is no need to keep them. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28netlink: make nla_nest_start() add NLA_F_NESTED flagMichal Kubecek98-422/+469
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most netlink based interfaces (including recently added ones) are still not setting it in kernel generated messages. Without the flag, message parsers not aware of attribute semantics (e.g. wireshark dissector or libmnl's mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display the structure of their contents. Unfortunately we cannot just add the flag everywhere as there may be userspace applications which check nlattr::nla_type directly rather than through a helper masking out the flags. Therefore the patch renames nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start() as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually are rewritten to use nla_nest_start(). Except for changes in include/net/netlink.h, the patch was generated using this semantic patch: @@ expression E1, E2; @@ -nla_nest_start(E1, E2) +nla_nest_start_noflag(E1, E2) @@ expression E1, E2; @@ -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED) +nla_nest_start(E1, E2) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net/tls: byte swap device req TCP seq no upon settingJakub Kicinski1-1/+1
To avoid a sparse warning byteswap the be32 sequence number before it's stored in the atomic value. While at it drop unnecessary brackets and use kernel's u64 type. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net/tls: remove old exports of sk_destruct functionsJakub Kicinski1-18/+17
tls_device_sk_destruct being set on a socket used to indicate that socket is a kTLS device one. That is no longer true - now we use sk_validate_xmit_skb pointer for that purpose. Remove the export. tls_device_attach() needs to be moved. While at it, remove the dead declaration of tls_sk_destruct(). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net/tls: don't log errors every time offload can't proceedJakub Kicinski1-6/+1
Currently when CONFIG_TLS_DEVICE is set each time kTLS connection is opened and the offload is not successful (either because the underlying device doesn't support it or e.g. it's tables are full) a rate limited error will be printed to the logs. There is nothing wrong with failing TLS offload. SW path will process the packets just fine, drop the noisy messages. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27bpf: Introduce bpf sk local storageMartin KaFai Lau5-0/+824
After allowing a bpf prog to - directly read the skb->sk ptr - get the fullsock bpf_sock by "bpf_sk_fullsock()" - get the bpf_tcp_sock by "bpf_tcp_sock()" - get the listener sock by "bpf_get_listener_sock()" - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock" into different bpf running context. this patch is another effort to make bpf's network programming more intuitive to do (together with memory and performance benefit). When bpf prog needs to store data for a sk, the current practice is to define a map with the usual 4-tuples (src/dst ip/port) as the key. If multiple bpf progs require to store different sk data, multiple maps have to be defined. Hence, wasting memory to store the duplicated keys (i.e. 4 tuples here) in each of the bpf map. [ The smallest key could be the sk pointer itself which requires some enhancement in the verifier and it is a separate topic. ] Also, the bpf prog needs to clean up the elem when sk is freed. Otherwise, the bpf map will become full and un-usable quickly. The sk-free tracking currently could be done during sk state transition (e.g. BPF_SOCK_OPS_STATE_CB). The size of the map needs to be predefined which then usually ended-up with an over-provisioned map in production. Even the map was re-sizable, while the sk naturally come and go away already, this potential re-size operation is arguably redundant if the data can be directly connected to the sk itself instead of proxy-ing through a bpf map. This patch introduces sk->sk_bpf_storage to provide local storage space at sk for bpf prog to use. The space will be allocated when the first bpf prog has created data for this particular sk. The design optimizes the bpf prog's lookup (and then optionally followed by an inline update). bpf_spin_lock should be used if the inline update needs to be protected. BPF_MAP_TYPE_SK_STORAGE: ----------------------- To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can be created to fit different bpf progs' needs. The map enforces BTF to allow printing the sk-local-storage during a system-wise sk dump (e.g. "ss -ta") in the future. The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete a "sk-local-storage" data from a particular sk. Think of the map as a meta-data (or "type") of a "sk-local-storage". This particular "type" of "sk-local-storage" data can then be stored in any sk. The main purposes of this map are mostly: 1. Define the size of a "sk-local-storage" type. 2. Provide a similar syscall userspace API as the map (e.g. lookup/update, map-id, map-btf...etc.) 3. Keep track of all sk's storages of this "type" and clean them up when the map is freed. sk->sk_bpf_storage: ------------------ The main lookup/update/delete is done on sk->sk_bpf_storage (which is a "struct bpf_sk_storage"). When doing a lookup, the "map" pointer is now used as the "key" to search on the sk_storage->list. The "map" pointer is actually serving as the "type" of the "sk-local-storage" that is being requested. To allow very fast lookup, it should be as fast as looking up an array at a stable-offset. At the same time, it is not ideal to set a hard limit on the number of sk-local-storage "type" that the system can have. Hence, this patch takes a cache approach. The last search result from sk_storage->list is cached in sk_storage->cache[] which is a stable sized array. Each "sk-local-storage" type has a stable offset to the cache[] array. In the future, a map's flag could be introduced to do cache opt-out/enforcement if it became necessary. The cache size is 16 (i.e. 16 types of "sk-local-storage"). Programs can share map. On the program side, having a few bpf_progs running in the networking hotpath is already a lot. The bpf_prog should have already consolidated the existing sock-key-ed map usage to minimize the map lookup penalty. 16 has enough runway to grow. All sk-local-storage data will be removed from sk->sk_bpf_storage during sk destruction. bpf_sk_storage_get() and bpf_sk_storage_delete(): ------------------------------------------------ Instead of using bpf_map_(lookup|update|delete)_elem(), the bpf prog needs to use the new helper bpf_sk_storage_get() and bpf_sk_storage_delete(). The verifier can then enforce the ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to "create" new elem if one does not exist in the sk. It is done by the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE. The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together, it has eliminated the potential use cases for an equivalent bpf_map_update_elem() API (for bpf_prog) in this patch. Misc notes: ---------- 1. map_get_next_key is not supported. From the userspace syscall perspective, the map has the socket fd as the key while the map can be shared by pinned-file or map-id. Since btf is enforced, the existing "ss" could be enhanced to pretty print the local-storage. Supporting a kernel defined btf with 4 tuples as the return key could be explored later also. 2. The sk->sk_lock cannot be acquired. Atomic operations is used instead. e.g. cmpxchg is done on the sk->sk_bpf_storage ptr. Please refer to the source code comments for the details in synchronization cases and considerations. 3. The mem is charged to the sk->sk_omem_alloc as the sk filter does. Benchmark: --------- Here is the benchmark data collected by turning on the "kernel.bpf_stats_enabled" sysctl. Two bpf progs are tested: One bpf prog with the usual bpf hashmap (max_entries = 8192) with the sk ptr as the key. (verifier is modified to support sk ptr as the key That should have shortened the key lookup time.) Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE. Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for each egress skb and then bump the cnt. netperf is used to drive data with 4096 connected UDP sockets. BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run) 27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633 loaded_at 2019-04-15T13:46:39-0700 uid 0 xlated 344B jited 258B memlock 4096B map_ids 16 btf_id 5 BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run) 30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739 loaded_at 2019-04-15T13:47:54-0700 uid 0 xlated 168B jited 156B memlock 4096B map_ids 17 btf_id 6 Here is a high-level picture on how are the objects organized: sk ┌──────┐ │ │ │ │ │ │ │*sk_bpf_storage─────▶ bpf_sk_storage └──────┘ ┌───────┐ ┌───────────┤ list │ │ │ │ │ │ │ │ │ │ │ └───────┘ │ │ elem │ ┌────────┐ ├─▶│ snode │ │ ├────────┤ │ │ data │ bpf_map │ ├────────┤ ┌─────────┐ │ │map_node│◀─┬─────┤ list │ │ └────────┘ │ │ │ │ │ │ │ │ elem │ │ │ │ ┌────────┐ │ └─────────┘ └─▶│ snode │ │ ├────────┤ │ bpf_map │ data │ │ ┌─────────┐ ├────────┤ │ │ list ├───────▶│map_node│ │ │ │ └────────┘ │ │ │ │ │ │ elem │ └─────────┘ ┌────────┐ │ ┌─▶│ snode │ │ │ ├────────┤ │ │ │ data │ │ │ ├────────┤ │ │ │map_node│◀─┘ │ └────────┘ │ │ │ ┌───────┐ sk └──────────│ list │ ┌──────┐ │ │ │ │ │ │ │ │ │ │ │ │ └───────┘ │*sk_bpf_storage───────▶bpf_sk_storage └──────┘ Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27selftests: bpf: test writable buffers in raw tpsMatt Mullins1-0/+4
This tests that: * a BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE cannot be attached if it uses either: * a variable offset to the tracepoint buffer, or * an offset beyond the size of the tracepoint buffer * a tracer can modify the buffer provided when attached to a writable tracepoint in bpf_prog_test_run Signed-off-by: Matt Mullins <mmullins@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26SUNRPC: Use namespace of listening daemon in the client AUTH_GSS upcallTrond Myklebust1-16/+44
When the client needs to talk to rpc.gssd, we should ensure that the uid argument is encoded to match the user namespace of the daemon. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26SUNRPC: Use the client user namespace when encoding credsTrond Myklebust2-7/+13
When encoding AUTH_UNIX creds and AUTH_GSS upcalls, use the user namespace of the process that created the rpc client. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26Merge tag 'mac80211-next-for-davem-2019-04-26' of ↵David S. Miller30-345/+1220
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Various updates, notably: * extended key ID support (from 802.11-2016) * per-STA TX power control support * mac80211 TX performance improvements * HE (802.11ax) updates * mesh link probing support * enhancements of multi-BSSID support (also related to HE) * OWE userspace processing support ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26SUNRPC: Cache cred of process creating the rpc_clientTrond Myklebust2-2/+14
When converting kuids to AUTH_UNIX creds, etc we will want to use the same user namespace as the process that created the rpc client. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26tipc: remove rcu_read_unlock() left in tipc_udp_recv()Eric Dumazet1-1/+0
I forgot to remove one rcu_read_unlock() before a return statement. Joy of mixing goto and return styles in a function :) Fixes: 4109a2c3b91e ("tipc: tipc_udp_recv() cleanup vs rcu verbs") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26genetlink: use idr_alloc_cyclic for family->id assignmentMarcel Holtmann1-2/+2
When allocating the next family->id it makes more sense to use idr_alloc_cyclic to avoid re-using a previously used family->id as much as possible. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26ipv6: Initialize fib6_result in bpf_ipv6_fib_lookupDavid Ahern1-1/+1
fib6_result is not initialized in bpf_ipv6_fib_lookup and potentially passses garbage to the fib lookup which triggers a KASAN warning: [ 262.055450] ================================================================== [ 262.057640] BUG: KASAN: user-memory-access in fib6_rule_suppress+0x4b/0xce [ 262.059488] Read of size 8 at addr 00000a20000000b0 by task swapper/1/0 [ 262.061238] [ 262.061673] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.1.0-rc5+ #56 [ 262.063493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014 [ 262.065593] Call Trace: [ 262.066277] <IRQ> [ 262.066848] dump_stack+0x7e/0xbb [ 262.067764] kasan_report+0x18b/0x1b5 [ 262.069921] __asan_load8+0x7f/0x81 [ 262.070879] fib6_rule_suppress+0x4b/0xce [ 262.071980] fib_rules_lookup+0x275/0x2cd [ 262.073090] fib6_lookup+0x119/0x218 [ 262.076457] bpf_ipv6_fib_lookup+0x39d/0x664 ... Initialize fib6_result to 0. Fixes: b1d40991506aa ("ipv6: Rename fib6_multipath_select and pass fib6_result") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26net: socket: Fix missing break in switch statementGustavo A. R. Silva1-0/+1
Add missing break statement in order to prevent the code from falling through to cases SIOCGSTAMP_NEW and SIOCGSTAMPNS_NEW. This bug was found thanks to the ongoing efforts to enable -Wimplicit-fallthrough. Fixes: 0768e17073dc ("net: socket: implement 64-bit timestamps") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26l2tp: use rcu_dereference_sk_user_data() in l2tp_udp_encap_recv()Eric Dumazet1-1/+1
Canonical way to fetch sk_user_data from an encap_rcv() handler called from UDP stack in rcu protected section is to use rcu_dereference_sk_user_data(), otherwise compiler might read it multiple times. Fixes: d00fa9adc528 ("il2tp: fix races with tunnel socket close") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>