summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2020-05-08netpoll: remove dev argument from netpoll_send_skb_on_dev()Eric Dumazet1-4/+6
netpoll_send_skb_on_dev() can get the device pointer directly from np->dev Rename it to __netpoll_send_skb() Following patch will move netpoll_send_skb() out-of-line. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller4-10/+20
Conflicts were all overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06bpf, sockmap: msg_pop_data can incorrecty set an sge lengthJohn Fastabend1-1/+1
When sk_msg_pop() is called where the pop operation is working on the end of a sge element and there is no additional trailing data and there _is_ data in front of pop, like the following case, |____________a_____________|__pop__| We have out of order operations where we incorrectly set the pop variable so that instead of zero'ing pop we incorrectly leave it untouched, effectively. This can cause later logic to shift the buffers around believing it should pop extra space. The result is we have 'popped' more data then we expected potentially breaking program logic. It took us a while to hit this case because typically we pop headers which seem to rarely be at the end of a scatterlist elements but we can't rely on this. Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/158861288359.14306.7654891716919968144.stgit@john-Precision-5820-Tower
2020-05-05neigh: send protocol value in neighbor create notificationRoman Mashak1-3/+3
When a new neighbor entry has been added, event is generated but it does not include protocol, because its value is assigned after the event notification routine has run, so move protocol assignment code earlier. Fixes: df9b0e30d44c ("neighbor: Add protocol attribute") Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net: partially revert dynamic lockdep key changesCong Wang1-19/+71
This patch reverts the folowing commits: commit 064ff66e2bef84f1153087612032b5b9eab005bd "bonding: add missing netdev_update_lockdep_key()" commit 53d374979ef147ab51f5d632dfe20b14aebeccd0 "net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()" commit 1f26c0d3d24125992ab0026b0dab16c08df947c7 "net: fix kernel-doc warning in <linux/netdevice.h>" commit ab92d68fc22f9afab480153bd82a20f6e2533769 "net: core: add generic lockdep keys" but keeps the addr_list_lock_key because we still lock addr_list_lock nestedly on stack devices, unlikely xmit_lock this is safe because we don't take addr_list_lock on any fast path. Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04devlink: let kernel allocate region snapshot idJakub Kicinski1-13/+44
Currently users have to choose a free snapshot id before calling DEVLINK_CMD_REGION_NEW. This is potentially racy and inconvenient. Make the DEVLINK_ATTR_REGION_SNAPSHOT_ID optional and try to allocate id automatically. Send a message back to the caller with the snapshot info. Example use: $ devlink region new netdevsim/netdevsim1/dummy netdevsim/netdevsim1/dummy: snapshot 1 $ id=$(devlink -j region new netdevsim/netdevsim1/dummy | \ jq '.[][][][]') $ devlink region dump netdevsim/netdevsim1/dummy snapshot $id [...] $ devlink region del netdevsim/netdevsim1/dummy snapshot $id v4: - inline the notification code v3: - send the notification only once snapshot creation completed. v2: - don't wrap the line containing extack; - add a few sentences to the docs. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04devlink: factor out building a snapshot notificationJakub Kicinski1-11/+28
We'll need to send snapshot info back on the socket which requested a snapshot to be created. Factor out constructing a snapshot description from the broadcast notification code. v3: new patch Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04devlink: Fix reporter's recovery conditionAya Levin1-2/+5
Devlink health core conditions the reporter's recovery with the expiration of the grace period. This is not relevant for the first recovery. Explicitly demand that the grace period will only apply to recoveries other than the first. Fixes: c8e1da0bf923 ("devlink: Add health report functionality") Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller4-133/+140
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-05-01 (v2) The following pull-request contains BPF updates for your *net-next* tree. We've added 61 non-merge commits during the last 6 day(s) which contain a total of 153 files changed, 6739 insertions(+), 3367 deletions(-). The main changes are: 1) pulled work.sysctl from vfs tree with sysctl bpf changes. 2) bpf_link observability, from Andrii. 3) BTF-defined map in map, from Andrii. 4) asan fixes for selftests, from Andrii. 5) Allow bpf_map_lookup_elem for SOCKMAP and SOCKHASH, from Jakub. 6) production cloudflare classifier as a selftes, from Lorenz. 7) bpf_ktime_get_*_ns() helper improvements, from Maciej. 8) unprivileged bpftool feature probe, from Quentin. 9) BPF_ENABLE_STATS command, from Song. 10) enable bpf_[gs]etsockopt() helpers for sock_ops progs, from Stanislav. 11) enable a bunch of common helpers for cg-device, sysctl, sockopt progs, from Stanislav. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-02drop_monitor: work around gcc-10 stringop-overflow warningArnd Bergmann1-4/+7
The current gcc-10 snapshot produces a false-positive warning: net/core/drop_monitor.c: In function 'trace_drop_common.constprop': cc1: error: writing 8 bytes into a region of size 0 [-Werror=stringop-overflow=] In file included from net/core/drop_monitor.c:23: include/uapi/linux/net_dropmon.h:36:8: note: at offset 0 to object 'entries' with size 4 declared here 36 | __u32 entries; | ^~~~~~~ I reported this in the gcc bugzilla, but in case it does not get fixed in the release, work around it by using a temporary variable. Fixes: 9a8afc8d3962 ("Network Drop Monitor: Adding drop monitor implementation & Netlink protocol") Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94881 Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-02devlink: fix return value after hitting end in region readJakub Kicinski1-0/+5
Commit d5b90e99e1d5 ("devlink: report 0 after hitting end in region read") fixed region dump, but region read still returns a spurious error: $ devlink region read netdevsim/netdevsim1/dummy snapshot 0 addr 0 len 128 0000000000000000 a6 f4 c4 1c 21 35 95 a6 9d 34 c3 5b 87 5b 35 79 0000000000000010 f3 a0 d7 ee 4f 2f 82 7f c6 dd c4 f6 a5 c3 1b ae 0000000000000020 a4 fd c8 62 07 59 48 03 70 3b c7 09 86 88 7f 68 0000000000000030 6f 45 5d 6d 7d 0e 16 38 a9 d0 7a 4b 1e 1e 2e a6 0000000000000040 e6 1d ae 06 d6 18 00 85 ca 62 e8 7e 11 7e f6 0f 0000000000000050 79 7e f7 0f f3 94 68 bd e6 40 22 85 b6 be 6f b1 0000000000000060 af db ef 5e 34 f0 98 4b 62 9a e3 1b 8b 93 fc 17 devlink answers: Invalid argument 0000000000000070 61 e8 11 11 66 10 a5 f7 b1 ea 8d 40 60 53 ed 12 This is a minimal fix, I'll follow up with a restructuring so we don't have two checks for the same condition. Fixes: fdd41ec21e15 ("devlink: Return right error code in case of errors for region read") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-02net: fix skb_panic to output real addressJesper Dangaard Brouer1-1/+1
In skb_panic() the real pointer values are really needed to diagnose issues, e.g. data and head are related (to calculate headroom). The hashed versions of the addresses doesn't make much sense here. The patch use the printk specifier %px to print the actual address. The printk documentation on %px: https://www.kernel.org/doc/html/latest/core-api/printk-formats.html#unmodified-addresses Fixes: ad67b74d2469 ("printk: hash addresses printed with %p") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01bpf: Bpf_{g,s}etsockopt for struct bpf_sock_addrStanislav Fomichev1-23/+95
Currently, bpf_getsockopt and bpf_setsockopt helpers operate on the 'struct bpf_sock_ops' context in BPF_PROG_TYPE_SOCK_OPS program. Let's generalize them and make them available for 'struct bpf_sock_addr'. That way, in the future, we can allow those helpers in more places. As an example, let's expose those 'struct bpf_sock_addr' based helpers to BPF_CGROUP_INET{4,6}_CONNECT hooks. That way we can override CC before the connection is made. v3: * Expose custom helpers for bpf_sock_addr context instead of doing generic bpf_sock argument (as suggested by Daniel). Even with try_socket_lock that doesn't sleep we have a problem where context sk is already locked and socket lock is non-nestable. v2: * s/BPF_PROG_TYPE_CGROUP_SOCKOPT/BPF_PROG_TYPE_SOCK_OPS/ Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200430233152.199403-1-sdf@google.com
2020-05-01net/core: Introduce netdev_get_xmit_slaveMaor Gottlieb1-0/+22
Add new ndo to get the xmit slave of master device. The reference counters are not incremented so the caller must be careful with locks. User can ask to get the xmit slave assume all the slaves can transmit by set all_slaves arg to true. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-04-30docs: networking: convert pktgen.txt to ReSTMauro Carvalho Chehab1-1/+1
- add SPDX header; - adjust title markup; - use bold markups on a few places; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30bpf: Allow bpf_map_lookup_elem for SOCKMAP and SOCKHASHJakub Sitnicki2-2/+20
White-list map lookup for SOCKMAP/SOCKHASH from BPF. Lookup returns a pointer to a full socket and acquires a reference if necessary. To support it we need to extend the verifier to know that: (1) register storing the lookup result holds a pointer to socket, if lookup was done on SOCKMAP/SOCKHASH, and that (2) map lookup on SOCKMAP/SOCKHASH is a reference acquiring operation, which needs a corresponding reference release with bpf_sk_release. On sock_map side, lookup handlers exposed via bpf_map_ops now bump sk_refcnt if socket is reference counted. In turn, bpf_sk_select_reuseport, the only in-kernel user of SOCKMAP/SOCKHASH ops->map_lookup_elem, was updated to release the reference. Sockets fetched from a map can be used in the same way as ones returned by BPF socket lookup helpers, such as bpf_sk_lookup_tcp. In particular, they can be used with bpf_sk_assign to direct packets toward a socket on TC ingress path. Suggested-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200429181154.479310-2-jakub@cloudflare.com
2020-04-29netpoll: Fix use correct return type for ndo_start_xmit()Yunjian Wang1-4/+5
The method ndo_start_xmit() returns a value of type netdev_tx_t. Fix the ndo function to use the correct type. Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29docs: networking: convert gen_stats.txt to ReSTMauro Carvalho Chehab1-1/+1
- add SPDX header; - mark code blocks and literals as such; - mark tables as such; - mark lists as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28Merge branch 'work.sysctl' of ↵Daniel Borkmann2-32/+23
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler methods to take kernel pointers instead. It gets rid of the set_fs address space overrides used by BPF. As per discussion, pull in the feature branch into bpf-next as it relates to BPF sysctl progs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
2020-04-27net: rtnetlink: remove redundant assignment to variable errColin Ian King1-1/+1
The variable err is being initializeed with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27sysctl: pass kernel pointers to ->proc_handlerChristoph Hellwig2-32/+23
Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-26net: bpf: Allow TC programs to call BPF_FUNC_skb_change_headLorenzo Colitti1-0/+2
This allows TC eBPF programs to modify and forward (redirect) packets from interfaces without ethernet headers (for example cellular) to interfaces with (for example ethernet/wifi). The lack of this appears to simply be an oversight. Tested: in active use in Android R on 4.14+ devices for ipv6 cellular to wifi tethering offload. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-04-26bpf: Fix missing bpf_base_func_proto in cgroup_base_func_proto for CGROUP_NET=nStanislav Fomichev1-77/+1
linux-next build bot reported compile issue [1] with one of its configs. It looks like when we have CONFIG_NET=n and CONFIG_BPF{,_SYSCALL}=y, we are missing the bpf_base_func_proto definition (from net/core/filter.c) in cgroup_base_func_proto. I'm reshuffling the code a bit to make it work. The common helpers are moved into kernel/bpf/helpers.c and the bpf_base_func_proto is exported from there. Also, bpf_get_raw_cpu_id goes into kernel/bpf/core.c akin to existing bpf_user_rnd_u32. [1] https://lore.kernel.org/linux-next/CAKH8qBsBvKHswiX1nx40LgO+BGeTmb1NX8tiTttt_0uu6T3dCA@mail.gmail.com/T/#mff8b0c083314c68c2e2ef0211cb11bc20dc13c72 Fixes: 0456ea170cd6 ("bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200424235941.58382-1-sdf@google.com
2020-04-26bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}Stanislav Fomichev1-1/+1
Currently the following prog types don't fall back to bpf_base_func_proto() (instead they have cgroup_base_func_proto which has a limited set of helpers from bpf_base_func_proto): * BPF_PROG_TYPE_CGROUP_DEVICE * BPF_PROG_TYPE_CGROUP_SYSCTL * BPF_PROG_TYPE_CGROUP_SOCKOPT I don't see any specific reason why we shouldn't use bpf_base_func_proto(), every other type of program (except bpf-lirc and, understandably, tracing) use it, so let's fall back to bpf_base_func_proto for those prog types as well. This basically boils down to adding access to the following helpers: * BPF_FUNC_get_prandom_u32 * BPF_FUNC_get_smp_processor_id * BPF_FUNC_get_numa_node_id * BPF_FUNC_tail_call * BPF_FUNC_ktime_get_ns * BPF_FUNC_spin_lock (CAP_SYS_ADMIN) * BPF_FUNC_spin_unlock (CAP_SYS_ADMIN) * BPF_FUNC_jiffies64 (CAP_SYS_ADMIN) I've also added bpf_perf_event_output() because it's really handy for logging and debugging. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200420174610.77494-1-sdf@google.com
2020-04-26net: remove obsolete commentEric Dumazet1-1/+0
Commit b656722906ef ("net: Increase the size of skb_frag_t") removed the 16bit limitation of a frag on some 32bit arches. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-3/+1
Simple overlapping changes to linux/vermagic.h Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23net: napi: use READ_ONCE()/WRITE_ONCE()Eric Dumazet2-5/+5
gro_flush_timeout and napi_defer_hard_irqs can be read from napi_complete_done() while other cpus write the value, whithout explicit synchronization. Use READ_ONCE()/WRITE_ONCE() to annotate the races. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23net: napi: add hard irqs deferral featureEric Dumazet2-11/+36
Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer") we added the ability to arm one high resolution timer, that we used to keep not-complete packets in GRO engine a bit longer, hoping that further frames might be added to them. Since then, we added the napi_complete_done() interface, and commit 364b6055738b ("net: busy-poll: return busypolling status to drivers") allowed drivers to avoid re-arming NIC interrupts if we made a promise that their NAPI poll() handler would be called in the near future. This infrastructure can be leveraged, thanks to a new device parameter, which allows to arm the napi hrtimer, instead of re-arming the device hard IRQ. We have noticed that on some servers with 32 RX queues or more, the chit-chat between the NIC and the host caused by IRQ delivery and re-arming could hurt throughput by ~20% on 100Gbit NIC. In contrast, hrtimers are using local (percpu) resources and might have lower cost. The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy than gro_flush_timeout (/sys/class/net/ethX/) By default, both gro_flush_timeout and napi_defer_hard_irqs are zero. This patch does not change the prior behavior of gro_flush_timeout if used alone : NIC hard irqs should be rearmed as before. One concrete usage can be : echo 20000 >/sys/class/net/eth1/gro_flush_timeout echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs If at least one packet is retired, then we will reset napi counter to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans of the queue. On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ avoidance was only possible if napi->poll() was exhausting its budget and not call napi_complete_done(). This feature also can be used to work around some non-optimal NIC irq coalescing strategies. Having the ability to insert XX usec delays between each napi->poll() can increase cache efficiency, since we increase batch sizes. It also keeps serving cpus not idle too long, reducing tail latencies. Co-developed-by: Luigi Rizzo <lrizzo@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22cgroup, netclassid: remove double cond_reschedJiri Slaby1-3/+1
Commit 018d26fcd12a ("cgroup, netclassid: periodically release file_lock on classid") added a second cond_resched to write_classid indirectly by update_classid_task. Remove the one in write_classid. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20net: Add testing sysfs attributeAndrew Lunn1-1/+14
Similar to speed, duplex and dorment, report the testing status in sysfs. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20net: Add IF_OPER_TESTINGAndrew Lunn3-3/+23
RFC 2863 defines the operational state testing. Add support for this state, both as a IF_LINK_MODE_ and __LINK_STATE_. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-14xdp: Reset prog in dev_change_xdp_fd when fd is negativeDavid Ahern1-1/+2
The commit mentioned in the Fixes tag reuses the local prog variable when looking up an expected_fd. The variable is not reset when fd < 0 causing a detach with the expected_fd set to actually call dev_xdp_install for the existing program. The end result is that the detach does not happen. Fixes: 92234c8f15c8 ("xdp: Support specifying expected existing program when attaching XDP") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20200412133204.43847-1-dsahern@kernel.org
2020-04-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2-2/+2
Daniel Borkmann says: ==================== pull-request: bpf 2020-04-10 The following pull-request contains BPF updates for your *net* tree. We've added 13 non-merge commits during the last 7 day(s) which contain a total of 13 files changed, 137 insertions(+), 43 deletions(-). The main changes are: 1) JIT code emission fixes for riscv and arm32, from Luke Nelson and Xi Wang. 2) Disable vmlinux BTF info if GCC_PLUGIN_RANDSTRUCT is used, from Slava Bacherikov. 3) Fix oob write in AF_XDP when meta data is used, from Li RongQing. 4) Fix bpf_get_link_xdp_id() handling on single prog when flags are specified, from Andrey Ignatov. 5) Fix sk_assign() BPF helper for request sockets that can have sk_reuseport field uninitialized, from Joe Stringer. 6) Fix mprotect() test case for the BPF LSM, from KP Singh. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-09net-sysfs: remove redundant assignment to variable retColin Ian King1-1/+1
The variable ret is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-09bpf: Fix use of sk->sk_reuseport from sk_assignJoe Stringer1-1/+1
In testing, we found that for request sockets the sk->sk_reuseport field may yet be uninitialized, which caused bpf_sk_assign() to randomly succeed or return -ESOCKTNOSUPPORT when handling the forward ACK in a three-way handshake. Fix it by only applying the reuseport check for full sockets. Fixes: cf7fbe660f2d ("bpf: Add socket assign support") Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200408033540.10339-1-joe@wand.net.nz
2020-04-08net: revert default NAPI poll timeout to 2 jiffiesKonstantin Khlebnikov1-1/+2
For HZ < 1000 timeout 2000us rounds up to 1 jiffy but expires randomly because next timer interrupt could come shortly after starting softirq. For commonly used CONFIG_HZ=1000 nothing changes. Fixes: 7acf8a1e8a28 ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning") Reported-by: Dmitry Yakunin <zeil@yandex-team.ru> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-03neigh: support smaller retrans_time setttingHangbin Liu1-4/+6
Currently, we limited the retrans_time to be greater than HZ/2. i.e. setting retrans_time less than 500ms will not work. This makes the user unable to achieve a more accurate control for bonding arp fast failover. Update the sanity check to HZ/100, which is 10ms, to let users have more ability on the retrans_time control. v3: sync the behavior with IPv6 and update all the timer handler v2: use HZ instead of hard code number Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-03net: core: enable SO_BINDTODEVICE for non-root usersVincent Bernat1-1/+1
Currently, SO_BINDTODEVICE requires CAP_NET_RAW. This change allows a non-root user to bind a socket to an interface if it is not already bound. This is useful to allow an application to bind itself to a specific VRF for outgoing or incoming connections. Currently, an application wanting to manage connections through several VRF need to be privileged. Previously, IP_UNICAST_IF and IPV6_UNICAST_IF were added for Wine (76e21053b5bf3 and c4062dfc425e9) specifically for use by non-root processes. However, they are restricted to sendmsg() and not usable with TCP. Allowing SO_BINDTODEVICE would allow TCP clients to get the same privilege. As for TCP servers, outside the VRF use case, SO_BINDTODEVICE would only further restrict connections a server could accept. When an application is restricted to a VRF (with `ip vrf exec`), the socket is bound to an interface at creation and therefore, a non-privileged call to SO_BINDTODEVICE to escape the VRF fails. When an application bound a socket to SO_BINDTODEVICE and transmit it to a non-privileged process through a Unix socket, a tentative to change the bound device also fails. Before: >>> import socket >>> s=socket.socket(socket.AF_INET, socket.SOCK_STREAM) >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0") Traceback (most recent call last): File "<stdin>", line 1, in <module> PermissionError: [Errno 1] Operation not permitted After: >>> import socket >>> s=socket.socket(socket.AF_INET, socket.SOCK_STREAM) >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0") >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0") Traceback (most recent call last): File "<stdin>", line 1, in <module> PermissionError: [Errno 1] Operation not permitted Signed-off-by: Vincent Bernat <vincent@bernat.ch> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-03net, sk_msg: Don't use RCU_INIT_POINTER on sk_user_dataJakub Sitnicki1-1/+1
sparse reports an error due to use of RCU_INIT_POINTER helper to assign to sk_user_data pointer, which is not tagged with __rcu: net/core/sock.c:1875:25: error: incompatible types in comparison expression (different address spaces): net/core/sock.c:1875:25: void [noderef] <asn:4> * net/core/sock.c:1875:25: void * ... and rightfully so. sk_user_data is not always treated as a pointer to an RCU-protected data. When it is used to point at an RCU-protected object, we access it with __sk_user_data to inform sparse about it. In this case, when the child socket does not inherit sk_user_data from the parent, there is no reason to treat it as an RCU-protected pointer. Use a regular assignment to clear the pointer value. Fixes: f1ff5ce2cd5e ("net, sk_msg: Clear sk_user_data pointer on clone if tagged") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200402125524.851439-1-jakub@cloudflare.com
2020-03-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-0/+1
2020-03-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller5-12/+196
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller1-3/+3
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next: 1) Add support to specify a stateful expression in set definitions, this allows users to specify e.g. counters per set elements. 2) Flowtable software counter support. 3) Flowtable hardware offload counter support, from wenxu. 3) Parallelize flowtable hardware offload requests, from Paul Blakey. This includes a patch to add one work entry per offload command. 4) Several patches to rework nf_queue refcount handling, from Florian Westphal. 4) A few fixes for the flowtable tunnel offload: Fix crash if tunneling information is missing and set up indirect flow block as TC_SETUP_FT, patch from wenxu. 5) Stricter netlink attribute sanity check on filters, from Romain Bellan and Florent Fourcot. 5) Annotations to make sparse happy, from Jules Irenge. 6) Improve icmp errors in debugging information, from Haishuang Yan. 7) Fix warning in IPVS icmp error debugging, from Haishuang Yan. 8) Fix endianess issue in tcp extension header, from Sergey Marinkevich. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-31devlink: Allow setting of packet trap group parametersIdo Schimmel1-2/+54
The previous patch allowed device drivers to publish their default binding between packet trap policers and packet trap groups. However, some users might not be content with this binding and would like to change it. In case user space passed a packet trap policer identifier when setting a packet trap group, invoke the appropriate device driver callback and pass the new policer identifier. v2: * Check for presence of 'DEVLINK_ATTR_TRAP_POLICER_ID' in devlink_trap_group_set() and bail if not present * Add extack error message in case trap group was partially modified Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-31devlink: Add packet trap group parameters supportIdo Schimmel1-0/+31
Packet trap groups are used to aggregate logically related packet traps. Currently, these groups allow user space to batch operations such as setting the trap action of all member traps. In order to prevent the CPU from being overwhelmed by too many trapped packets, it is desirable to bind a packet trap policer to these groups. For example, to limit all the packets that encountered an exception during routing to 10Kpps. Allow device drivers to bind default packet trap policers to packet trap groups when the latter are registered with devlink. The next patch will enable user space to change this default binding. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-31devlink: Add packet trap policers supportIdo Schimmel1-0/+428
Devices capable of offloading the kernel's datapath and perform functions such as bridging and routing must also be able to send (trap) specific packets to the kernel (i.e., the CPU) for processing. For example, a device acting as a multicast-aware bridge must be able to trap IGMP membership reports to the kernel for processing by the bridge module. In most cases, the underlying device is capable of handling packet rates that are several orders of magnitude higher compared to those that can be handled by the CPU. Therefore, in order to prevent the underlying device from overwhelming the CPU, devices usually include packet trap policers that are able to police the trapped packets to rates that can be handled by the CPU. This patch allows capable device drivers to register their supported packet trap policers with devlink. User space can then tune the parameters of these policer (currently, rate and burst size) and read from the device the number of packets that were dropped by the policer, if supported. Subsequent patches in the series will allow device drivers to create default binding between these policers and packet trap groups and allow user space to change the binding. v2: * Add 'strict_start_type' in devlink policy * Have device drivers provide max/min rate/burst size for each policer. Use them to check validity of user provided parameters Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30bpf: Don't refcount LISTEN sockets in sk_assign()Joe Stringer2-4/+5
Avoid taking a reference on listen sockets by checking the socket type in the sk_assign and in the corresponding skb_steal_sock() code in the the transport layer, and by ensuring that the prefetch free (sock_pfree) function uses the same logic to check whether the socket is refcounted. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-4-joe@wand.net.nz
2020-03-30bpf: Add socket assign supportJoe Stringer2-0/+42
Add support for TPROXY via a new bpf helper, bpf_sk_assign(). This helper requires the BPF program to discover the socket via a call to bpf_sk*_lookup_*(), then pass this socket to the new helper. The helper takes its own reference to the socket in addition to any existing reference that may or may not currently be obtained for the duration of BPF processing. For the destination socket to receive the traffic, the traffic must be routed towards that socket via local route. The simplest example route is below, but in practice you may want to route traffic more narrowly (eg by CIDR): $ ip route add local default dev lo This patch avoids trying to introduce an extra bit into the skb->sk, as that would require more invasive changes to all code interacting with the socket to ensure that the bit is handled correctly, such as all error-handling cases along the path from the helper in BPF through to the orphan path in the input. Instead, we opt to use the destructor variable to switch on the prefetch of the socket. Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
2020-03-30devlink: Add auto dump flag to health reporterEran Ben Elisha1-4/+22
On low memory system, run time dumps can consume too much memory. Add administrator ability to disable auto dumps per reporter as part of the error flow handle routine. This attribute is not relevant while executing DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET. By default, auto dump is activated for any reporter that has a dump method, as part of the reporter registration to devlink. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30devlink: Implicitly set auto recover flag when registering health reporterEran Ben Elisha1-6/+3
When health reporter is registered to devlink, devlink will implicitly set auto recover if and only if the reporter has a recover method. No reason to explicitly get the auto recover flag from the driver. Remove this flag from all drivers that called devlink_health_reporter_create. All existing health reporters set auto recovery to true if they have a recover method. Yet, administrator can unset auto recover via netlink command as prior to this patch. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net: devlink: use NL_SET_ERR_MSG_MOD instead of NL_SET_ERR_MSGJiri Pirko1-2/+2
The rest of the devlink code sets the extack message using NL_SET_ERR_MSG_MOD. Change the existing appearances of NL_SET_ERR_MSG to NL_SET_ERR_MSG_MOD. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>