summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2021-12-07net: dst: add net device refcount tracking to dst_entryEric Dumazet1-4/+4
We want to track all dev_hold()/dev_put() to ease leak hunting. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07drop_monitor: add net device refcount trackerEric Dumazet1-3/+3
We want to track all dev_hold()/dev_put() to ease leak hunting. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07net: add net device refcount tracker to dev_ifsioc()Eric Dumazet1-2/+3
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07net: add net device refcount tracker to struct netdev_queueEric Dumazet1-2/+2
This will help debugging pesky netdev reference leaks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07net: add net device refcount tracker to struct netdev_rx_queueEric Dumazet1-2/+2
This helps debugging net device refcount leaks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07net: add net device refcount tracker infrastructureEric Dumazet1-0/+3
net device are refcounted. Over the years we had numerous bugs caused by imbalanced dev_hold() and dev_put() calls. The general idea is to be able to precisely pair each decrement with a corresponding prior increment. Both share a cookie, basically a pointer to private data storing stack traces. This patch adds dev_hold_track() and dev_put_track(). To use these helpers, each data structure owning a refcount should also use a "netdevice_tracker" to pair the hold and put. netdevice_tracker dev_tracker; ... dev_hold_track(dev, &dev_tracker, GFP_ATOMIC); ... dev_put_track(dev, &dev_tracker); Whenever a leak happens, we will get precise stack traces of the point dev_hold_track() happened, at device dismantle phase. We will also get a stack trace if too many dev_put_track() for the same netdevice_tracker are attempted. This is guarded by CONFIG_NET_DEV_REFCNT_TRACKER option. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-2/+24
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-02net: annotate data-races on txq->xmit_lock_ownerEric Dumazet1-1/+4
syzbot found that __dev_queue_xmit() is reading txq->xmit_lock_owner without annotations. No serious issue there, let's document what is happening there. BUG: KCSAN: data-race in __dev_queue_xmit / __dev_queue_xmit write to 0xffff888139d09484 of 4 bytes by interrupt on cpu 0: __netif_tx_unlock include/linux/netdevice.h:4437 [inline] __dev_queue_xmit+0x948/0xf70 net/core/dev.c:4229 dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265 macvlan_queue_xmit drivers/net/macvlan.c:543 [inline] macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567 __netdev_start_xmit include/linux/netdevice.h:4987 [inline] netdev_start_xmit include/linux/netdevice.h:5001 [inline] xmit_one+0x105/0x2f0 net/core/dev.c:3590 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342 __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817 __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194 dev_queue_xmit+0x13/0x20 net/core/dev.c:4259 neigh_hh_output include/net/neighbour.h:511 [inline] neigh_output include/net/neighbour.h:525 [inline] ip6_finish_output2+0x995/0xbb0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508 ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898 call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421 expire_timers+0x116/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x410 kernel/time/timer.c:1734 run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747 __do_softirq+0x158/0x2de kernel/softirq.c:558 __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x37/0x70 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x3e/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 read to 0xffff888139d09484 of 4 bytes by interrupt on cpu 1: __dev_queue_xmit+0x5e3/0xf70 net/core/dev.c:4213 dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265 macvlan_queue_xmit drivers/net/macvlan.c:543 [inline] macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567 __netdev_start_xmit include/linux/netdevice.h:4987 [inline] netdev_start_xmit include/linux/netdevice.h:5001 [inline] xmit_one+0x105/0x2f0 net/core/dev.c:3590 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342 __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817 __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194 dev_queue_xmit+0x13/0x20 net/core/dev.c:4259 neigh_resolve_output+0x3db/0x410 net/core/neighbour.c:1523 neigh_output include/net/neighbour.h:527 [inline] ip6_finish_output2+0x9be/0xbb0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508 ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898 call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421 expire_timers+0x116/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x410 kernel/time/timer.c:1734 run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747 __do_softirq+0x158/0x2de kernel/softirq.c:558 __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x37/0x70 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x8d/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 kcsan_setup_watchpoint+0x94/0x420 kernel/kcsan/core.c:443 folio_test_anon include/linux/page-flags.h:581 [inline] PageAnon include/linux/page-flags.h:586 [inline] zap_pte_range+0x5ac/0x10e0 mm/memory.c:1347 zap_pmd_range mm/memory.c:1467 [inline] zap_pud_range mm/memory.c:1496 [inline] zap_p4d_range mm/memory.c:1517 [inline] unmap_page_range+0x2dc/0x3d0 mm/memory.c:1538 unmap_single_vma+0x157/0x210 mm/memory.c:1583 unmap_vmas+0xd0/0x180 mm/memory.c:1615 exit_mmap+0x23d/0x470 mm/mmap.c:3170 __mmput+0x27/0x1b0 kernel/fork.c:1113 mmput+0x3d/0x50 kernel/fork.c:1134 exit_mm+0xdb/0x170 kernel/exit.c:507 do_exit+0x608/0x17a0 kernel/exit.c:819 do_group_exit+0xce/0x180 kernel/exit.c:929 get_signal+0xfc3/0x1550 kernel/signal.c:2852 arch_do_signal_or_restart+0x8c/0x2e0 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300 do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0xffffffff Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 28712 Comm: syz-executor.0 Tainted: G W 5.16.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211130170155.2331929-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-30bpf, docs: Prune all references to "internal BPF"Christoph Hellwig1-6/+5
The eBPF name has completely taken over from eBPF in general usage for the actual eBPF representation, or BPF for any general in-kernel use. Prune all remaining references to "internal BPF". Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20211119163215.971383-4-hch@lst.de
2021-11-30devlink: Simplify devlink resources unregister callLeon Romanovsky1-15/+48
The devlink_resources_unregister() used second parameter as an entry point for the recursive removal of devlink resources. None of the callers outside of devlink core needed to use this field, so let's remove it. As part of this removal, the "struct devlink_resource" was moved from .h to .c file as it is not possible to use in any place in the code except devlink.c. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-30wireguard: device: reset peer src endpoint when netns exitsJason A. Donenfeld1-0/+19
Each peer's endpoint contains a dst_cache entry that takes a reference to another netdev. When the containing namespace exits, we take down the socket and prevent future sockets from being created (by setting creating_net to NULL), which removes that potential reference on the netns. However, it doesn't release references to the netns that a netdev cached in dst_cache might be taking, so the netns still might fail to exit. Since the socket is gimped anyway, we can simply clear all the dst_caches (by way of clearing the endpoint src), which will release all references. However, the current dst_cache_reset function only releases those references lazily. But it turns out that all of our usages of wg_socket_clear_peer_endpoint_src are called from contexts that are not exactly high-speed or bottle-necked. For example, when there's connection difficulty, or when userspace is reconfiguring the interface. And in particular for this patch, when the netns is exiting. So for those cases, it makes more sense to call dst_release immediately. For that, we add a small helper function to dst_cache. This patch also adds a test to netns.sh from Hangbin Liu to ensure this doesn't regress. Tested-by: Hangbin Liu <liuhangbin@gmail.com> Reported-by: Xiumei Mu <xmu@redhat.com> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Cc: Paolo Abeni <pabeni@redhat.com> Fixes: 900575aa33a3 ("wireguard: device: avoid circular netns references") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-29ipv6: fix memory leak in fib6_rule_suppressmsizanoen11-1/+1
The kernel leaks memory when a `fib` rule is present in IPv6 nftables firewall rules and a suppress_prefix rule is present in the IPv6 routing rules (used by certain tools such as wg-quick). In such scenarios, every incoming packet will leak an allocation in `ip6_dst_cache` slab cache. After some hours of `bpftrace`-ing and source code reading, I tracked down the issue to ca7a03c41753 ("ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule"). The problem with that change is that the generic `args->flags` always have `FIB_LOOKUP_NOREF` set[1][2] but the IPv6-specific flag `RT6_LOOKUP_F_DST_NOREF` might not be, leading to `fib6_rule_suppress` not decreasing the refcount when needed. How to reproduce: - Add the following nftables rule to a prerouting chain: meta nfproto ipv6 fib saddr . mark . iif oif missing drop This can be done with: sudo nft create table inet test sudo nft create chain inet test test_chain '{ type filter hook prerouting priority filter + 10; policy accept; }' sudo nft add rule inet test test_chain meta nfproto ipv6 fib saddr . mark . iif oif missing drop - Run: sudo ip -6 rule add table main suppress_prefixlength 0 - Watch `sudo slabtop -o | grep ip6_dst_cache` to see memory usage increase with every incoming ipv6 packet. This patch exposes the protocol-specific flags to the protocol specific `suppress` function, and check the protocol-specific `flags` argument for RT6_LOOKUP_F_DST_NOREF instead of the generic FIB_LOOKUP_NOREF when decreasing the refcount, like this. [1]: https://github.com/torvalds/linux/blob/ca7a03c4175366a92cee0ccc4fec0038c3266e26/net/ipv6/fib6_rules.c#L71 [2]: https://github.com/torvalds/linux/blob/ca7a03c4175366a92cee0ccc4fec0038c3266e26/net/ipv6/fib6_rules.c#L99 Link: https://bugzilla.kernel.org/show_bug.cgi?id=215105 Fixes: ca7a03c41753 ("ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule") Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-29devlink: Remove misleading internal_flags from health reporter dumpLeon Romanovsky1-2/+0
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET command doesn't have .doit callback and has no use in internal_flags at all. Remove this misleading assignment. Fixes: e44ef4e4516c ("devlink: Hang reporter's dump method on a dumpit cb") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-29net: Write lock dev_base_lock without disabling bottom halves.Sebastian Andrzej Siewior3-14/+14
The writer acquires dev_base_lock with disabled bottom halves. The reader can acquire dev_base_lock without disabling bottom halves because there is no writer in softirq context. On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts as a lock to ensure that resources, that are protected by disabling bottom halves, remain protected. This leads to a circular locking dependency if the lock acquired with disabled bottom halves (as in write_lock_bh()) and somewhere else with enabled bottom halves (as by read_lock() in netstat_show()) followed by disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout() -> spin_lock_bh()). This is the reverse locking order. All read_lock() invocation are from sysfs callback which are not invoked from softirq context. Therefore there is no need to disable bottom halves while acquiring a write lock. Acquire the write lock of dev_base_lock without disabling bottom halves. Reported-by: Pei Zhang <pezhang@redhat.com> Reported-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+1
drivers/net/ipa/ipa_main.c 8afc7e471ad3 ("net: ipa: separate disabling setup from modem stop") 76b5fbcd6b47 ("net: ipa: kill ipa_modem_init()") Duplicated include, drop one. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-25net: allow SO_MARK with CAP_NET_RAWMaciej Żenczykowski1-1/+2
A CAP_NET_RAW capable process can already spoof (on transmit) anything it desires via raw packet sockets... There is no good reason to not allow it to also be able to play routing tricks on packets from its own normal sockets. There is a desire to be able to use SO_MARK for routing table selection (via ip rule fwmark) from within a user process without having to run it as root. Granting it CAP_NET_RAW is much less dangerous than CAP_NET_ADMIN (CAP_NET_RAW doesn't permit persistent state change, while CAP_NET_ADMIN does - by for example allowing the reconfiguration of the routing tables and/or bringing up/down devices). Let's keep CAP_NET_ADMIN for persistent state changes, while using CAP_NET_RAW for non-configuration related stuff. Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20211123203715.193413-1-zenczykowski@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-25net: allow CAP_NET_RAW to setsockopt SO_PRIORITYMaciej Żenczykowski1-0/+1
CAP_NET_ADMIN is and should continue to be about configuring the system as a whole, not about configuring per-socket or per-packet parameters. Sending and receiving raw packets is what CAP_NET_RAW is all about. It can already send packets with any VLAN tag, and any IPv4 TOS mark, and any IPv6 TCLASS mark, simply by virtue of building such a raw packet. Not to mention using any protocol and source/ /destination ip address/port tuple. These are the fields that networking gear uses to prioritize packets. Hence, a CAP_NET_RAW process is already capable of affecting traffic prioritization after it hits the wire. This change makes it capable of affecting traffic prioritization even in the host at the nic and before that in the queueing disciplines (provided skb->priority is actually being used for prioritization, and not the TOS/TCLASS field) Hence it makes sense to allow a CAP_NET_RAW process to set the priority of sockets and thus packets it sends. Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20211123203702.193221-1-zenczykowski@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-23net: remove .ndo_change_proto_downJakub Kicinski3-32/+5
.ndo_change_proto_down was added seemingly to enable out-of-tree implementations. Over 2.5yrs later we still have no real users upstream. Hardwire the generic implementation for now, we can revert once real users materialize. (rocker is a test vehicle, not a user.) We need to drop the optimization on the sysfs side, because unlike ndos priv_flags will be changed at runtime, so we'd need READ_ONCE/WRITE_ONCE everywhere.. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-22devlink: Add 'enable_iwarp' generic device paramShiraz Saleem1-0/+5
Add a new device generic parameter to enable and disable iWARP functionality on a multi-protocol RDMA device. Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Tested-by: Leszek Kaliszczuk <leszek.kaliszczuk@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-11-22skbuff: Switch structure bounds to struct_group()Kees Cook1-9/+5
In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Replace the existing empty member position markers "headers_start" and "headers_end" with a struct_group(). This will allow memcpy() and sizeof() to more easily reason about sizes, and improve readability. "pahole" shows no size nor member offset changes to struct sk_buff. "objdump -d" shows no object code changes (outside of WARNs affected by source line number changes). Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> # drivers/net/wireguard/* Link: https://lore.kernel.org/lkml/20210728035006.GD35706@embeddedor Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-22skbuff: Move conditional preprocessor directives out of struct sk_buffKees Cook1-5/+5
In preparation for using the struct_group() macro in struct sk_buff, move the conditional preprocessor directives out of the region of struct sk_buff that will be enclosed by struct_group(). While GCC and Clang are happy with conditional preprocessor directives here, sparse is not, even under -Wno-directive-within-macro[1], as would be seen under a C=1 build: net/core/filter.c: note: in included file (through include/linux/netlink.h, include/linux/sock_diag.h): ./include/linux/skbuff.h:820:1: warning: directive in macro's argument list ./include/linux/skbuff.h:822:1: warning: directive in macro's argument list ./include/linux/skbuff.h:846:1: warning: directive in macro's argument list ./include/linux/skbuff.h:848:1: warning: directive in macro's argument list Additionally remove empty macro argument definitions and usage. "objdump -d" shows no object code differences. [1] https://www.spinics.net/lists/linux-sparse/msg10857.html Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-22net, neigh: Fix crash in v6 module initialization error pathDaniel Borkmann1-0/+1
When IPv6 module gets initialized, but it's hitting an error in inet6_init() where it then needs to undo all the prior initialization work, it also might do a call to ndisc_cleanup() which then calls neigh_table_clear(). In there is a missing timer cancellation of the table's managed_work item. The kernel test robot explicitly triggered this error path and caused a UAF crash similar to the below: [...] [ 28.833183][ C0] BUG: unable to handle page fault for address: f7a43288 [ 28.833973][ C0] #PF: supervisor write access in kernel mode [ 28.834660][ C0] #PF: error_code(0x0002) - not-present page [ 28.835319][ C0] *pde = 06b2c067 *pte = 00000000 [ 28.835853][ C0] Oops: 0002 [#1] PREEMPT [ 28.836367][ C0] CPU: 0 PID: 303 Comm: sed Not tainted 5.16.0-rc1-00233-g83ff5faa0d3b #7 [ 28.837293][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014 [ 28.838338][ C0] EIP: __run_timers.constprop.0+0x82/0x440 [...] [ 28.845607][ C0] Call Trace: [ 28.845942][ C0] <SOFTIRQ> [ 28.846333][ C0] ? check_preemption_disabled.isra.0+0x2a/0x80 [ 28.846975][ C0] ? __this_cpu_preempt_check+0x8/0xa [ 28.847570][ C0] run_timer_softirq+0xd/0x40 [ 28.848050][ C0] __do_softirq+0xf5/0x576 [ 28.848547][ C0] ? __softirqentry_text_start+0x10/0x10 [ 28.849127][ C0] do_softirq_own_stack+0x2b/0x40 [ 28.849749][ C0] </SOFTIRQ> [ 28.850087][ C0] irq_exit_rcu+0x7d/0xc0 [ 28.850587][ C0] common_interrupt+0x2a/0x40 [ 28.851068][ C0] asm_common_interrupt+0x119/0x120 [...] Note that IPv6 module cannot be unloaded as per 8ce440610357 ("ipv6: do not allow ipv6 module to be removed") hence this can only be seen during module initialization error. Tested with kernel test robot's reproducer. Fixes: 7482e3841d52 ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Li Zhijian <zhijianx.li@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-22net-sysfs: Slightly optimize 'xps_queue_show()'Christophe JAILLET1-1/+1
The 'mask' bitmap is local to this function. So the non-atomic '__set_bit()' can be used to save a few cycles. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-22net: annotate accesses to dev->gso_max_segsEric Dumazet3-4/+5
dev->gso_max_segs is written under RTNL protection, or when the device is not yet visible, but is read locklessly. Add netif_set_gso_max_segs() helper. Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs() where we can to better document what is going on. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-22net: annotate accesses to dev->gso_max_sizeEric Dumazet1-1/+2
dev->gso_max_size is written under RTNL protection, or when the device is not yet visible, but is read locklessly. Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size() where we can to better document what is going on. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-20net: kunit: add a test for dev_addr_listsJakub Kicinski2-0/+238
Add a KUnit test for the dev_addr API. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-20dev_addr_list: put the first addr on the treeJakub Kicinski1-28/+34
Since all netdev->dev_addr modifications go via dev_addr_mod() we can put it on the list. When address is change remove it and add it back. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-20dev_addr: add a modification checkJakub Kicinski2-0/+20
netdev->dev_addr should only be modified via helpers, but someone may be casting off the const. Add a runtime check to catch abuses. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-20net: unexport dev_addr_init() & dev_addr_flush()Jakub Kicinski1-2/+0
There are no module callers in-tree and it's hard to justify why anyone would init or flush addresses of a netdev (note the flush is more of a destructor, it frees netdev->dev_addr). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-20net: constify netdev->dev_addrJakub Kicinski1-0/+10
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. We converted all users to make modifications via appropriate helpers, make netdev->dev_addr const. The update helpers need to upcast from the buffer to struct netdev_hw_addr. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-20bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmapJohn Fastabend2-1/+9
When a sock is added to a sock map we evaluate what proto op hooks need to be used. However, when the program is removed from the sock map we have not been evaluating if that changes the required program layout. Before the patch listed in the 'fixes' tag this was not causing failures because the base program set handles all cases. Specifically, the case with a stream parser and the case with out a stream parser are both handled. With the fix below we identified a race when running with a proto op that attempts to read skbs off both the stream parser and the skb->receive_queue. Namely, that a race existed where when the stream parser is empty checking the skb->receive_queue from recvmsg at the precies moment when the parser is paused and the receive_queue is not empty could result in skipping the stream parser. This may break a RX policy depending on the parser to run. The fix tag then loads a specific proto ops that resolved this race. But, we missed removing that proto ops recv hook when the sock is removed from the sockmap. The result is the stream parser is stopped so no more skbs will be aggregated there, but the hook and BPF program continues to be attached on the psock. User space will then get an EBUSY when trying to read the socket because the recvmsg() handler is now waiting on a stopped stream parser. To fix we rerun the proto ops init() function which will look at the new set of progs attached to the psock and rest the proto ops hook to the correct handlers. And in the above case where we remove the sock from the sock map the RX prog will no longer be listed so the proto ops is removed. Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211119181418.353932-3-john.fastabend@gmail.com
2021-11-20bpf, sockmap: Attach map progs to psock early for feature probesJohn Fastabend1-4/+6
When a TCP socket is added to a sock map we look at the programs attached to the map to determine what proto op hooks need to be changed. Before the patch in the 'fixes' tag there were only two categories -- the empty set of programs or a TX policy. In any case the base set handled the receive case. After the fix we have an optimized program for receive that closes a small, but possible, race on receive. This program is loaded only when the map the psock is being added to includes a RX policy. Otherwise, the race is not possible so we don't need to handle the race condition. In order for the call to sk_psock_init() to correctly evaluate the above conditions all progs need to be set in the psock before the call. However, in the current code this is not the case. We end up evaluating the requirements on the old prog state. If your psock is attached to multiple maps -- for example a tx map and rx map -- then the second update would pull in the correct maps. But, the other pattern with a single rx enabled map the correct receive hooks are not used. The result is the race fixed by the patch in the fixes tag below may still be seen in this case. To fix we simply set all psock->progs before doing the call into sock_map_init(). With this the init() call gets the full list of programs and chooses the correct proto ops on the first iteration instead of requiring the second update to pull them in. This fixes the race case when only a single map is used. Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211119181418.353932-2-john.fastabend@gmail.com
2021-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-10/+16
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-18devlink: Don't throw an error if flash notification sent before devlink visibleLeon Romanovsky1-1/+3
The mlxsw driver calls to various devlink flash routines even before users can get any access to the devlink instance itself. For example, mlxsw_core_fw_rev_validate() one of such functions. __mlxsw_core_bus_device_register -> mlxsw_core_fw_rev_validate -> mlxsw_core_fw_flash -> mlxfw_firmware_flash -> mlxfw_status_notify -> devlink_flash_update_status_notify -> __devlink_flash_update_notify -> WARN_ON(...) It causes to the WARN_ON to trigger warning about devlink not registered. Fixes: cf530217408e ("devlink: Notify users when objects are accessible") Reported-by: Danielle Ratson <danieller@nvidia.com> Tested-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-18page_pool: Revert "page_pool: disable dma mapping support..."Yunsheng Lin1-6/+4
This reverts commit d00e60ee54b12de945b8493cf18c1ada9e422514. As reported by Guillaume in [1]: Enabling LPAE always enables CONFIG_ARCH_DMA_ADDR_T_64BIT in 32-bit systems, which breaks the bootup proceess when a ethernet driver is using page pool with PP_FLAG_DMA_MAP flag. As we were hoping we had no active consumers for such system when we removed the dma mapping support, and LPAE seems like a common feature for 32 bits system, so revert it. 1. https://www.spinics.net/lists/netdev/msg779890.html Fixes: d00e60ee54b1 ("page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reported-by: "kernelci.org bot" <bot@kernelci.org> Tested-by: "kernelci.org bot" <bot@kernelci.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-17net: use an atomic_long_t for queue->trans_timeoutEric Dumazet1-5/+1
tx_timeout_show() assumed dev_watchdog() would stop all the queues, to fetch queue->trans_timeout under protection of the queue->_xmit_lock. As we want to no longer disrupt transmits, we use an atomic_long_t instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: david decotigny <david.decotigny@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-17net: align static siphash keysEric Dumazet2-3/+3
siphash keys use 16 bytes. Define siphash_aligned_key_t macro so that we can make sure they are not crossing a cache line boundary. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-17Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski1-0/+6
Daniel Borkmann says: ==================== pull-request: bpf 2021-11-16 We've added 12 non-merge commits during the last 5 day(s) which contain a total of 23 files changed, 573 insertions(+), 73 deletions(-). The main changes are: 1) Fix pruning regression where verifier went overly conservative rejecting previsouly accepted programs, from Alexei Starovoitov and Lorenz Bauer. 2) Fix verifier TOCTOU bug when using read-only map's values as constant scalars during verification, from Daniel Borkmann. 3) Fix a crash due to a double free in XSK's buffer pool, from Magnus Karlsson. 4) Fix libbpf regression when cross-building runqslower, from Jean-Philippe Brucker. 5) Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_*() helpers in tracing programs due to deadlock possibilities, from Dmitrii Banshchikov. 6) Fix checksum validation in sockmap's udp_read_sock() callback, from Cong Wang. 7) Various BPF sample fixes such as XDP stats in xdp_sample_user, from Alexander Lobakin. 8) Fix libbpf gen_loader error handling wrt fd cleanup, from Kumar Kartikeya Dwivedi. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: udp: Validate checksum in udp_read_sock() bpf: Fix toctou on read-only map's constant scalar tracking samples/bpf: Fix build error due to -isystem removal selftests/bpf: Add tests for restricted helpers bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs libbpf: Perform map fd cleanup for gen_loader in case of error samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu tools/runqslower: Fix cross-build samples/bpf: Fix summary per-sec stats in xdp_sample_user selftests/bpf: Check map in map pruning bpf: Fix inner map state pruning regression. xsk: Fix crash on double free in buffer pool ==================== Link: https://lore.kernel.org/r/20211116141134.6490-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-16net: merge net->core.prot_inuse and net->core.sock_inuseEric Dumazet1-11/+1
net->core.sock_inuse is a per cpu variable (int), while net->core.prot_inuse is another per cpu variable of 64 integers. per cpu allocator tend to place them in very different places. Grouping them together makes sense, since it makes updates potentially faster, if hitting the same cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16net: make sock_inuse_add() availableEric Dumazet1-10/+0
MPTCP hard codes it, let us instead provide this helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16net: inline sock_prot_inuse_add()Eric Dumazet1-11/+0
sock_prot_inuse_add() is very small, we can inline it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16net: gro: populate net/core/gro.cEric Dumazet2-667/+649
Move gro code and data from net/core/dev.c to net/core/gro.c to ease maintenance. gro_normal_list() and gro_normal_one() are inlined because they are called from both files. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16net: gro: move skb_gro_receive into net/core/gro.cEric Dumazet3-118/+119
net/core/gro.c will contain all core gro functions, to shrink net/core/skbuff.c and net/core/dev.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16net: gro: move skb_gro_receive_list to udp_offload.cEric Dumazet1-26/+0
This helper is used once, no need to keep it in fat net/core/skbuff.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16net: move gro definitions to include/net/gro.hEric Dumazet1-0/+1
include/linux/netdevice.h became too big, move gro stuff into include/net/gro.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16tcp: add RETPOLINE mitigation to sk_backlog_rcvEric Dumazet1-1/+4
Use INDIRECT_CALL_INET() to avoid an indirect call when/if CONFIG_RETPOLINE=y Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16net: remove sk_route_nocapsEric Dumazet1-1/+2
Instead of using a full netdev_features_t, we can use a single bit, as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from sk->sk_route_cap. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16net: remove sk_route_forced_capsEric Dumazet1-1/+3
We were only using one bit, and we can replace it by sk_is_tcp() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16net: use sk_is_tcp() in more placesEric Dumazet2-8/+4
Move sk_is_tcp() to include/net/sock.h and use it where we can. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progsDmitrii Banshchikov1-0/+6
Use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in tracing progs may result in locking issues. bpf_ktime_get_coarse_ns() uses ktime_get_coarse_ns() time accessor that isn't safe for any context: ====================================================== WARNING: possible circular locking dependency detected 5.15.0-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor.4/14877 is trying to acquire lock: ffffffff8cb30008 (tk_core.seq.seqcount){----}-{0:0}, at: ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255 but task is already holding lock: ffffffff90dbf200 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_deactivate+0x61/0x400 lib/debugobjects.c:735 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&obj_hash[i].lock){-.-.}-{2:2}: lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0xd1/0x120 kernel/locking/spinlock.c:162 __debug_object_init+0xd9/0x1860 lib/debugobjects.c:569 debug_hrtimer_init kernel/time/hrtimer.c:414 [inline] debug_init kernel/time/hrtimer.c:468 [inline] hrtimer_init+0x20/0x40 kernel/time/hrtimer.c:1592 ntp_init_cmos_sync kernel/time/ntp.c:676 [inline] ntp_init+0xa1/0xad kernel/time/ntp.c:1095 timekeeping_init+0x512/0x6bf kernel/time/timekeeping.c:1639 start_kernel+0x267/0x56e init/main.c:1030 secondary_startup_64_no_verify+0xb1/0xbb -> #0 (tk_core.seq.seqcount){----}-{0:0}: check_prev_add kernel/locking/lockdep.c:3051 [inline] check_prevs_add kernel/locking/lockdep.c:3174 [inline] validate_chain+0x1dfb/0x8240 kernel/locking/lockdep.c:3789 __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5015 lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625 seqcount_lockdep_reader_access+0xfe/0x230 include/linux/seqlock.h:103 ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255 ktime_get_coarse include/linux/timekeeping.h:120 [inline] ktime_get_coarse_ns include/linux/timekeeping.h:126 [inline] ____bpf_ktime_get_coarse_ns kernel/bpf/helpers.c:173 [inline] bpf_ktime_get_coarse_ns+0x7e/0x130 kernel/bpf/helpers.c:171 bpf_prog_a99735ebafdda2f1+0x10/0xb50 bpf_dispatcher_nop_func include/linux/bpf.h:721 [inline] __bpf_prog_run include/linux/filter.h:626 [inline] bpf_prog_run include/linux/filter.h:633 [inline] BPF_PROG_RUN_ARRAY include/linux/bpf.h:1294 [inline] trace_call_bpf+0x2cf/0x5d0 kernel/trace/bpf_trace.c:127 perf_trace_run_bpf_submit+0x7b/0x1d0 kernel/events/core.c:9708 perf_trace_lock+0x37c/0x440 include/trace/events/lock.h:39 trace_lock_release+0x128/0x150 include/trace/events/lock.h:58 lock_release+0x82/0x810 kernel/locking/lockdep.c:5636 __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:149 [inline] _raw_spin_unlock_irqrestore+0x75/0x130 kernel/locking/spinlock.c:194 debug_hrtimer_deactivate kernel/time/hrtimer.c:425 [inline] debug_deactivate kernel/time/hrtimer.c:481 [inline] __run_hrtimer kernel/time/hrtimer.c:1653 [inline] __hrtimer_run_queues+0x2f9/0xa60 kernel/time/hrtimer.c:1749 hrtimer_interrupt+0x3b3/0x1040 kernel/time/hrtimer.c:1811 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline] __sysvec_apic_timer_interrupt+0xf9/0x270 arch/x86/kernel/apic/apic.c:1103 sysvec_apic_timer_interrupt+0x8c/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline] _raw_spin_unlock_irqrestore+0xd4/0x130 kernel/locking/spinlock.c:194 try_to_wake_up+0x702/0xd20 kernel/sched/core.c:4118 wake_up_process kernel/sched/core.c:4200 [inline] wake_up_q+0x9a/0xf0 kernel/sched/core.c:953 futex_wake+0x50f/0x5b0 kernel/futex/waitwake.c:184 do_futex+0x367/0x560 kernel/futex/syscalls.c:127 __do_sys_futex kernel/futex/syscalls.c:199 [inline] __se_sys_futex+0x401/0x4b0 kernel/futex/syscalls.c:180 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae There is a possible deadlock with bpf_timer_* set of helpers: hrtimer_start() lock_base(); trace_hrtimer...() perf_event() bpf_run() bpf_timer_start() hrtimer_start() lock_base() <- DEADLOCK Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_PERF_EVENT and BPF_PROG_TYPE_RAW_TRACEPOINT prog types. Fixes: d05512618056 ("bpf: Add bpf_ktime_get_coarse_ns helper") Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.") Reported-by: syzbot+43fd005b5a1b4d10781e@syzkaller.appspotmail.com Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211113142227.566439-2-me@ubique.spb.ru