Age | Commit message (Collapse) | Author | Files | Lines |
|
syzbot found another way to trigger the infamous WARN_ON_ONCE(delta < len)
in skb_try_coalesce() [1]
I was able to root cause the issue to kfence.
When kfence is in action, the following assertion is no longer true:
int size = xxxx;
void *ptr1 = kmalloc(size, gfp);
void *ptr2 = kmalloc(size, gfp);
if (ptr1 && ptr2)
ASSERT(ksize(ptr1) == ksize(ptr2));
We attempted to fix these issues in the blamed commits, but forgot
that TCP was possibly shifting data after skb_unclone_keeptruesize()
has been used, notably from tcp_retrans_try_collapse().
So we not only need to keep same skb->truesize value,
we also need to make sure TCP wont fill new tailroom
that pskb_expand_head() was able to get from a
addr = kmalloc(...) followed by ksize(addr)
Split skb_unclone_keeptruesize() into two parts:
1) Inline skb_unclone_keeptruesize() for the common case,
when skb is not cloned.
2) Out of line __skb_unclone_keeptruesize() for the 'slow path'.
WARNING: CPU: 1 PID: 6490 at net/core/skbuff.c:5295 skb_try_coalesce+0x1235/0x1560 net/core/skbuff.c:5295
Modules linked in:
CPU: 1 PID: 6490 Comm: syz-executor161 Not tainted 5.17.0-rc4-syzkaller-00229-g4f12b742eb2b #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:skb_try_coalesce+0x1235/0x1560 net/core/skbuff.c:5295
Code: bf 01 00 00 00 0f b7 c0 89 c6 89 44 24 20 e8 62 24 4e fa 8b 44 24 20 83 e8 01 0f 85 e5 f0 ff ff e9 87 f4 ff ff e8 cb 20 4e fa <0f> 0b e9 06 f9 ff ff e8 af b2 95 fa e9 69 f0 ff ff e8 95 b2 95 fa
RSP: 0018:ffffc900063af268 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 00000000ffffffd5 RCX: 0000000000000000
RDX: ffff88806fc05700 RSI: ffffffff872abd55 RDI: 0000000000000003
RBP: ffff88806e675500 R08: 00000000ffffffd5 R09: 0000000000000000
R10: ffffffff872ab659 R11: 0000000000000000 R12: ffff88806dd554e8
R13: ffff88806dd9bac0 R14: ffff88806dd9a2c0 R15: 0000000000000155
FS: 00007f18014f9700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020002000 CR3: 000000006be7a000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
tcp_try_coalesce net/ipv4/tcp_input.c:4651 [inline]
tcp_try_coalesce+0x393/0x920 net/ipv4/tcp_input.c:4630
tcp_queue_rcv+0x8a/0x6e0 net/ipv4/tcp_input.c:4914
tcp_data_queue+0x11fd/0x4bb0 net/ipv4/tcp_input.c:5025
tcp_rcv_established+0x81e/0x1ff0 net/ipv4/tcp_input.c:5947
tcp_v4_do_rcv+0x65e/0x980 net/ipv4/tcp_ipv4.c:1719
sk_backlog_rcv include/net/sock.h:1037 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2779
release_sock+0x54/0x1b0 net/core/sock.c:3311
sk_wait_data+0x177/0x450 net/core/sock.c:2821
tcp_recvmsg_locked+0xe28/0x1fd0 net/ipv4/tcp.c:2457
tcp_recvmsg+0x137/0x610 net/ipv4/tcp.c:2572
inet_recvmsg+0x11b/0x5e0 net/ipv4/af_inet.c:850
sock_recvmsg_nosec net/socket.c:948 [inline]
sock_recvmsg net/socket.c:966 [inline]
sock_recvmsg net/socket.c:962 [inline]
____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
___sys_recvmsg+0x127/0x200 net/socket.c:2674
__sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: c4777efa751d ("net: add and use skb_unclone_keeptruesize() helper")
Fixes: 097b9146c0e2 ("net: fix up truesize of cloned skb in skb_prepare_for_shift()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We have multiple places where this helper is convenient,
and plan using it in the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch add a new bonding option ns_ip6_target, which correspond
to the arp_ip_target. With this we set IPv6 targets and send IPv6 NS
request to determine the health of the link.
For other related options like the validation, we still use
arp_validate, and will change to ns_validate later.
Note: the sysfs configuration support was removed based on
https://lore.kernel.org/netdev/8863.1645071997@famine
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a new bonding parameter ns_targets to store IPv6 address.
Add required bond_ns_send/rcv functions first before adding
IPv6 address option setting.
Add two functions bond_send/rcv_validate so we can send/recv
ARP and NS at the same time.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Adding an extra storage field for bond_opt_value so we can set large
bytes of data for bonding options in future, e.g. IPv6 address.
Define a new call bond_opt_initextra(). Also change the checking order of
__bond_opt_init() and check values first.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch separate NS message allocation steps from ndisc_send_ns(),
so it could be used in other places, like bonding, to allocate and
send IPv6 NS message.
Also export ndisc_send_skb() and ndisc_ns_create() for later bonding usage.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In case user space sends a packet destined to a broadcast address when a
matching broadcast route is not configured, the kernel will create a
unicast neighbour entry that will never be resolved [1].
When the broadcast route is configured, the unicast neighbour entry will
not be invalidated and continue to linger, resulting in packets being
dropped.
Solve this by invalidating unresolved neighbour entries for broadcast
addresses after routes for these addresses are internally configured by
the kernel. This allows the kernel to create a broadcast neighbour entry
following the next route lookup.
Another possible solution that is more generic but also more complex is
to have the ARP code register a listener to the FIB notification chain
and invalidate matching neighbour entries upon the addition of broadcast
routes.
It is also possible to wave off the issue as a user space problem, but
it seems a bit excessive to expect user space to be that intimately
familiar with the inner workings of the FIB/neighbour kernel code.
[1] https://lore.kernel.org/netdev/55a04a8f-56f3-f73c-2aea-2195923f09d1@huawei.com/
Reported-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Replace tcp_drop() used in tcp_data_queue_ofo with tcp_drop_reason().
Following drop reasons are introduced:
SKB_DROP_REASON_TCP_OFOMERGE
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Replace tcp_drop() used in tcp_data_queue() with tcp_drop_reason().
Following drop reasons are introduced:
SKB_DROP_REASON_TCP_ZEROWINDOW
SKB_DROP_REASON_TCP_OLD_DATA
SKB_DROP_REASON_TCP_OVERWINDOW
SKB_DROP_REASON_TCP_OLD_DATA is used for the case that end_seq of skb
less than the left edges of receive window. (Maybe there is a better
name?)
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Replace tcp_drop() used in tcp_rcv_established() with tcp_drop_reason().
Following drop reasons are added:
SKB_DROP_REASON_TCP_FLAGS
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pass the address of drop_reason to tcp_add_backlog() to store the
reasons for skb drops when fails. Following drop reasons are
introduced:
SKB_DROP_REASON_SOCKET_BACKLOG
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pass the address of drop reason to tcp_v4_inbound_md5_hash() and
tcp_v6_inbound_md5_hash() to store the reasons for skb drops when this
function fails. Therefore, the drop reason can be passed to
kfree_skb_reason() when the skb needs to be freed.
Following drop reasons are added:
SKB_DROP_REASON_TCP_MD5NOTFOUND
SKB_DROP_REASON_TCP_MD5UNEXPECTED
SKB_DROP_REASON_TCP_MD5FAILURE
SKB_DROP_REASON_TCP_MD5* above correspond to LINUX_MIB_TCPMD5*
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
phylink_config's pcs_poll is no longer used, let's get rid of it.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
With drivers converted over to using phylink PCS, there is no need for
the struct dsa_switch member "pcs_poll" to exist anymore - there is a
flag in the struct phylink_pcs which indicates whether this PCS needs
to be polled which supersedes this.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds support for MRT6MSG_WRMIFWHOLE which is used to pass
full packet and real vif id when the incoming interface is wrong.
While the RP and FHR are setting up state we need to be sending the
registers encapsulated with all the data inside otherwise we lose it.
The RP then decapsulates it and forwards it to the interested parties.
Currently with WRONGMIF we can only be sending empty register packets
and will lose that data.
This behaviour can be enabled by using MRT_PIM with
val == MRT6MSG_WRMIFWHOLE. This doesn't prevent MRT6MSG_WRONGMIF from
happening, it happens in addition to it, also it is controlled by the same
throttling parameters as WRONGMIF (i.e. 1 packet per 3 seconds currently).
Both messages are generated to keep backwards compatibily and avoid
breaking someone who was enabling MRT_PIM with val == 4, since any
positive val is accepted and treated the same.
Signed-off-by: Mobashshera Rasool <mobash.rasool.linux@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, we have mctp_address_ok(), which checks if an EID is in the
"valid" range of 8-254 inclusive. However, 0 and 255 may also be valid
addresses, depending on context. 0 is the NULL EID, which may be set
when physical addressing is used. 255 is valid as a destination address
for broadcasts.
This change renames mctp_address_ok to mctp_address_unicast, and adds
similar helpers for broadcast and null EIDs, which will be used in an
upcoming commit.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch adds a new protocol attribute to IPv4 and IPv6 addresses.
Inspiration was taken from the protocol attribute of routes. User space
applications like iproute2 can set/get the protocol with the Netlink API.
The attribute is stored as an 8-bit unsigned integer.
The protocol attribute is set by kernel for these categories:
- IPv4 and IPv6 loopback addresses
- IPv6 addresses generated from router announcements
- IPv6 link local addresses
User space may pass custom protocols, not defined by the kernel.
Grouping addresses on their origin is useful in scenarios where you want
to distinguish between addresses based on who added them, e.g. kernel
vs. user space.
Tagging addresses with a string label is an existing feature that could be
used as a solution. Unfortunately the max length of a label is
15 characters, and for compatibility reasons the label must be prefixed
with the name of the device followed by a colon. Since device names also
have a max length of 15 characters, only -1 characters is guaranteed to be
available for any origin tag, which is not that much.
A reference implementation of user space setting and getting protocols
is available for iproute2:
https://github.com/westermo/iproute2/commit/9a6ea18bd79f47f293e5edc7780f315ea42ff540
Signed-off-by: Jacques de Laval <Jacques.De.Laval@westermo.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220217150202.80802-1-Jacques.De.Laval@westermo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add DSA support for the phylink mac_select_pcs() method so DSA drivers
can return provide phylink with the appropriate PCS for the PHY
interface mode.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Following the cited commit, sparse started complaining about:
../include/net/gro.h:58:1: warning: directive in macro's argument list
../include/net/gro.h:59:1: warning: directive in macro's argument list
Fix that by moving the defines out of the struct_group() macro.
Fixes: de5a1f3ce4c8 ("net: gro: minor optimization for dev_gro_receive()")
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Acked-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduced in commit cf963573039a ("net: dsa: Allow providing PHY
statistics from CPU port"), it appears these were never used.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20220216193726.2926320-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fast path bpf marge for some -next work.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Alexei Starovoitov says:
====================
pull-request: bpf 2022-02-17
We've added 8 non-merge commits during the last 7 day(s) which contain
a total of 8 files changed, 119 insertions(+), 15 deletions(-).
The main changes are:
1) Add schedule points in map batch ops, from Eric.
2) Fix bpf_msg_push_data with len 0, from Felix.
3) Fix crash due to incorrect copy_map_value, from Kumar.
4) Fix crash due to out of bounds access into reg2btf_ids, from Kumar.
5) Fix a bpf_timer initialization issue with clang, from Yonghong.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Add schedule points in batch ops
bpf: Fix crash due to out of bounds access into reg2btf_ids.
selftests: bpf: Check bpf_msg_push_data return value
bpf: Fix a bpf_timer initialization issue
bpf: Emit bpf_timer in vmlinux BTF
selftests/bpf: Add test for bpf_timer overwriting crash
bpf: Fix crash due to incorrect copy_map_value
bpf: Do not try bpf_msg_push_data with len 0
====================
Link: https://lore.kernel.org/r/20220217190000.37925-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
No conflicts.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless and netfilter.
Current release - regressions:
- dsa: lantiq_gswip: fix use after free in gswip_remove()
- smc: avoid overwriting the copies of clcsock callback functions
Current release - new code bugs:
- iwlwifi:
- fix use-after-free when no FW is present
- mei: fix the pskb_may_pull check in ipv4
- mei: retry mapping the shared area
- mvm: don't feed the hardware RFKILL into iwlmei
Previous releases - regressions:
- ipv6: mcast: use rcu-safe version of ipv6_get_lladdr()
- tipc: fix wrong publisher node address in link publications
- iwlwifi: mvm: don't send SAR GEO command for 3160 devices, avoid FW
assertion
- bgmac: make idm and nicpm resource optional again
- atl1c: fix tx timeout after link flap
Previous releases - always broken:
- vsock: remove vsock from connected table when connect is
interrupted by a signal
- ping: change destination interface checks to match raw sockets
- crypto: af_alg - get rid of alg_memory_allocated to avoid confusing
semantics (and null-deref) after SO_RESERVE_MEM was added
- ipv6: make exclusive flowlabel checks per-netns
- bonding: force carrier update when releasing slave
- sched: limit TC_ACT_REPEAT loops
- bridge: multicast: notify switchdev driver whenever MC processing
gets disabled because of max entries reached
- wifi: brcmfmac: fix crash in brcm_alt_fw_path when WLAN not found
- iwlwifi: fix locking when "HW not ready"
- phy: mediatek: remove PHY mode check on MT7531
- dsa: mv88e6xxx: flush switchdev FDB workqueue before removing VLAN
- dsa: lan9303:
- fix polarity of reset during probe
- fix accelerated VLAN handling"
* tag 'net-5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits)
bonding: force carrier update when releasing slave
nfp: flower: netdev offload check for ip6gretap
ipv6: fix data-race in fib6_info_hw_flags_set / fib6_purge_rt
ipv4: fix data races in fib_alias_hw_flags_set
net: dsa: lan9303: add VLAN IDs to master device
net: dsa: lan9303: handle hwaccel VLAN tags
vsock: remove vsock from connected table when connect is interrupted by a signal
Revert "net: ethernet: bgmac: Use devm_platform_ioremap_resource_byname"
ping: fix the dif and sdif check in ping_lookup
net: usb: cdc_mbim: avoid altsetting toggling for Telit FN990
net: sched: limit TC_ACT_REPEAT loops
tipc: fix wrong notification node addresses
net: dsa: lantiq_gswip: fix use after free in gswip_remove()
ipv6: per-netns exclusive flowlabel checks
net: bridge: multicast: notify switchdev driver whenever MC processing gets disabled
CDC-NCM: avoid overflow in sanity checking
mctp: fix use after free
net: mscc: ocelot: fix use-after-free in ocelot_vlan_del()
bonding: fix data-races around agg_select_timer
dpaa2-eth: Initialize mutex used in one step timestamping path
...
|
|
Because fib6_info_hw_flags_set() is called without any synchronization,
all accesses to gi6->offload, fi->trap and fi->offload_failed
need some basic protection like READ_ONCE()/WRITE_ONCE().
BUG: KCSAN: data-race in fib6_info_hw_flags_set / fib6_purge_rt
read to 0xffff8881087d5886 of 1 bytes by task 13953 on cpu 0:
fib6_drop_pcpu_from net/ipv6/ip6_fib.c:1007 [inline]
fib6_purge_rt+0x4f/0x580 net/ipv6/ip6_fib.c:1033
fib6_del_route net/ipv6/ip6_fib.c:1983 [inline]
fib6_del+0x696/0x890 net/ipv6/ip6_fib.c:2028
__ip6_del_rt net/ipv6/route.c:3876 [inline]
ip6_del_rt+0x83/0x140 net/ipv6/route.c:3891
__ipv6_dev_ac_dec+0x2b5/0x370 net/ipv6/anycast.c:374
ipv6_dev_ac_dec net/ipv6/anycast.c:387 [inline]
__ipv6_sock_ac_close+0x141/0x200 net/ipv6/anycast.c:207
ipv6_sock_ac_close+0x79/0x90 net/ipv6/anycast.c:220
inet6_release+0x32/0x50 net/ipv6/af_inet6.c:476
__sock_release net/socket.c:650 [inline]
sock_close+0x6c/0x150 net/socket.c:1318
__fput+0x295/0x520 fs/file_table.c:280
____fput+0x11/0x20 fs/file_table.c:313
task_work_run+0x8e/0x110 kernel/task_work.c:164
tracehook_notify_resume include/linux/tracehook.h:189 [inline]
exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
exit_to_user_mode_prepare+0x160/0x190 kernel/entry/common.c:207
__syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300
do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
write to 0xffff8881087d5886 of 1 bytes by task 1912 on cpu 1:
fib6_info_hw_flags_set+0x155/0x3b0 net/ipv6/route.c:6230
nsim_fib6_rt_hw_flags_set drivers/net/netdevsim/fib.c:668 [inline]
nsim_fib6_rt_add drivers/net/netdevsim/fib.c:691 [inline]
nsim_fib6_rt_insert drivers/net/netdevsim/fib.c:756 [inline]
nsim_fib6_event drivers/net/netdevsim/fib.c:853 [inline]
nsim_fib_event drivers/net/netdevsim/fib.c:886 [inline]
nsim_fib_event_work+0x284f/0x2cf0 drivers/net/netdevsim/fib.c:1477
process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
worker_thread+0x616/0xa70 kernel/workqueue.c:2454
kthread+0x2c7/0x2e0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30
value changed: 0x22 -> 0x2a
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 1912 Comm: kworker/1:3 Not tainted 5.16.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events nsim_fib_event_work
Fixes: 0c5fcf9e249e ("IPv6: Add "offload failed" indication to routes")
Fixes: bb3c4ab93e44 ("ipv6: Add "offload" and "trap" indications to routes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Amit Cohen <amcohen@nvidia.com>
Cc: Ido Schimmel <idosch@nvidia.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20220216173217.3792411-2-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Historically, the felix DSA driver has installed special traps such that
PTP over L2 works with the ocelot-8021q tagging protocol; commit
0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP
timestamping") has the details.
Then the ocelot switch library also gained more comprehensive support
for PTP traps through commit 96ca08c05838 ("net: mscc: ocelot: set up
traps for PTP packets").
Right now, PTP over L2 works using ocelot-8021q via the traps it has set
for itself, but nothing else does. Consolidating the two code blocks
would make ocelot-8021q gain support for PTP over L4 and tc-flower
traps, and at the same time avoid some code and TCAM duplication.
The traps are similar in intent, but different in execution, so some
explanation is required. The traps set up by felix_setup_mmio_filtering()
are VCAP IS1 filters, which have a PAG that chains them to a VCAP IS2
filter, and the IS2 is where the 'trap' action resides. The traps set up
by ocelot_trap_add(), on the other hand, have a single filter, in VCAP
IS2. The reason for chaining VCAP IS1 and IS2 in Felix was to ensure
that the hardcoded traps take precedence and cannot be overridden by the
Ocelot switch library.
So in principle, the PTP traps needed for ocelot-8021q in the Felix
driver can rely on ocelot_trap_add(), but the filters need to be patched
to account for a quirk that LS1028A has: the quirk_no_xtr_irq described
in commit 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP
timestamping"). Live-patching is done by iterating through the trap list
every time we know it has been updated, and transforming a trap into a
redirect + CPU copy if ocelot-8021q is in use.
Making the DSA ocelot-8021q tagger work with the Ocelot traps means we
can eliminate the dedicated OCELOT_VCAP_IS1_TAG_8021Q_PTP_MMIO and
OCELOT_VCAP_IS2_TAG_8021Q_PTP_MMIO cookies. To minimize the patch delta,
OCELOT_VCAP_IS2_MRP_TRAP takes the place of OCELOT_VCAP_IS2_TAG_8021Q_PTP_MMIO
(the alternative would have been to left-shift all cookie numbers by 1).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The ocelot switch library does not need this information, but the felix
DSA driver does.
As a reminder, the VSC9959 switch in LS1028A doesn't have an IRQ line
for packet extraction, so to be notified that a PTP packet needs to be
dequeued, it receives that packet also over Ethernet, by setting up a
packet trap. The Felix driver needs to install special kinds of traps
for packets in need of RX timestamps, such that the packets are
replicated both over Ethernet and over the CPU port module.
But the Ocelot switch library sets up more than one trap for PTP event
messages; it also traps PTP general messages, MRP control messages etc.
Those packets don't need PTP timestamps, so there's no reason for the
Felix driver to send them to the CPU port module.
By knowing which traps need PTP timestamps, the Felix driver can
adjust the traps installed using ocelot_trap_add() such that only those
will actually get delivered to the CPU port module.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When using the ocelot-8021q tagging protocol, the CPU port isn't
configured as an NPI port, but is a regular port. So a "trap to CPU"
operation is actually a "redirect" operation. So DSA needs to set up the
trapping action one way or another, depending on the tagging protocol in
use.
To ease DSA's work of modifying the action, keep all currently installed
traps in a list, so that DSA can live-patch them when the tagging
protocol changes.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
OCELOT_VCAP_IS2_TAG_8021Q_TXVLAN overlaps with OCELOT_VCAP_IS2_MRP_REDIRECT.
To avoid this, make OCELOT_VCAP_IS2_MRP_REDIRECT take the cookie region
from N to 2 * N - 1 (where N is ocelot->num_phys_ports).
To avoid any risk that the singleton (not per port) VCAP IS2 filters
overlap with per-port VCAP IS2 filters, we must ensure that the number
of singleton filters is smaller than the number of physical ports.
This is true right now, but may change in the future as switches with
less ports get supported, or more singleton filters get added. So to be
future-proof, let's move the singleton filters at the end of the range,
where they won't overlap with anything to their right.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The MRP assist code installs a VCAP IS2 trapping rule for each port, but
since the key and the action is the same, just the ingress port mask
differs, there isn't any need to do this. We can save some space in the
TCAM by using a single filter and adjusting the ingress port mask.
Reuse the ocelot_trap_add() and ocelot_trap_del() functions for this
purpose.
Now that the cookies are no longer per port, we need to change the
allocation scheme such that MRP traps use a fixed number.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
MRP frames are configured to be trapped to the CPU queue 7, and this
number is reflected in the extraction header. However, the information
isn't used anywhere, so just leave MRP frames to go to CPU queue 0
unless needed otherwise.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Every use case that needed VCAP filters (in order: DSA tag_8021q, MRP,
PTP traps) has hardcoded filter identifiers that worked well enough for
that use case alone. But when two or more of those use cases would be
used together, some of those identifiers would overlap, leading to
breakage.
Add definitions for each cookie and centralize them in ocelot_vcap.h,
such that the overlaps are more obvious.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Options parsing in now done from mptcp_incoming_options().
mptcp_parse_option() has been removed from mptcp.h when CONFIG_MPTCP is
defined but not when it is not.
Fixes: cfde141ea3fa ("mptcp: move option parsing into mptcp_incoming_options()")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Ipv6 flowlabels historically require a reservation before use.
Optionally in exclusive mode (e.g., user-private).
Commit 59c820b2317f ("ipv6: elide flowlabel check if no exclusive
leases exist") introduced a fastpath that avoids this check when no
exclusive leases exist in the system, and thus any flowlabel use
will be granted.
That allows skipping the control operation to reserve a flowlabel
entirely. Though with a warning if the fast path fails:
This is an optimization. Robust applications still have to revert to
requesting leases if the fast path fails due to an exclusive lease.
Still, this is subtle. Better isolate network namespaces from each
other. Flowlabels are per-netns. Also record per-netns whether
exclusive leases are in use. Then behavior does not change based on
activity in other netns.
Changes
v2
- wrap in IS_ENABLED(CONFIG_IPV6) to avoid breakage if disabled
Fixes: 59c820b2317f ("ipv6: elide flowlabel check if no exclusive leases exist")
Link: https://lore.kernel.org/netdev/MWHPR2201MB1072BCCCFCE779E4094837ACD0329@MWHPR2201MB1072.namprd22.prod.outlook.com/
Reported-by: Congyu Liu <liu3101@purdue.edu>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Tested-by: Congyu Liu <liu3101@purdue.edu>
Link: https://lore.kernel.org/r/20220215160037.1976072-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, DSA programs VLANs on shared (DSA and CPU) ports each time it
does so on user ports. This is good for basic functionality but has
several limitations:
- the VLAN group which must reach the CPU may be radically different
from the VLAN group that must be autonomously forwarded by the switch.
In other words, the admin may want to isolate noisy stations and avoid
traffic from them going to the control processor of the switch, where
it would just waste useless cycles. The bridge already supports
independent control of VLAN groups on bridge ports and on the bridge
itself, and when VLAN-aware, it will drop packets in software anyway
if their VID isn't added as a 'self' entry towards the bridge device.
- Replaying host FDB entries may depend, for some drivers like mv88e6xxx,
on replaying the host VLANs as well. The 2 VLAN groups are
approximately the same in most regular cases, but there are corner
cases when timing matters, and DSA's approximation of replicating
VLANs on shared ports simply does not work.
- If a user makes the bridge (implicitly the CPU port) join a VLAN by
accident, there is no way for the CPU port to isolate itself from that
noisy VLAN except by rebooting the system. This is because for each
VLAN added on a user port, DSA will add it on shared ports too, but
for each VLAN deletion on a user port, it will remain installed on
shared ports, since DSA has no good indication of whether the VLAN is
still in use or not.
Now that the bridge driver emits well-balanced SWITCHDEV_OBJ_ID_PORT_VLAN
addition and removal events, DSA has a simple and straightforward task
of separating the bridge port VLANs (these have an orig_dev which is a
DSA slave interface, or a LAG interface) from the host VLANs (these have
an orig_dev which is a bridge interface), and to keep a simple reference
count of each VID on each shared port.
Forwarding VLANs must be installed on the bridge ports and on all DSA
ports interconnecting them. We don't have a good view of the exact
topology, so we simply install forwarding VLANs on all DSA ports, which
is what has been done until now.
Host VLANs must be installed primarily on the dedicated CPU port of each
bridge port. More subtly, they must also be installed on upstream-facing
and downstream-facing DSA ports that are connecting the bridge ports and
the CPU. This ensures that the mv88e6xxx's problem (VID of host FDB
entry may be absent from VTU) is still addressed even if that switch is
in a cross-chip setup, and it has no local CPU port.
Therefore:
- user ports contain only bridge port (forwarding) VLANs, and no
refcounting is necessary
- DSA ports contain both forwarding and host VLANs. Refcounting is
necessary among these 2 types.
- CPU ports contain only host VLANs. Refcounting is also necessary.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
interfaces
The switchdev_handle_port_obj_add() helper is good for replicating a
port object on the lower interfaces of @dev, if that object was emitted
on a bridge, or on a bridge port that is a LAG.
However, drivers that use this helper limit themselves to a box from
which they can no longer intercept port objects notified on neighbor
ports ("foreign interfaces").
One such driver is DSA, where software bridging with foreign interfaces
such as standalone NICs or Wi-Fi APs is an important use case. There, a
VLAN installed on a neighbor bridge port roughly corresponds to a
forwarding VLAN installed on the DSA switch's CPU port.
To support this use case while also making use of the benefits of the
switchdev_handle_* replication helper for port objects, introduce a new
variant of these functions that crawls through the neighbor ports of
@dev, in search of potentially compatible switchdev ports that are
interested in the event.
The strategy is identical to switchdev_handle_fdb_event_to_device():
if @dev wasn't a switchdev interface, then go one step upper, and
recursively call this function on the bridge that this port belongs to.
At the next recursion step, __switchdev_handle_port_obj_add() will
iterate through the bridge's lower interfaces. Among those, some will be
switchdev interfaces, and one will be the original @dev that we came
from. To prevent infinite recursion, we must suppress reentry into the
original @dev, and just call the @add_cb for the switchdev_interfaces.
It looks like this:
br0
/ | \
/ | \
/ | \
swp0 swp1 eth0
1. __switchdev_handle_port_obj_add(eth0)
-> check_cb(eth0) returns false
-> eth0 has no lower interfaces
-> eth0's bridge is br0
-> switchdev_lower_dev_find(br0, check_cb, foreign_dev_check_cb))
finds br0
2. __switchdev_handle_port_obj_add(br0)
-> check_cb(br0) returns false
-> netdev_for_each_lower_dev
-> check_cb(swp0) returns true, so we don't skip this interface
3. __switchdev_handle_port_obj_add(swp0)
-> check_cb(swp0) returns true, so we call add_cb(swp0)
(back to netdev_for_each_lower_dev from 2)
-> check_cb(swp1) returns true, so we don't skip this interface
4. __switchdev_handle_port_obj_add(swp1)
-> check_cb(swp1) returns true, so we call add_cb(swp1)
(back to netdev_for_each_lower_dev from 2)
-> check_cb(eth0) returns false, so we skip this interface to
avoid infinite recursion
Note: eth0 could have been a LAG, and we don't want to suppress the
recursion through its lowers if those exist, so when check_cb() returns
false, we still call switchdev_lower_dev_find() to estimate whether
there's anything worth a recursion beneath that LAG. Using check_cb()
and foreign_dev_check_cb(), switchdev_lower_dev_find() not only figures
out whether the lowers of the LAG are switchdev, but also whether they
actively offload the LAG or not (whether the LAG is "foreign" to the
switchdev interface or not).
The port_obj_info->orig_dev is preserved across recursive calls, so
switchdev drivers still know on which device was this notification
originally emitted.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
br_switchdev_port_vlan_add() currently emits a SWITCHDEV_PORT_OBJ_ADD
event with a SWITCHDEV_OBJ_ID_PORT_VLAN for 2 distinct cases:
- a struct net_bridge_vlan got created
- an existing struct net_bridge_vlan was modified
This makes it impossible for switchdev drivers to properly balance
PORT_OBJ_ADD with PORT_OBJ_DEL events, so if we want to allow that to
happen, we must provide a way for drivers to distinguish between a
VLAN with changed flags and a new one.
Annotate struct switchdev_obj_port_vlan with a "bool changed" that
distinguishes the 2 cases above.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Rework use of DMA_BIT_MASK in vmbus to work around a clang bug
(Michael Kelley)
- Fix NUMA topology (Long Li)
- Fix a memory leak in vmbus (Miaoqian Lin)
- One minor clean-up patch (Cai Huoqing)
* tag 'hyperv-fixes-signed-20220215' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Drivers: hv: utils: Make use of the helper macro LIST_HEAD()
Drivers: hv: vmbus: Rework use of DMA_BIT_MASK(64)
Drivers: hv: vmbus: Fix memory leak in vmbus_add_channel_kobj
PCI: hv: Fix NUMA node assignment when kernel boots with custom NUMA topology
|
|
syzbot reported that two threads might write over agg_select_timer
at the same time. Make agg_select_timer atomic to fix the races.
BUG: KCSAN: data-race in bond_3ad_initiate_agg_selection / bond_3ad_state_machine_handler
read to 0xffff8881242aea90 of 4 bytes by task 1846 on cpu 1:
bond_3ad_state_machine_handler+0x99/0x2810 drivers/net/bonding/bond_3ad.c:2317
process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
worker_thread+0x616/0xa70 kernel/workqueue.c:2454
kthread+0x1bf/0x1e0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30
write to 0xffff8881242aea90 of 4 bytes by task 25910 on cpu 0:
bond_3ad_initiate_agg_selection+0x18/0x30 drivers/net/bonding/bond_3ad.c:1998
bond_open+0x658/0x6f0 drivers/net/bonding/bond_main.c:3967
__dev_open+0x274/0x3a0 net/core/dev.c:1407
dev_open+0x54/0x190 net/core/dev.c:1443
bond_enslave+0xcef/0x3000 drivers/net/bonding/bond_main.c:1937
do_set_master net/core/rtnetlink.c:2532 [inline]
do_setlink+0x94f/0x2500 net/core/rtnetlink.c:2736
__rtnl_newlink net/core/rtnetlink.c:3414 [inline]
rtnl_newlink+0xfeb/0x13e0 net/core/rtnetlink.c:3529
rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5594
netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmsg+0x195/0x230 net/socket.c:2496
__do_sys_sendmsg net/socket.c:2505 [inline]
__se_sys_sendmsg net/socket.c:2503 [inline]
__x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x00000050 -> 0x0000004f
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 25910 Comm: syz-executor.1 Tainted: G W 5.17.0-rc4-syzkaller-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
WWAN driver call's wwan_get_debugfs_dir() to obtain
WWAN debugfs dir entry. As part of this procedure it
returns a reference to a found device.
Since there is no debugfs interface available at WWAN
subsystem, it is not possible to drop dev reference post
debugfs use. This leads to side effects like post wwan
driver load and reload the wwan instance gets increment
from wwanX to wwanX+1.
A new debugfs interface is added in wwan subsystem so that
wwan driver can drop the obtained dev reference post debugfs
use.
void wwan_put_debugfs_dir(struct dentry *dir)
Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Dave suggested a while ago (eleven years by now) "Let's make netif_rx()
work in all contexts and get rid of netif_rx_ni()". Eric agreed and
pointed out that modern devices should use netif_receive_skb() to avoid
the overhead.
In the meantime someone added another variant, netif_rx_any_context(),
which behaves as suggested.
netif_rx() must be invoked with disabled bottom halves to ensure that
pending softirqs, which were raised within the function, are handled.
netif_rx_ni() can be invoked only from process context (bottom halves
must be enabled) because the function handles pending softirqs without
checking if bottom halves were disabled or not.
netif_rx_any_context() invokes on the former functions by checking
in_interrupts().
netif_rx() could be taught to handle both cases (disabled and enabled
bottom halves) by simply disabling bottom halves while invoking
netif_rx_internal(). The local_bh_enable() invocation will then invoke
pending softirqs only if the BH-disable counter drops to zero.
Eric is concerned about the overhead of BH-disable+enable especially in
regard to the loopback driver. As critical as this driver is, it will
receive a shortcut to avoid the additional overhead which is not needed.
Add a local_bh_disable() section in netif_rx() to ensure softirqs are
handled if needed.
Provide __netif_rx() which does not disable BH and has a lockdep assert
to ensure that interrupts are disabled. Use this shortcut in the
loopback driver and in drivers/net/*.c.
Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they
can be removed once they are no more users left.
Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.net
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
syzbot found a data-race [1] which lead me to add __rcu
annotations to netdev->qdisc, and proper accessors
to get LOCKDEP support.
[1]
BUG: KCSAN: data-race in dev_activate / qdisc_lookup_rcu
write to 0xffff888168ad6410 of 8 bytes by task 13559 on cpu 1:
attach_default_qdiscs net/sched/sch_generic.c:1167 [inline]
dev_activate+0x2ed/0x8f0 net/sched/sch_generic.c:1221
__dev_open+0x2e9/0x3a0 net/core/dev.c:1416
__dev_change_flags+0x167/0x3f0 net/core/dev.c:8139
rtnl_configure_link+0xc2/0x150 net/core/rtnetlink.c:3150
__rtnl_newlink net/core/rtnetlink.c:3489 [inline]
rtnl_newlink+0xf4d/0x13e0 net/core/rtnetlink.c:3529
rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5594
netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmsg+0x195/0x230 net/socket.c:2496
__do_sys_sendmsg net/socket.c:2505 [inline]
__se_sys_sendmsg net/socket.c:2503 [inline]
__x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff888168ad6410 of 8 bytes by task 13560 on cpu 0:
qdisc_lookup_rcu+0x30/0x2e0 net/sched/sch_api.c:323
__tcf_qdisc_find+0x74/0x3a0 net/sched/cls_api.c:1050
tc_del_tfilter+0x1c7/0x1350 net/sched/cls_api.c:2211
rtnetlink_rcv_msg+0x5ba/0x7e0 net/core/rtnetlink.c:5585
netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmsg+0x195/0x230 net/socket.c:2496
__do_sys_sendmsg net/socket.c:2505 [inline]
__se_sys_sendmsg net/socket.c:2503 [inline]
__x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0xffffffff85dee080 -> 0xffff88815d96ec00
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 13560 Comm: syz-executor.2 Not tainted 5.17.0-rc3-syzkaller-00116-gf1baf68e1383-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 470502de5bdb ("net: sched: unlock rules update API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
mv88e6xxx is special among DSA drivers in that it requires the VTU to
contain the VID of the FDB entry it modifies in
mv88e6xxx_port_db_load_purge(), otherwise it will return -EOPNOTSUPP.
Sometimes due to races this is not always satisfied even if external
code does everything right (first deletes the FDB entries, then the
VLAN), because DSA commits to hardware FDB entries asynchronously since
commit c9eb3e0f8701 ("net: dsa: Add support for learning FDB through
notification").
Therefore, the mv88e6xxx driver must close this race condition by
itself, by asking DSA to flush the switchdev workqueue of any FDB
deletions in progress, prior to exiting a VLAN.
Fixes: c9eb3e0f8701 ("net: dsa: Add support for learning FDB through notification")
Reported-by: Rafael Richter <rafael.richter@gin.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some time ago 8965779d2c0e ("ipv6,mcast: always hold idev->lock before mca_lock")
switched ipv6_get_lladdr() to __ipv6_get_lladdr(), which is rcu-unsafe
version. That was OK, because idev->lock was held for these codepaths.
In 88e2ca308094 ("mld: convert ifmcaddr6 to RCU") these external locks were
removed, so we probably need to restore the original rcu-safe call.
Otherwise, we occasionally get a machine crashed/stalled with the following
in dmesg:
[ 3405.966610][T230589] general protection fault, probably for non-canonical address 0xdead00000000008c: 0000 [#1] SMP NOPTI
[ 3405.982083][T230589] CPU: 44 PID: 230589 Comm: kworker/44:3 Tainted: G O 5.15.19-cloudflare-2022.2.1 #1
[ 3405.998061][T230589] Hardware name: SUPA-COOL-SERV
[ 3406.009552][T230589] Workqueue: mld mld_ifc_work
[ 3406.017224][T230589] RIP: 0010:__ipv6_get_lladdr+0x34/0x60
[ 3406.025780][T230589] Code: 57 10 48 83 c7 08 48 89 e5 48 39 d7 74 3e 48 8d 82 38 ff ff ff eb 13 48 8b 90 d0 00 00 00 48 8d 82 38 ff ff ff 48 39 d7 74 22 <66> 83 78 32 20 77 1b 75 e4 89 ca 23 50 2c 75 dd 48 8b 50 08 48 8b
[ 3406.055748][T230589] RSP: 0018:ffff94e4b3fc3d10 EFLAGS: 00010202
[ 3406.065617][T230589] RAX: dead00000000005a RBX: ffff94e4b3fc3d30 RCX: 0000000000000040
[ 3406.077477][T230589] RDX: dead000000000122 RSI: ffff94e4b3fc3d30 RDI: ffff8c3a31431008
[ 3406.089389][T230589] RBP: ffff94e4b3fc3d10 R08: 0000000000000000 R09: 0000000000000000
[ 3406.101445][T230589] R10: ffff8c3a31430000 R11: 000000000000000b R12: ffff8c2c37887100
[ 3406.113553][T230589] R13: ffff8c3a39537000 R14: 00000000000005dc R15: ffff8c3a31431000
[ 3406.125730][T230589] FS: 0000000000000000(0000) GS:ffff8c3b9fc80000(0000) knlGS:0000000000000000
[ 3406.138992][T230589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3406.149895][T230589] CR2: 00007f0dfea1db60 CR3: 000000387b5f2000 CR4: 0000000000350ee0
[ 3406.162421][T230589] Call Trace:
[ 3406.170235][T230589] <TASK>
[ 3406.177736][T230589] mld_newpack+0xfe/0x1a0
[ 3406.186686][T230589] add_grhead+0x87/0xa0
[ 3406.195498][T230589] add_grec+0x485/0x4e0
[ 3406.204310][T230589] ? newidle_balance+0x126/0x3f0
[ 3406.214024][T230589] mld_ifc_work+0x15d/0x450
[ 3406.223279][T230589] process_one_work+0x1e6/0x380
[ 3406.232982][T230589] worker_thread+0x50/0x3a0
[ 3406.242371][T230589] ? rescuer_thread+0x360/0x360
[ 3406.252175][T230589] kthread+0x127/0x150
[ 3406.261197][T230589] ? set_kthread_struct+0x40/0x40
[ 3406.271287][T230589] ret_from_fork+0x22/0x30
[ 3406.280812][T230589] </TASK>
[ 3406.288937][T230589] Modules linked in: ... [last unloaded: kheaders]
[ 3406.476714][T230589] ---[ end trace 3525a7655f2f3b9e ]---
Fixes: 88e2ca308094 ("mld: convert ifmcaddr6 to RCU")
Reported-by: David Pinilla Caparros <dpini@cloudflare.com>
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Create and utilize bulk regmap reads instead of single access for gathering
stats. The background reading of statistics happens frequently, and over
a few contiguous memory regions.
High speed PCIe buses and MMIO access will probably see negligible
performance increase. Lower speed buses like SPI and I2C could see
significant performance increase, since the bus configuration and register
access times account for a large percentage of data transfer time.
Signed-off-by: Colin Foster <colin.foster@in-advantage.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Regmap supports bulk register reads. Ocelot does not. This patch adds
support for Ocelot to invoke bulk regmap reads. That will allow any driver
that performs consecutive reads over memory regions to optimize that
access.
Signed-off-by: Colin Foster <colin.foster@in-advantage.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the ocelot.h file, several read / write macros were split across
multiple lines, while others weren't. Split all macros that exceed the 80
character column width and match the style of the rest of the file.
Signed-off-by: Colin Foster <colin.foster@in-advantage.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fix from Borislav Petkov:
"Fix a case where objtool would mistakenly warn about instructions
being unreachable"
* tag 'objtool_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bug: Merge annotate_reachable() into _BUG_FLAGS() asm
|
|
With GCC 12, -Wstringop-overread was warning about an implicit cast from
char[6] to char[8]. However, the extra 2 bytes are always thrown away,
alignment doesn't matter, and the risk of hitting the edge of unallocated
memory has been accepted, so this prototype can just be converted to a
regular char *. Silences:
net/core/dev.c: In function ‘bpf_prog_run_generic_xdp’: net/core/dev.c:4618:21: warning: ‘ether_addr_equal_64bits’ reading 8 bytes from a region of size 6 [-Wstringop-overread]
4618 | orig_host = ether_addr_equal_64bits(eth->h_dest, > skb->dev->dev_addr);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
net/core/dev.c:4618:21: note: referencing argument 1 of type ‘const u8[8]’ {aka ‘const unsigned char[8]’}
net/core/dev.c:4618:21: note: referencing argument 2 of type ‘const u8[8]’ {aka ‘const unsigned char[8]’}
In file included from net/core/dev.c:91: include/linux/etherdevice.h:375:20: note: in a call to function ‘ether_addr_equal_64bits’
375 | static inline bool ether_addr_equal_64bits(const u8 addr1[6+2],
| ^~~~~~~~~~~~~~~~~~~~~~~
Reported-by: Marc Kleine-Budde <mkl@pengutronix.de>
Tested-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/netdev/20220212090811.uuzk6d76agw2vv73@pengutronix.de
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Merge misc fixes from Andrew Morton:
"5 patches.
Subsystems affected by this patch series: binfmt, procfs, and mm
(vmscan, memcg, and kfence)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
kfence: make test case compatible with run time set sample interval
mm: memcg: synchronize objcg lists with a dedicated spinlock
mm: vmscan: remove deadlock due to throttling failing to make progress
fs/proc: task_mmu.c: don't read mapcount for migration entry
fs/binfmt_elf: fix PT_LOAD p_align values for loaders
|