summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2025-02-20flow_dissector: Fix handling of mixed port and port-range keysCong Wang1-12/+19
This patch fixes a bug in TC flower filter where rules combining a specific destination port with a source port range weren't working correctly. The specific case was when users tried to configure rules like: tc filter add dev ens38 ingress protocol ip flower ip_proto udp \ dst_port 5000 src_port 2000-3000 action drop The root cause was in the flow dissector code. While both FLOW_DISSECTOR_KEY_PORTS and FLOW_DISSECTOR_KEY_PORTS_RANGE flags were being set correctly in the classifier, the __skb_flow_dissect_ports() function was only populating one of them: whichever came first in the enum check. This meant that when the code needed both a specific port and a port range, one of them would be left as 0, causing the filter to not match packets as expected. Fix it by removing the either/or logic and instead checking and populating both key types independently when they're in use. Fixes: 8ffb055beae5 ("cls_flower: Fix the behavior using port ranges with hw-offload") Reported-by: Qiang Zhang <dtzq01@gmail.com> Closes: https://lore.kernel.org/netdev/CAPx+-5uvFxkhkz4=j_Xuwkezjn9U6kzKTD5jz4tZ9msSJ0fOJA@mail.gmail.com/ Cc: Yoshiki Komachi <komachi.yoshiki@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250218043210.732959-2-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: fib_rules: Enable port mask usageIdo Schimmel1-2/+2
Allow user space to configure FIB rules that match on the source and destination ports with a mask, now that support has been added to the FIB rule core and the IPv4 and IPv6 address families. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250217134109.311176-6-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20ipv6: fib_rules: Add port mask matchingIdo Schimmel1-4/+4
Extend IPv6 FIB rules to match on source and destination ports using a mask. Note that the mask is only set when not matching on a range. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250217134109.311176-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20ipv4: fib_rules: Add port mask matchingIdo Schimmel1-4/+4
Extend IPv4 FIB rules to match on source and destination ports using a mask. Note that the mask is only set when not matching on a range. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250217134109.311176-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: fib_rules: Add port mask supportIdo Schimmel1-1/+66
Add support for configuring and deleting rules that match on source and destination ports using a mask as well as support for dumping such rules to user space. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250217134109.311176-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: fib_rules: Add port mask attributesIdo Schimmel1-0/+2
Add attributes that allow matching on source and destination ports with a mask. Matching on the source port with a mask is needed in deployments where users encode path information into certain bits of the UDP source port. Temporarily set the type of the attributes to 'NLA_REJECT' while support is being added. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250217134109.311176-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20SUNRPC: Handle -ETIMEDOUT return from tlshdBenjamin Coddington1-1/+9
If the TLS handshake attempt returns -ETIMEDOUT, we currently translate that error into -EACCES. This becomes problematic for cases where the RPC layer is attempting to re-connect in paths that don't resonably handle -EACCES, for example: writeback. The RPC layer can handle -ETIMEDOUT quite well, however - so if the handshake returns this error let's just pass it along. Fixes: 75eb6af7acdf ("SUNRPC: Add a TCP-with-TLS RPC transport class") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-02-20SUNRPC: Prevent looping due to rpc_signal_task() racesTrond Myklebust1-2/+0
If rpc_signal_task() is called while a task is in an rpc_call_done() callback function, and the latter calls rpc_restart_call(), the task can end up looping due to the RPC_TASK_SIGNALLED flag being set without the tk_rpc_status being set. Removing the redundant mechanism for signalling the task fixes the looping behaviour. Reported-by: Li Lingfeng <lilingfeng3@huawei.com> Fixes: 39494194f93b ("SUNRPC: Fix races with rpc_killall_tasks()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-02-19bpf: Support TCP_RTO_MAX_MS for bpf_setsockoptJason Xing1-0/+1
Some applications don't want to wait for too long because the time of retransmission increases exponentially and can reach more than 10 seconds, for example. Eric implements the core logic on supporting rto max feature in the stack previously. Based on that, we can support it for BPF use. This patch reuses the same logic of TCP_RTO_MAX_MS in do_tcp_setsockopt() and do_tcp_getsockopt(). BPF program can call bpf_{set/get}sockopt() to set/get the maximum value of RTO. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219081333.56378-2-kerneljasonxing@gmail.com
2025-02-19dev: Use rtnl_net_dev_lock() in unregister_netdev().Kuniyuki Iwashima1-4/+2
The following sequence is basically illegal when dev was fetched without lookup because dev_net(dev) might be different after holding rtnl_net_lock(): net = dev_net(dev); rtnl_net_lock(net); Let's use rtnl_net_dev_lock() in unregister_netdev(). Note that there is no real bug in unregister_netdev() for now because RTNL protects the scope even if dev_net(dev) is changed before/after RTNL. Fixes: 00fb9823939e ("dev: Hold per-netns RTNL in (un)?register_netdev().") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250217191129.19967-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19net: Fix dev_net(dev) race in unregister_netdevice_notifier_dev_net().Kuniyuki Iwashima1-4/+44
After the cited commit, dev_net(dev) is fetched before holding RTNL and passed to __unregister_netdevice_notifier_net(). However, dev_net(dev) might be different after holding RTNL. In the reported case [0], while removing a VF device, its netns was being dismantled and the VF was moved to init_net. So the following sequence is basically illegal when dev was fetched without lookup: net = dev_net(dev); rtnl_net_lock(net); Let's use a new helper rtnl_net_dev_lock() to fix the race. It fetches dev_net_rcu(dev), bumps its net->passive, and checks if dev_net_rcu(dev) is changed after rtnl_net_lock(). [0]: BUG: KASAN: slab-use-after-free in notifier_call_chain (kernel/notifier.c:75 (discriminator 2)) Read of size 8 at addr ffff88810cefb4c8 by task test-bridge-lag/21127 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:123) print_report (mm/kasan/report.c:379 mm/kasan/report.c:489) kasan_report (mm/kasan/report.c:604) notifier_call_chain (kernel/notifier.c:75 (discriminator 2)) call_netdevice_notifiers_info (net/core/dev.c:2011) unregister_netdevice_many_notify (net/core/dev.c:11551) unregister_netdevice_queue (net/core/dev.c:11487) unregister_netdev (net/core/dev.c:11635) mlx5e_remove (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:6552 drivers/net/ethernet/mellanox/mlx5/core/en_main.c:6579) mlx5_core auxiliary_bus_remove (drivers/base/auxiliary.c:230) device_release_driver_internal (drivers/base/dd.c:1275 drivers/base/dd.c:1296) bus_remove_device (./include/linux/kobject.h:193 drivers/base/base.h:73 drivers/base/bus.c:583) device_del (drivers/base/power/power.h:142 drivers/base/core.c:3855) mlx5_rescan_drivers_locked (./include/linux/auxiliary_bus.h:241 drivers/net/ethernet/mellanox/mlx5/core/dev.c:333 drivers/net/ethernet/mellanox/mlx5/core/dev.c:535 drivers/net/ethernet/mellanox/mlx5/core/dev.c:549) mlx5_core mlx5_unregister_device (drivers/net/ethernet/mellanox/mlx5/core/dev.c:468) mlx5_core mlx5_uninit_one (./include/linux/instrumented.h:68 ./include/asm-generic/bitops/instrumented-non-atomic.h:141 drivers/net/ethernet/mellanox/mlx5/core/main.c:1563) mlx5_core remove_one (drivers/net/ethernet/mellanox/mlx5/core/main.c:965 drivers/net/ethernet/mellanox/mlx5/core/main.c:2019) mlx5_core pci_device_remove (./include/linux/pm_runtime.h:129 drivers/pci/pci-driver.c:475) device_release_driver_internal (drivers/base/dd.c:1275 drivers/base/dd.c:1296) unbind_store (drivers/base/bus.c:245) kernfs_fop_write_iter (fs/kernfs/file.c:338) vfs_write (fs/read_write.c:587 (discriminator 1) fs/read_write.c:679 (discriminator 1)) ksys_write (fs/read_write.c:732) do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) RIP: 0033:0x7f6a4d5018b7 Fixes: 7fb1073300a2 ("net: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_dev_net().") Reported-by: Yael Chemla <ychemla@nvidia.com> Closes: https://lore.kernel.org/netdev/146eabfe-123c-4970-901e-e961b4c09bc3@nvidia.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250217191129.19967-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19net: Add net_passive_inc() and net_passive_dec().Kuniyuki Iwashima1-4/+4
net_drop_ns() is NULL when CONFIG_NET_NS is disabled. The next patch introduces a function that increments and decrements net->passive. As a prep, let's rename and export net_free() to net_passive_dec() and add net_passive_inc(). Suggested-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/CANn89i+oUCt2VGvrbrweniTendZFEh+nwS=uonc004-aPkWy-Q@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250217191129.19967-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19ipv6: initialize inet socket cookies with sockcm_initWillem de Bruijn3-12/+3
Avoid open coding the same logic. Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250214222720.3205500-8-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19ipv6: replace ipcm6_init calls with ipcm6_init_skWillem de Bruijn3-20/+3
This initializes tclass and dontfrag before cmsg parsing, removing the need for explicit checks against -1 in each caller. Leave hlimit set to -1, because its full initialization (in ip6_sk_dst_hoplimit) requires more state (dst, flowi6, ..). This also prepares for calling sockcm_init in a follow-on patch. Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250214222720.3205500-7-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19icmp: reflect tos through ip cookie rather than updating inet_skWillem de Bruijn1-4/+2
Do not modify socket fields if it can be avoided. The current code predates the introduction of ip cookies in commit aa6615814533 ("ipv4: processing ancillary IP_TOS or IP_TTL"). Now that cookies exist and support tos, update that field directly. Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250214222720.3205500-6-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19ipv4: remove get_rttosWillem de Bruijn3-9/+9
Initialize the ip cookie tos field when initializing the cookie, in ipcm_init_sk. The existing code inverts the standard pattern for initializing cookie fields. Default is to initialize the field from the sk, then possibly overwrite that when parsing cmsgs (the unlikely case). This field inverts that, setting the field to an illegal value and after cmsg parsing checking whether the value is still illegal and thus should be overridden. Be careful to always apply mask INET_DSCP_MASK, as before. Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250214222720.3205500-5-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19net: initialize mark in sockcm_initWillem de Bruijn2-6/+5
Avoid open coding initialization of sockcm fields. Avoid reading the sk_priority field twice. This ensures all callers, existing and future, will correctly try a cmsg passed mark before sk_mark. This patch extends support for cmsg mark to: packet_spkt and packet_tpacket and net/can/raw.c. This patch extends support for cmsg priority to: packet_spkt and packet_tpacket. Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250214222720.3205500-3-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19tcp: only initialize sockcm tsflags fieldWillem de Bruijn1-1/+1
TCP only reads the tsflags field. Don't bother initializing others. Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250214222720.3205500-2-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19net: Remove redundant variable declaration in __dev_change_flags()Breno Leitao1-1/+1
The old_flags variable is declared twice in __dev_change_flags(), causing a shadow variable warning. This patch fixes the issue by removing the redundant declaration, reusing the existing old_flags variable instead. net/core/dev.c:9225:16: warning: declaration shadows a local variable [-Wshadow] 9225 | unsigned int old_flags = dev->flags; | ^ net/core/dev.c:9185:15: note: previous declaration is here 9185 | unsigned int old_flags = dev->flags; | ^ 1 warning generated. Remove the redundant inner declaration and reuse the existing old_flags variable since its value is not needed outside the if block, and it is safe to reuse the variable. This eliminates the warning while maintaining the same functionality. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250217-old_flags-v2-1-4cda3b43a35f@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19netlink: Unset cb_running when terminating dump on releaseSiddh Raman Pant1-0/+1
When we terminated the dump, the callback isn't running, so cb_running should be set to false to be logically consistent. cb_running signifies whether a dump is ongoing. It is set to true in cb->start(), and is checked in netlink_dump() to be true initially. After the dump, it is set to false in the same function. This is just a cleanup, no path should access this field on a closed socket. Signed-off-by: Siddh Raman Pant <siddh.raman.pant@oracle.com> Link: https://patch.msgid.link/aff028e3eb2b768b9895fa6349fa1981ae22f098.camel@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19tcp: adjust rcvq_space after updating scaling ratioJakub Kicinski1-3/+9
Since commit under Fixes we set the window clamp in accordance to newly measured rcvbuf scaling_ratio. If the scaling_ratio decreased significantly we may put ourselves in a situation where windows become smaller than rcvq_space, preventing tcp_rcv_space_adjust() from increasing rcvbuf. The significant decrease of scaling_ratio is far more likely since commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio"), which increased the "default" scaling ratio from ~30% to 50%. Hitting the bad condition depends a lot on TCP tuning, and drivers at play. One of Meta's workloads hits it reliably under following conditions: - default rcvbuf of 125k - sender MTU 1500, receiver MTU 5000 - driver settles on scaling_ratio of 78 for the config above. Initial rcvq_space gets calculated as TCP_INIT_CWND * tp->advmss (10 * 5k = 50k). Once we find out the true scaling ratio and MSS we clamp the windows to 38k. Triggering the condition also depends on the message sequence of this workload. I can't repro the problem with simple iperf or TCP_RR-style tests. Fixes: a2cbb1603943 ("tcp: Update window clamping condition") Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250217232905.3162187-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18trace: tcp: Add tracepoint for tcp_cwnd_reduction()Breno Leitao1-0/+2
Add a lightweight tracepoint to monitor TCP congestion window adjustments via tcp_cwnd_reduction(). This tracepoint enables tracking of: - TCP window size fluctuations - Active socket behavior - Congestion window reduction events Meta has been using BPF programs to monitor this function for years. Adding a proper tracepoint provides a stable API for all users who need to monitor TCP congestion window behavior. Use DECLARE_TRACE instead of TRACE_EVENT to avoid creating trace event infrastructure and exporting to tracefs, keeping the implementation minimal. (Thanks Steven Rostedt) Given that this patch creates a rawtracepoint, you could hook into it using regular tooling, like bpftrace, using regular rawtracepoint infrastructure, such as: rawtracepoint:tcp_cwnd_reduction_tp { .... } Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250214-cwnd_tracepoint-v2-1-ef8d15162d95@debian.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-18page_pool: avoid infinite loop to schedule delayed workerJason Xing1-1/+7
We noticed the kworker in page_pool_release_retry() was waken up repeatedly and infinitely in production because of the buggy driver causing the inflight less than 0 and warning us in page_pool_inflight()[1]. Since the inflight value goes negative, it means we should not expect the whole page_pool to get back to work normally. This patch mitigates the adverse effect by not rescheduling the kworker when detecting the inflight negative in page_pool_release_retry(). [1] [Mon Feb 10 20:36:11 2025] ------------[ cut here ]------------ [Mon Feb 10 20:36:11 2025] Negative(-51446) inflight packet-pages ... [Mon Feb 10 20:36:11 2025] Call Trace: [Mon Feb 10 20:36:11 2025] page_pool_release_retry+0x23/0x70 [Mon Feb 10 20:36:11 2025] process_one_work+0x1b1/0x370 [Mon Feb 10 20:36:11 2025] worker_thread+0x37/0x3a0 [Mon Feb 10 20:36:11 2025] kthread+0x11a/0x140 [Mon Feb 10 20:36:11 2025] ? process_one_work+0x370/0x370 [Mon Feb 10 20:36:11 2025] ? __kthread_cancel_work+0x40/0x40 [Mon Feb 10 20:36:11 2025] ret_from_fork+0x35/0x40 [Mon Feb 10 20:36:11 2025] ---[ end trace ebffe800f33e7e34 ]--- Note: before this patch, the above calltrace would flood the dmesg due to repeated reschedule of release_dw kworker. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250214064250.85987-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-18vsock/bpf: Warn on socket without transportMichal Luczaj2-1/+4
In the spirit of commit 91751e248256 ("vsock: prevent null-ptr-deref in vsock_*[has_data|has_space]"), armorize the "impossible" cases with a warning. Fixes: 634f1a7110b4 ("vsock: support sockmap") Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-18sockmap, vsock: For connectible sockets allow only connectedMichal Luczaj1-0/+3
sockmap expects all vsocks to have a transport assigned, which is expressed in vsock_proto::psock_update_sk_prot(). However, there is an edge case where an unconnected (connectible) socket may lose its previously assigned transport. This is handled with a NULL check in the vsock/BPF recv path. Another design detail is that listening vsocks are not supposed to have any transport assigned at all. Which implies they are not supported by the sockmap. But this is complicated by the fact that a socket, before switching to TCP_LISTEN, may have had some transport assigned during a failed connect() attempt. Hence, we may end up with a listening vsock in a sockmap, which blows up quickly: KASAN: null-ptr-deref in range [0x0000000000000120-0x0000000000000127] CPU: 7 UID: 0 PID: 56 Comm: kworker/7:0 Not tainted 6.14.0-rc1+ Workqueue: vsock-loopback vsock_loopback_work RIP: 0010:vsock_read_skb+0x4b/0x90 Call Trace: sk_psock_verdict_data_ready+0xa4/0x2e0 virtio_transport_recv_pkt+0x1ca8/0x2acc vsock_loopback_work+0x27d/0x3f0 process_one_work+0x846/0x1420 worker_thread+0x5b3/0xf80 kthread+0x35a/0x700 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 For connectible sockets, instead of relying solely on the state of vsk->transport, tell sockmap to only allow those representing established connections. This aligns with the behaviour for AF_INET and AF_UNIX. Fixes: 634f1a7110b4 ("vsock: support sockmap") Signed-off-by: Michal Luczaj <mhal@rbox.co> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-18xfrm: Switch to use hrtimer_setup()Nam Cao2-6/+4
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/d338b246f087ee2b2a305348c896449e107a7ff4.1738746872.git.namcao@linutronix.de
2025-02-18can: Switch to use hrtimer_setup()Nam Cao4-26/+16
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Most of this patch is generated by Coccinelle. Except for the TX thrtimer in bcm_tx_setup() because this timer is not used and the callback function is never set. For this particular case, set the callback to hrtimer_dummy_timeout() Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Kleine-Budde <mkl@pengutronix.de> Link: https://lore.kernel.org/all/a3a6be42c818722ad41758457408a32163bfd9a0.1738746872.git.namcao@linutronix.de
2025-02-18tcp: Switch to use hrtimer_setup()Nam Cao1-6/+4
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/a16c227cc6882d8aecf658e6a7e38b74e7fd7573.1738746872.git.namcao@linutronix.de
2025-02-18mac802154: Switch to use hrtimer_setup()Nam Cao1-2/+2
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/5a89d86092d11a1d0cdfdfaa33990aaf7f1eec87.1738746872.git.namcao@linutronix.de
2025-02-18net/sched: Switch to use hrtimer_setup()Nam Cao3-8/+4
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/f1d4843589dd924743b35de1f0920c4a4c43be01.1738746872.git.namcao@linutronix.de
2025-02-18netdev: Switch to use hrtimer_setup()Nam Cao1-2/+1
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/861d9b287f2b5206b853dbcc0f302127f2a7f8ac.1738746872.git.namcao@linutronix.de
2025-02-18net: move stale comment about ntuple validationJakub Kicinski1-1/+3
Gal points out that the comment now belongs further down, since the original if condition was split into two in commit de7f7582dff2 ("net: ethtool: prevent flow steering to RSS contexts which don't exist") Link: https://lore.kernel.org/de4a2a8a-1eb9-4fa8-af87-7526e58218e9@nvidia.com Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250214224340.2268691-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18netdev-genl: Add an XSK attribute to queuesJoe Damato1-0/+12
Expose a new per-queue nest attribute, xsk, which will be present for queues that are being used for AF_XDP. If the queue is not being used for AF_XDP, the nest will not be present. In the future, this attribute can be extended to include more data about XSK as it is needed. Signed-off-by: Joe Damato <jdamato@fastly.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250214211255.14194-3-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18net: use napi_id_valid helperStefano Jordhani5-7/+7
In commit 6597e8d35851 ("netdev-genl: Elide napi_id when not present"), napi_id_valid function was added. Use the helper to refactor open-coded checks in the source. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Stefano Jordhani <sjordhani@gmail.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> # for iouring Link: https://patch.msgid.link/20250214181801.931-1-sjordhani@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18drop_monitor: fix incorrect initialization orderGavrilov Ilia1-15/+14
Syzkaller reports the following bug: BUG: spinlock bad magic on CPU#1, syz-executor.0/7995 lock: 0xffff88805303f3e0, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 CPU: 1 PID: 7995 Comm: syz-executor.0 Tainted: G E 5.10.209+ #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x119/0x179 lib/dump_stack.c:118 debug_spin_lock_before kernel/locking/spinlock_debug.c:83 [inline] do_raw_spin_lock+0x1f6/0x270 kernel/locking/spinlock_debug.c:112 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:117 [inline] _raw_spin_lock_irqsave+0x50/0x70 kernel/locking/spinlock.c:159 reset_per_cpu_data+0xe6/0x240 [drop_monitor] net_dm_cmd_trace+0x43d/0x17a0 [drop_monitor] genl_family_rcv_msg_doit+0x22f/0x330 net/netlink/genetlink.c:739 genl_family_rcv_msg net/netlink/genetlink.c:783 [inline] genl_rcv_msg+0x341/0x5a0 net/netlink/genetlink.c:800 netlink_rcv_skb+0x14d/0x440 net/netlink/af_netlink.c:2497 genl_rcv+0x29/0x40 net/netlink/genetlink.c:811 netlink_unicast_kernel net/netlink/af_netlink.c:1322 [inline] netlink_unicast+0x54b/0x800 net/netlink/af_netlink.c:1348 netlink_sendmsg+0x914/0xe00 net/netlink/af_netlink.c:1916 sock_sendmsg_nosec net/socket.c:651 [inline] __sock_sendmsg+0x157/0x190 net/socket.c:663 ____sys_sendmsg+0x712/0x870 net/socket.c:2378 ___sys_sendmsg+0xf8/0x170 net/socket.c:2432 __sys_sendmsg+0xea/0x1b0 net/socket.c:2461 do_syscall_64+0x30/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x62/0xc7 RIP: 0033:0x7f3f9815aee9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f3f972bf0c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f3f9826d050 RCX: 00007f3f9815aee9 RDX: 0000000020000000 RSI: 0000000020001300 RDI: 0000000000000007 RBP: 00007f3f981b63bd R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000006e R14: 00007f3f9826d050 R15: 00007ffe01ee6768 If drop_monitor is built as a kernel module, syzkaller may have time to send a netlink NET_DM_CMD_START message during the module loading. This will call the net_dm_monitor_start() function that uses a spinlock that has not yet been initialized. To fix this, let's place resource initialization above the registration of a generic netlink family. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 9a8afc8d3962 ("Network Drop Monitor: Adding drop monitor implementation & Netlink protocol") Cc: stable@vger.kernel.org Signed-off-by: Ilia Gavrilov <Ilia.Gavrilov@infotecs.ru> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250213152054.2785669-1-Ilia.Gavrilov@infotecs.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-15ndisc: ndisc_send_redirect() cleanupEric Dumazet1-5/+3
ndisc_send_redirect() is always called under rcu_read_lock(). It can use dev_net_rcu() and avoid one redundant rcu_read_lock()/rcu_read_unlock() pair. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250214140705.2105890-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-15net/sched: cls_api: fix error handling causing NULL dereferencePierre Riteau1-1/+1
tcf_exts_miss_cookie_base_alloc() calls xa_alloc_cyclic() which can return 1 if the allocation succeeded after wrapping. This was treated as an error, with value 1 returned to caller tcf_exts_init_ex() which sets exts->actions to NULL and returns 1 to caller fl_change(). fl_change() treats err == 1 as success, calling tcf_exts_validate_ex() which calls tcf_action_init() with exts->actions as argument, where it is dereferenced. Example trace: BUG: kernel NULL pointer dereference, address: 0000000000000000 CPU: 114 PID: 16151 Comm: handler114 Kdump: loaded Not tainted 5.14.0-503.16.1.el9_5.x86_64 #1 RIP: 0010:tcf_action_init+0x1f8/0x2c0 Call Trace: tcf_action_init+0x1f8/0x2c0 tcf_exts_validate_ex+0x175/0x190 fl_change+0x537/0x1120 [cls_flower] Fixes: 80cd22c35c90 ("net/sched: cls_api: Support hardware miss to tc action") Signed-off-by: Pierre Riteau <pierre@stackhpc.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250213223610.320278-1-pierre@stackhpc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-15vsock/virtio: fix variables initialization during resumingJunnan Wu1-3/+7
When executing suspend to ram twice in a row, the `rx_buf_nr` and `rx_buf_max_nr` increase to three times vq->num_free. Then after virtqueue_get_buf and `rx_buf_nr` decreased in function virtio_transport_rx_work, the condition to fill rx buffer (rx_buf_nr < rx_buf_max_nr / 2) will never be met. It is because that `rx_buf_nr` and `rx_buf_max_nr` are initialized only in virtio_vsock_probe(), but they should be reset whenever virtqueues are recreated, like after a suspend/resume. Move the `rx_buf_nr` and `rx_buf_max_nr` initialization in virtio_vsock_vqs_init(), so we are sure that they are properly initialized, every time we initialize the virtqueues, either when we load the driver or after a suspend/resume. To prevent erroneous atomic load operations on the `queued_replies` in the virtio_transport_send_pkt_work() function which may disrupt the scheduling of vsock->rx_work when transmitting reply-required socket packets, this atomic variable must undergo synchronized initialization alongside the preceding two variables after a suspend/resume. Fixes: bd50c5dc182b ("vsock/virtio: add support for device suspend/resume") Link: https://lore.kernel.org/virtualization/20250207052033.2222629-1-junnan01.wu@samsung.com/ Co-developed-by: Ying Gao <ying01.gao@samsung.com> Signed-off-by: Ying Gao <ying01.gao@samsung.com> Signed-off-by: Junnan Wu <junnan01.wu@samsung.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250214012200.1883896-1-junnan01.wu@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-15inet: consolidate inet_csk_clone_lock()Eric Dumazet5-23/+25
Current inet_sock_set_state trace from inet_csk_clone_lock() is missing many details : ... sock:inet_sock_set_state: family=AF_INET6 protocol=IPPROTO_TCP \ sport=4901 dport=0 \ saddr=127.0.0.6 daddr=0.0.0.0 \ saddrv6=:: daddrv6=:: \ oldstate=TCP_LISTEN newstate=TCP_SYN_RECV Only the sport gives the listener port, no other parts of the n-tuple are correct. In this patch, I initialize relevant fields of the new socket before calling inet_sk_set_state(newsk, TCP_SYN_RECV). We now have a trace including all the source/destination bits. ... sock:inet_sock_set_state: family=AF_INET6 protocol=IPPROTO_TCP \ sport=4901 dport=47648 \ saddr=127.0.0.6 daddr=127.0.0.6 \ saddrv6=2002:a05:6830:1f85:: daddrv6=2001:4860:f803:65::3 \ oldstate=TCP_LISTEN newstate=TCP_SYN_RECV Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250212131328.1514243-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-15inet: reduce inet_csk_clone_lock() indent levelEric Dumazet1-23/+27
Return early from inet_csk_clone_lock() if the socket allocation failed, to reduce the indentation level. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250212131328.1514243-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-15udp: use EXPORT_IPV6_MOD[_GPL]()Eric Dumazet1-32/+31
Use EXPORT_IPV6_MOD[_GPL]() for symbols that don't need to be exported unless CONFIG_IPV6=m udp_table is no longer used from any modules, and does not need to be exported anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Link: https://patch.msgid.link/20250212132418.1524422-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-15tcp: use EXPORT_IPV6_MOD[_GPL]()Eric Dumazet9-73/+71
Use EXPORT_IPV6_MOD[_GPL]() for symbols that don't need to be exported unless CONFIG_IPV6=m tcp_hashinfo and tcp_openreq_init_rwin() are no longer used from any module anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Link: https://patch.msgid.link/20250212132418.1524422-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-15inetpeer: use EXPORT_IPV6_MOD[_GPL]()Eric Dumazet1-4/+4
Use EXPORT_IPV6_MOD[_GPL]() for symbols that do not need to to be exported unless CONFIG_IPV6=m Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Link: https://patch.msgid.link/20250212132418.1524422-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski38-335/+463
Cross-merge networking fixes after downstream PR (net-6.14-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13Merge tag 'net-6.14-rc3' of ↵Linus Torvalds36-333/+457
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, wireless and bluetooth. Kalle Valo steps down after serving as the WiFi driver maintainer for over a decade. Current release - fix to a fix: - vsock: orphan socket after transport release, avoid null-deref - Bluetooth: L2CAP: fix corrupted list in hci_chan_del Current release - regressions: - eth: - stmmac: correct Rx buffer layout when SPH is enabled - iavf: fix a locking bug in an error path - rxrpc: fix alteration of headers whilst zerocopy pending - s390/qeth: move netif_napi_add_tx() and napi_enable() from under BH - Revert "netfilter: flowtable: teardown flow if cached mtu is stale" Current release - new code bugs: - rxrpc: fix ipv6 path MTU discovery, only ipv4 worked - pse-pd: fix deadlock in current limit functions Previous releases - regressions: - rtnetlink: fix netns refleak with rtnl_setlink() - wifi: brcmfmac: use random seed flag for BCM4355 and BCM4364 firmware Previous releases - always broken: - add missing RCU protection of struct net throughout the stack - can: rockchip: bail out if skb cannot be allocated - eth: ti: am65-cpsw: base XDP support fixes Misc: - ethtool: tsconfig: update the format of hwtstamp flags, changes the uAPI but this uAPI was not in any release yet" * tag 'net-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) net: pse-pd: Fix deadlock in current limit functions rxrpc: Fix ipv6 path MTU discovery Reapply "net: skb: introduce and use a single page frag cache" s390/qeth: move netif_napi_add_tx() and napi_enable() from under BH mlxsw: Add return value check for mlxsw_sp_port_get_stats_raw() ipv6: mcast: add RCU protection to mld_newpack() team: better TEAM_OPTION_TYPE_STRING validation Bluetooth: L2CAP: Fix corrupted list in hci_chan_del Bluetooth: btintel_pcie: Fix a potential race condition Bluetooth: L2CAP: Fix slab-use-after-free Read in l2cap_send_cmd net: ethernet: ti: am65_cpsw: fix tx_cleanup for XDP case net: ethernet: ti: am65-cpsw: fix RX & TX statistics for XDP_TX case net: ethernet: ti: am65-cpsw: fix memleak in certain XDP cases vsock/test: Add test for SO_LINGER null ptr deref vsock: Orphan socket after transport release MAINTAINERS: Add sctp headers to the general netdev entry Revert "netfilter: flowtable: teardown flow if cached mtu is stale" iavf: Fix a locking bug in an error path rxrpc: Fix alteration of headers whilst zerocopy pending net: phylink: make configuring clock-stop dependent on MAC support ...
2025-02-13rxrpc: Fix ipv6 path MTU discoveryDavid Howells1-0/+7
rxrpc path MTU discovery currently only makes use of ICMPv4, but not ICMPv6, which means that pmtud for IPv6 doesn't work correctly. Fix it to check for ICMPv6 messages also. Fixes: eeaedc5449d9 ("rxrpc: Implement path-MTU probing using padded PING ACKs (RFC8899)") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/3517283.1739359284@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13Merge tag 'for-net-2025-02-13' of ↵Jakub Kicinski2-98/+86
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - btintel_pcie: Fix a potential race condition - L2CAP: Fix slab-use-after-free Read in l2cap_send_cmd - L2CAP: Fix corrupted list in hci_chan_del * tag 'for-net-2025-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: L2CAP: Fix corrupted list in hci_chan_del Bluetooth: btintel_pcie: Fix a potential race condition Bluetooth: L2CAP: Fix slab-use-after-free Read in l2cap_send_cmd ==================== Link: https://patch.msgid.link/20250213162446.617632-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13Merge tag 'nf-25-02-13' of ↵Jakub Kicinski1-6/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains one revert for: 1) Revert flowtable entry teardown cycle when skbuff exceeds mtu to deal with DF flag unset scenarios. This is reverts a patch coming in the previous merge window (available in 6.14-rc releases). * tag 'nf-25-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: Revert "netfilter: flowtable: teardown flow if cached mtu is stale" ==================== Link: https://patch.msgid.link/20250213100502.3983-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13Reapply "net: skb: introduce and use a single page frag cache"Jakub Kicinski2-22/+98
This reverts commit 011b0335903832facca86cd8ed05d7d8d94c9c76. Sabrina reports that the revert may trigger warnings due to intervening changes, especially the ability to rise MAX_SKB_FRAGS. Let's drop it and revisit once that part is also ironed out. Fixes: 011b03359038 ("Revert "net: skb: introduce and use a single page frag cache"") Reported-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/6bf54579233038bc0e76056c5ea459872ce362ab.1739375933.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13ipv6: mcast: add RCU protection to mld_newpack()Eric Dumazet1-4/+10
mld_newpack() can be called without RTNL or RCU being held. Note that we no longer can use sock_alloc_send_skb() because ipv6.igmp_sk uses GFP_KERNEL allocations which can sleep. Instead use alloc_skb() and charge the net->ipv6.igmp_sk socket under RCU protection. Fixes: b8ad0cbc58f7 ("[NETNS][IPV6] mcast - handle several network namespace") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250212141021.1663666-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>