summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2026-02-19net/mlx5: Fix misidentification of write combining CQE during poll loopGal Pressman1-9/+5
The write combining completion poll loop uses usleep_range() which can sleep much longer than requested due to scheduler latency. Under load, we witnessed a 20ms+ delay until the process was rescheduled, causing the jiffies based timeout to expire while the thread is sleeping. The original do-while loop structure (poll, sleep, check timeout) would exit without a final poll when waking after timeout, missing a CQE that arrived during sleep. Instead of the open-coded while loop, use the kernel's poll_timeout_us() which always performs an additional check after the sleep expiration, and is less error-prone. Note: poll_timeout_us() doesn't accept a sleep range, by passing 10 sleep_us the sleep range effectively changes from 2-10 to 3-10 usecs. Fixes: d98995b4bf98 ("net/mlx5: Reimplement write combining test") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <Jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260218072904.1764634-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19net/mlx5e: Fix misidentification of ASO CQE during poll loopGal Pressman2-14/+6
The ASO completion poll loop uses usleep_range() which can sleep much longer than requested due to scheduler latency. Under load, we witnessed a 20ms+ delay until the process was rescheduled, causing the jiffies based timeout to expire while the thread is sleeping. The original do-while loop structure (poll, sleep, check timeout) would exit without a final poll when waking after timeout, missing a CQE that arrived during sleep. Instead of the open-coded while loop, use the kernel's read_poll_timeout() which always performs an additional check after the sleep expiration, and is less error-prone. Note: read_poll_timeout() doesn't accept a sleep range, by passing 10 sleep_us the sleep range effectively changes from 2-10 to 3-10 usecs. Fixes: 739cfa34518e ("net/mlx5: Make ASO poll CQ usable in atomic context") Fixes: 7e3fce82d945 ("net/mlx5e: Overcome slow response for first macsec ASO WQE") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <Jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260218072904.1764634-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19net/mlx5: Fix multiport device check over light SFsShay Drory1-2/+2
Driver is using num_vhca_ports capability to distinguish between multiport master device and multiport slave device. num_vhca_ports is a capability the driver sets according to the MAX num_vhca_ports capability reported by FW. On the other hand, light SFs doesn't set the above capbility. This leads to wrong results whenever light SFs is checking whether he is a multiport master or slave. Therefore, use the MAX capability to distinguish between master and slave devices. Fixes: e71383fb9cd1 ("net/mlx5: Light probe local SFs") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <Jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260218072904.1764634-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19bonding: alb: fix UAF in rlb_arp_recv during bond up/downHangbin Liu1-1/+5
The ALB RX path may access rx_hashtbl concurrently with bond teardown. During rapid bond up/down cycles, rlb_deinitialize() frees rx_hashtbl while RX handlers are still running, leading to a null pointer dereference detected by KASAN. However, the root cause is that rlb_arp_recv() can still be accessed after setting recv_probe to NULL, which is actually a use-after-free (UAF) issue. That is the reason for using the referenced commit in the Fixes tag. [ 214.174138] Oops: general protection fault, probably for non-canonical address 0xdffffc000000001d: 0000 [#1] SMP KASAN PTI [ 214.186478] KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef] [ 214.194933] CPU: 30 UID: 0 PID: 2375 Comm: ping Kdump: loaded Not tainted 6.19.0-rc8+ #2 PREEMPT(voluntary) [ 214.205907] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.14.0 01/14/2022 [ 214.214357] RIP: 0010:rlb_arp_recv+0x505/0xab0 [bonding] [ 214.220320] Code: 0f 85 2b 05 00 00 48 b8 00 00 00 00 00 fc ff df 40 0f b6 ed 48 c1 e5 06 49 03 ad 78 01 00 00 48 8d 7d 28 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 06 0f 8e 12 05 00 00 80 7d 28 00 0f 84 8c 00 [ 214.241280] RSP: 0018:ffffc900073d8870 EFLAGS: 00010206 [ 214.247116] RAX: dffffc0000000000 RBX: ffff888168556822 RCX: ffff88816855681e [ 214.255082] RDX: 000000000000001d RSI: dffffc0000000000 RDI: 00000000000000e8 [ 214.263048] RBP: 00000000000000c0 R08: 0000000000000002 R09: ffffed11192021c8 [ 214.271013] R10: ffff8888c9010e43 R11: 0000000000000001 R12: 1ffff92000e7b119 [ 214.278978] R13: ffff8888c9010e00 R14: ffff888168556822 R15: ffff888168556810 [ 214.286943] FS: 00007f85d2d9cb80(0000) GS:ffff88886ccb3000(0000) knlGS:0000000000000000 [ 214.295966] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 214.302380] CR2: 00007f0d047b5e34 CR3: 00000008a1c2e002 CR4: 00000000001726f0 [ 214.310347] Call Trace: [ 214.313070] <IRQ> [ 214.315318] ? __pfx_rlb_arp_recv+0x10/0x10 [bonding] [ 214.320975] bond_handle_frame+0x166/0xb60 [bonding] [ 214.326537] ? __pfx_bond_handle_frame+0x10/0x10 [bonding] [ 214.332680] __netif_receive_skb_core.constprop.0+0x576/0x2710 [ 214.339199] ? __pfx_arp_process+0x10/0x10 [ 214.343775] ? sched_balance_find_src_group+0x98/0x630 [ 214.349513] ? __pfx___netif_receive_skb_core.constprop.0+0x10/0x10 [ 214.356513] ? arp_rcv+0x307/0x690 [ 214.360311] ? __pfx_arp_rcv+0x10/0x10 [ 214.364499] ? __lock_acquire+0x58c/0xbd0 [ 214.368975] __netif_receive_skb_one_core+0xae/0x1b0 [ 214.374518] ? __pfx___netif_receive_skb_one_core+0x10/0x10 [ 214.380743] ? lock_acquire+0x10b/0x140 [ 214.385026] process_backlog+0x3f1/0x13a0 [ 214.389502] ? process_backlog+0x3aa/0x13a0 [ 214.394174] __napi_poll.constprop.0+0x9f/0x370 [ 214.399233] net_rx_action+0x8c1/0xe60 [ 214.403423] ? __pfx_net_rx_action+0x10/0x10 [ 214.408193] ? lock_acquire.part.0+0xbd/0x260 [ 214.413058] ? sched_clock_cpu+0x6c/0x540 [ 214.417540] ? mark_held_locks+0x40/0x70 [ 214.421920] handle_softirqs+0x1fd/0x860 [ 214.426302] ? __pfx_handle_softirqs+0x10/0x10 [ 214.431264] ? __neigh_event_send+0x2d6/0xf50 [ 214.436131] do_softirq+0xb1/0xf0 [ 214.439830] </IRQ> The issue is reproducible by repeatedly running ip link set bond0 up/down while receiving ARP messages, where rlb_arp_recv() can race with rlb_deinitialize() and dereference a freed rx_hashtbl entry. Fix this by setting recv_probe to NULL and then calling synchronize_net() to wait for any concurrent RX processing to finish. This ensures that no RX handler can access rx_hashtbl after it is freed in bond_alb_deinitialize(). Reported-by: Liang Li <liali@redhat.com> Fixes: 3aba891dde38 ("bonding: move processing of recv handlers into handle_frame()") Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Acked-by: Jay Vosburgh <jv@jvosburgh.net> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20260218060919.101574-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19bnge: fix reserving resources from FWVikas Gupta1-1/+1
HWRM_FUNC_CFG is used to reserve resources, whereas HWRM_FUNC_QCFG is intended for querying resource information from the firmware. Since __bnge_hwrm_reserve_pf_rings() reserves resources for a specific PF, the command type should be HWRM_FUNC_CFG. Fixes: 627c67f038d2 ("bng_en: Add resource management support") Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260218052755.4097468-1-vikas.gupta@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19eth: fbnic: Advertise supported XDP features.Dimitri Daskalakis1-0/+2
Drivers are supposed to advertise the XDP features they support. This was missed while adding XDP support. Before: $ ynl --family netdev --dump dev-get ... {'ifindex': 3, 'xdp-features': set(), 'xdp-rx-metadata-features': set(), 'xsk-features': set()}, ... After: $ ynl --family netdev --dump dev-get ... {'ifindex': 3, 'xdp-features': {'basic', 'rx-sg'}, 'xdp-rx-metadata-features': set(), 'xsk-features': set()}, ... Fixes: 168deb7b31b2 ("eth: fbnic: Add support for XDP_TX action") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260218030620.3329608-1-dimitri.daskalakis1@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-19rds: tcp: fix uninit-value in __inet_bindTabrez Ahmed1-1/+1
KMSAN reported an uninit-value access in __inet_bind() when binding an RDS TCP socket. The uninitialized memory originates from rds_tcp_conn_alloc(), which uses kmem_cache_alloc() to allocate the rds_tcp_connection structure. Specifically, the field 't_client_port_group' is incremented in rds_tcp_conn_path_connect() without being initialized first: if (++tc->t_client_port_group >= port_groups) Since kmem_cache_alloc() does not zero the memory, this field contains garbage, leading to the KMSAN report. Fix this by using kmem_cache_zalloc() to ensure the structure is zero-initialized upon allocation. Reported-by: syzbot+aae646f09192f72a68dc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=aae646f09192f72a68dc Tested-by: syzbot+aae646f09192f72a68dc@syzkaller.appspotmail.com Fixes: a20a6992558f ("net/rds: Encode cp_index in TCP source port") Signed-off-by: Tabrez Ahmed <tabreztalks@gmail.com> Reviewed-by: Charalampos Mitrodimas <charmitro@posteo.net> Reviewed-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260217135350.33641-1-tabreztalks@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-19net/rds: Fix NULL pointer dereference in rds_tcp_accept_oneAllison Henderson1-3/+17
Save a local pointer to new_sock->sk and hold a reference before installing callbacks in rds_tcp_accept_one. After rds_tcp_set_callbacks() or rds_tcp_reset_callbacks(), tc->t_sock is set to new_sock which may race with the shutdown path. A concurrent rds_tcp_conn_path_shutdown() may call sock_release(), which sets new_sock->sk = NULL and may eventually free sk when the refcount reaches zero. Subsequent accesses to new_sock->sk->sk_state would dereference NULL, causing the crash. The fix saves a local sk pointer before callbacks are installed so that sk_state can be accessed safely even after new_sock->sk is nulled, and uses sock_hold()/sock_put() to ensure sk itself remains valid for the duration. Fixes: 826c1004d4ae ("net/rds: rds_tcp_conn_path_shutdown must not discard messages") Reported-by: syzbot+96046021045ffe6d7709@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=96046021045ffe6d7709 Signed-off-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260216222643.2391390-1-achender@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-19octeontx2-af: Fix default entries mcam entry actionHariprasad Kelam1-19/+22
As per design, AF should update the default MCAM action only when mcam_index is -1. A bug in the previous patch caused default entries to be changed even when the request was not for them. Fixes: 570ba37898ec ("octeontx2-af: Update RSS algorithm index") Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260216090338.1318976-1-hkelam@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19Merge tag 'nf-26-02-17' of ↵Jakub Kicinski20-303/+166
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== netfilter: updates for net The following patchset contains Netfilter fixes for *net*: 1) Add missing __rcu annotations to NAT helper hook pointers in Amanda, FTP, IRC, SNMP and TFTP helpers. From Sun Jian. 2-4): - Add global spinlock to serialize nft_counter fetch+reset operations. - Use atomic64_xchg() for nft_quota reset instead of read+subtract pattern. Note AI review detects a race in this change but it isn't new. The 'racing' bit only exists to prevent constant stream of 'quota expired' notifications. - Revert commit_mutex usage in nf_tables reset path, it caused circular lock dependency. All from Brian Witte. 5) Fix uninitialized l3num value in nf_conntrack_h323 helper. 6) Fix musl libc compatibility in netfilter_bridge.h UAPI header. This change isn't nice (UAPI headers should not include libc headers), but as-is musl builds may fail due to redefinition of struct ethhdr. 7) Fix protocol checksum validation in IPVS for IPv6 with extension headers, from Julian Anastasov. 8) Fix device reference leak in IPVS when netdev goes down. Also from Julian. 9) Remove WARN_ON_ONCE when accessing forward path array, this can trigger with sufficiently long forward paths. From Pablo Neira Ayuso. 10) Fix use-after-free in nf_tables_addchain() error path, from Inseo An. * tag 'nf-26-02-17' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: fix use-after-free in nf_tables_addchain() net: remove WARN_ON_ONCE when accessing forward path array ipvs: do not keep dest_dst if dev is going down ipvs: skip ipv6 extension headers for csum checks include: uapi: netfilter_bridge.h: Cover for musl libc netfilter: nf_conntrack_h323: don't pass uninitialised l3num value netfilter: nf_tables: revert commit_mutex usage in reset path netfilter: nft_quota: use atomic64_xchg for reset netfilter: nft_counter: serialize reset with spinlock netfilter: annotate NAT helper hook pointers with __rcu ==================== Link: https://patch.msgid.link/20260217163233.31455-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19net/mlx5e: XSK, Fix unintended ICOSQ changeTariq Toukan4-10/+21
XSK wakeup must use the async ICOSQ (with proper locking), as it is not guaranteed to run on the same CPU as the channel. The commit that converted the NAPI trigger path to use the sync ICOSQ incorrectly applied the same change to XSK, causing XSK wakeups to use the sync ICOSQ as well. Revert XSK flows to use the async ICOSQ. XDP program attach/detach triggers channel reopen, while XSK pool enable/disable can happen on-the-fly via NDOs without reopening channels. As a result, xsk_pool state cannot be reliably used at mlx5e_open_channel() time to decide whether an async ICOSQ is needed. Update the async_icosq_needed logic to depend on the presence of an XDP program rather than the xsk_pool, ensuring the async ICOSQ is available when XSK wakeups are enabled. This fixes multiple issues: 1. Illegal synchronize_rcu() in an RCU read- side critical section via mlx5e_xsk_wakeup() -> mlx5e_trigger_napi_icosq() -> synchronize_net(). The stack holds RCU read-lock in xsk_poll(). 2. Hitting a NULL pointer dereference in mlx5e_xsk_wakeup(): [] BUG: kernel NULL pointer dereference, address: 0000000000000240 [] #PF: supervisor read access in kernel mode [] #PF: error_code(0x0000) - not-present page [] PGD 0 P4D 0 [] Oops: Oops: 0000 [#1] SMP [] CPU: 0 UID: 0 PID: 2255 Comm: qemu-system-x86 Not tainted 6.19.0-rc5+ #229 PREEMPT(none) [] Hardware name: [...] [] RIP: 0010:mlx5e_xsk_wakeup+0x53/0x90 [mlx5_core] Reported-by: Daniel Borkmann <daniel@iogearbox.net> Closes: https://lore.kernel.org/all/20260123223916.361295-1-daniel@iogearbox.net/ Fixes: 56aca3e0f730 ("net/mlx5e: Use regular ICOSQ for triggering NAPI") Tested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Acked-by: Alice Mikityanska <alice.kernel@fastmail.im> Link: https://patch.msgid.link/20260217074525.1761454-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19Merge branch 'icmp-better-deal-with-ddos'Jakub Kicinski5-19/+31
Eric Dumazet says: ==================== icmp: better deal with DDOS When dealing with death of big UDP servers, admins might want to increase net.ipv4.icmp_msgs_per_sec and net.ipv4.icmp_msgs_burst to big values (2,000,000 or more). They also might need to tune the per-host ratelimit to 1ms or 0ms in favor of the global rate limit. This series fixes bugs showing up in all these needs. ==================== Link: https://patch.msgid.link/20260216142832.3834174-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19ipv6: icmp: icmpv6_xrlim_allow() optimization if net.ipv6.icmp.ratelimit is zeroEric Dumazet1-2/+6
If net.ipv6.icmp.ratelimit is zero we do not have to call inet_getpeer_v6() and inet_peer_xrlim_allow(). Both can be very expensive under DDOS. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216142832.3834174-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zeroEric Dumazet1-4/+10
If net.ipv4.icmp_ratelimit is zero, we do not have to call inet_getpeer_v4() and inet_peer_xrlim_allow(). Both can be very expensive under DDOS. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216142832.3834174-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow()Eric Dumazet3-10/+6
Following part was needed before the blamed commit, because inet_getpeer_v6() second argument was the prefix. /* Give more bandwidth to wider prefixes. */ if (rt->rt6i_dst.plen < 128) tmo >>= ((128 - rt->rt6i_dst.plen)>>5); Now inet_getpeer_v6() retrieves hosts, we need to remove @tmo adjustement or wider prefixes likes /24 allow 8x more ICMP to be sent for a given ratelimit. As we had this issue for a while, this patch changes net.ipv6.icmp.ratelimit default value from 1000ms to 100ms to avoid potential regressions. Also add a READ_ONCE() when reading net->ipv6.sysctl.icmpv6_time. Fixes: fd0273d7939f ("ipv6: Remove external dependency on rt6i_dst and rt6i_src") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260216142832.3834174-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19inet: move icmp_global_{credit,stamp} to a separate cache lineEric Dumazet1-2/+7
icmp_global_credit was meant to be changed ~1000 times per second, but if an admin sets net.ipv4.icmp_msgs_per_sec to a very high value, icmp_global_credit changes can inflict false sharing to surrounding fields that are read mostly. Move icmp_global_credit and icmp_global_stamp to a separate cacheline aligned group. Fixes: b056b4cd9178 ("icmp: move icmp_global.credit and icmp_global.stamp to per netns storage") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216142832.3834174-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19icmp: prevent possible overflow in icmp_global_allow()Eric Dumazet1-1/+2
Following expression can overflow if sysctl_icmp_msgs_per_sec is big enough. sysctl_icmp_msgs_per_sec * delta / HZ; Fixes: 4cdf507d5452 ("icmp: add a global rate limitation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216142832.3834174-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19selftests/net: packetdrill: add ipv4-mapped-ipv6 testsEric Dumazet1-1/+10
Add ipv4-mapped-ipv6 case to ksft_runner.sh before an upcoming TCP fix in this area. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260217142924.1853498-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18eth: fbnic: Add validation for MTU changesDimitri Daskalakis1-0/+18
Increasing the MTU beyond the HDS threshold causes the hardware to fragment packets across multiple buffers. If a single-buffer XDP program is attached, the driver will drop all multi-frag frames. While we can't prevent a remote sender from sending non-TCP packets larger than the MTU, this will prevent users from inadvertently breaking new TCP streams. Traditionally, drivers supported XDP with MTU less than 4Kb (packet per page). Fbnic currently prevents attaching XDP when MTU is too high. But it does not prevent increasing MTU after XDP is attached. Fixes: 1b0a3950dbd4 ("eth: fbnic: Add XDP pass, drop, abort support") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2026-02-18net/sched: act_skbedit: fix divide-by-zero in tcf_skbedit_hash()Ruitong Liu1-1/+5
Commit 38a6f0865796 ("net: sched: support hash selecting tx queue") added SKBEDIT_F_TXQ_SKBHASH support. The inclusive range size is computed as: mapping_mod = queue_mapping_max - queue_mapping + 1; The range size can be 65536 when the requested range covers all possible u16 queue IDs (e.g. queue_mapping=0 and queue_mapping_max=U16_MAX). That value cannot be represented in a u16 and previously wrapped to 0, so tcf_skbedit_hash() could trigger a divide-by-zero: queue_mapping += skb_get_hash(skb) % params->mapping_mod; Compute mapping_mod in a wider type and reject ranges larger than U16_MAX to prevent params->mapping_mod from becoming 0 and avoid the crash. Fixes: 38a6f0865796 ("net: sched: support hash selecting tx queue") Cc: stable@vger.kernel.org # 6.12+ Signed-off-by: Ruitong Liu <cnitlrt@gmail.com> Link: https://patch.msgid.link/20260213175948.1505257-1-cnitlrt@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18vsock: document namespace mode sysctlsStefano Garzarella1-2/+50
Add documentation for the vsock per-namespace sysctls (`ns_mode` and `child_ns_mode`) to Documentation/admin-guide/sysctl/net.rst. These sysctls were introduced by commit eafb64f40ca4 ("vsock: add netns to vsock core"). Document the two namespace modes (`global` and `local`), the inheritance behavior of `child_ns_mode`, and the restriction preventing local namespaces from setting `child_ns_mode` to `global`. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20260216163147.236844-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18net: ethernet: ec_bhf: Fix dma_free_coherent() dma handleThomas Fourier1-1/+1
dma_free_coherent() in error path takes priv->rx_buf.alloc_len as the dma handle. This would lead to improper unmapping of the buffer. Change the dma handle to priv->rx_buf.alloc_phys. Fixes: 6af55ff52b02 ("Driver for Beckhoff CX5020 EtherCAT master module.") Cc: <stable@vger.kernel.org> Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Link: https://patch.msgid.link/20260213164340.77272-2-fourier.thomas@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18macvlan: observe an RCU grace period in macvlan_common_newlink() error pathEric Dumazet1-0/+5
valis reported that a race condition still happens after my prior patch. macvlan_common_newlink() might have made @dev visible before detecting an error, and its caller will directly call free_netdev(dev). We must respect an RCU period, either in macvlan or the core networking stack. After adding a temporary mdelay(1000) in macvlan_forward_source_one() to open the race window, valis repro was: ip link add p1 type veth peer p2 ip link set address 00:00:00:00:00:20 dev p1 ip link set up dev p1 ip link set up dev p2 ip link add mv0 link p2 type macvlan mode source (ip link add invalid% link p2 type macvlan mode source macaddr add 00:00:00:00:00:20 &) ; sleep 0.5 ; ping -c1 -I p1 1.2.3.4 PING 1.2.3.4 (1.2.3.4): 56 data bytes RTNETLINK answers: Invalid argument BUG: KASAN: slab-use-after-free in macvlan_forward_source (drivers/net/macvlan.c:408 drivers/net/macvlan.c:444) Read of size 8 at addr ffff888016bb89c0 by task e/175 CPU: 1 UID: 1000 PID: 175 Comm: e Not tainted 6.19.0-rc8+ #33 NONE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Call Trace: <IRQ> dump_stack_lvl (lib/dump_stack.c:123) print_report (mm/kasan/report.c:379 mm/kasan/report.c:482) ? macvlan_forward_source (drivers/net/macvlan.c:408 drivers/net/macvlan.c:444) kasan_report (mm/kasan/report.c:597) ? macvlan_forward_source (drivers/net/macvlan.c:408 drivers/net/macvlan.c:444) macvlan_forward_source (drivers/net/macvlan.c:408 drivers/net/macvlan.c:444) ? tasklet_init (kernel/softirq.c:983) macvlan_handle_frame (drivers/net/macvlan.c:501) Allocated by task 169: kasan_save_stack (mm/kasan/common.c:58) kasan_save_track (./arch/x86/include/asm/current.h:25 mm/kasan/common.c:70 mm/kasan/common.c:79) __kasan_kmalloc (mm/kasan/common.c:419) __kvmalloc_node_noprof (./include/linux/kasan.h:263 mm/slub.c:5657 mm/slub.c:7140) alloc_netdev_mqs (net/core/dev.c:12012) rtnl_create_link (net/core/rtnetlink.c:3648) rtnl_newlink (net/core/rtnetlink.c:3830 net/core/rtnetlink.c:3957 net/core/rtnetlink.c:4072) rtnetlink_rcv_msg (net/core/rtnetlink.c:6958) netlink_rcv_skb (net/netlink/af_netlink.c:2550) netlink_unicast (net/netlink/af_netlink.c:1319 net/netlink/af_netlink.c:1344) netlink_sendmsg (net/netlink/af_netlink.c:1894) __sys_sendto (net/socket.c:727 net/socket.c:742 net/socket.c:2206) __x64_sys_sendto (net/socket.c:2209) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:131) Freed by task 169: kasan_save_stack (mm/kasan/common.c:58) kasan_save_track (./arch/x86/include/asm/current.h:25 mm/kasan/common.c:70 mm/kasan/common.c:79) kasan_save_free_info (mm/kasan/generic.c:587) __kasan_slab_free (mm/kasan/common.c:287) kfree (mm/slub.c:6674 mm/slub.c:6882) rtnl_newlink (net/core/rtnetlink.c:3845 net/core/rtnetlink.c:3957 net/core/rtnetlink.c:4072) rtnetlink_rcv_msg (net/core/rtnetlink.c:6958) netlink_rcv_skb (net/netlink/af_netlink.c:2550) netlink_unicast (net/netlink/af_netlink.c:1319 net/netlink/af_netlink.c:1344) netlink_sendmsg (net/netlink/af_netlink.c:1894) __sys_sendto (net/socket.c:727 net/socket.c:742 net/socket.c:2206) __x64_sys_sendto (net/socket.c:2209) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:131) Fixes: f8db6475a836 ("macvlan: fix error recovery in macvlan_common_newlink()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: valis <sec@valis.email> Link: https://patch.msgid.link/20260213142557.3059043-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18selftests: tc_actions: don't dump 2MB of \0 to stdoutJakub Kicinski1-1/+1
Since we started running selftests in NIPA we have been seeing tc_actions.sh generate a soft lockup warning on ~20% of the runs. On the pre-netdev foundation setup it was actually a missed irq splat from the console. Now it's either that or a lockup. I initially suspected a socket locking issue since the test is exercising local loopback with act_mirred. After hours of staring at this I noticed in strace that ncat when -o $file is specified _both_ saves the output to the file and still prints it to stdout. Because the file being sent is constructed with: dd conv=sparse status=none if=/dev/zero bs=1M count=2 of=$mirred ^^^^^^^^^ the data printed is all \0. Most terminals don't display nul characters (and neither does vng output capture save them). But QEMU's serial console still has to poke them thru which is very slow and causes the lockup (if the file is >600kB). Replace the '-o $file' with '> $file'. This speeds the test up from 2m20s to 18s on debug kernels, and prevents the warnings. Fixes: ca22da2fbd69 ("act_mirred: use the backlog for nested calls to mirred ingress") Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260214035159.2119699-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18ipv6: addrconf: reduce default temp_valid_lft to 2 daysFernando Fernandez Mancera1-1/+2
This is a recommendation from RFC 8981 and it was intended to be changed by commit 969c54646af0 ("ipv6: Implement draft-ietf-6man-rfc4941bis") but it only changed the sysctl documentation. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260214172543.5783-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18ping: annotate data-races in ping_lookup()Eric Dumazet1-12/+19
isk->inet_num, isk->inet_rcv_saddr and sk->sk_bound_dev_if are read locklessly in ping_lookup(). Add READ_ONCE()/WRITE_ONCE() annotations. The race on isk->inet_rcv_saddr is probably coming from IPv6 support, but does not deserve a specific backport. Fixes: dbca1596bbb0 ("ping: convert to RCU lookups, get rid of rwlock") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216100149.3319315-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18net: dsa: MxL862xx: don't force-enable MAXLINEAR_GPHYArnd Bergmann1-1/+0
The newly added dsa driver attempts to enable the corresponding PHY driver, but that one has additional dependencies that may not be available: WARNING: unmet direct dependencies detected for MAXLINEAR_GPHY Depends on [m]: NETDEVICES [=y] && PHYLIB [=y] && (HWMON [=m] || HWMON [=m]=n [=n]) Selected by [y]: - NET_DSA_MXL862 [=y] && NETDEVICES [=y] && NET_DSA [=y] aarch64-linux-ld: drivers/net/phy/mxl-gpy.o: in function `gpy_probe': mxl-gpy.c:(.text.gpy_probe+0x13c): undefined reference to `devm_hwmon_device_register_with_info' aarch64-linux-ld: drivers/net/phy/mxl-gpy.o: in function `gpy_hwmon_read': mxl-gpy.c:(.text.gpy_hwmon_read+0x48): undefined reference to `polynomial_calc' There is actually no compile-time dependency, as DSA correctly uses the PHY abstractions. Remove the 'select' statement to reduce the complexity. Fixes: 23794bec1cb6 ("net: dsa: add basic initial driver for MxL862xx switches") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Daniel Golle <daniel@makrotopia.org> Link: https://patch.msgid.link/20260216105522.2382373-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18dpll: zl3073x: Fix ref frequency settingIvan Vecera1-0/+2
The frequency for an input reference is computed as: frequency = freq_base * freq_mult * freq_ratio_m / freq_ratio_n Before commit 5bc02b190a3fb ("dpll: zl3073x: Cache all reference properties in zl3073x_ref"), zl3073x_dpll_input_pin_frequency_set() explicitly wrote 1 to both the REF_RATIO_M and REF_RATIO_N hardware registers whenever a new frequency was set. This ensured the FEC ratio was always reset to 1:1 alongside the new base/multiplier values. The refactoring in that commit introduced zl3073x_ref_freq_set() to update the cached ref state, but this helper only sets freq_base and freq_mult without resetting freq_ratio_m and freq_ratio_n to 1. Because zl3073x_ref_state_set() uses a compare-and-write strategy, unchanged ratio fields are never written to the hardware. If the device previously had non-unity FEC ratio values, they remain in effect after a frequency change, resulting in an incorrect computed frequency. Explicitly set freq_ratio_m and freq_ratio_n to 1 in zl3073x_ref_freq_set() to restore the original behavior. Fixes: 5bc02b190a3fb ("dpll: zl3073x: Cache all reference properties in zl3073x_ref") Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260216194007.680416-1-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18net: do not delay zero-copy skbs in skb_attempt_defer_free()Eric Dumazet1-1/+6
After the blamed commit, TCP tx zero copy notifications could be arbitrarily delayed and cause regressions in applications waiting for them. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs") Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260216193653.627617-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18net: psp: select CONFIG_SKB_EXTENSIONSArnd Bergmann1-0/+1
psp now uses skb extensions, failing to build when that is disabled: In file included from include/net/psp.h:7, from net/psp/psp_sock.c:9: include/net/psp/functions.h: In function '__psp_skb_coalesce_diff': include/net/psp/functions.h:60:13: error: implicit declaration of function 'skb_ext_find'; did you mean 'skb_ext_copy'? [-Wimplicit-function-declaration] 60 | a = skb_ext_find(one, SKB_EXT_PSP); | ^~~~~~~~~~~~ | skb_ext_copy include/net/psp/functions.h:60:31: error: 'SKB_EXT_PSP' undeclared (first use in this function) 60 | a = skb_ext_find(one, SKB_EXT_PSP); | ^~~~~~~~~~~ include/net/psp/functions.h:60:31: note: each undeclared identifier is reported only once for each function it appears in include/net/psp/functions.h: In function '__psp_sk_rx_policy_check': include/net/psp/functions.h:94:53: error: 'SKB_EXT_PSP' undeclared (first use in this function) 94 | struct psp_skb_ext *pse = skb_ext_find(skb, SKB_EXT_PSP); | ^~~~~~~~~~~ net/psp/psp_sock.c: In function 'psp_sock_recv_queue_check': net/psp/psp_sock.c:164:41: error: 'SKB_EXT_PSP' undeclared (first use in this function) 164 | pse = skb_ext_find(skb, SKB_EXT_PSP); | ^~~~~~~~~~~ Select the Kconfig symbol as we do from its other users. Fixes: 6b46ca260e22 ("net: psp: add socket security association code") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20260216105500.2382181-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18ipv6: fix a race in ip6_sock_set_v6only()Eric Dumazet1-4/+7
It is unlikely that this function will be ever called with isk->inet_num being not zero. Perform the check on isk->inet_num inside the locked section for complete safety. Fixes: 9b115749acb24 ("ipv6: add ip6_sock_set_v6only") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260216102202.3343588-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-17netfilter: nf_tables: fix use-after-free in nf_tables_addchain()Inseo An1-0/+1
nf_tables_addchain() publishes the chain to table->chains via list_add_tail_rcu() (in nft_chain_add()) before registering hooks. If nf_tables_register_hook() then fails, the error path calls nft_chain_del() (list_del_rcu()) followed by nf_tables_chain_destroy() with no RCU grace period in between. This creates two use-after-free conditions: 1) Control-plane: nf_tables_dump_chains() traverses table->chains under rcu_read_lock(). A concurrent dump can still be walking the chain when the error path frees it. 2) Packet path: for NFPROTO_INET, nf_register_net_hook() briefly installs the IPv4 hook before IPv6 registration fails. Packets entering nft_do_chain() via the transient IPv4 hook can still be dereferencing chain->blob_gen_X when the error path frees the chain. Add synchronize_rcu() between nft_chain_del() and the chain destroy so that all RCU readers -- both dump threads and in-flight packet evaluation -- have finished before the chain is freed. Fixes: 91c7b38dc9f0 ("netfilter: nf_tables: use new transaction infrastructure to handle chain") Signed-off-by: Inseo An <y0un9sa@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17net: remove WARN_ON_ONCE when accessing forward path arrayPablo Neira Ayuso1-1/+1
Although unlikely, recent support for IPIP tunnels increases chances of reaching this WARN_ON_ONCE if userspace manages to build a sufficiently long forward path. Remove it. Fixes: ddb94eafab8b ("net: resolve forwarding path from virtual netdevice and HW destination address") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17ipvs: do not keep dest_dst if dev is going downJulian Anastasov1-10/+36
There is race between the netdev notifier ip_vs_dst_event() and the code that caches dst with dev that is going down. As the FIB can be notified for the closed device after our handler finishes, it is possible valid route to be returned and cached resuling in a leaked dev reference until the dest is not removed. To prevent new dest_dst to be attached to dest just after the handler dropped the old one, add a netif_running() check to make sure the notifier handler is not currently running for device that is closing. Fixes: 7a4f0761fce3 ("IPVS: init and cleanup restructuring") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17ipvs: skip ipv6 extension headers for csum checksJulian Anastasov3-39/+20
Protocol checksum validation fails for IPv6 if there are extension headers before the protocol header. iph->len already contains its offset, so use it to fix the problem. Fixes: 2906f66a5682 ("ipvs: SCTP Trasport Loadbalancing Support") Fixes: 0bbdd42b7efa ("IPVS: Extend protocol DNAT/SNAT and state handlers") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17include: uapi: netfilter_bridge.h: Cover for musl libcPhil Sutter1-0/+4
Musl defines its own struct ethhdr and thus defines __UAPI_DEF_ETHHDR to zero. To avoid struct redefinition errors, user space is therefore supposed to include netinet/if_ether.h before (or instead of) linux/if_ether.h. To relieve them from this burden, include the libc header here if not building for kernel space. Reported-by: Alyssa Ross <hi@alyssa.is> Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17netfilter: nf_conntrack_h323: don't pass uninitialised l3num valueFlorian Westphal1-5/+5
Mihail Milev reports: Error: UNINIT (CWE-457): net/netfilter/nf_conntrack_h323_main.c:1189:2: var_decl: Declaring variable "tuple" without initializer. net/netfilter/nf_conntrack_h323_main.c:1197:2: uninit_use_in_call: Using uninitialized value "tuple.src.l3num" when calling "__nf_ct_expect_find". net/netfilter/nf_conntrack_expect.c:142:2: read_value: Reading value "tuple->src.l3num" when calling "nf_ct_expect_dst_hash". 1195| tuple.dst.protonum = IPPROTO_TCP; 1196| 1197|-> exp = __nf_ct_expect_find(net, nf_ct_zone(ct), &tuple); 1198| if (exp && exp->master == ct) 1199| return exp; Switch this to a C99 initialiser and set the l3num value. Fixes: f587de0e2feb ("[NETFILTER]: nf_conntrack/nf_nat: add H.323 helper port") Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17netfilter: nf_tables: revert commit_mutex usage in reset pathBrian Witte1-206/+42
It causes circular lock dependency between commit_mutex, nfnl_subsys_ipset and nlk_cb_mutex when nft reset, ipset list, and iptables-nft with '-m set' rule run at the same time. Previous patches made it safe to run individual reset handlers concurrently so commit_mutex is no longer required to prevent this. Fixes: bd662c4218f9 ("netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests") Fixes: 3d483faa6663 ("netfilter: nf_tables: Add locking for NFT_MSG_GETSETELEM_RESET requests") Fixes: 3cb03edb4de3 ("netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests") Link: https://lore.kernel.org/all/aUh_3mVRV8OrGsVo@strlen.de/ Reported-by: <syzbot+ff16b505ec9152e5f448@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=ff16b505ec9152e5f448 Signed-off-by: Brian Witte <brianwitte@mailfence.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17netfilter: nft_quota: use atomic64_xchg for resetBrian Witte1-6/+7
Use atomic64_xchg() to atomically read and zero the consumed value on reset, which is simpler than the previous read+sub pattern and doesn't require lock serialization. Fixes: bd662c4218f9 ("netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests") Fixes: 3d483faa6663 ("netfilter: nf_tables: Add locking for NFT_MSG_GETSETELEM_RESET requests") Fixes: 3cb03edb4de3 ("netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests") Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Brian Witte <brianwitte@mailfence.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17netfilter: nft_counter: serialize reset with spinlockBrian Witte1-4/+16
Add a global static spinlock to serialize counter fetch+reset operations, preventing concurrent dump-and-reset from underrunning values. The lock is taken before fetching the total so that two parallel resets cannot both read the same counter values and then both subtract them. A global lock is used for simplicity since resets are infrequent. If this becomes a bottleneck, it can be replaced with a per-net lock later. Fixes: bd662c4218f9 ("netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests") Fixes: 3d483faa6663 ("netfilter: nf_tables: Add locking for NFT_MSG_GETSETELEM_RESET requests") Fixes: 3cb03edb4de3 ("netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests") Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Brian Witte <brianwitte@mailfence.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17netfilter: annotate NAT helper hook pointers with __rcuSun Jian10-32/+34
The NAT helper hook pointers are updated and dereferenced under RCU rules, but lack the proper __rcu annotation. This makes sparse report address space mismatches when the hooks are used with rcu_dereference(). Add the missing __rcu annotations to the global hook pointer declarations and definitions in Amanda, FTP, IRC, SNMP and TFTP. No functional change intended. Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17Merge branch 'selftests-forwarding-fix-br_netfilter-related-test-failures'Paolo Abeni4-11/+33
Aleksei Oladko says: ==================== selftests: forwarding: fix br_netfilter related test failures This patch series fixes kselftests that fail when the br_nefilter module is loaded. The failures occur because the tests generate packets that are either modified or encapsulated, but their IP headers are not fully correct for sanity checks performed by be_netfilter. Signed-off-by: Aleksei Oladko <aleksey.oladko@virtuozzo.com> ==================== Link: https://patch.msgid.link/20260213131907.43351-1-aleksey.oladko@virtuozzo.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-17selftests: forwarding: fix pedit tests failure with br_netfilter enabledAleksei Oladko2-0/+16
The tests use the tc pedit action to modify the IPv4 source address ("pedit ex munge ip src set"), but the IP header checksum is not recalculated after the modification. As a result, the modified packet fails sanity checks in br_netfilter after bridging and is dropped, which causes the test to fail. Fix this by ensuring net.bridge.bridge-nf-call-iptables is set to 0 during the test execution. This prevents the bridge from passing L2 traffic to netfilter, bypassing the checksum validation that causes the test failure. Fixes: 92ad3828944e ("selftests: forwarding: Add a test for pedit munge SIP and DIP") Fixes: 226657ba2389 ("selftests: forwarding: Add a forwarding test for pedit munge dsfield") Signed-off-by: Aleksei Oladko <aleksey.oladko@virtuozzo.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260213131907.43351-4-aleksey.oladko@virtuozzo.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-17selftests: forwarding: vxlan_bridge_1d_ipv6: fix test failure with ↵Aleksei Oladko1-1/+1
br_netfilter enabled The test generates VXLAN traffic using mausezahn, where the encapsulated inner IPv6 packet has an incorrect payload length set in the IPv6 header. After VXLAN decapsulation, such packets do not pass sanity checks in br_netfilter and are dropped, which causes the test to fail. Fix this by setting the correct IPv6 payload length for the encapsulated packet generated by mausezahn, so that the packet is accepted by br_netfilter. tools/testing/selftests/net/forwarding/vxlan_bridge_1d_ipv6.sh lines 698-706 )"00:03:"$( : Payload length )"3a:"$( : Next header )"04:"$( : Hop limit )"$saddr:"$( : IP saddr )"$daddr:"$( : IP daddr )"80:"$( : ICMPv6.type )"00:"$( : ICMPv6.code )"00:"$( : ICMPv6.checksum ) Data after IPv6 header: • 80: — 1 byte (ICMPv6 type) • 00: — 1 byte (ICMPv6 code) • 00: — 1 byte (ICMPv6 checksum, truncated) Total: 3 bytes → 00:03 is correct. The old value 00:08 did not match the actual payload size. Fixes: b07e9957f220 ("selftests: forwarding: Add VxLAN tests with a VLAN-unaware bridge for IPv6") Signed-off-by: Aleksei Oladko <aleksey.oladko@virtuozzo.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260213131907.43351-3-aleksey.oladko@virtuozzo.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-17selftests: forwarding: vxlan_bridge_1d: fix test failure with br_netfilter ↵Aleksei Oladko1-10/+16
enabled The test generates VXLAN traffic using mausezahn, where the encapsulated inner IPv4 packet contains a zero IP header checksum. After VXLAN decapsulation, such packets do not pass sanity checks in br_netfilter and are dropped, which causes the test to fail. Fix this by calculating and setting a valid IPv4 header checksum for the encapsulated packet generated by mausezahn, so that the packet is accepted by br_netfilter. Fixed by using the payload_template_calc_checksum() / payload_template_expand_checksum() helpers that are only available in v6.3 and newer kernels. Fixes: a0b61f3d8ebf ("selftests: forwarding: vxlan_bridge_1d: Add an ECN decap test") Signed-off-by: Aleksei Oladko <aleksey.oladko@virtuozzo.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260213131907.43351-2-aleksey.oladko@virtuozzo.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-17net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RTEric Dumazet1-5/+12
CONFIG_PREEMPT_RT is special, make this clear in backlog_lock_irq_save() and backlog_unlock_irq_restore(). The issue shows up with CONFIG_DEBUG_IRQFLAGS=y raw_local_irq_restore() called with IRQs enabled WARNING: kernel/locking/irqflag-debug.c:10 at warn_bogus_irq_restore+0xc/0x20 kernel/locking/irqflag-debug.c:10, CPU#1: aoe_tx0/1321 Modules linked in: CPU: 1 UID: 0 PID: 1321 Comm: aoe_tx0 Not tainted syzkaller #0 PREEMPT_{RT,(full)} Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026 RIP: 0010:warn_bogus_irq_restore+0xc/0x20 kernel/locking/irqflag-debug.c:10 Call Trace: <TASK> backlog_unlock_irq_restore net/core/dev.c:253 [inline] enqueue_to_backlog+0x525/0xcf0 net/core/dev.c:5347 netif_rx_internal+0x120/0x550 net/core/dev.c:5659 __netif_rx+0xa9/0x110 net/core/dev.c:5679 loopback_xmit+0x43a/0x660 drivers/net/loopback.c:90 __netdev_start_xmit include/linux/netdevice.h:5275 [inline] netdev_start_xmit include/linux/netdevice.h:5284 [inline] xmit_one net/core/dev.c:3864 [inline] dev_hard_start_xmit+0x2df/0x830 net/core/dev.c:3880 __dev_queue_xmit+0x16f4/0x3990 net/core/dev.c:4829 dev_queue_xmit include/linux/netdevice.h:3384 [inline] Fixes: 27a01c1969a5 ("net: fully inline backlog_unlock_irq_restore()") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260213120427.2914544-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-17printk: add CONFIG_PRINTK dependency for netconsoleArnd Bergmann1-0/+1
The 'select PRINTK_EXECUTION_CTX' line now causes a harmless warning when NETCONSOLE_DYNAMIC is enabled but PRINTK is not: WARNING: unmet direct dependencies detected for PRINTK_EXECUTION_CTX Depends on [n]: PRINTK [=n] Selected by [y]: - NETCONSOLE_DYNAMIC [=y] && NETDEVICES [=y] && NET_CORE [=y] && NETCONSOLE [=y] && SYSFS [=y] && CONFIGFS_FS [=y] && (NETCONSOLE [=y]!=y [=y] || CONFIGFS_FS [=y]!=m [=m]) In that configuration, the netconsole driver is useless anyway, so avoid this with an added dependency that prevents CONFIG_NETCONSOLE to be enabled without CONFIG_PRINTK. Fixes: 60325c27d3cf ("printk: Add execution context (task name/CPU) to printk_info") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260213074431.1729627-1-arnd@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-17Merge branch 'net-bridge-mcast-fix-mdb_n_entries-counting-warning'Paolo Abeni2-29/+106
Nikolay Aleksandrov says: ==================== net: bridge: mcast: fix mdb_n_entries counting warning The first patch fixes a warning in the mdb_n_entries code which was reported by syzbot, the second patch adds tests for different ways which used to trigger said warning. For more information please check the individual commit messages. ==================== Link: https://patch.msgid.link/20260213070031.1400003-1-nikolay@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-17selftests: forwarding: bridge_mdb_max: add tests for mdb_n_entries warningNikolay Aleksandrov1-2/+88
Recently we were able to trigger a warning in the mdb_n_entries counting code. Add tests that exercise different ways which used to trigger that warning. Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://patch.msgid.link/20260213070031.1400003-3-nikolay@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-17net: bridge: mcast: always update mdb_n_entries for vlan contextsNikolay Aleksandrov1-27/+18
syzbot triggered a warning[1] about the number of mdb entries in a context. It turned out that there are multiple ways to trigger that warning today (some got added during the years), the root cause of the problem is that the increase is done conditionally, and over the years these different conditions increased so there were new ways to trigger the warning, that is to do a decrease which wasn't paired with a previous increase. For example one way to trigger it is with flush: $ ip l add br0 up type bridge vlan_filtering 1 mcast_snooping 1 $ ip l add dumdum up master br0 type dummy $ bridge mdb add dev br0 port dumdum grp 239.0.0.1 permanent vid 1 $ ip link set dev br0 down $ ip link set dev br0 type bridge mcast_vlan_snooping 1 ^^^^ this will enable snooping, but will not update mdb_n_entries because in __br_multicast_enable_port_ctx() we check !netif_running $ bridge mdb flush dev br0 ^^^ this will trigger the warning because it will delete the pg which we added above, which will try to decrease mdb_n_entries Fix the problem by removing the conditional increase and always keep the count up-to-date while the vlan exists. In order to do that we have to first initialize it on port-vlan context creation, and then always increase or decrease the value regardless of mcast options. To keep the current behaviour we have to enforce the mdb limit only if the context is port's or if the port-vlan's mcast snooping is enabled. [1] ------------[ cut here ]------------ n == 0 WARNING: net/bridge/br_multicast.c:718 at br_multicast_port_ngroups_dec_one net/bridge/br_multicast.c:718 [inline], CPU#0: syz.4.4607/22043 WARNING: net/bridge/br_multicast.c:718 at br_multicast_port_ngroups_dec net/bridge/br_multicast.c:771 [inline], CPU#0: syz.4.4607/22043 WARNING: net/bridge/br_multicast.c:718 at br_multicast_del_pg+0x1bbe/0x1e20 net/bridge/br_multicast.c:825, CPU#0: syz.4.4607/22043 Modules linked in: CPU: 0 UID: 0 PID: 22043 Comm: syz.4.4607 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026 RIP: 0010:br_multicast_port_ngroups_dec_one net/bridge/br_multicast.c:718 [inline] RIP: 0010:br_multicast_port_ngroups_dec net/bridge/br_multicast.c:771 [inline] RIP: 0010:br_multicast_del_pg+0x1bbe/0x1e20 net/bridge/br_multicast.c:825 Code: 41 5f 5d e9 04 7a 48 f7 e8 3f 73 5c f7 90 0f 0b 90 e9 cf fd ff ff e8 31 73 5c f7 90 0f 0b 90 e9 16 fd ff ff e8 23 73 5c f7 90 <0f> 0b 90 e9 60 fd ff ff e8 15 73 5c f7 eb 05 e8 0e 73 5c f7 48 8b RSP: 0018:ffffc9000c207220 EFLAGS: 00010293 RAX: ffffffff8a68042d RBX: ffff88807c6f1800 RCX: ffff888066e90000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff888066e90000 R09: 000000000000000c R10: 000000000000000c R11: 0000000000000000 R12: ffff8880303ef800 R13: dffffc0000000000 R14: ffff888050eb11c4 R15: 1ffff1100a1d6238 FS: 00007fa45921b6c0(0000) GS:ffff8881256f5000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa4591f9ff8 CR3: 0000000081df2000 CR4: 00000000003526f0 Call Trace: <TASK> br_mdb_flush_pgs net/bridge/br_mdb.c:1525 [inline] br_mdb_flush net/bridge/br_mdb.c:1544 [inline] br_mdb_del_bulk+0x5e2/0xb20 net/bridge/br_mdb.c:1561 rtnl_mdb_del+0x48a/0x640 net/core/rtnetlink.c:-1 rtnetlink_rcv_msg+0x77e/0xbe0 net/core/rtnetlink.c:6967 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0xa68/0xad0 net/socket.c:2592 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2646 __sys_sendmsg net/socket.c:2678 [inline] __do_sys_sendmsg net/socket.c:2683 [inline] __se_sys_sendmsg net/socket.c:2681 [inline] __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2681 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa45839aeb9 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fa45921b028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007fa458615fa0 RCX: 00007fa45839aeb9 RDX: 0000000000000000 RSI: 00002000000000c0 RDI: 0000000000000004 RBP: 00007fa458408c1f R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fa458616038 R14: 00007fa458615fa0 R15: 00007fff0b59fae8 </TASK> Fixes: b57e8d870d52 ("net: bridge: Maintain number of MDB entries in net_bridge_mcast_port") Reported-by: syzbot+d5d1b7343531d17bd3c5@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/aYrWbRp83MQR1ife@debil/T/#t Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://patch.msgid.link/20260213070031.1400003-2-nikolay@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>