kernel/linux.git/Documentation/networking, branch v7.2-rc1

vlan: defer real device state propagation to netdev_work

2026-06-25T17:18:40+00:00

vlan_device_event() generates nested UP/DOWN, MTU and feature change events. It executes an event for the VLAN device directly from the notifier - while the locks of the lower device are held. This causes deadlocks, for example: bond (3) bond_update_speed_duplex(vlan) | ^ v vlan (2) UP(vlan) (4) vlan_ethtool_get_link_ksettings() | ^ v dummy (1) UP(dummy) (5) __ethtool_get_link_ksettings() The dummy device is ops locked, vlan creates a nested event (2), then bond wants to ask vlan for link state (3). bond uses the "I'm already holding the instance lock" flavor of API. But in this case the lock held refers to vlan itself. We hit vlan's link settings trampoline (4) and call __ethtool_get_link_ksettings() which tries to lock dummy. Deadlock. There's no clean way for us to tell the vlan_ethtool_get_link_ksettings() that the caller is already in lower device's critical section. Defer the propagation to the per-netdev work facility instead: the notifier only schedules netdev_work_sched(vlandev, VLAN_WORK_*), and ndo_work (vlan_dev_work) applies the change later. Hopefully nobody expects the VLAN state changes to be instantaneous. If someone does expect the changes to be instantaneous we will have to do the same thing Stan did for rx_mode and "strategically" place sync calls, to make sure such delayed works are executed after we drop the ops lock but before we drop rtnl_lock. Stan suggests that if we need that down the line we may consider reshaping the mechanism into "async notifications". AFAICT only vlan does this sort of netdev open chaining, so as a first try I think that sticking the complexity into the vlan code makes sense. One corner case is that we need to cancel the event if user explicitly changes the state before work could run. Consider the following operations with vlan0 on top of dummy0: ip link set dev dummy0 up # queues work to up vlan0 ip link set dev vlan0 down # user explicitly downs the vlan ndo_work # acts on the stale event Reported-by: syzbot+09da62a8b78959ceb8bb@syzkaller.appspotmail.com Reported-by: syzbot+cb67c392b0b8f0fd0fc1@syzkaller.appspotmail.com Reported-by: syzbot+9bb8bd77f3966641f298@syzkaller.appspotmail.com Fixes: 9f275c2e9020 ("net: ethtool: make sure __ethtool_get_link_ksettings() is ops-locked") Reviewed-by: Kuniyuki Iwashima Reviewed-by: Nicolai Buchwitz Acked-by: Stanislav Fomichev Link: https://patch.msgid.link/20260624182018.2445732-4-kuba@kernel.org Signed-off-by: Jakub Kicinski

appletalk: stop storing per-interface state in struct net_device

2026-06-16T21:37:06+00:00

AppleTalk keeps its per-interface control block (struct atalk_iface) directly in struct netdevice (dev->atalk_ptr). This is the only thing tying the protocol into the core net_device layout and is the sole blocker to moving AppleTalk out of tree. Replace dev->atalk_ptr with a small ifindex-keyed hashtable internal to ddp.c. The existing atalk_interfaces list stays the owner of the iface objects; the hashtable is purely a fast dev->iface index and reuses the same atalk_interfaces_lock. AFAICT this patch does not make this code any more racy than it already is, I'm sure Sashiko will point out some basically existing bugs. AFAICT atalk_interfaces_lock is the innermost lock already. Acked-by: Stephen Hemminger Link: https://patch.msgid.link/20260615222935.947233-2-kuba@kernel.org Signed-off-by: Jakub Kicinski

tcp: rehash onto different local ECMP path on retransmit timeout

2026-06-15T22:57:31+00:00

Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO, PLB, and spurious-retransmission events, but the cached route is reused and the new hash is not propagated into the ECMP path selection logic. Two changes are needed to make rehash select a different local ECMP path: 1. Add __sk_dst_reset() alongside sk_rethink_txhash() in tcp_write_timeout(), tcp_rcv_spurious_retrans(), and tcp_plb_check_rehash() so the cached dst is invalidated and the next transmit triggers a fresh route lookup. 2. Set fl6->mp_hash from sk_txhash (or tcp_rsk(req)->txhash for SYN/ACK retransmits and syncookies) in tcp_v6_connect(), inet6_sk_rebuild_header(), inet6_csk_route_req(), inet6_csk_route_socket(), tcp_v6_send_response(), and cookie_v6_check() so fib6_select_path() picks a path based on the new hash. The mp_hash override only applies to fib_multipath_hash_policy 0 (the default L3 policy). Its hash includes the flow label, but that is 0 by default -- np->flow_label is unset, and auto_flowlabels only computes the on-wire label later, per packet -- so flows to the same peer share one local path. Keying the hash on sk_txhash makes the local path per-connection and lets a rehash re-select it. Policies 1-3 are left unchanged. The mp_hash assignment is factored into a small helper, ip6_ecmp_set_mp_hash(), shared by inet6_csk_route_req(), inet6_csk_route_socket(), tcp_v6_connect(), inet6_sk_rebuild_header(), tcp_v6_send_response(), and cookie_v6_check(). It applies (txhash >> 1) ?: 1 for policy 0 (the >> 1 keeps mp_hash in the 31-bit range; ?: 1 keeps it non-zero, since 0 would fall back to rt6_multipath_hash()). inet6_csk_route_socket() calls it only for sk_protocol == IPPROTO_TCP so that non-TCP callers (e.g., L2TP via inet6_csk_xmit) fall through to rt6_multipath_hash() and retain their existing flow-key-based ECMP behavior. tcp_v6_send_response() also sets mp_hash from the response txhash so that a control packet (a RST from the full socket, or an ACK from a time-wait socket) selects the same local ECMP nexthop as the connection's txhash rather than falling back to the flow hash. The time-wait socket's tw_txhash is copied from sk_txhash when the connection enters TIME_WAIT, so it reflects any rehash that occurred. Setting mp_hash explicitly is necessary because the default ECMP hash derives from fl6->flowlabel via np->flow_label, which is not updated from sk_txhash (REPFLOW is off by default). ip6_make_flowlabel() cannot help either, as it runs after the route lookup. As a consequence, for policy 0 the local ECMP path of an IPv6 TCP flow follows sk_txhash even when fl6->flowlabel is non-zero, e.g. a reflected (REPFLOW) or explicitly set (IPV6_FLOWLABEL_MGR) flow label. This is intentional: only local path selection changes, so rehash can recover from a failed path; the on-wire flow label is unchanged. sk_set_txhash() is moved before ip6_dst_lookup_flow() in tcp_v6_connect() so the initial ECMP path is selected by the same txhash that subsequent route rebuilds will use. This avoids unintended path changes when the cached dst is naturally invalidated (e.g., by PMTU discovery or route changes). The rehash sites (tcp_write_timeout(), tcp_plb_check_rehash(), and tcp_rcv_spurious_retrans()) call __sk_rethink_txhash_reset_dst(), which re-rolls the txhash and, when it changed, drops the cached dst so the next transmit re-runs route selection. The dst reset is guarded by sk->sk_family == AF_INET6 since IPv4 ECMP does not currently use sk_txhash for path selection. For IPv4-mapped IPv6 sockets this produces a redundant dst reset on a cold path (RTO/PLB); the subsequent IPv4 route lookup returns the same result. The helper is deliberately separate from sk_rethink_txhash() itself: dst_negative_advice() calls sk_rethink_txhash() before its own dst op, so resetting the dst inside sk_rethink_txhash() would skip that op (e.g. rt6_remove_exception_rt()). For syncookies, cookie_init_sequence() computes the cookie value before route_req() and sets txhash so the SYN-ACK selects the same ECMP path that cookie_v6_check() will use when the full socket is created. cookie_tcp_reqsk_init() derives txhash from the cookie so the full socket's ECMP path matches the SYN-ACK. Both the SYN-ACK assignment in tcp_conn_request() and the full-socket assignment in cookie_tcp_reqsk_init() set txhash from the cookie for IPv4 and IPv6 alike. On IPv6 this drives ECMP path selection; on IPv4, which does not use sk_txhash for ECMP, it only affects TX-queue selection. That selection scales the hash by its high bits (reciprocal_scale()), which are uniform in the keyed secure_tcp_syn_cookie() output -- the MSS index only perturbs the low bits -- so the queue distribution matches net_tx_rndhash(). cookie_init_sequence() is split from the former version that also called tcp_synq_overflow() and incremented SYNCOOKIESSENT; those side effects are now in cookie_record_sent(), called after route_req() succeeds so they are not bumped when route_req() fails. cookie_record_sent() is guarded by CONFIG_SYN_COOKIES to match the guard on tcp_synq_overflow(). route_req() receives 0 as tw_isn for the syncookie path so that tcp_v6_init_req() still saves ireq->pktopts for REPFLOW flowlabel reflection and IPv6 cmsg options. The ecn_ok clear for syncookies without timestamps stays after tcp_ecn_create_request() so it takes precedence. Signed-off-by: Neil Spring Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20260615042158.1600746-2-ntspring@meta.com Signed-off-by: Jakub Kicinski

docs: net: fix minor issues with strparser docs

2026-06-15T22:42:53+00:00

Not sure if anyone would read this doc, but the API has evolved since it was written. Update to: - show the int return type for strp_init() - refer to strp_data_ready(), not the old strp_tcp_data_ready() name - direct users to strp_msg(skb) for strparser metadata instead of treating skb->cb as struct strp_msg directly Link: https://patch.msgid.link/20260613165846.2913092-4-kuba@kernel.org Signed-off-by: Jakub Kicinski

docs: net: fix minor issues with devlink docs

2026-06-15T22:42:53+00:00

Update devlink documentation to match current code: - describe health reporter defaults (it's currently under "callbacks"), best-effort auto-dump, and port-scoped reporters - fix generic parameter names and values - fix nested devlink setup wording and registration ordering Link: https://patch.msgid.link/20260613165846.2913092-3-kuba@kernel.org Signed-off-by: Jakub Kicinski

docs: net: tls-offload: document tls_dev_del, tls_dev_resync, and rekey

2026-06-15T22:42:52+00:00

Fill in some gaps in the TLS offload doc: - describe the tls_dev_del and tls_dev_resync callbacks - add a mention of rekeying being out of scope for now Reviewed-by: Sabrina Dubroca Link: https://patch.msgid.link/20260613165846.2913092-2-kuba@kernel.org Signed-off-by: Jakub Kicinski

Merge tag 'nf-next-26-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

2026-06-15T21:09:57+00:00

Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next. More specifically, this contains conncount rework to address AI related reports, assorted Netfiter updates and two small incremental updates on IPVS: 1) Replace old obsolete workqueues (system_wq, system_unbound_wq) in IPVS, from Marco Crivellari. 2) Replace WARN_ON{_ONCE} by DEBUG_NET_WARN_ON_ONCE in nf_tables. In the recent years, reporters say that the use of WARN_ON{_ONCE} in conjunction with panic_on_warn=1 results in DoS. Let's replace it by DEBUG_NET_WARN_ON_ONCE so this is only exercised by test infrastructure and fuzzers, while also providing context to AI agents. From Fernando F. Mancera. Five patches from Florian Westphal to address AI reports in the conncount infrastructures: 3) Fix missing rcu read lock section when calling __ovs_ct_limit_get_zone_limit(). 4) Add a dedicate lock per rbtree tree, this increases memory usage but it should improve scalability. 5) Add a helper function to find the rbtree node, no functional changes are intented. 6) Add sequence counter to detect concurrent tree modifications and retry lookups. 7) Add locks to GC conncount walk and address other nitpicks. Then, several assorted updates: 8) Defensive Tree-wide addition of NULL checks for ct extensions. 9) Bail out if flowtable bypass cannot be fully set up from the flow offload expression, instead of lazy building a likely incomplete one. 10) Fix documentation for the new conn_max sysctl toggle in IPVS. 11) Add nf_dev_xmit_recursion*() helpers and use them, to address recent AI reports. * tag 'nf-next-26-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_dup_netdev: add nf_dev_xmit_recursion*() helpers and use them ipvs: fix doc syntax for conn_max sysctl netfilter: flowtable: bail out if forward path cannot be discovered netfilter: conntrack: check NULL when retrieving ct extension netfilter: nf_conncount: gc and rcu fixes netfilter: nf_conncount: add sequence counter to detect tree modifications netfilter: nf_conncount: split count_tree_node rbtree walk into helper netfilter: nf_conncount: use per nf_conncount_data spinlocks netfilter: nf_conncount: callers must hold rcu read lock netfilter: nf_tables: use DEBUG_NET_WARN_ON_ONCE in packet and control paths ipvs: Replace use of system_unbound_wq with system_dfl_long_wq ==================== Link: https://patch.msgid.link/20260614114605.474783-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski

ipvs: fix doc syntax for conn_max sysctl

2026-06-14T10:51:55+00:00

Fix the docutils error reported by kernel test robot for the new conn_max sysctl: Documentation/networking/ipvs-sysctl.rst:76: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils] Documentation/networking/ipvs-sysctl.rst:76: ERROR: Unexpected section title or transition. Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202606071851.Dc1H7hOO-lkp@intel.com/ Fixes: 4a15044a2b06 ("ipvs: add conn_max sysctl to limit connections") Signed-off-by: Julian Anastasov Signed-off-by: Pablo Neira Ayuso

dt-bindings: net: dsa: Convert lan9303.txt to yaml format

2026-06-13T22:00:07+00:00

Convert lan9303.txt to yaml format to fix below CHECK_DTBS warnings: arch/arm/boot/dts/nxp/imx/imx53-kp-hsc.dtb: /soc/bus@50000000/i2c@53fec000/switch@a: failed to match any schema with compatible: ['smsc,lan9303-i2c'] Additional changes: - rename switch-phy to switch in example. Reviewed-by: Rob Herring (Arm) Signed-off-by: Frank Li Link: https://patch.msgid.link/20260610150533.515914-1-Frank.Li@oss.nxp.com Signed-off-by: Jakub Kicinski

Merge tag 'ipsec-next-2026-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

2026-06-13T20:16:39+00:00

Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2026-06-12 1) Replace the open-coded manual cleanup in xfrm_add_policy() error path with xfrm_policy_destroy() for consistency with xfrm_policy_construct(). From Deepanshu Kartikey. 2) Limit XFRMA_TFCPAD to a sensible maximum (max IP length, 64k) since u32 is excessive for traffic flow confidentiality padding. From David Ahern. 3) Add a new netlink message XFRM_MSG_MIGRATE_STATE that allows migrating individual IPsec SAs independently of their policies. The existing XFRM_MSG_MIGRATE is tightly coupled to policy+SA migration, lacks SPI for unique SA identification, and cannot express reqid changes or migrate Transport mode selectors. The new interface identifies the SA via SPI and mark, supports reqid changes, address family changes, encap removal, and uses an atomic create+install flow under x->lock to prevent SN/IV reuse during AEAD SA migration. From Antony Antony. * tag 'ipsec-next-2026-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: add documentation for XFRM_MSG_MIGRATE_STATE xfrm: restrict netlink attributes for XFRM_MSG_MIGRATE_STATE xfrm: add XFRM_MSG_MIGRATE_STATE for single SA migration xfrm: make xfrm_dev_state_add xuo parameter const xfrm: extract address family and selector validation helpers xfrm: refactor XFRMA_MTIMER_THRESH validation into a helper xfrm: move encap and xuo into struct xfrm_migrate xfrm: add error messages to state migration xfrm: add state synchronization after migration xfrm: check family before comparing addresses in migrate xfrm: split xfrm_state_migrate into create and install functions xfrm: rename reqid in xfrm_migrate xfrm: fix NAT-related field inheritance in SA migration xfrm: allow migration from UDP encapsulated to non-encapsulated ESP xfrm: add extack to xfrm_init_state xfrm: remove redundant assignments xfrm: Reject excessive values for XFRMA_TFCPAD xfrm: cleanup error path in xfrm_add_policy() ==================== Link: https://patch.msgid.link/20260612074725.1760473-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski