summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2025-11-20net/mlx5: Move SF dev table notifier registration outside the PF devlink lockCosmin Ratiu4-17/+49
This completes the previous patches by moving notifier registration for SF dev tables outside the devlink locked critical section in mlx5_init_one() / mlx5_uninit_one() and into the mlx5_mdev_init() / mlx5_mdev_uninit() functions. This is only done for non-SFs, since SFs do not have a SF HW table themselves. After this patch, notifiers can grab the PF devlink lock (soon to be necessary) without creating a locking cycle. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763325940-1231508-7-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net/mlx5: Move the SF table notifiers outside the devlink lockCosmin Ratiu4-33/+78
Move the SF table notifiers registration/unregistration outside of mlx5_init_one() / mlx5_uninit_one() and into the mlx5_mdev_init() / mlx5_mdev_uninit() functions. This is only done for non-SFs, since SFs do not have a SF table themselves and thus don't need notifiers. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763325940-1231508-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net/mlx5: Move the SF HW table notifier outside the devlink lockCosmin Ratiu4-35/+54
Move the SF HW table notifier registration/unregistration outside of mlx5_init_one() / mlx5_uninit_one() and into the mlx5_mdev_init() / mlx5_mdev_uninit() functions. This is only done for non-SFs, since SFs do not have a SF HW table themselves. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763325940-1231508-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net/mlx5: Move the vhca event notifier outside of the devlink lockCosmin Ratiu6-48/+35
The vhca event notifier consists of an atomic notifier for vhca state changes (used for SF events), multiple workqueues and a blocking notifier chain for delivering the vhca state change events for further processing. This patch moves the vhca notifier head outside of mlx5_init_one() / mlx5_uninit_one() and into the mlx5_mdev_init() / mlx5_mdev_uninit() functions. This allows called notifiers to grab the PF devlink lock which was previously impossible because it would create a circular lock dependency. mlx5_vhca_event_stop() is now called earlier in the cleanup phase and flushes the workqueues to ensure that after the call, there are no pending events. This simplifies the cleanup flow for vhca event consumers. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763325940-1231508-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net/mlx5: Move the esw mode notifier chain outside the devlink lockCosmin Ratiu5-12/+17
The esw mode change notifier chain is initialized/cleaned up in mlx5_init_one() / mlx5_uninit_one() with the devlink lock held. Move the notifier head from the eswitch struct into mlx5_priv directly, and initialize it outside the critical section. This will allow notifier registration to happen earlier in the init procedure in subsequent patches. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763325940-1231508-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net/mlx5: Initialize events outside devlink lockCosmin Ratiu1-10/+24
Move event init/cleanup outside of mlx5_init_one() / mlx5_uninit_one() and into the mlx5_mdev_init() / mlx5_mdev_uninit() functions. By doing this, we avoid the events being reinitialized on devlink reload and, more importantly, the events->sw_nh notifier chain becomes available earlier in the init procedure, which will be used in subsequent patches. This makes sense because the events struct is pure software, independent of any HW details. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763325940-1231508-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20Merge branch 'net-adjust-conservative-values-around-napi'Jakub Kicinski1-8/+12
Jason Xing says: ==================== net: adjust conservative values around napi This series keeps at least 96 skbs per cpu and frees 32 skbs at one time in conclusion. More initial discussions with Eric can be seen at the link [1]. [1]: https://lore.kernel.org/all/CAL+tcoBEEjO=-yvE7ZJ4sB2smVBzUht1gJN85CenJhOKV==================== Link: https://patch.msgid.link/20251118070646.61344-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: prefetch the next skb in napi_skb_cache_get()Jason Xing1-0/+2
After getting the current skb in napi_skb_cache_get(), the next skb in cache is highly likely to be used soon, so prefetch would be helpful. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251118070646.61344-5-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: use NAPI_SKB_CACHE_FREE to keep 32 as default to do bulk freeJason Xing1-6/+8
- Replace NAPI_SKB_CACHE_HALF with NAPI_SKB_CACHE_FREE - Only free 32 skbs in napi_skb_cache_put() Since the first patch adjusting NAPI_SKB_CACHE_SIZE to 128, the number of packets to be freed in the softirq was increased from 32 to 64. Considering a subsequent net_rx_action() calling napi_poll() a few times can easily consume the 64 available slots and we can afford keeping a higher value of sk_buffs in per-cpu storage, decrease NAPI_SKB_CACHE_FREE to 32 like before. So now the logic is 1) keeping 96 skbs, 2) freeing 32 skbs at one time. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251118070646.61344-4-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: increase default NAPI_SKB_CACHE_BULK to 32Jason Xing1-1/+1
The previous value 16 is a bit conservative, so adjust it along with NAPI_SKB_CACHE_SIZE, which can minimize triggering memory allocation in napi_skb_cache_get*(). Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251118070646.61344-3-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: increase default NAPI_SKB_CACHE_SIZE to 128Jason Xing1-1/+1
After commit b61785852ed0 ("net: increase skb_defer_max default to 128") changed the value sysctl_skb_defer_max to avoid many calls to kick_defer_list_purge(), the same situation can be applied to NAPI_SKB_CACHE_SIZE that was proposed in 2016. It's a trade-off between using pre-allocated memory in skb_cache and saving more a bit heavy function calls in the softirq context. With this patch applied, we can have more skbs per-cpu to accelerate the sending path that needs to acquire new skbs. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251118070646.61344-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20Merge branch 'disable-clkout-on-rtl8211f-d-i-vd-cg'Jakub Kicinski1-57/+102
Vladimir Oltean says: ==================== Disable CLKOUT on RTL8211F(D)(I)-VD-CG The Realtek RTL8211F(D)(I)-VD-CG is similar to other RTL8211F models in that the CLKOUT signal can be turned off - a feature requested to reduce EMI, and implemented via "realtek,clkout-disable" as documented in Documentation/devicetree/bindings/net/realtek,rtl82xx.yaml. It is also dissimilar to said PHY models because it has no PHYCR2 register, and disabling CLKOUT is done through some other register. The strategy adopted in this 6-patch series is to make the PHY driver not think in terms of "priv->has_phycr2" and "priv->phycr2", but of more high-level features ("priv->disable_clk_out") while maintaining behaviour. Then, the logic is extended for the new PHY. Very loosely based on previous work from Clark Wang, who took a different approach, to pretend that the RTL8211FVD_CLKOUT_REG is actually this PHY's PHYCR2. ==================== Link: https://patch.msgid.link/20251117234033.345679-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: phy: realtek: create rtl8211f_config_phy_eee() helperVladimir Oltean1-11/+12
To simplify the rtl8211f_config_init() control flow and get rid of "early" returns for PHYs where the PHYCR2 register is absent, move the entire logic sub-block that deals with disabling PHY-mode EEE to a separate function. There, it is much more obvious what the early "return 0" skips, and it becomes more difficult to accidentally skip unintended stuff. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20251117234033.345679-7-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: phy: realtek: eliminate priv->phycr1 variableVladimir Oltean1-16/+28
Previous changes have replaced the machine-level priv->phycr2 with a high-level priv->disable_clk_out. This created a discrepancy with priv->phycr1 which is resolved here, for uniformity. One advantage of this new implementation is that we don't read priv->phycr1 in rtl821x_probe() if we're never going to modify it. We never test the positive return code from phy_modify_mmd_changed(), so we could just as well use phy_modify_mmd(). I took the ALDPS feature description from commit d90db36a9e74 ("net: phy: realtek: add dt property to enable ALDPS mode") and transformed it into a function comment - the feature is sufficiently non-obvious to deserve that. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251117234033.345679-6-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: phy: realtek: allow CLKOUT to be disabled on RTL8211F(D)(I)-VD-CGVladimir Oltean1-9/+22
Add CLKOUT disable support for RTL8211F(D)(I)-VD-CG. Like with other PHY variants, this feature might be requested by customers when the clock output is not used, in order to reduce electromagnetic interference (EMI). In the common driver, the CLKOUT configuration is done through PHYCR2. The RTL_8211FVD_PHYID is singled out as not having that register, and execution in rtl8211f_config_init() returns early after commit 2c67301584f2 ("net: phy: realtek: Avoid PHYCR2 access if PHYCR2 not present"). But actually CLKOUT is configured through a different register for this PHY. Instead of pretending this is PHYCR2 (which it is not), just add some code for modifying this register inside the rtl8211f_disable_clk_out() function, and move that outside the code portion that runs only if PHYCR2 exists. In practice this reorders the PHYCR2 writes to disable PHY-mode EEE and to disable the CLKOUT for the normal RTL8211F variants, but this should be perfectly fine. It was not noted that RTL8211F(D)(I)-VD-CG would need a genphy_soft_reset() call after disabling the CLKOUT. Despite that, we do it out of caution and for symmetry with the other RTL8211F models. Co-developed-by: Clark Wang <xiaoning.wang@nxp.com> Signed-off-by: Clark Wang <xiaoning.wang@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251117234033.345679-5-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: phy: realtek: eliminate has_phycr2 variableVladimir Oltean1-4/+2
This variable is assigned in rtl821x_probe() and used in rtl8211f_config_init(), which is more complex than it needs to be. Simply testing the same condition from rtl821x_probe() in rtl8211f_config_init() yields the same result (the PHY driver ID is a runtime invariant), but with one temporary variable less. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251117234033.345679-4-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: phy: realtek: eliminate priv->phycr2 variableVladimir Oltean1-15/+23
The RTL8211F(D)(I)-VD-CG PHY also has support for disabling the CLKOUT, and we'd like to introduce the "realtek,clkout-disable" property for that. But it isn't done through the PHYCR2 register, and it becomes awkward to have the driver pretend that it is. So just replace the machine-level "u16 phycr2" variable with a logical "bool disable_clk_out", which scales better to the other PHY as well. The change is a complete functional equivalent. Before, if the device tree property was absent, priv->phycr2 would contain the RTL8211F_CLKOUT_EN bit as read from hardware. Now, we don't save priv->phycr2, but we just don't call phy_modify_paged() on it. Also, we can simply call phy_modify_paged() with the "set" argument to 0. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20251117234033.345679-3-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: phy: realtek: create rtl8211f_config_rgmii_delay()Vladimir Oltean1-26/+39
The control flow in rtl8211f_config_init() has some pitfalls which were probably unintended. Specifically it has an early return: switch (phydev->interface) { ... default: /* the rest of the modes imply leaving delay as is. */ return 0; } which exits the entire config_init() function. This means it also skips doing things such as disabling CLKOUT or disabling PHY-mode EEE. For the RTL8211FS, which uses PHY_INTERFACE_MODE_SGMII, this might be a problem. However, I don't know that it is, so there is no Fixes: tag. The issue was observed through code inspection. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251117234033.345679-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: vmxnet3: convert to use .get_rx_ring_countBreno Leitao1-15/+3
Convert the vmxnet3 driver to use the new .get_rx_ring_count ethtool operation instead of implementing .get_rxnfc solely for handling ETHTOOL_GRXRINGS command. This simplifies the code by removing the switch statement and replacing it with a direct return of the queue count. The new callback provides the same functionality in a more direct way, following the ongoing ethtool API modernization. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20251118-vmxnet3_grxrings-v1-1-ed8abddd2d52@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20Merge branch 'net-mana-enforce-tx-sge-limit-and-fix-error-cleanup'Jakub Kicinski5-12/+53
Aditya Garg says: ==================== net: mana: Enforce TX SGE limit and fix error cleanup Add pre-transmission checks to block SKBs that exceed the hardware's SGE limit. Force software segmentation for GSO traffic and linearize non-GSO packets as needed. Update TX error handling to drop failed SKBs and unmap resources immediately. ==================== Link: https://patch.msgid.link/1763464269-10431-1-git-send-email-gargaditya@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: mana: Drop TX skb on post_work_request failure and unmap resourcesAditya Garg3-9/+5
Drop TX packets when posting the work request fails and ensure DMA mappings are always cleaned up. Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/1763464269-10431-3-git-send-email-gargaditya@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net: mana: Handle SKB if TX SGEs exceed hardware limitAditya Garg4-3/+48
The MANA hardware supports a maximum of 30 scatter-gather entries (SGEs) per TX WQE. Exceeding this limit can cause TX failures. Add ndo_features_check() callback to validate SKB layout before transmission. For GSO SKBs that would exceed the hardware SGE limit, clear NETIF_F_GSO_MASK to enforce software segmentation in the stack. Add a fallback in mana_start_xmit() to linearize non-GSO SKBs that still exceed the SGE limit. Also, Add ethtool counter for SKBs linearized Co-developed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/1763464269-10431-2-git-send-email-gargaditya@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20octeontx2-af: Skip TM tree print for disabled SQsAnshumali Gaur1-0/+3
Currently, the TM tree is printing all SQ topology including those which are not enabled, this results in redundant output for SQs which are not active. This patch adds a check in print_tm_tree() to skip printing the TM tree hierarchy if the SQ is not enabled. Signed-off-by: Anshumali Gaur <agaur@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251118054235.1599714-1-agaur@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20dt-bindings: net: mediatek,net: Correct bindings for MT7981Sjoerd Simons1-3/+23
Different SoCs have different numbers of Wireless Ethernet Dispatch (WED) units: - MT7981: Has 1 WED unit - MT7986: Has 2 WED units - MT7988: Has 2 WED units Update the binding to reflect these hardware differences. The MT7981 also uses infracfg for PHY switching, so allow that property. Signed-off-by: Sjoerd Simons <sjoerd@collabora.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://patch.msgid.link/20251115-openwrt-one-network-v4-6-48cbda2969ac@collabora.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19Merge branch 'net-stmmac-sanitise-stmmac_is_jumbo_frm'Jakub Kicinski4-16/+10
Russell King says: ==================== net: stmmac: sanitise stmmac_is_jumbo_frm() stmmac_is_jumbo_frm() takes skb->len, which is unsigned int, but the parameter is passed as an "int" and then tested using signed comparisons. This can cause bugs. Change the parameter to be unsigned. Also arrange for it to return a bool. ==================== Link: https://patch.msgid.link/aRxDqJSWxOdOaRt4@shell.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net: stmmac: stmmac_is_jumbo_frm() returns booleanRussell King (Oracle)4-16/+10
stmmac_is_jumbo_frm() returns whether the driver considers the frame size to be a jumbo frame, and thus returns 0/1 values. This is boolean, so convert it to return a boolean and use false/true instead. Also convert stmmac_xmit()'s is_jumbo to be bool, which causes several variables to be repositioned to keep it in reverse Christmas-tree order. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1vLIWW-0000000Ewkl-21Ia@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net: stmmac: stmmac_is_jumbo_frm() len should be unsignedRussell King (Oracle)3-3/+3
stmmac_is_jumbo_frm() and the is_jumbo_frm() methods take skb->len which is an unsigned int. Avoid an implicit cast to "int" via the method parameter and then incorrectly doing signed comparisons on this unsigned value. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1vLIWR-0000000Ewkf-1Tdx@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net: stmmac: convert priv->sph* to boolean and renameRussell King (Oracle)4-17/+17
priv->sph* only have 'true' and 'false' used with them, yet they are an int. Change their type to a bool, and rename to make their usage more clear. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1vLIDN-0000000Evur-2NLU@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19selftests: fib_tests: add fib6 from ra to static testFernando Fernandez Mancera1-1/+65
The new test checks that a route that has been promoted from RA-learned to static does not switch back when a new RA message arrives. In addition, it checks that the route is owned by RA again when the static address is removed. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20251115095939.6967-2-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19ipv6: clear RA flags when adding a static routeFernando Fernandez Mancera1-0/+4
When an IPv6 Router Advertisement (RA) is received for a prefix, the kernel creates the corresponding on-link route with flags RTF_ADDRCONF and RTF_PREFIX_RT configured and RTF_EXPIRES if lifetime is set. If later a user configures a static IPv6 address on the same prefix the kernel clears the RTF_EXPIRES flag but it doesn't clear the RTF_ADDRCONF and RTF_PREFIX_RT. When the next RA for that prefix is received, the kernel sees the route as RA-learned and wrongly configures back the lifetime. This is problematic because if the route expires, the static address won't have the corresponding on-link route. This fix clears the RTF_ADDRCONF and RTF_PREFIX_RT flags preventing that the lifetime is configured when the next RA arrives. If the static address is deleted, the route becomes RA-learned again. Fixes: 14ef37b6d00e ("ipv6: fix route lookup in addrconf_prefix_rcv()") Reported-by: Garri Djavadyan <g.djavadyan@gmail.com> Closes: https://lore.kernel.org/netdev/ba807d39aca5b4dcf395cc11dca61a130a52cfd3.camel@gmail.com/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20251115095939.6967-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19Merge branch 'af_unix-gc-cleanup-and-optimisation'Jakub Kicinski3-52/+51
Kuniyuki Iwashima says: ==================== af_unix: GC cleanup and optimisation. Currently, AF_UNIX GC is triggered from close() and sendmsg() based on the number of inflight AF_UNIX sockets. This is because the old GC implementation had no idea of the shape of the graph formed by SCM_RIGHTS references. The new GC knows whether cyclic references (could) exist. This series refines such conditions not to trigger GC unless really needed. ==================== Link: https://patch.msgid.link/20251115020935.2643121-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19af_unix: Consolidate unix_schedule_gc() and wait_for_unix_gc().Kuniyuki Iwashima3-21/+11
unix_schedule_gc() and wait_for_unix_gc() share some code. Let's consolidate the two. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251115020935.2643121-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19af_unix: Remove unix_tot_inflight.Kuniyuki Iwashima1-3/+0
unix_tot_inflight is no longer used. Let's remove it. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251115020935.2643121-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19af_unix: Refine wait_for_unix_gc().Kuniyuki Iwashima1-13/+8
unix_tot_inflight is a poor metric, only telling the number of inflight AF_UNXI sockets, and we should use unix_graph_state instead. Also, if the receiver is catching up with the passed fds, the sender does not need to schedule GC. GC only helps unreferenced cyclic SCM_RIGHTS references, and in such a situation, the malicious sendmsg() will continue to call wait_for_unix_gc() and hit the UNIX_INFLIGHT_SANE_USER condition. Let's make only malicious users schedule GC and wait for it to finish if a cyclic reference exists during the previous GC run. Then, sane users will pay almost no cost for wait_for_unix_gc(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251115020935.2643121-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19af_unix: Don't call wait_for_unix_gc() on every sendmsg().Kuniyuki Iwashima3-8/+6
We have been calling wait_for_unix_gc() on every sendmsg() in case there are too many inflight AF_UNIX sockets. This is also because the old GC implementation had poor knowledge of the inflight sockets and had to suspect every sendmsg(). This was improved by commit d9f21b361333 ("af_unix: Try to run GC async."), but we do not even need to call wait_for_unix_gc() if the process is not sending AF_UNIX sockets. The wait_for_unix_gc() call only helps when a malicious process continues to create cyclic references, and we can detect that in a better place and slow it down. Let's move wait_for_unix_gc() to unix_prepare_fpl() that is called only when AF_UNIX socket fd is passed via SCM_RIGHTS. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251115020935.2643121-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19af_unix: Don't trigger GC from close() if unnecessary.Kuniyuki Iwashima3-14/+19
We have been triggering GC on every close() if there is even one inflight AF_UNIX socket. This is because the old GC implementation had no idea of the graph shape formed by SCM_RIGHTS references. The new GC knows whether there could be a cyclic reference or not, and we can do better. Let's not trigger GC from close() if there is no cyclic reference or GC is already in progress. While at it, unix_gc() is renamed to unix_schedule_gc() as it does not actually perform GC since commit 8b90a9f819dc ("af_unix: Run GC on only one CPU."). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251115020935.2643121-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19af_unix: Simplify GC state.Kuniyuki Iwashima1-9/+12
GC manages its state by two variables, unix_graph_maybe_cyclic and unix_graph_grouped, both of which are set to false in the initial state. When an AF_UNIX socket is passed to an in-flight AF_UNIX socket, unix_update_graph() sets unix_graph_maybe_cyclic to true and unix_graph_grouped to false, making the next GC invocation call unix_walk_scc() to group SCCs. Once unix_walk_scc() finishes, sockets in the same SCC are linked via vertex->scc_entry. Then, unix_graph_grouped is set to true so that the following GC invocations can skip Tarjan's algorithm and simply iterate through the list in unix_walk_scc_fast(). In addition, if we know there is at least one cyclic reference, we set unix_graph_maybe_cyclic to true so that we do not skip GC. So the state transitions as follows: (unix_graph_maybe_cyclic, unix_graph_grouped) = (false, false) -> (true, false) -> (true, true) or (false, true) ^.______________/________________/ There is no transition to the initial state where both variables are false. If we consider the initial state as grouped, we can see that the GC actually has a tristate. Let's consolidate two variables into one enum. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251115020935.2643121-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19af_unix: Count cyclic SCC.Kuniyuki Iwashima1-10/+21
__unix_walk_scc() and unix_walk_scc_fast() call unix_scc_cyclic() for each SCC to check if it forms a cyclic reference, so that we can skip GC at the following invocations in case all SCCs do not have any cycles. If we count the number of cyclic SCCs in __unix_walk_scc(), we can simplify unix_walk_scc_fast() because the number of cyclic SCCs only changes when it garbage-collects a SCC. So, let's count cyclic SCC in __unix_walk_scc() and decrement it in unix_walk_scc_fast() when performing garbage collection. Note that we will use this counter in a later patch to check if a cycle existed in the previous GC run. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251115020935.2643121-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19Merge branch 'net-mlx5-misc-changes-2025-11-17'Jakub Kicinski14-51/+111
Tariq Toukan says: ==================== net/mlx5: misc changes 2025-11-17 This series contains misc enhancements to the mlx5 driver. ==================== Link: https://patch.msgid.link/1763415729-1238421-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net/mlx5: Use EOPNOTSUPP instead of ENOTSUPPTariq Toukan5-8/+8
Per Documentation/dev-tools/checkpatch.rst, ENOTSUPP is not a standard error code and should be avoided. EOPNOTSUPP should be used instead. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/1763415729-1238421-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net/mlx5: Abort new commands if all command slots are stalledSaeed Mahameed2-0/+56
In case of a FW issue, FW might be not responding to FW commands, causing kernel lockout for a long period of time, e.g. rtnl_lock held while ethtool is trying to collect stats waiting for FW to respond to multiple commands, when all of them will timeout. While there's no immediate indication of the FW lockout, we can safely assume that something is wrong when all command slots are busy and in a timeout state and no FW completion was received on any of them. In such case, start immediately failing new commands. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763415729-1238421-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net/mlx5: Remove redundant bw_share minimal value assignmentCarolina Jubran1-7/+0
Remove unnecessary logic that sets bw_share to minimal value, when parent has bw_share configured but nodes don't have min_rate. This check is redundant because the parent bandwidth acts as the upper bound regardless, and the firmware always enforces the topmost bandwidth constraint. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763415729-1238421-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net/mlx5e: Recover SQ on excessive PTP TX timestamp deltaCarolina Jubran3-8/+17
Extend the TX timestamp handler to recover the SQ when the difference between the port and CQE TX timestamps is abnormally large. The current logic aborts timestamp delivery if the delta exceeds 1/128 seconds, which matches the maximum expected packet interval in ptp4l. A larger delta makes the timestamps unreliable. This change adds recovery if the delta exceeds 0.5 seconds. Such a large gap should not occur in normal operation and indicates that firmware is stuck or metadata tracking is out of sync, leading to stale or mismatched timestamps. Recovering the SQ ensures forward progress and avoids silently dropping invalid timestamps. The timestamp handler now takes mlx5e_ptpsq directly to access both CQ stats and the recovery state. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763415729-1238421-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net/mlx5: Refactor EEPROM query error handling to return status separatelyGal Pressman3-28/+30
Matthew and Jakub reported [1] issues where inventory automation tools are calling EEPROM query repeatedly on a port that doesn't have an SFP connected, resulting in millions of error prints. Move MCIA register status extraction from the query functions to the callers, allowing use of extack reporting instead of a dmesg print when using the netlink API. [1] https://lore.kernel.org/netdev/20251028194011.39877-1-mattc@purestorage.com/ Cc: Matthew W Carlis <mattc@purestorage.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763415729-1238421-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19netlink: specs: support ipv4-or-v6 for dual-stack fieldsHangbin Liu8-21/+23
Since commit 1b255e1beabf ("tools: ynl: add ipv4-or-v6 display hint"), we can display either IPv4 or IPv6 addresses for a single field based on the address family. However, most dual-stack fields still use the ipv4 display hint. This update changes them to use the new ipv4-or-v6 display hint and converts IPv4-only fields to use the u32 type. Field changes: - v4-or-v6 - IFA_ADDRESS, IFA_LOCAL - IFLA_GRE_LOCAL, IFLA_GRE_REMOTE - IFLA_VTI_LOCAL, IFLA_VTI_REMOTE - IFLA_IPTUN_LOCAL, IFLA_IPTUN_REMOTE - NDA_DST - RTA_DST, RTA_SRC, RTA_GATEWAY, RTA_PREFSRC - FRA_SRC, FRA_DST - ipv4 - IFA_BROADCAST - IFLA_GENEVE_REMOTE - IFLA_IPTUN_6RD_RELAY_PREFIX Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20251117024457.3034-3-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19tools: ynl: Add MAC address parsing supportHangbin Liu1-0/+9
Add missing support for parsing MAC addresses when display_hint is 'mac' in the YNL library. This enables YNL CLI to accept MAC address strings for attributes like lladdr in rt-neigh operations. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20251117024457.3034-2-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19Merge branch 'net-expand-napi_skb_cache-use'Jakub Kicinski1-17/+31
Eric Dumazet says: ==================== net: expand napi_skb_cache use This is a followup of commit e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs"). Now the per-cpu napi_skb_cache is populated from TX completion path, we can make use of this cache, especially for cpus not used from a driver NAPI poll (primary user of napi_cache). With this series, I consistently reach 130 Mpps on my UDP tx stress test and reduce SLUB spinlock contention to smaller values. ==================== Link: https://patch.msgid.link/20251116202717.1542829-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net: use napi_skb_cache even in process contextEric Dumazet1-0/+5
This is a followup of commit e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs"). Now the per-cpu napi_skb_cache is populated from TX completion path, we can make use of this cache, especially for cpus not used from a driver NAPI poll (primary user of napi_cache). We can use the napi_skb_cache only if current context is not from hard irq. With this patch, I consistently reach 130 Mpps on my UDP tx stress test and reduce SLUB spinlock contention to smaller values. Note there is still some SLUB contention for skb->head allocations. I had to tune /sys/kernel/slab/skbuff_small_head/cpu_partial and /sys/kernel/slab/skbuff_small_head/min_partial depending on the platform taxonomy. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251116202717.1542829-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net: __alloc_skb() cleanupEric Dumazet1-10/+18
This patch refactors __alloc_skb() to prepare the following one, and does not change functionality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251116202717.1542829-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net: add a new @alloc parameter to napi_skb_cache_get()Eric Dumazet1-7/+8
We want to be able in the series last patch to get an skb from napi_skb_cache from process context, if there is one available. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251116202717.1542829-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>