summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2025-05-15ovpn: improve 'no route to host' debug messageAntonio Quartulli2-3/+13
When debugging a 'no route to host' error it can be beneficial to know the address of the unreachable destination. Print it along the debugging text. While at it, add a missing parenthesis in a different debugging message inside ovpn_peer_endpoints_update(). Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
2025-05-15ovpn: drop useless reg_state check in keepalive workerAntonio Quartulli1-2/+1
The keepalive worker is cancelled before calling unregister_netdevice_queue(), therefore it will never hit a situation where the reg_state can be different than NETDEV_REGISTERED. For this reason, checking reg_state is useless and the condition can be removed. Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
2025-05-15selftest/net/ovpn: extend coverage with more test casesAntonio Quartulli5-11/+34
To increase code coverage, extend the ovpn selftests with the following cases: * connect UDP peers using a mix of IPv6 and IPv4 at the transport layer * run full test with tunnel MTU equal to transport MTU (exercising IP layer fragmentation) * ping "LAN IP" served by VPN peer ("LAN behind a client" test case) Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
2025-05-15ovpn: fix ndo_start_xmit return value on errorAntonio Quartulli1-1/+1
ndo_start_xmit is basically expected to always return NETDEV_TX_OK. However, in case of error, it was currently returning NET_XMIT_DROP, which is not a valid netdev_tx_t return value, leading to misinterpretation. Change ndo_start_xmit to always return NETDEV_TX_OK to signal back to the caller that the packet was handled (even if dropped). Effects of this bug can be seen when sending IPv6 packets having no peer to forward them to: $ ip netns exec ovpn-server oping -c20 fd00:abcd:220:201::1 PING fd00:abcd:220:201::1 (fd00:abcd:220:201::1) 56 bytes of data.00:abcd:220:201 :1 ping_send failed: No buffer space available ping_sendto: No buffer space available ping_send failed: No buffer space available ping_sendto: No buffer space available ... Fixes: c2d950c4672a ("ovpn: add basic interface creation/destruction/management routines") Reported-by: Gert Doering <gert@greenie.muc.de> Closes: https://github.com/OpenVPN/ovpn-net-next/issues/5 Tested-by: Gert Doering <gert@greenie.muc.de> Acked-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31591.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
2025-05-15selftest/net/ovpn: fix crash in case of getaddrinfo() failureAntonio Quartulli1-2/+8
getaddrinfo() may fail with error code different from EAI_FAIL or EAI_NONAME, however in this case we still try to free the results object, thus leading to a crash. Fix this by bailing out on any possible error. Fixes: 959bc330a439 ("testing/selftests: add test tool and scripts for ovpn module") Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
2025-05-15ovpn: don't drop skb's dst when xmitting packetAntonio Quartulli2-0/+7
When routing a packet to a LAN behind a peer, ovpn needs to inspect the route entry that brought the packet there in the first place. If this packet is truly routable, the route entry provides the GW to be used when looking up the VPN peer to send the packet to. However, the route entry is currently dropped before entering the ovpn xmit function, because the IFF_XMIT_DST_RELEASE priv_flag is enabled by default. Clear the IFF_XMIT_DST_RELEASE flag during interface setup to allow the route entry (skb's dst) to survive and thus be inspected by the ovpn routing logic. Fixes: a3aaef8cd173 ("ovpn: implement peer lookup logic") Reported-by: Gert Doering <gert@greenie.muc.de> Closes: https://github.com/OpenVPN/ovpn-net-next/issues/2 Tested-by: Gert Doering <gert@greenie.muc.de> Acked-by: Gert Doering <gert@greenie.muc.de> # as a primary user Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31583.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
2025-05-15ovpn: set skb->ignore_df = 1 before sending IPv6 packets outAntonio Quartulli1-0/+10
IPv6 user packets (sent over the tunnel) may be larger than the outgoing interface MTU after encapsulation. When this happens ovpn should allow the kernel to fragment them because they are "locally generated". To achieve the above, we must set skb->ignore_df = 1 so that ip6_fragment() can be made aware of this decision. Failing to do so will result in ip6_fragment() dropping the packet thinking it was "routed". No change is required in the IPv4 path, because when calling udp_tunnel_xmit_skb() we already pass the 'df' argument set to 0, therefore the resulting datagram is allowed to be fragmented if need be. Fixes: 08857b5ec5d9 ("ovpn: implement basic TX path (UDP)") Reported-by: Gert Doering <gert@greenie.muc.de> Closes: https://github.com/OpenVPN/ovpn-net-next/issues/3 Tested-by: Gert Doering <gert@greenie.muc.de> Acked-by: Gert Doering <gert@greenie.muc.de> # as primary user Link: https://mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31577.html Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
2025-05-15MAINTAINERS: update git URL for ovpnAntonio Quartulli1-1/+1
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
2025-05-15MAINTAINERS: add Sabrina as official reviewer for ovpnAntonio Quartulli1-0/+1
Sabrina put quite some effort in reviewing the ovpn module during its official submission to netdev. For this reason she obtain extensive knowledge of the module architecture and implementation. Make her an official reviewer, so that I can be supported in reviewing and acking new patches. Acked-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
2025-05-14net: enetc: fix implicit declaration of function FIELD_PREPWei Fang1-0/+1
The kernel test robot reported the following error: drivers/net/ethernet/freescale/enetc/ntmp.c: In function 'ntmp_fill_request_hdr': drivers/net/ethernet/freescale/enetc/ntmp.c:203:38: error: implicit declaration of function 'FIELD_PREP' [-Wimplicit-function-declaration] 203 | cbd->req_hdr.access_method = FIELD_PREP(NTMP_ACCESS_METHOD, | ^~~~~~~~~~ Therefore, add "bitfield.h" to ntmp_private.h to fix this issue. Fixes: 4701073c3deb ("net: enetc: add initial netc-lib driver to support NTMP") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505101047.NTMcerZE-lkp@intel.com/ Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-14net: wangxun: Correct clerical errors in commentsJiawen Wu2-2/+2
There are wrong "#endif" comments in .h files need to be corrected. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-14net: phy: remove stub for mdiobus_register_board_infoHeiner Kallweit1-9/+0
The functionality of mdiobus_register_board_info() typically isn't optional for the caller. Therefore remove the stub. Note: Currently we have only one caller of mdiobus_register_board_info(), in a DSA/PHYLINK context. Therefore CONFIG_MDIO_DEVICE is selected anyway. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/410a2222-c4e8-45b0-9091-d49674caeb00@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: mlxsw: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()Vladimir Oltean4-72/+48
New timestamping API was introduced in commit 66f7223039c0 ("net: add NDOs for configuring hardware timestamping") from kernel v6.6. It is time to convert the mlxsw driver to the new API, so that the ndo_eth_ioctl() path can be removed completely. The UAPI is still ioctl-only, but it's best to remove the "ioctl" mentions from the driver in case a netlink variant appears. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250512154411.848614-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: ipa: Make the SMEM item ID constantKonrad Dybcio11-21/+11
It can't vary, stop storing the same magic number everywhere. Signed-off-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Reviewed-by: Alex Elder <elder@kernel.org> Link: https://patch.msgid.link/20250512-topic-ipa_smem-v1-1-302679514a0d@oss.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: enetc: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()Vladimir Oltean4-26/+31
New timestamping API was introduced in commit 66f7223039c0 ("net: add NDOs for configuring hardware timestamping") from kernel v6.6. It is time to convert the ENETC driver to the new API, so that the ndo_eth_ioctl() path can be removed completely. Move the enetc_hwtstamp_get() and enetc_hwtstamp_set() calls away from enetc_ioctl() to dedicated net_device_ops for the LS1028A PF and VF (NETC v4 does not yet implement enetc_ioctl()), adapt the prototypes and export these symbols (enetc_ioctl() is also exported). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20250512112402.4100618-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: txgbe: Fix pending interruptJiawen Wu1-6/+1
For unknown reasons, sometimes the value of MISC interrupt is 0 in the IRQ handle function. In this case, wx_intr_enable() is also should be invoked to clear the interrupt. Otherwise, the next interrupt would never be reported. Fixes: a9843689e2de ("net: txgbe: add sriov function support") Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/F4F708403CE7090B+20250512100652.139510-1-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14Merge branch 'net-mlx5-hws-complex-matchers-and-rehash-mechanism-fixes'Jakub Kicinski13-179/+2178
Tariq Toukan says: ==================== net/mlx5: HWS, Complex Matchers and rehash mechanism fixes Motivation: ---------- A matcher can match a certain set of match parameters. However, the number and size of match params for a single matcher are limited — all the parameters must fit within a single definer. A common example of this limitation is IPv6 address matching, where matching both source and destination IPs requires more bits than a single definer can support. SW Steering addresses this limitation by chaining multiple Steering Table Entries (STEs) within the same matcher, where each STE matches on a subset of the parameters. In HW Steering, such chaining is not possible — the matcher's STEs are managed in a hash table, and a single definer is used to calculate the hash index for STEs. Overview: -------- To address this limitation in HW Steering, we introduce *Complex Matchers*, which consist of two chained matchers. This allows matching on twice as many parameters. Complex Matchers are filled with *Complex Rules* — rules that are split into two parts and inserted into their respective matchers. The first half of the Complex Matcher is a regular matcher and points to the second half, which is an *Isolated Matcher*. An Isolated Matcher has its own isolated table and is accessible only by traffic coming from the first half of the Complex Matcher. This splitting of matchers/rules into multiple parts is transparent to users. It is hidden behind the BWC HWS API. It becomes visible only when dumping steering debug information, where the Complex Matcher appears as two separate matchers: one in the user-created table and another in its isolated table. Implementation Details: ---------------------- All user actions are performed on the second part of the rules only. The first part handles matching and applies two actions: modify header (set metadata, see details below) and go-to-table (directing traffic to the isolated table containing the isolated matcher). Rule updates (updating rule actions) are applied to the second part of the rule since user-provided actions are not executed in the first matcher. We use REG_C_6 metadata register to set and match on unique per-rule tag (see details below). Splitting rules into two parts introduces new challenges: 1. Invalid Combinations Consider two rules with different matching values: - Rule 1: A+B - Rule 2: C+D Let's split the rules into two parts as follows: |-----Complex Matcher-------| | | | 1st matcher 2nd matcher | | |---| |---| | | | A | | B | | | |---| -----> |---| | | | C | | D | | | |---| |---| | | | |---------------------------| Splitting these rules results in invalid combinations: A+D and C+B: any packet that matched on A will be forwarded to the 2nd matcher, where it will try to match on B (which is legal, and it is what the user asked for), but it will also try to match on D (which is not what the user asked for). To resolve this, we assign unique tags to each rule on the first matcher and match on these tags on the second matcher: |----------| |---------| | A | | B, TagA | | action: | | | | set TagA | | | |----------| --> |---------| | C | | D, TagB | | action: | | | | set TagB | | | |----------| |---------| 2. Duplicated Entries: Consider two rules with overlapping values: - Rule 1: A+B - Rule 2: A+D Let's split the rules into two parts as follows: |---| |---| | A | | B | |---| --> |---| | | | D | |---| |---| This leads to the duplicated entries on the first matcher, which HWS doesn't allow: subsequent delete of either of the rules will delete the only entry in the first matcher, leaving the remaining rule broken. To address this, we use a reference count for entries in the first matcher and delete STEs only when their refcount reaches zero. Both challenges are resolved by having a per-matcher data structure (implemented with rhashtable) that manages refcounts for the first part of the rules and holds unique tags (managed via IDA) for these rules to set and to match on the second matcher. Limitations: ----------- We utilize metadata register REG_C_6 in this implementation, so its usage anywhere along the flow that might include the need for Complex Matcher is prohibited. The number and size of match parameters remain limited — now constrained by what can be represented by two definers instead of one. This architectural limitation arises from the structure of Complex Matchers. If future requirements demand more parameters, Complex Matchers can be extended beyond two matchers. Additionally, there is an implementation limit of 32 match parameters per matcher (disregarding parameter size). This limit can be lifted if needed. Patches: ------- - Patches 1-3: small additions/refactoring in preparation for Complex Matcher: exposed mlx5hws_table_ft_set_next_ft() in header, added definer function to convert field name enum to string, expose the polling function mlx5hws_bwc_queue_poll() in a header. - Patch 4: in preparation for Complex Matcher, this patch adds support for Isolated Matcher. - Patch 5: the main patch - Complex Matchers implementation. [2] Patch 6: fixing the usecase where rule insertion was failing, but rehash couldn't be initiated if the number of rules in the table is below the rehash threshold. Patch 7: fixing the usecase where many rules in parallel would require rehash, due to the way the counting of rules was done. Patch 8: fixing the case where rules were requiring action template extension in parallel, leading to unneeded extensions with the same templates. Patch 9: refactor and simplify the rehash loop. Patch 10: dump error completion details, which helps a lot in trying to understand what went wrong, especially during rehash. ==================== Link: https://patch.msgid.link/1746992290-568936-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net/mlx5: HWS, dump bad completion detailsYevgeny Kliteynik2-3/+120
Failing to insert/delete a rule should not happen. If it does happen, it would be good to know at which stage it happened and what was the failure. This patch adds printing of bad CQE details. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-11-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net/mlx5: HWS, rework rehash loopYevgeny Kliteynik1-75/+52
Reworking the rehash loop - simplifying the code and making it less error prone: - Instead of doing round-robin on all the queues with batch of rules in each cycle, just go over all the queues and move all the rules that belong to this queue. - If at some stage of moving the rule we get a failure (which should not happen), this can't be rolled back. So instead of aborting rehash and leaving the matcher in a broken state, allow the loop to continue: attempt to move the rest of the rules and delete the old matcher. A rule that failed to move to a new matcher will loose its match STE once the rehash is completed and the old matcher is deleted, so the rule won't match any traffic any more. This rule's packets will fall back to the steering pipeline w/o HW offload. Rehash procedure will return an error, which will cause the rule insertion to fail for the rule that started this whole rehash. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-10-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net/mlx5: HWS, fix redundant extension of action templatesYevgeny Kliteynik1-51/+54
When a rule is inserted into a matcher, we search for the suitable action template. If such template is not found, action template array is extended with the new template. However, when several threads are performing this in parallel, there is a race - we can end up with extending the action templates array with the same template. This patch is doing the following: - refactor the code to find action template index in rule create and update, have the common code in an auxiliary function - after locking all the queues, check again if the action template array still needs to be extended Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-9-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net/mlx5: HWS, fix counting of rules in the matcherYevgeny Kliteynik1-5/+5
Currently the counter that counts number of rules in a matcher is increased only when rule insertion is completed. In a multi-threaded usecase this can lead to a scenario that many rules can be in process of insertion in the same matcher, while none of them has completed the insertion and the rule counter is not updated. This results in a rule insertion failure for many of them at first attempt, which leads to all of them requiring rehash and requiring locking of all the queue locks. This patch fixes the case by increasing the rule counter in the beginning of insertion process and decreasing in case of any failure. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-8-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net/mlx5: HWS, force rehash when rule insertion failedYevgeny Kliteynik2-2/+7
Rules are inserted into hash table in accordance with their hash index. When a certain number of rules is reached, the table is rehashed: a bigger new table is allocated and all the rules are moved there. But sometimes a new rule can't be inserted into the hash table because its index is full, even though the number of rules in the table is well below the threshold. The hash function is not perfect, so such cases are not rare. When that happens, we want to do the same rehash, in order to increase the table size and lower the probability for such cases. This patch fixes the usecase where rule insertion was failing, but rehash couldn't be initiated due to low number of rules: it adds flag that denotes that rehash is required, even if the number of rules in the table is below the rehash threshold. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-7-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net/mlx5: HWS, support complex matchersYevgeny Kliteynik4-27/+1410
This patch adds support for Complex Matchers/Rules Overview: -------- A matcher can match on a certain set of match parameters. However, the number and size of match params for a single matcher are limited: all the parameters must fit within a single definer. A common example of this limitation is IPv6 address matching, where matching both source and destination IPs requires more bits than a single definer can support. SW Steering addresses this limitation by chaining multiple Steering Table Entries (STEs) within the same matcher, where each STE matches on a subset of the parameters. In HW Steering, such chaining is not possible — the matcher's STEs are managed in a hash table, and a single definer is used to calculate the hash index for STEs. To address this limitation in HW Steering, we introduce Complex Matchers, which consist of two chained matchers. This allows matching on twice as many parameters. Complex Matchers are filled with Complex Rules — rules that are split into two parts and inserted into their respective matchers. The first half of the Complex Matcher is a regular matcher and points to the second half, which is an Isolated Matcher. An Isolated Matcher has its own isolated table and is accessible only by traffic coming from the first half of the Complex Matcher. This splitting of matchers/rules into multiple parts is transparent to users. It is hidden under the BWC HWS API. It becomes visible only when dumping steering debug information, where the Complex Matcher appears as two separate matchers: one in the user-created table and another in its isolated table. Some implementation details: --------------------------- All user actions are performed on the second part of the rules only. The first part handles matching and applies two actions: modify header (set metadata, see details below) and go-to-table (directing traffic to the isolated table containing the isolated matcher). Rule updates (updating rule actions) are applied to the second part of the rule since user-provided actions are not executed in the first matcher. We use REG_C_6 metadata register to set and match on unique per-rule tag (see details below). Splitting rules into two parts introduces new challenges: 1. Invalid Combinations Consider two rules with different matching values: - Rule 1: A+B - Rule 2: C+D Let's split the rules into two parts as follows: |---| |---| | A | | B | |---| --> |---| | C | | D | |---| |---| Splitting these rules results in invalid combinations like A+D and C+B. To resolve this, we assign unique tags to each rule on the first matcher and match these tags on the second matcher (the tag is implemented through modify_hdr action that sets value to metadata register REG_C_6): |----------| |---------| | A | | B, TagA | | action: | | | | set TagA | | | |----------| --> |---------| | C | | D, TagB | | action: | | | | set TagB | | | |----------| |---------| 2. Duplicated Entries: Consider two rules with overlapping values: - Rule 1: A+B - Rule 2: A+D Let's split the rules into two parts as follows: |---| |---| | A | | B | |---| --> |---| | | | D | |---| |---| This leads to the duplicated entries on the first matcher, which HWS doesn't allow: subsequent delete of either of the rules will delete the only entry in the first matcher, leaving the remaining rule broken. To address this, we use a reference count for entries in the first matcher and delete STEs only when their refcount reaches zero. Both challenges are resolved by having a per-matcher data structure (implemented with rhashtable) that manages refcounts for the first part of the rules and holds unique tags (managed via IDA) for these rules to set and to match on the second matcher. Limitations: ----------- We utilize metadata register REG_C_6 in this implementation, so its usage anywhere along the steering of the flow that might include the need for Complex Matcher is prohibited. The number and size of match parameters remain limited — now it is constrained by what can be represented by two definers instead of one. This architectural limitation arises from the structure of Complex Matchers. If future requirements demand more parameters, Complex Matchers can be extended beyond two matchers. Additionally, there is an implementation limit of 32 match parameters per rule (disregarding parameter size). This limit can be lifted if needed. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net/mlx5: HWS, introduce isolated matchersYevgeny Kliteynik3-1/+288
In preparation for complex matcher support, introduce the isolated matcher. Isolated matcher is a matcher that has its own isolated table. It is used as the second half of the complex matcher: when the rule is split into two parts (complex rule), then matching on the first part will send the packet to the isolated matcher that will try to match on the second part. In case of miss, the packet goes back to the matcher's end flow table. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net/mlx5: HWS, expose polling function in header fileYevgeny Kliteynik2-13/+21
In preparation for complex matcher, expose the function that is polling queue for completion (mlx5hws_bwc_queue_poll) in header file, so that it will be used by complex matcher code. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net/mlx5: HWS, add definer function to get field name strYevgeny Kliteynik2-0/+214
In preparation for complex matcher support, add function for converting definer fname to str, which will be used in following patches. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net/mlx5: HWS, expose function mlx5hws_table_ft_set_next_ft in headerYevgeny Kliteynik2-8/+13
In preparation for complex matcher support, make function mlx5hws_table_ft_set_next_ft() non-static and expose it in header. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13Merge branch 'amd-xgbe-add-support-for-amd-renoir'Paolo Abeni5-57/+227
Raju Rangoju says: ==================== amd-xgbe: add support for AMD Renoir Add support for a new AMD Ethernet device called "Renoir". It has a new PCI ID, add this to the current list of supported devices in the amd-xgbe devices. Also, the BAR1 addresses cannot be used to access the PCS registers on Renoir platform, use the indirect addressing via SMN instead. ==================== Link: https://patch.msgid.link/20250509155325.720499-1-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13amd-xgbe: add support for new pci device id 0x1641Raju Rangoju1-0/+18
Add support for new pci device id 0x1641 to register Renoir device with PCIe. Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250509155325.720499-6-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13amd-xgbe: Add XGBE_XPCS_ACCESS_V3 support to xgbe_pci_probe()Raju Rangoju3-7/+33
A new version of XPCS access routines have been introduced, add the support to xgbe_pci_probe() to use these routines. Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250509155325.720499-5-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13amd-xgbe: add support for new XPCS routinesRaju Rangoju3-0/+122
Add the necessary support to enable Renoir ethernet device. Since the BAR1 address cannot be used to access the XPCS registers on Renoir, use the smn functions. Some of the ethernet add-in-cards have dual PHY but share a single MDIO line (between the ports). In such cases, link inconsistencies are noticed during the heavy traffic and during reboot stress tests. Using smn calls helps avoid such race conditions. Suggested-by: Sudheesh Mavila <sudheesh.mavila@amd.com> Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250509155325.720499-4-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13amd-xgbe: reorganize the xgbe_pci_probe() code pathRaju Rangoju2-14/+25
Reorganize the xgbe_pci_probe() code path to convert if/else statements to switch case to help add future code. This helps code look cleaner. Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250509155325.720499-3-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13amd-xgbe: reorganize the code of XPCS accessRaju Rangoju1-36/+29
The xgbe_{read/write}_mmd_regs_v* functions have common code which can be moved to helper functions. Add new helper functions to calculate the mmd_address for v1/v2 of xpcs access. Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250509155325.720499-2-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13Merge branch 'tools-ynl-gen-support-sub-types-for-binary-attributes'Paolo Abeni1-3/+60
Jakub Kicinski says: ==================== tools: ynl-gen: support sub-types for binary attributes Binary attributes have sub-type annotations which either indicate that the binary object should be interpreted as a raw / C array of a simple type (e.g. u32), or that it's a struct. Use this information in the C codegen instead of outputting void * for all binary attrs. It doesn't make a huge difference in the genl families, but in classic Netlink there is a lot more structs. v1: https://lore.kernel.org/20250508022839.1256059-1-kuba@kernel.org ==================== Link: https://patch.msgid.link/20250509154213.1747885-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13tools: ynl-gen: support struct for binary attributesJakub Kicinski1-1/+20
Support using a struct pointer for binary attrs. Len field is maintained because the structs may grow with newer kernel versions. Or, which matters more, be shorter if the binary is built against newer uAPI than kernel against which it's executed. Since we are storing a pointer to a struct type - always allocate at least the amount of memory needed by the struct per current uAPI headers (unused mem is zeroed). Technically users should check the length field but per modern ASAN checks storing a short object under a pointer seems like a bad idea. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250509154213.1747885-4-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13tools: ynl-gen: auto-indent elseJakub Kicinski1-0/+1
We auto-indent if statements (increase the indent of the subsequent line by 1), do the same thing for else branches without a block. There hasn't been any else branches before but we're about to add one. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250509154213.1747885-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13tools: ynl-gen: support sub-type for binary attributesJakub Kicinski1-3/+40
Sub-type annotation on binary attributes may indicate that the attribute carries an array of simple types (also referred to as "C array" in docs). Support rendering them as such in the C user code. For example for u32, instead of: struct { u32 arr; } _len; void *arr; render: struct { u32 arr; } _count; __u32 *arr; Note that count is the number of elements while len was the length in bytes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250509154213.1747885-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13Merge branch 'device-memory-tcp-tx'Paolo Abeni30-86/+1007
Mina Almasry says: ==================== Device memory TCP TX The TX path had been dropped from the Device Memory TCP patch series post RFCv1 [1], to make that series slightly easier to review. This series rebases the implementation of the TX path on top of the net_iov/netmem framework agreed upon and merged. The motivation for the feature is thoroughly described in the docs & cover letter of the original proposal, so I don't repeat the lengthy descriptions here, but they are available in [1]. Full outline on usage of the TX path is detailed in the documentation included with this series. Test example is available via the kselftest included in the series as well. The series is relatively small, as the TX path for this feature largely piggybacks on the existing MSG_ZEROCOPY implementation. Patch Overview: --------------- 1. Documentation & tests to give high level overview of the feature being added. 1. Add netmem refcounting needed for the TX path. 2. Devmem TX netlink API. 3. Devmem TX net stack implementation. 4. Make dma-buf unbinding scheduled work to handle TX cases where it gets freed from contexts where we can't sleep. 5. Add devmem TX documentation. 6. Add scaffolding enabling driver support for netmem_tx. Add helpers, driver feature flag, and docs to enable drivers to declare netmem_tx support. 7. Guard netmem_tx against being enabled against drivers that don't support it. 8. Add devmem_tx selftests. Add TX path to ncdevmem and add a test to devmem.py. Testing: -------- Testing is very similar to devmem TCP RX path. The ncdevmem test used for the RX path is now augemented with client functionality to test TX path. * Test Setup: Kernel: net-next with this RFC and memory provider API cherry-picked locally. Hardware: Google Cloud A3 VMs. NIC: GVE with header split & RSS & flow steering support. Performance results are not included with this version, unfortunately. I'm having issues running the dma-buf exporter driver against the upstream kernel on my test setup. The issues are specific to that dma-buf exporter and do not affect this patch series. I plan to follow up this series with perf fixes if the tests point to issues once they're up and running. Special thanks to Stan who took a stab at rebasing the TX implementation on top of the netmem/net_iov framework merged. Parts of his proposal [2] that are reused as-is are forked off into their own patches to give full credit. [1] https://lore.kernel.org/netdev/20240909054318.1809580-1-almasrymina@google.com/ [2] https://lore.kernel.org/netdev/20240913150913.1280238-2-sdf@fomichev.me/T/#m066dd407fbed108828e2c40ae50e3f4376ef57fd Cc: sdf@fomichev.me Cc: asml.silence@gmail.com Cc: dw@davidwei.uk Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Victor Nogueira <victor@mojatatu.com> Cc: Pedro Tammela <pctammela@mojatatu.com> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> v14: https://lore.kernel.org/netdev/20250429032645.363766-1-almasrymina@google.com/ v13: https://lore.kernel.org/netdev/20250425204743.617260-1-almasrymina@google.com/ v12: https://lore.kernel.org/netdev/20250423031117.907681-1-almasrymina@google.com/ v11: https://lore.kernel.org/netdev/20250423031117.907681-1-almasrymina@google.com/ v10: https://lore.kernel.org/netdev/20250417231540.2780723-1-almasrymina@google.com/ v9: https://lore.kernel.org/netdev/20250415224756.152002-1-almasrymina@google.com/ v8: https://lore.kernel.org/netdev/20250308214045.1160445-1-almasrymina@google.com/ v7: https://lore.kernel.org/netdev/20250227041209.2031104-1-almasrymina@google.com/ v6: https://lore.kernel.org/netdev/20250222191517.743530-1-almasrymina@google.com/ v5: https://lore.kernel.org/netdev/20250220020914.895431-1-almasrymina@google.com/ v4: https://lore.kernel.org/netdev/20250203223916.1064540-1-almasrymina@google.com/ v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=* RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=* ==================== Link: https://patch.msgid.link/20250508004830.4100853-1-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13selftests: ncdevmem: Implement devmem TCP TXMina Almasry2-15/+311
Add support for devmem TX in ncdevmem. This is a combination of the ncdevmem from the devmem TCP series RFCv1 which included the TX path, and work by Stan to include the netlink API and refactored on top of his generic memory_provider support. Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-10-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13net: check for driver support in netmem TXMina Almasry3-2/+45
We should not enable netmem TX for drivers that don't declare support. Check for driver netmem TX support during devmem TX binding and fail if the driver does not have the functionality. Check for driver support in validate_xmit_skb as well. Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-9-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13gve: add netmem TX support to GVE DQO-RDA modeMina Almasry2-3/+8
Use netmem_dma_*() helpers in gve_tx_dqo.c DQO-RDA paths to enable netmem TX support in that mode. Declare support for netmem TX in GVE DQO-RDA mode. Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Harshitha Ramamurthy <hramamurthy@google.com> Link: https://patch.msgid.link/20250508004830.4100853-8-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13net: enable driver support for netmem TXMina Almasry5-2/+49
Drivers need to make sure not to pass netmem dma-addrs to the dma-mapping API in order to support netmem TX. Add helpers and netmem_dma_*() helpers that enables special handling of netmem dma-addrs that drivers can use. Document in netmem.rst what drivers need to do to support netmem TX. Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-7-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13net: add devmem TCP TX documentationMina Almasry1-4/+146
Add documentation outlining the usage and details of the devmem TCP TX API. Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-6-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13net: devmem: Implement TX pathMina Almasry13-60/+340
Augment dmabuf binding to be able to handle TX. Additional to all the RX binding, we also create tx_vec needed for the TX path. Provide API for sendmsg to be able to send dmabufs bound to this device: - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY implementation, while disabling instances where MSG_ZEROCOPY falls back to copying. We additionally pipe the binding down to the new zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems instead of the traditional page netmems. We also special case skb_frag_dma_map to return the dma-address of these dmabuf net_iovs instead of attempting to map pages. The TX path may release the dmabuf in a context where we cannot wait. This happens when the user unbinds a TX dmabuf while there are still references to its netmems in the TX path. In that case, the netmems will be put_netmem'd from a context where we can't unmap the dmabuf, Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd. Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat of the implementation came from devmem TCP RFC v1[1], which included the TX path, but Stan did all the rebasing on top of netmem/net_iov. Cc: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13net: devmem: TCP tx netlink apiStanislav Fomichev6-0/+34
Add bind-tx netlink call to attach dmabuf for TX; queue is not required, only ifindex and dmabuf fd for attachment. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250508004830.4100853-4-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13net: add get_netmem/put_netmem supportMina Almasry5-2/+65
Currently net_iovs support only pp ref counts, and do not support a page ref equivalent. This is fine for the RX path as net_iovs are used exclusively with the pp and only pp refcounting is needed there. The TX path however does not use pp ref counts, thus, support for get_page/put_page equivalent is needed for netmem. Support get_netmem/put_netmem. Check the type of the netmem before passing it to page or net_iov specific code to obtain a page ref equivalent. For dmabuf net_iovs, we obtain a ref on the underlying binding. This ensures the entire binding doesn't disappear until all the net_iovs have been put_netmem'ed. We do not need to track the refcount of individual dmabuf net_iovs as we don't allocate/free them from a pool similar to what the buddy allocator does for pages. This code is written to be extensible by other net_iov implementers. get_netmem/put_netmem will check the type of the netmem and route it to the correct helper: pages -> [get|put]_page() dmabuf net_iovs -> net_devmem_[get|put]_net_iov() new net_iovs -> new helpers Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-3-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13netmem: add niov->type attribute to distinguish different net_iov typesMina Almasry3-2/+13
Later patches in the series adds TX net_iovs where there is no pp associated, so we can't rely on niov->pp->mp_ops to tell what is the type of the net_iov. Add a type enum to the net_iov which tells us the net_iov type. Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250508004830.4100853-2-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13net: dsa: b53: implement setting ageing timeJonas Gorski4-0/+37
b53 supported switches support configuring ageing time between 1 and 1,048,575 seconds, so add an appropriate setter. This allows b53 to pass the FDB learning test for both vlan aware and vlan unaware bridges. Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://patch.msgid.link/20250510092211.276541-1-jonas.gorski@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net: mlx4: add SOF_TIMESTAMPING_TX_SOFTWARE flag when getting ts infoJason Xing1-0/+1
As mlx4 has implemented skb_tx_timestamp() in mlx4_en_xmit(), the SOFTWARE flag is surely needed when users are trying to get timestamp information. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250510093442.79711-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13netlink: fix policy dump for int with validation callbackJakub Kicinski2-0/+11
Recent devlink change added validation of an integer value via NLA_POLICY_VALIDATE_FN, for sparse enums. Handle this in policy dump. We can't extract any info out of the callback, so report only the type. Fixes: 429ac6211494 ("devlink: define enum for attr types of dynamic attributes") Reported-by: syzbot+01eb26848144516e7f0a@syzkaller.appspotmail.com Link: https://patch.msgid.link/20250509212751.1905149-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>