summaryrefslogtreecommitdiff
path: root/drivers/net
AgeCommit message (Collapse)AuthorFilesLines
2021-02-07net: ena: Update XDP verdict upon failureShay Agroskin1-1/+5
The verdict returned from ena_xdp_execute() is used to determine the fate of the RX buffer's page. In case of XDP Redirect/TX error the verdict should be set to XDP_ABORTED, otherwise the page won't be freed. Fixes: a318c70ad152 ("net: ena: introduce XDP redirect implementation") Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: ipa: avoid field overflowAlex Elder1-8/+14
It's possible that the length passed to ipa_header_size_encoded() is larger than what can be represented by the HDR_LEN field alone (starting with IPA v4.5). If we attempted that, u32_encode_bits() would trigger a build-time error. Avoid this problem by masking off high-order bits of the value encoded as the lower portion of the header length. The same sort of problem exists in ipa_metadata_offset_encoded(), so implement the same fix there. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: ipa: get rid of status size constraintAlex Elder1-3/+0
There is a build-time check that the packet status structure is a multiple of 4 bytes in size. It's not clear where that constraint comes from, but the structure defines what hardware provides so its definition won't change. Get rid of the check; it adds no value. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: ipa: use a Boolean rather than count when replenishingAlex Elder1-16/+19
The count argument to ipa_endpoint_replenish() is only ever 0 or 1, and always will be (because we always handle each receive buffer in a single transaction). Rename the argument to be add_one and change it to be Boolean. Update the function description to reflect the current code. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: ipa: remove two unused register definitionsAlex Elder1-10/+0
We do not support inter-EE channel or event ring commands. Inter-EE interrupts are disabled (and never re-enabled) for all channels and event rings, so we have no need for the GSI registers that clear those interrupt conditions. So remove their definitions. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: ipa: do not cache event ring stateAlex Elder2-19/+21
An event ring's state only needs to be known when it is allocated, reset, or deallocated. We check an event ring's state both before and after performing an event ring control command that changes its state. These are only issued at startup and shutdown, so there is very little value in caching the state. Stop recording a copy of the channel's last known state, and instead fetch the true state from hardware whenever it's needed. In such cases, *do* record the state in a local variable, in case an error message reports it (so the value reported is the value seen). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: ipa: synchronize NAPI only for suspendAlex Elder1-8/+10
When stopping a channel, gsi_channel_stop() will ensure NAPI polling is complete when it calls napi_disable(). So there is no need to call napi_synchronize() in that case. Move the call to napi_synchronize() out of __gsi_channel_stop() and into gsi_channel_suspend(), so it's only used where needed. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: ipa: move mutex calls into __gsi_channel_stop()Alex Elder1-11/+17
Move the mutex calls out of gsi_channel_stop_retry() and into __gsi_channel_stop(), to make the latter more semantically similar to __gsi_channel_start(). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: dsa: felix: propagate the LAG offload ops towards the ocelot libVladimir Oltean2-6/+32
The ocelot switch has been supporting LAG offload since its initial commit, however felix could not make use of that, due to lack of a LAG abstraction in DSA. Now that we have that, let's forward DSA's calls towards the ocelot library, who will deal with setting up the bonding. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: mscc: ocelot: rebalance LAGs on link up/down eventsVladimir Oltean3-9/+63
At present there is an issue when ocelot is offloading a bonding interface, but one of the links of the physical ports goes down. Traffic keeps being hashed towards that destination, and of course gets dropped on egress. Monitor the netdev notifier events emitted by the bonding driver for changes in the physical state of lower interfaces, to determine which ports are active and which ones are no longer. Then extend ocelot_get_bond_mask to return either the configured bonding interfaces, or the active ones, depending on a boolean argument. The code that does rebalancing only needs to do so among the active ports, whereas the bridge forwarding mask and the logical port IDs still need to look at the permanently bonded ports. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: mscc: ocelot: rename aggr_count to num_ports_in_lagVladimir Oltean1-4/+3
It makes it a bit easier to read and understand the code that deals with balancing the 16 aggregation codes among the ports in a certain LAG. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: mscc: ocelot: drop the use of the "lags" arrayVladimir Oltean1-56/+39
We can now simplify the implementation by always using ocelot_get_bond_mask to look up the other ports that are offloading the same bonding interface as us. In ocelot_set_aggr_pgids, the code had a way to uniquely iterate through LAGs. We need to achieve the same behavior by marking each LAG as visited, which we do now by using a temporary 32-bit "visited" bitmask. This is ok and we do not need dynamic memory allocation, because we know that this switch architecture will not have more than 32 ports (the PGID port masks are 32-bit anyway). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: mscc: ocelot: set up logical port IDs centrallyVladimir Oltean1-19/+28
The setup of logical port IDs is done in two places: from the inconclusively named ocelot_setup_lag and from ocelot_port_lag_leave, a function that also calls ocelot_setup_lag (which apparently does an incomplete setup of the LAG). To improve this situation, we can rename ocelot_setup_lag into ocelot_setup_logical_port_ids, and drop the "lag" argument. It will now set up the logical port IDs of all switch ports, which may be just slightly more inefficient but more maintainable. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: mscc: ocelot: avoid unneeded "lp" variable in LAG joinVladimir Oltean1-10/+6
The index of the LAG is equal to the logical port ID that all the physical port members have, which is further equal to the index of the first physical port that is a member of the LAG. The code gets a bit carried away with logic like this: if (a == b) c = a; else c = b; which can be simplified, of course, into: c = b; (with a being port, b being lp, c being lag) This further makes the "lp" variable redundant, since we can use "lag" everywhere where "lp" (logical port) was used. So instead of a "c = b" assignment, we can do a complete deletion of b. Only one comment here: if (bond_mask) { lp = __ffs(bond_mask); ocelot->lags[lp] = 0; } lp was clobbered before, because it was used as a temporary variable to hold the new smallest port ID from the bond. Now that we don't have "lp" any longer, we'll just avoid the temporary variable and zeroize the bonding mask directly. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: mscc: ocelot: set up the bonding mask in a way that avoids a net_deviceVladimir Oltean1-7/+22
Since this code should be called from pure switchdev as well as from DSA, we must find a way to determine the bonding mask not by looking directly at the net_device lowers of the bonding interface, since those could have different private structures. We keep a pointer to the bonding upper interface, if present, in struct ocelot_port. Then the bonding mask becomes the bitwise OR of all ports that have the same bonding upper interface. This adds a duplication of functionality with the current "lags" array, but the duplication will be short-lived, since further patches will remove the latter completely. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: mscc: ocelot: use ipv6 in the aggregation codeVladimir Oltean1-1/+4
IPv6 header information is not currently part of the entropy source for the 4-bit aggregation code used for LAG offload, even though it could be. The hardware reference manual says about these fields: ANA::AGGR_CFG.AC_IP6_TCPUDP_PORT_ENA Use IPv6 TCP/UDP port when calculating aggregation code. Configure identically for all ports. Recommended value is 1. ANA::AGGR_CFG.AC_IP6_FLOW_LBL_ENA Use IPv6 flow label when calculating AC. Configure identically for all ports. Recommended value is 1. Integration with the xmit_hash_policy of the bonding interface is TBD. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: mscc: ocelot: don't refuse bonding interfaces we can't offloadVladimir Oltean3-28/+17
Since switchdev/DSA exposes network interfaces that fulfill many of the same user space expectations that dedicated NICs do, it makes sense to not deny bonding interfaces with a bonding policy that we cannot offload, but instead allow the bonding driver to select the egress interface in software. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: mscc: ocelot: use a switch-case statement in ocelot_netdevice_eventVladimir Oltean1-23/+45
Make ocelot's net device event handler more streamlined by structuring it in a similar way with others. The inspiration here was dsa_slave_netdevice_event. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: mscc: ocelot: rename ocelot_netdevice_port_event to ↵Vladimir Oltean1-32/+27
ocelot_netdevice_changeupper ocelot_netdevice_port_event treats a single event, NETDEV_CHANGEUPPER. So we can remove the check for the type of event, and rename the function to be more suggestive, since there already is a function with a very similar name of ocelot_netdevice_event. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: hns3: replace macro of max qset number with specificationGuangbin Huang6-6/+12
The max qset number is a fixed value now and it is defined by a macro. In order to support other value in different kinds of device, it is better to use specification queried from firmware to replace macro. Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: hns3: debugfs add max tm rate specification printGuangbin Huang1-0/+1
In order to add a method to check the specification of max tm rate for debugging, function hns3_dbg_dev_specs() adds this value print. Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: hns3: add support for obtaining the maximum frame sizeYufeng Mo9-8/+20
Since the newer hardware may supports different frame size, so add support to obtain the capability from the firmware instead of the fixed value. Signed-off-by: Yufeng Mo <moyufeng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: hns3: optimize the code when update the tc infoGuoJia Liao3-7/+12
When update the TC info for NIC, there are some differences between PF and VF. Currently, four "vport->vport_id" are used to distinguish PF or VF. So merge them into one to improve readability and maintainability of code. Signed-off-by: GuoJia Liao <liaoguojia@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: hns3: RSS indirection table use device specificationGuangbin Huang6-41/+66
As RSS indirection table size may be different in different hardware. Instead of using macro, this value is better to use device specification which querying from firmware. BTW, RSS indirection table should be allocated by the queried size instead the static array. .get_rss_indir_size in struct hnae3_ae_ops is not used now, so remove it as well. Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: hns3: add api capability bits for firmwareJian Shen4-2/+30
To improve the compatibility of firmware for driver, help firmware to deal with different api commands, add api capability bits when initialize the command queue. Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: dpaa2-mac: add backplane link mode supportRussell King2-3/+6
Add support for backplane link mode, which is, according to discussions with NXP earlier in the year, is a mode where the OS (Linux) is able to manage the PCS and Serdes itself. This commit prepares the ground work for allowing 1G fiber connections to be used with DPAA2 on the SolidRun CEX7 platforms. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: dpaa2-mac: add 1000BASE-X supportRussell King1-3/+17
Now that pcs-lynx supports 1000BASE-X, add support for this interface mode to dpaa2-mac. pcs-lynx can be switched at runtime between SGMII and 1000BASE-X mode, so allow dpaa2-mac to switch between these as well. This commit prepares the ground work for allowing 1G fiber connections to be used with DPAA2 on the SolidRun CEX7 platforms. Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-07net: pcs: add pcs-lynx 1000BASE-X supportRussell King1-0/+36
Add support for 1000BASE-X to pcs-lynx for the LX2160A. This commit prepares the ground work for allowing 1G fiber connections to be used with DPAA2 on the SolidRun CEX7 platforms. Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06net: wan: farsync: use new tasklet APIEmil Renner Berthing1-6/+6
This converts the driver to use the new tasklet API introduced in commit 12cc923f1ccc ("tasklet: Introduce new initialization API") The new API changes the argument passed to callback functions, but fortunately it is unused so it is straight forward to use DECLARE_TASKLET rather than DECLARE_TASLKLET_OLD. Signed-off-by: Emil Renner Berthing <kernel@esmil.dk> Link: https://lore.kernel.org/r/20210204173947.92884-1-kernel@esmil.dk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06net: dpaa2: Use napi_alloc_frag_align() to avoid the memory wasteKevin Hao1-2/+1
The napi_alloc_frag_align() will guarantee that a correctly align buffer address is returned. So use this function to simplify the buffer alloc and avoid the unnecessary memory waste. Signed-off-by: Kevin Hao <haokexin@gmail.com> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06net: octeontx2: Use napi_alloc_frag_align() to avoid the memory wasteKevin Hao1-2/+1
The napi_alloc_frag_align() will guarantee that a correctly align buffer address is returned. So use this function to simplify the buffer alloc and avoid the unnecessary memory waste. Signed-off-by: Kevin Hao <haokexin@gmail.com> Tested-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06net: dwc-xlgmac: Fix spelling mistake in function nameColin Ian King3-3/+3
There is a spelling mistake in the function name alloc_channles_and_rings. Fix this by renaming it to alloc_channels_and_rings. Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20210204094944.51460-1-colin.king@canonical.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06net: mhi-net: Add re-aggregation of fragmented packetsLoic Poulain1-10/+64
When device side MTU is larger than host side MTU, the packets (typically rmnet packets) are split over multiple MHI transfers. In that case, fragments must be re-aggregated to recover the packet before forwarding to upper layer. A fragmented packet result in -EOVERFLOW MHI transaction status for each of its fragments, except the final one. Such transfer was previously considered as error and fragments were simply dropped. This change adds re-aggregation mechanism using skb chaining, via skb frag_list. A warning (once) is printed since this behavior usually comes from a misconfiguration of the device (e.g. modem MTU). Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Link: https://lore.kernel.org/r/1612428002-12333-1-git-send-email-loic.poulain@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06net: qualcomm: rmnet: Fix rx_handler for non-linear skbsLoic Poulain1-0/+5
There is no guarantee that rmnet rx_handler is only fed with linear skbs, but current rmnet implementation does not check that, leading to crash in case of non linear skbs processed as linear ones. Fix that by ensuring skb linearization before processing. Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Acked-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Link: https://lore.kernel.org/r/1612428002-12333-2-git-send-email-loic.poulain@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06net: ethernet: ti: fix netdevice stats for XDPLorenzo Bianconi4-9/+13
Align netdevice statistics when the device is running in XDP mode to other upstream drivers. In particular report to user-space rx packets even if they are not forwarded to the networking stack (XDP_PASS) but if they are redirected (XDP_REDIRECT), dropped (XDP_DROP) or sent back using the same interface (XDP_TX). This patch allows the system administrator to verify the device is receiving data correctly. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/a457cb17dd9c58c116d64ee34c354b2e89c0ff8f.1612375372.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06dpaa2-eth: Simplify the calculation of variablesJiapeng Chong1-1/+1
Fix the following coccicheck warnings: ./drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c:1651:36-38: WARNING !A || A && B is equivalent to !A || B. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> Link: https://lore.kernel.org/r/1612260157-128026-1-git-send-email-jiapeng.chong@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06ibmvnic: Clear failover_pending if unable to scheduleSukadev Bhattiprolu1-1/+16
Normally we clear the failover_pending flag when processing the reset. But if we are unable to schedule a failover reset we must clear the flag ourselves. We could fail to schedule the reset if we are in PROBING state (eg: when booting via kexec) or because we could not allocate memory. Thanks to Cris Forno for helping isolate the problem and for testing. Fixes: 1d8504937478 ("powerpc/vnic: Extend "failover pending" window") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Tested-by: Cristobal Forno <cforno12@linux.ibm.com> Link: https://lore.kernel.org/r/20210203050802.680772-1-sukadev@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06Merge tag 'wireless-drivers-next-2021-02-05' of ↵Jakub Kicinski41-170/+283
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next Kalle Valo says: ==================== wireless-drivers-next patches for v5.12 First set of patches for v5.12. A smaller pull request this time, biggest feature being a better key handling for ath9k. And of course the usual fixes and cleanups all over. Major changes: ath9k * more robust encryption key cache management brcmfmac * support BCM4365E with 43666 ChipCommon chip ID * tag 'wireless-drivers-next-2021-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next: (35 commits) iwl4965: do not process non-QOS frames on txq->sched_retry path mt7601u: process tx URBs with status EPROTO properly wlcore: Fix command execute failure 19 for wl12xx mt7601u: use ieee80211_rx_list to pass frames to the network stack as a batch rtw88: 8723de: adjust the LTR setting rtlwifi: rtl8821ae: fix bool comparison in expressions rtlwifi: rtl8192se: fix bool comparison in expressions rtlwifi: rtl8188ee: fix bool comparison in expressions rtlwifi: rtl8192c-common: fix bool comparison in expressions rtlwifi: rtl_pci: fix bool comparison in expressions wlcore: Downgrade exceeded max RX BA sessions to debug wilc1000: use flexible-array member instead of zero-length array brcmfmac: clear EAP/association status bits on linkdown events brcmfmac: Delete useless kfree code qtnfmac_pcie: Use module_pci_driver mt7601u: check the status of device in calibration mt7601u: process URBs in status EPROTO properly brcmfmac: support BCM4365E with 43666 ChipCommon chip ID wilc1000: fix spelling mistake in Kconfig "devision" -> "division" mwifiex: pcie: Drop bogus __refdata annotation ... ==================== Link: https://lore.kernel.org/r/20210205161901.C7F83C433ED@smtp.codeaurora.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06Merge tag 'wireless-drivers-2021-02-05' of ↵Jakub Kicinski2-9/+7
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers Kalle Valo says: ==================== wireless-drivers fixes for v5.11 Third, and most likely the last, set of fixes for v5.11. Two very small fixes. ath9k * fix build regression related to LEDS_CLASS mt76 * fix a memory leak * tag 'wireless-drivers-2021-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers: mt76: dma: fix a possible memory leak in mt76_add_fragment() ath9k: fix build error with LEDS_CLASS=m ==================== Link: https://lore.kernel.org/r/20210205163434.14D94C433ED@smtp.codeaurora.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06net/mlx5e: Handle FIB events to update tunnel endpoint deviceVlad Buslov7-67/+773
Process FIB route update events to dynamically update the stack device rules when tunnel routing changes. Use rtnl lock to prevent FIB event handler from running concurrently with neigh update and neigh stats workqueue tasks. Use encap_tbl_lock mutex to synchronize with TC rule update path that doesn't use rtnl lock. FIB event workflow for encap flows: - Unoffload all flows attached to route encaps from slow or fast path depending on encap destination endpoint neigh state. - Update encap IP header according to new route dev. - Update flows mod_hdr action that is responsible for overwriting reg_c0 source port bits to source port of new underlying VF of new route dev. This step requires changing flow create/delete code to save flow parse attribute mod_hdr_acts structure for whole flow lifetime instead of deallocating it after flow creation. Refactor mod_hdr code to allow saving id of individual mod_hdr actions and updating them with dedicated helper. - Offload all flows to either slow or fast path depending on encap destination endpoint neigh state. FIB event workflow for decap flows: - Unoffload all route flows from hardware. When last route flow is deleted all indirect table rules for the route dev will also be deleted. - Update flow attr decap_vport and destination MAC according to underlying VF of new rote dev. - Offload all route flows back to hardware creating new indirect table rules according to updated flow attribute data. Extract some neigh update code to helper functions to be used by both neigh update and route update infrastructure. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-06net/mlx5e: Rename some encap-specific API to generic namesVlad Buslov5-9/+9
Some of the encap-specific functions and fields will also be used by route update infrastructure in following patches. Rename them to generic names. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-06net/mlx5e: TC preparation refactoring for routing update eventVlad Buslov5-9/+288
Following patch in series implement routing update event which requires ability to modify rule match_to_reg modify header actions dynamically during rule lifetime. In order to accommodate such behavior, refactor and extend TC infrastructure in following ways: - Modify mod_hdr infrastructure to preserve its parse attribute for whole rule lifetime, instead of deallocating it after rule creation. - Extend match_to_reg infrastructure with new function mlx5e_tc_match_to_reg_set_and_get_id() that returns mod_hdr action id that can be used afterwards to update the action, and mlx5e_tc_match_to_reg_mod_hdr_change() that can modify existing actions by its id. - Extend tun API with new functions mlx5e_tc_tun_update_header_ipv{4|6}() that are used to updated existing encap entry tunnel header. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-06net/mlx5e: Refactor neigh update infrastructureVlad Buslov9-31/+35
Following patches in series implements route update which can cause encap entries to migrate between routing devices. Consecutively, their parent nhe's need to be also transferable between devices instead of having neigh device as a part of their immutable key. Move neigh device from struct mlx5_neigh to struct mlx5e_neigh_hash_entry and check that nhe and neigh devices are the same in workqueue neigh update handler. Save neigh net_device that can change dynamically in dedicated nhe->dev field. With FIB event handler that is implemented in following patches changing nhe->dev, NETEVENT_DELAY_PROBE_TIME_UPDATE handler can concurrently access the nhe entry when traversing neigh list under rcu read lock. Processing stale values in that handler doesn't change the handler logic, so just wrap all accesses to the dev pointer in {WRITE|READ}_ONCE() helpers. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-06net/mlx5e: Create route entry infrastructureVlad Buslov7-11/+290
Implement dedicated route entry infrastructure to be used in following patch by route update event. Both encap (indirectly through their corresponding encap entries) and decap (directly) flows are attached to routing entry. Since route update also requires updating encap (route device MAC address is a source MAC address of tunnel encapsulation), same encap_tbl_lock mutex is used for synchronization. The new infrastructure looks similar to existing infrastructures for shared encap, mod_hdr and hairpin entries: - Per-eswitch hash table is used for quick entry lookup. - Flows are attached to per-entry linked list and hold reference to entry during their lifetime. - Atomic reference counting and rcu mechanisms are used as synchronization primitives for concurrent access. The infrastructure also enables connection tracking on stacked devices topology by attaching CT chain 0 flow on tunneling dev to decap route entry. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-06net/mlx5e: Extract tc tunnel encap/decap code to dedicated fileVlad Buslov7-885/+947
Following patches in series extend the extracted code with routing infrastructure. To improve code modularity created a dedicated tc_tun_encap.c source file and move encap/decap related code to the new file. Export code that is used by both regular TC code and encap/decap code into tc_priv.h (new header intended to be used only by TC module). Rename some exported functions by adding "mlx5e_" prefix to their names. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-06net/mlx5e: Match recirculated packet miss in slow table using reg_c1Vlad Buslov4-7/+134
Previous patch in series that implements stack devices RX path implements indirect table rules that match on tunnel VNI. After such rule is created all tunnel traffic is recirculated to root table. However, recirculated packet might not match on any rules installed in the table (for example, when IP traffic follows ARP traffic). In that case packets appear on representor of tunnel endpoint VF instead being redirected to the VF itself. Extend slow table with additional flow group that matches on reg_c0 (source port value set by indirect tables implemented by previous patch in series) and reg_c1 (special 0xFFF mark). When creating offloads fdb tables, install one rule per VF vport to match on recirculated miss packets and redirect them to appropriate VF vport. Modify indirect tables code to also rewrite reg_c1 with special 0xFFF mark. Implementation reuses reg_c1 tunnel id bits. This is safe to do because recirculated packets are always matched before decapsulation. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-06net/mlx5e: Refactor reg_c1 usageVlad Buslov4-9/+7
Following patch in series uses reg_c1 in eswitch code. To use reg_c1 helpers in both TC and eswitch code, refactor existing helpers according to similar use case of reg_c0 and move the functionality into eswitch.h. Calculate reg mappings length from new defines to ensure that they are always in sync and only need to be changed in single place. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-06net/mlx5e: VF tunnel RX traffic offloadingVlad Buslov5-8/+271
When tunnel endpoint is on VF the encapsulated RX traffic is exposed on the representor of the VF without any further processing of rules installed on the VF. Detect such case by checking if the device returned by route lookup in decap rule handling code is a mlx5 VF and handle it with new redirection tables API. Example TC rules for VF tunnel traffic: 1. Rule that encapsulates the tunneled flow and redirects packets from source VF rep to tunnel device: $ tc -s filter show dev enp8s0f0_1 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac 0a:40:bd:30:89:99 src_mac ca:2e:a7:3f:f5:0f eth_type ipv4 ip_tos 0/0x3 ip_flags nofrag in_hw in_hw_count 1 action order 1: tunnel_key set src_ip 7.7.7.5 dst_ip 7.7.7.1 key_id 98 dst_port 4789 nocsum ttl 64 pipe index 1 ref 1 bind 1 installed 411 sec used 411 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 no_percpu used_hw_stats delayed action order 2: mirred (Egress Redirect to device vxlan_sys_4789) stolen index 1 ref 1 bind 1 installed 411 sec used 0 sec Action statistics: Sent 5615833 bytes 4028 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 5615833 bytes 4028 pkt backlog 0b 0p requeues 0 cookie bb406d45d343bf7ade9690ae80c7cba4 no_percpu used_hw_stats delayed 2. Rule that redirects from tunnel device to UL rep: $ tc -s filter show dev vxlan_sys_4789 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac ca:2e:a7:3f:f5:0f src_mac 0a:40:bd:30:89:99 eth_type ipv4 enc_dst_ip 7.7.7.5 enc_src_ip 7.7.7.1 enc_key_id 98 enc_dst_port 4789 enc_tos 0 ip_flags nofrag in_hw in_hw_count 1 action order 1: tunnel_key unset pipe index 2 ref 1 bind 1 installed 434 sec used 434 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 used_hw_stats delayed action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen index 4 ref 1 bind 1 installed 434 sec used 0 sec Action statistics: Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 129936 bytes 1082 pkt backlog 0b 0p requeues 0 cookie ac17cf398c4c69e4a5b2f7aabd1b88ff no_percpu used_hw_stats delayed Co-developed-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-06net/mlx5e: Remove redundant match on tunnel destination macVlad Buslov1-8/+0
Remove hardcoded match on tunnel destination MAC address. Such match is no longer required and would be wrong for stacked devices topology where encapsulation destination MAC address will be the address of tunnel VF that can change dynamically on route change (implemented in following patches in the series). Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-06net/mlx5: E-Switch, Indirect table infrastructureVlad Buslov6-0/+616
Indirect table infrastructure is used to allow fully processing VF tunnel traffic in hardware. Kernel software model uses two TC rules for such traffic: UL rep to tunnel device, then tunnel VF rep to destination VF rep. To implement such pipeline driver needs to program the hardware after matching on UL rule to overwrite source vport from UL to tunnel VF and recirculate the packet to the root table to allow matching on the rule installed on tunnel VF. For this indirect table matches all encapsulated traffic by tunnel parameters and all other IP traffic is sent to tunnel VF by the miss rule. Indirect table API overview: - mlx5_esw_indir_table_{init|destroy}() - init and destroy opaque indirect table object. - mlx5_esw_indir_table_get() - get or create new table according to vport id and IP version. Table has following pre-created groups: recirculation group with match on ethertype and VNI (rules that match encapsulated packets are installed to this group) and forward group with default/miss rule that forwards to vport of tunnel endpoint VF (rule for regular non-encapsulated packets). - mlx5_esw_indir_table_put() - decrease reference to the indirect table and matching rule (for encapsulated traffic). - mlx5_esw_indir_table_needed() - check that in_port is an uplink port and out_port is VF on the same eswitch, verify that the rule is for IP traffic and source port rewrite functionality can be used. - mlx5_esw_indir_table_decap_vport() - function returns decap vport of flow attribute. Co-developed-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>