kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
2021-08-22	sfc: falcon: Read VPD with pci_vpd_alloc()	Heiner Kallweit	1	-16/+14
	This is the same as 5119e20facfa "sfc: Read VPD with pci_vpd_alloc()", just for the falcon chip version. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	mlxsw: spectrum_router: Increase parsing depth for multipath hash	Amit Cohen	2	-1/+44
	Commit 01848e05f8bb ("mlxsw: spectrum_router: Add support for inner layer 3 multipath hash policy") and commit daeabf89eb89 ("mlxsw: spectrum_router: Add support for custom multipath hash policy") added support for multipath hash policies where the hash is calculated based on inner packet fields. For IPv6-in-IPv6 packets, the default parsing depth (96 bytes) is not enough when these policies are used. Therefore, for such cases, call the new API to increase / decrease the parsing depth as necessary. Care is taken to ensure the API is not called multiple times. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	mlxsw: Remove old parsing depth infrastructure	Amit Cohen	2	-69/+0
	The previous patches added new API to handle parsing depth and converted the existing code to use it. Remove the old infrastructure which is not needed anymore. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	mlxsw: Convert existing consumers to use new API for parsing configuration	Amit Cohen	2	-8/+22
	Convert VxLAN and PTP modules to increase parsing depth using new API that was added in the previous patch. Separate MPRS register's configuration to VxLAN related configuration and parsing depth configuration. Handle each one using the appropriate API. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	mlxsw: spectrum: Add infrastructure for parsing configuration	Amit Cohen	2	-0/+94
	Spectrum ASICs have a configurable limit on how deep into the packet they parse. By default, the limit is 96 bytes. There are several cases where this parsing depth is not enough and there is a need to increase it. Currently, increasing parsing depth is maintained as part of VxLAN module, because the MPRS register which configures parsing depth also configures UDP destination port number used for VxLAN encapsulation and decapsulation. Add an API for increasing parsing depth as part of spectrum.c code, so that it will be possible to use it from other modules. In addition, add an API for setting UDP destination port and protect it using a dedicated lock for saving parsing configurations. The lock is needed as not all the callers hold RTNL lock. Maintain a counter for increased parsing depth consumers. For first consumer subscription, increase the parsing depth and for last consumer unsubscription, set parsing depth to default value. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	octeontx2-af: cn10k: Use FLIT0 register instead of FLIT1	Geetha sowjanya	2	-3/+3
	RVU SMMU widget stores the final translated PA at RVU_AF_SMMU_TLN_FLIT0<57:18> instead of FLIT1 register. This patch fixes the address translation logic to use the correct register. Fixes: 893ae97214c3 ("octeontx2-af: cn10k: Support configurable LMTST regions") Signed-off-by: Geetha sowjanya <gakula@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	octeontx2-pf: Fix algorithm index in MCAM rules with RSS action	Sunil Goutham	3	-0/+15
	Otherthan setting action as RSS in NPC MCAM entry, RSS flowkey algorithm index also needs to be set. Otherwise whatever algorithm is defined at flowkey index '0' will be considered by HW and pkt flows will be distributed as such. Fix this by saving the flowkey index sent by admin function while initializing RSS and then use it when framing MCAM rules. Fixes: 81a4362016e7 ("octeontx2-pf: Add RSS multi group support") Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	octeontx2-pf: Don't install VLAN offload rule if netdev is down	Sunil Goutham	1	-1/+2
	Whenever user changes interface MAC address both default DMAC based MCAM rule and VLAN offload (for strip) rules are updated with new MAC address. To update or install VLAN offload rule PF driver needs interface's receive channel info, which is retrieved from admin function at the time of NIXLF initialization. If user changes MAC address before interface is UP, VLAN offload rule installation will fail and throw error as receive channel is not valid. To avoid this, skip VLAN offload rule installation if netdev is not UP. This rule will anyway be reinslatted as part of open() call. Fixes: fd9d7859db6c ("octeontx2-pf: Implement ingress/egress VLAN offload") Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	octeontx2-af: Check capability flag while freeing ipolicer memory	Geetha sowjanya	1	-3/+6
	Bandwidth profiles (ipolicer structure)is implemented only on CN10K platform. But current code try to free the ipolicer memory without checking the capibility flag leading to driver crash on OCTEONTX2 platform. This patch fixes the issue by add capability flag check. Fixes: e8e095b3b3700 ("octeontx2-af: cn10k: Bandwidth profiles config support") Signed-off-by: Geetha sowjanya <gakula@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	octeontx2-af: Use DMA_ATTR_FORCE_CONTIGUOUS attribute in DMA alloc	Geetha sowjanya	1	-5/+6
	CN10K platform requires physically contiguous memory for LMTST operations which goes beyond a single page. Not having physically contiguous memory will result in HW fetching transmit descriptors from a wrong memory location. Hence use DMA_ATTR_FORCE_CONTIGUOUS attribute while allocating LMTST regions. Signed-off-by: Geetha sowjanya <gakula@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	octeontx2-pf: send correct vlan priority mask to npc_install_flow_req	Naveen Mamindlapalli	1	-2/+2
	This patch corrects the erroneous vlan priority mask field that was send to npc_install_flow_req. Fixes: 1d4d9e42c240 ("octeontx2-pf: Add tc flower hardware offload on ingress traffic") Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	octeontx2-pf: Don't mask out supported link modes	Hariprasad Kelam	1	-5/+0
	Supported link modes are updated by firmware in shared structure per interface. Kernel uses this value to display supported link modes via ethtool. Currently there is extra validation that firmware updated modes are validated against internal list of supported modes. As intenal list of supported modes are not updated frequently new modes supported by firmware are not updated to ethtool. Hence remove extra validation and report all firmware updated modes. Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	octeontx2-af: Handle return value in block reset.	Geetha sowjanya	1	-1/+4
	Print debug message if any of the RVU hardware blocks reset fails. Signed-off-by: Geetha sowjanya <gakula@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	octeontx2-af: cn10k: Fix SDP base channel number	Subbaraya Sundeep	2	-11/+22
	As per hardware the base channel number configured for programmable channels of a block must be multiple of number of channels of that block. This condition is not met for SDP base channel currently. Hence this patch ensures all the base channel numbers of all blocks are multiple of number of channels present in the blocks. Also instead of hardcoding SDP number of channels the same is read from the NIX_AF_CONST1 register. Fixes: 242da439214b ("octeontx2-af: cn10k: Add support for programmable") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	octeontx2-pf: Fix NIX1_RX interface backpressure	Subbaraya Sundeep	1	-0/+15
	'bp_ena' in Aura context is NIX block index, setting it zero will always backpressure NIX0 block, even if NIXLF belongs to NIX1. Hence fix this by setting it appropriately based on NIX block address. Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	net: ipa: rename "ipa_clock.c"	Alex Elder	6	-8/+8
	Finally, rename "ipa_clock.c" to be "ipa_power.c" and "ipa_clock.h" to be "ipa_power.h". Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	net: ipa: rename ipa_clock_* symbols	Alex Elder	17	-173/+171
	Rename a number of functions to clarify that there is no longer a notion of an "IPA clock," but rather that the functions are more generally related to IPA power management. ipa_clock_enable() -> ipa_power_enable() ipa_clock_disable() -> ipa_power_disable() ipa_clock_rate() -> ipa_core_clock_rate() ipa_clock_init() -> ipa_power_init() ipa_clock_exit() -> ipa_power_exit() Rename the ipa_clock structure to be ipa_power. Rename all variables and fields using that structure type "power" rather than "clock". Rename the ipa_clock_data structure to be ipa_power_data, and more broadly, just substitute "power" for "clock" in places that previously represented things related to the "IPA clock". Update comments throughout. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22	net: ipa: use autosuspend	Alex Elder	6	-25/+40
	Use runtime power management autosuspend. Up until this point, we only suspended the IPA hardware for system suspend; now we'll suspend it aggressively using runtime power management, setting the initial autosuspend delay to half a second of inactivity. Replace pm_runtime_put() calls with pm_runtime_put_autosuspend(), call pm_runtime_mark_last_busy() before each of those. In places where we're shutting things down, or decrementing power references for errors, use pm_runtime_put_noidle() instead. Finally, remove ipa_runtime_idle(), so the ->runtime_suspend callback will occur if idle. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	octeontx2-pf: Add check for non zero mcam flows	Sunil Goutham	2	-0/+25
	This patch ensures that mcam flows are allocated before adding or destroying the flows. Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: ipa: kill ipa_clock_get()	Alex Elder	3	-48/+7
	The only remaining user of the ipa_clock_{get,put}() interface is ipa_isr_thread(). Replace calls to ipa_clock_get() there calling pm_runtime_get_sync() instead. And call pm_runtime_put() there rather than ipa_clock_put(). Warn if we ever get an error. With that, we can get rid of ipa_clock_get() and ipa_clock_put(). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: ipa: don't use ipa_clock_get() in "ipa_modem.c"	Alex Elder	1	-17/+23
	When we open or close the modem network device we need to ensure the hardware is powered. Replace the callers of ipa_clock_get() found in ipa_open() and ipa_stop() with calls to pm_runtime_get_sync(). If an error is returned, simply return that error to the caller (without any error or warning message). This could conceivably occur if the function was called while the system was suspended, but that really shouldn't happen. Replace corresponding calls to ipa_clock_put() with pm_runtime_put() also. If the modem crashes we also need to ensure the hardware is powered to recover. If getting power returns an error there's not much we can do, but at least report the error. (Ideally the remoteproc SSR code would ensure the AP was not suspended when it sends the notification, but that is not (yet) the case.) Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: ipa: don't use ipa_clock_get() in "ipa_uc.c"	Alex Elder	1	-9/+13
	Replace the ipa_clock_get() call in ipa_uc_clock() when taking the "proxy" clock reference for the microcontroller with a call to pm_runtime_get_sync(). Replace calls of ipa_clock_put() for the microcontroller with pm_runtime_put() calls instead. There is a chance we get an error when taking the microcontroller power reference. This is an unlikely scenario, where system suspend is initiated just before we learn the modem is booting. For now we'll just accept that this could occur, and report it if it does. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: ipa: don't use ipa_clock_get() in "ipa_smp2p.c"	Alex Elder	1	-8/+11
	If the "modem-init" Device Tree property is present for a platform, the modem performs early IPA hardware initialization, and signals this is complete with an "ipa-setup-ready" SMP2P interrupt. This triggers a call to ipa_setup(), which requires the hardware to be powered. Replace the call to ipa_clock_get() in this case with a call to pm_runtime_get_sync(). And replace the corresponding calls to ipa_clock_put() with calls to pm_runtime_put() instead. There is a chance we get an error when taking this power reference. This is an unlikely scenario, where system suspend is initiated just before the modem signals it has finished initializing the IPA hardware. For now we'll just accept that this could occur, and report it if it does. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: ipa: don't use ipa_clock_get() in "ipa_main.c"	Alex Elder	1	-10/+11
	We need the hardware to be powered starting at the config stage of initialization when the IPA driver probes. And we need it powered when the driver is removed, at least until the deconfig stage has completed. Replace callers of ipa_clock_get() in ipa_probe() and ipa_exit(), calling pm_runtime_get_sync() instead. Replace the corresponding callers of ipa_clock_put(), calling pm_runtime_put() instead. The only error we expect when getting power would occur when the system is suspended. The ->probe and ->remove driver callbacks won't be called when suspended, so issue a WARN() call if an error is seen getting power. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: ipa: fix TX queue race	Alex Elder	3	-2/+94
	Jakub Kicinski pointed out a race condition in ipa_start_xmit() in a recently-accepted series of patches: https://lore.kernel.org/netdev/20210812195035.2816276-1-elder@linaro.org/ We are stopping the modem TX queue in that function if the power state is not active. We restart the TX queue again once hardware resume is complete. TX path Power Management ------- ---------------- pm_runtime_get(); no power Start resume Stop TX queue ... pm_runtime_put() Resume complete return NETDEV_TX_BUSY Start TX queue pm_runtime_get() Power present, transmit pm_runtime_put() (auto-suspend) The issue is that the power management (resume) activity and the network transmit activity can occur concurrently, and there's a chance the queue will be stopped after it has been started again. TX path Power Management ------- ---------------- Resume underway pm_runtime_get(); no power ... Resume complete Start TX queue Stop TX queue <-- No more transmits after this pm_runtime_put() return NETDEV_TX_BUSY We address this using a STARTED flag to indicate when the TX queue has been started from the resume path, and a spinlock to make the flag and queue updates happen atomically. TX path Power Management ------- ---------------- Resume underway pm_runtime_get(); no power Resume complete start TX queue \ If STARTED flag is not set: > atomic Stop TX queue set STARTED flag / pm_runtime_put() return NETDEV_TX_BUSY A second flag is used to address a different race that involves another path requesting power. TX path Other path Power Management ------- ---------- ---------------- pm_runtime_get_sync() Resume Start TX queue \ atomic Set STARTED flag / (do its thing) pm_runtime_put() (auto-suspend) pm_runtime_get() Mark delayed resume STARTED is set, so do not stop TX queue <-- Queue should be stopped here pm_runtime_put() return NETDEV_TX_BUSY Suspend done, resume Resume complete pm_runtime_get() Stop TX queue (STARTED is not set) Start TX queue \ atomic pm_runtime_put() Set STARTED flag / return NETDEV_TX_BUSY So a STOPPED flag is set in the transmit path when it has stopped the TX queue, and this pair of operations is also protected by the spinlock. The resume path only restarts the TX queue if the STOPPED flag is set. This case isn't a major problem, but it avoids the "non-trivial amount of useless work" done by the networking stack when NETDEV_TX_BUSY is returned. Fixes: 6b51f802d652b ("net: ipa: ensure hardware has power in ipa_start_xmit()") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: mscc: ocelot: use helpers for port VLAN membership	Vladimir Oltean	1	-20/+40
	This is a mostly cosmetic patch that creates some helpers for accessing the VLAN table. These helpers are also a bit more careful in that they do not modify the ocelot->vlan_mask unless the hardware operation succeeded. Not all callers check the return value (the init code doesn't), but anyway. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: mscc: ocelot: transmit the VLAN filtering restrictions via extack	Vladimir Oltean	3	-7/+9
	We need to transmit more restrictions in future patches, convert this one to netlink extack. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: mscc: ocelot: transmit the "native VLAN" error via extack	Vladimir Oltean	3	-21/+24
	We need to reject some more configurations in future patches, convert the existing one to netlink extack. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: mscc: ocelot: allow probing to continue with ports that fail to register	Vladimir Oltean	1	-3/+4
	The existing ocelot device trees, like ocelot_pcb123.dts for example, have SERDES ports (ports 4 and higher) that do not have status = "disabled"; but on the other hand do not have a phy-handle or a fixed-link either. So from the perspective of phylink, they have broken DT bindings. Since the blamed commit, probing for the entire switch will fail when such a device tree binding is encountered on a port. There used to be this piece of code which skipped ports without a phy-handle: phy_node = of_parse_phandle(portnp, "phy-handle", 0); if (!phy_node) continue; but now it is gone. Anyway, fixed-link setups are a thing which should work out of the box with phylink, so it would not be in the best interest of the driver to add that check back. Instead, let's look at what other drivers do. Since commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal"), DSA continues after a switch port fails to register, and works only with the ports that succeeded. We can achieve the same behavior in ocelot by unregistering the devlink port for ports where ocelot_port_phylink_create() failed (called via ocelot_probe_port), and clear the bit in devlink_ports_registered for that port. This will make the next iteration reconsider the port that failed to probe as an unused port, and re-register a devlink port of type UNUSED for it. No other cleanup should need to be performed, since ocelot_probe_port() should be self-contained when it fails. Fixes: e6e12df625f2 ("net: mscc: ocelot: convert to phylink") Reported-and-tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: mscc: ocelot: be able to reuse a devlink_port after teardown	Horatiu Vultur	1	-0/+1
	There are cases where we would like to continue probing the switch even if one port has failed to probe. When that happens, we need to unregister a devlink_port of type DEVLINK_PORT_FLAVOUR_PHYSICAL and re-register it of type DEVLINK_PORT_FLAVOUR_UNUSED. This is fine, except when calling devlink_port_attrs_set on a structure on which devlink_port_register has been previously called, there is a WARN_ON in devlink_port_attrs_set that devlink_port->devlink must be NULL. So don't assume that the memory behind dlp is clean when calling ocelot_port_devlink_init, just zero-initialize it. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: dpaa2-switch: call dpaa2_switch_port_disconnect_mac on probe error path	Vladimir Oltean	1	-5/+14
	Currently when probing returns an error, the netdev is freed but phylink_disconnect is not called. Create a common function between the unbind path and the error path, call it the opposite of dpaa2_switch_probe_port: dpaa2_switch_remove_port, and call it from both the unbind and the error path. Fixes: 84cba72956fd ("dpaa2-switch: integrate the MAC endpoint support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: dpaa2-switch: phylink_disconnect_phy needs rtnl_lock	Vladimir Oltean	1	-0/+4
	There is an ASSERT_RTNL in phylink_disconnect_phy which triggers whenever dpaa2_switch_port_disconnect_mac is called. To follow the pattern established by dpaa2_eth_disconnect_mac, take the rtnl_mutex every time we call dpaa2_switch_port_disconnect_mac. Fixes: 84cba72956fd ("dpaa2-switch: integrate the MAC endpoint support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: phy: gmii2rgmii: Support PHY loopback	Gerhard Engleder	1	-11/+35
	Configure speed if loopback is used. read_status is not called for loopback. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: phy: Uniform PHY driver access	Gerhard Engleder	1	-3/+1
	struct phy_device contains a pointer to the PHY driver and nearly everywhere this pointer is used to access the PHY driver. Only mdio_bus_phy_may_suspend() is still using to_phy_driver() instead of the PHY driver pointer. Uniform PHY driver access by eliminating to_phy_driver() use in mdio_bus_phy_may_suspend(). Only phy_bus_match() and phy_probe() are still using to_phy_driver(), because PHY driver pointer is not available there. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: phy: Support set_loopback override	Gerhard Engleder	1	-5/+4
	phy_read_status and various other PHY functions support PHY specific overriding of driver functions by using a PHY specific pointer to the PHY driver. Add support of PHY specific override to phy_loopback too. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net: sparx5: switchdev: adding frame DMA functionality	Steen Hegelund	7	-10/+693
	This add frame DMA functionality to the Sparx5 platform. Ethernet frames can be extracted or injected autonomously to or from the device’s DDR3/DDR3L memory and/or PCIe memory space. Linked list data structures in memory are used for injecting or extracting Ethernet frames. The FDMA generates interrupts when frame extraction or injection is done and when the linked lists need updating. The FDMA implements two extraction channels, one per switch core port towards the VCore CPU system and a total of six injection channels. Extraction channels are mapped one-to-one to the CPU ports, while injection channels can be individually assigned to any CPU port. - FDMA channel 0 through 5 corresponds to CPU port 0 injection direction FDMA_CH_CFG[channel].CH_INJ_PORT is set to 0. - FDMA channel 0 through 5 corresponds to CPU port 1 injection direction when FDMA_CH_CFG[channel].CH_INJ_PORT is set to 1. - FDMA channel 6 corresponds to CPU port 0 extraction direction. - FDMA channel 7 corresponds to CPU port 1 extraction direction. The FDMA implements a strict priority scheme among channels. Extraction channels are prioritized over injection channels and secondarily channels with higher channel number are prioritized over channels with lower number. On the other hand, ports are being served on an equal-bandwidth principle both on injection and extraction directions. The equal-bandwidth principle will not force an equal bandwidth. Instead, it ensures that the ports perform at their best considering the operating conditions. When more than one injection channel is enabled for injection on the same CPU port, priority determines which channel can inject data. Ownership is re-arbitrated on frame boundaries. The FDMA processes linked lists of DMA Control Block Structures (DCBs). The DCBs have the same basic structure for both injection and extraction. A DCB must be placed on a 64-bit word-aligned address in memory. Each DCB has a per-channel configurable amount of associated data blocks in memory, where the frame data is stored. The data blocks that are used by extraction channels must be placed on 64-bit word aligned addresses in memory, and their length must be a multiple of 128 bytes. A DCB carries the pointer to the next DCB of the linked list, the INFO word which holds information for the DCB, and a pair of status word and memory pointer for every data block that it is associated with. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	Merge tag 'mlx5-updates-2021-08-19' of ↵	David S. Miller	22	-702/+1769
	git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2021-08-19 This series introduces the support for two new mlx5 features: 1) Sample offload for tunneled traffic 2) devlink rate objects support 1) From Chris Mi: Sample offload for tunneled traffic ===================================================== Background and solution ----------------------- Currently the sample offload actions send the encapsulated packet to software. This series de-capsulates the packet before performing the sampling and set the tunnel properties on the skb metadata fields to make the behavior consistent with OVS sFlow. If de-capsulating first, we can't use the same match like before in default table. So instantiate a post action instance to continue processing the action list. If HW can preserve reg_c, also use the post action instance. Post action infrastructure -------------------------- Some tc actions are modeled in hardware using multiple tables causing a tc action list split. For example, CT action is modeled by jumping to a ct table which is controlled by nf flow table. sFlow jumps in hardware to a sample table, which continues to a "default table" where it should continue processing the action list. Multi table actions are modeled in hardware using a unique fte_id. The fte_id is set before jumping to a table. Split actions continue to a post-action table where the matched fte_id value continues the execution the tc action list. This series also introduces post action infrastructure. Both ct and sample use it. Sample for tunnel in TC SW -------------------------- tc filter add dev vxlan1 protocol ip parent ffff: prio 3 \ flower src_mac 24:25:d0:e1:00:00 dst_mac 02:25:d0:13:01:02 \ enc_src_ip 192.168.1.14 enc_dst_ip 192.168.1.13 \ enc_dst_port 4789 enc_key_id 4 \ action sample rate 1 group 6 \ action tunnel_key unset \ action mirred egress redirect dev enp4s0f0_1 MLX5 sample HW offload ---------------------- For the following typical flow table: +-------------------------------+ + original flow table + +-------------------------------+ + original match + +-------------------------------+ + sample action + other actions + +-------------------------------+ We translate the tc filter with sample action to the following HW model: +---------------------+ + original flow table + +---------------------+ + original match + +---------------------+ \| set fte_id (if reg_c preserve cap) \| do decap v +------------------------------------------------+ + Flow Sampler Object + +------------------------------------------------+ + sample ratio + +------------------------------------------------+ + sample table id \| default table id + +------------------------------------------------+ \| \| v v +-----------------------------+ +-------------------+ + sample table + + default table + +-----------------------------+ +-------------------+ + forward to management vport + \| +-----------------------------+ \| +-------+------+ \| \|reg_c preserve cap \| \|or decap action v v +-----------------+ +-------------+ + per vport table + + post action + +-----------------+ +-------------+ + original match + +-----------------+ + other actions + +-----------------+ 2) From Dmytro Linkin: devlink rate object support for mlx5_core driver ======================================================================= HIGH-LEVEL OVERVIEW Devlink leaf rate objects created per vport (VF/SF, and PF on BlueField) in switchdev mode on devlink port registration. Implement devlink ops callbacks to create/destroy rate groups, set TX rate values of the vport/group, assign vport to the group. Driver accepts TX rate values as fraction of 1Mbps. Refactor existing eswitch QoS infrastructure to be accessible by legacy NDO rate API and new devlink rate API. NDO rate API is not removed/disabled in switchdev mode to not break existing users. Rate values configured with NDO rate API are not visible for devlink infrastructure, therefore APIs should not be used simultaneously. IMPLEMENTATION DETAILS Driver provide two level rate hierarchy to manage bandwidth - group level and vport level. Initially each vport added to internal unlimited group created by default. Each rate element (vport or group) receive bandwidth relative to its parent element (for groups the parent is a physical link itself) in a Round Robin manner, where element get bandwidth value according to its weight. Example: Created four rate groups with tx_share limits: $ devlink port function rate add \ pci/0000:06:00.0/group_1 tx_share 30gbit $ devlink port function rate add \ pci/0000:06:00.0/group_2 tx_share 20gbit $ devlink port function rate add \ pci/0000:06:00.0/group_3 tx_share 20gbit $ devlink port function rate add \ pci/0000:06:00.0/group_4 tx_share 10gbit Weights created in HW for each group are relative to the bigest tx_share value, which is 30gbit: <group_1> 1.0 <group_2> 0.67 <group_3> 0.67 <group_4> 0.33 Assuming link speed is 50 Gbit/sec and each group can sustain such amount of traffic, maximum bandwidth is 50 / (1.0 + 0.67 + 0.67 + 0.33) = ~18.75 Gbit/sec. Normilized bandwidth values for groups: <group_1> 18.75 * 1.0 = 18.75 Gbit/sec <group_2> 18.75 * 0.67 = 12.5 Gbit/sec <group_3> 18.75 * 0.67 = 12.5 Gbit/sec <group_4> 18.75 * 0.33 = 6.25 Gbit/sec If in example above group_1 doesn't produce any traffic, then maximum bandwidth becomes 50 / (0.67 + 0.67 + 0.33) = ~30.0 Gbit/sec. Normalized values: <group_2> 30.0 * 0.67 = 20.0 Gbit/sec <group_3> 30.0 * 0.67 = 20.0 Gbit/sec <group_4> 30.0 * 0.33 = 10.0 Gbit/sec Same normalization applied to each vport in the group. Normalized values are internal, therefore driver provides QoS tracepoints for next events: * vport rate element creation/deletion: * vport rate element configuration; * group rate element creation/deletion; * group rate element configuration. PATCHES OVERVIEW 1 - Moving and isolation of eswitch QoS logic in separate file; 2 - Implement devlink leaf rate object support for vports; 3 - Implement rate groups creation/deletion; 4 - Implement TX rate management for the groups; 5 - Implement parent set for vports; 6 - Eswitch QoS tracepoints. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	Merge tag 'for-net-next-2021-08-19' of ↵	David S. Miller	11	-1307/+1698
	git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for Foxconn Mediatek Chip - Add support for LG LGSBWAC92/TWCM-K505D - hci_h5 flow control fixes and suspend support - Switch to use lock_sock for SCO and RFCOMM - Various fixes for extended advertising - Reword Intel's setup on btusb unifying the supported generations ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20	net/mlx5: E-switch, Add QoS tracepoints	Dmytro Linkin	2	-1/+135
	Add tracepoints to log QoS enabling/disabling/configuration for vports and rate groups. Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Huy Nguyen <huyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-20	net/mlx5: E-switch, Allow to add vports to rate groups	Dmytro Linkin	5	-25/+199
	Implement eswitch API that allows updating rate groups. If group pointer is NULL, then move the vport to internal unlimited group zero. Implement devlink_ops->rate_parent_node_set() callback in the terms of the new eswitch group update API. Enable QoS for all group's elements if a group has allocated BW share. Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Huy Nguyen <huyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-20	net/mlx5: E-switch, Allow setting share/max tx rate limits of rate groups	Dmytro Linkin	4	-39/+225
	Provide eswitch API to allow controlling group rate limits. Use it to implement devlink_ops->mlx5_devlink_rate_node_tx_{share\|max}_set(). The share rate will create relative bandwidth share on the groups level while within the group the user can set shared rate on the member vports of that group and this rate will be relative to the group's share rate. The group with the highest shared rate will get a BW share of 100 and the rest of the groups will get a value that reflects the ratio between their share rate and the maximum share rate. Example: Created four rate groups with tx_share limits: $ devlink port function rate add \ pci/0000:06:00.0/group_1 tx_share 30gbit $ devlink port function rate add \ pci/0000:06:00.0/group_2 tx_share 20gbit $ devlink port function rate add \ pci/0000:06:00.0/group_3 tx_share 20gbit $ devlink port function rate add \ pci/0000:06:00.0/group_4 tx_share 10gbit Assuming link speed is 50 Gbit/sec ratio divider will be 50 / (30+20+20+10) = 0.625. Normalized rate values for the groups: <group_1> 30 * 0.625 = 18.75 Gbit/sec <group_2> 20 * 0.625 = 12.5 Gbit/sec <group_3> 20 * 0.625 = 12.5 Gbit/sec <group_4> 10 * 0.625 = 6.25 Gbit/sec Rate group with unlimited tx_share rate will receive minimum BW value (1Mbit/sec) if presented any group with tx_share rate limit. This allow to not drop all packets in case of heavy traffic. Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Huy Nguyen <huyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-20	net/mlx5: E-switch, Introduce rate limiting groups API	Dmytro Linkin	4	-5/+143
	Extend eswitch API with rate limiting groups: - Define new struct mlx5_esw_rate_group that is used to hold all internal group data. - Implement functions that allow creation, destruction and cleanup of groups. - Assign all vports to internal unlimited zero group by default. This commit lays the groundwork for group rate limiting by implementing devlink_ops->rate_node_{new\|del}() callbacks to support creating and deleting groups through devlink rate node objects. APIs that allows setting rates and adding/removing members are implemented in following patches. Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Huy Nguyen <huyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-20	net/mlx5: E-switch, Enable devlink port tx_{share\|max} rate control	Dmytro Linkin	5	-27/+157
	Register devlink rate leaf object for every eswitch vport. Implement devlink ops that enable setting shared and max tx rates through devlink API. Extract common eswitch code from existing tx rate set function that is accessed through NDO to be reused for the devlink. Values configured with NDO API are not visible for the devlink API, therefore shouldn't be used simultaneously. When normalizing the BW share value, dividing the desired minimum rate by the common divider results in losing information since the quotient is rounded down. This has a significant affect on configurations of low rate where the round down eliminates a large percentage of the total rate. To improve the formula, round up the division result to make sure that the BW share is at least the value it was supposed to be and won't lost a significant amount of the expected value. Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Huy Nguyen <huyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-20	net/mlx5: E-switch, Move QoS related code to dedicated file	Dmytro Linkin	7	-316/+346
	Move eswitch QoS related code into dedicated file. Provide eswitch API to access this code meaning it is isolated and restricted to be used only by eswitch.c. Exception is legacy NDO vf set rate, which moved to esw/legacy.c. Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Huy Nguyen <huyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-20	net/mlx5e: TC, Support sample offload action for tunneled traffic	Chris Mi	4	-91/+214
	Currently the sample offload actions send the encapsulated packet to software. This commit decapsulates the packet before performing the sampling and set the tunnel properties on the skb metadata fields to make the behavior consistent with OVS sFlow. If decapsulating first, we can't use the same match like before in default table. So instantiate a post action instance to continue processing the action list. If HW can preserve reg_c, also use the post action instance. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-20	net/mlx5e: TC, Restore tunnel info for sample offload	Chris Mi	5	-14/+37
	Currently the sample offload actions send the encapsulated packet to software. sFlow expects tunneled packets to be decapsulated while having the tunnel properties on the skb metadata fields. Reuse the functions used by connection tracking to map the outer header properties to a unique id. The next patch will use that id to restore the tunnel information of decapsulated packets onto the skb. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-20	net/mlx5e: TC, Remove CONFIG_NET_TC_SKB_EXT dependency when restoring tunnel	Chris Mi	1	-9/+6
	CONFIG_NET_TC_SKB_EXT controls the SKB extension support for restoring chain ids. SKB extension is not required for tunnel restoration. Remove the CONFIG_NET_TC_SKB_EXT dependency as a pre-step for using the tunnel restore methods for sample offload use cases. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-20	net/mlx5e: Refactor ct to use post action infrastructure	Chris Mi	7	-122/+176
	Move post action table management to common library providing add/del/get API. Refactor the ct action offload to use the common API. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-20	net/mlx5e: Introduce post action infrastructure	Chris Mi	3	-1/+81
	Some tc actions are modeled in hardware using multiple tables causing a tc action list split. For example, CT action is modeled by jumping to a ct table which is controlled by nf flowtable. sFlow jumps in hardware to a sample table, which continues to a "default table" where it should continue processing the action list. Multi table actions are modeled in hardware using a unique fte_id. The fte_id is set before jumping to a table. Split actions continue to a post-action table where the matched fte_id value continues the execution the tc action list. Currently the post-action design is implemented only by the ct action. Introduce post action infrastructure as a pre-step for reusing it with the sFlow offload feature. Init and destroy the common post action table. Refactor the ct offload to use the common post table infrastructure in the next patch. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-20	net/mlx5e: CT, Use xarray to manage fte ids	Chris Mi	1	-9/+9
	IDR is deprecated. Use xarray instead. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>