diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2022-12-14 02:47:48 +0300 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2022-12-14 02:47:48 +0300 |
commit | 7e68dd7d07a28faa2e6574dd6b9dbd90cdeaae91 (patch) | |
tree | ae0427c5a3b905f24b3a44b510a9bcf35d9b67a3 /Documentation/networking/devlink | |
parent | 1ca06f1c1acecbe02124f14a37cce347b8c1a90c (diff) | |
parent | 7c4a6309e27f411743817fe74a832ec2d2798a4b (diff) | |
download | linux-7e68dd7d07a28faa2e6574dd6b9dbd90cdeaae91.tar.xz |
Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Allow live renaming when an interface is up
- Add retpoline wrappers for tc, improving considerably the
performances of complex queue discipline configurations
- Add inet drop monitor support
- A few GRO performance improvements
- Add infrastructure for atomic dev stats, addressing long standing
data races
- De-duplicate common code between OVS and conntrack offloading
infrastructure
- A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements
- Netfilter: introduce packet parser for tunneled packets
- Replace IPVS timer-based estimators with kthreads to scale up the
workload with the number of available CPUs
- Add the helper support for connection-tracking OVS offload
BPF:
- Support for user defined BPF objects: the use case is to allocate
own objects, build own object hierarchies and use the building
blocks to build own data structures flexibly, for example, linked
lists in BPF
- Make cgroup local storage available to non-cgroup attached BPF
programs
- Avoid unnecessary deadlock detection and failures wrt BPF task
storage helpers
- A relevant bunch of BPF verifier fixes and improvements
- Veristat tool improvements to support custom filtering, sorting,
and replay of results
- Add LLVM disassembler as default library for dumping JITed code
- Lots of new BPF documentation for various BPF maps
- Add bpf_rcu_read_{,un}lock() support for sleepable programs
- Add RCU grace period chaining to BPF to wait for the completion of
access from both sleepable and non-sleepable BPF programs
- Add support storing struct task_struct objects as kptrs in maps
- Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
values
- Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions
Protocols:
- TCP: implement Protective Load Balancing across switch links
- TCP: allow dynamically disabling TCP-MD5 static key, reverting back
to fast[er]-path
- UDP: Introduce optional per-netns hash lookup table
- IPv6: simplify and cleanup sockets disposal
- Netlink: support different type policies for each generic netlink
operation
- MPTCP: add MSG_FASTOPEN and FastOpen listener side support
- MPTCP: add netlink notification support for listener sockets events
- SCTP: add VRF support, allowing sctp sockets binding to VRF devices
- Add bridging MAC Authentication Bypass (MAB) support
- Extensions for Ethernet VPN bridging implementation to better
support multicast scenarios
- More work for Wi-Fi 7 support, comprising conversion of all the
existing drivers to internal TX queue usage
- IPSec: introduce a new offload type (packet offload) allowing
complete header processing and crypto offloading
- IPSec: extended ack support for more descriptive XFRM error
reporting
- RXRPC: increase SACK table size and move processing into a
per-local endpoint kernel thread, reducing considerably the
required locking
- IEEE 802154: synchronous send frame and extended filtering support,
initial support for scanning available 15.4 networks
- Tun: bump the link speed from 10Mbps to 10Gbps
- Tun/VirtioNet: implement UDP segmentation offload support
Driver API:
- PHY/SFP: improve power level switching between standard level 1 and
the higher power levels
- New API for netdev <-> devlink_port linkage
- PTP: convert existing drivers to new frequency adjustment
implementation
- DSA: add support for rx offloading
- Autoload DSA tagging driver when dynamically changing protocol
- Add new PCP and APPTRUST attributes to Data Center Bridging
- Add configuration support for 800Gbps link speed
- Add devlink port function attribute to enable/disable RoCE and
migratable
- Extend devlink-rate to support strict prioriry and weighted fair
queuing
- Add devlink support to directly reading from region memory
- New device tree helper to fetch MAC address from nvmem
- New big TCP helper to simplify temporary header stripping
New hardware / drivers:
- Ethernet:
- Marvel Octeon CNF95N and CN10KB Ethernet Switches
- Marvel Prestera AC5X Ethernet Switch
- WangXun 10 Gigabit NIC
- Motorcomm yt8521 Gigabit Ethernet
- Microchip ksz9563 Gigabit Ethernet Switch
- Microsoft Azure Network Adapter
- Linux Automation 10Base-T1L adapter
- PHY:
- Aquantia AQR112 and AQR412
- Motorcomm YT8531S
- PTP:
- Orolia ART-CARD
- WiFi:
- MediaTek Wi-Fi 7 (802.11be) devices
- RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
devices
- Bluetooth:
- Broadcom BCM4377/4378/4387 Bluetooth chipsets
- Realtek RTL8852BE and RTL8723DS
- Cypress.CYW4373A0 WiFi + Bluetooth combo device
Drivers:
- CAN:
- gs_usb: bus error reporting support
- kvaser_usb: listen only and bus error reporting support
- Ethernet NICs:
- Intel (100G):
- extend action skbedit to RX queue mapping
- implement devlink-rate support
- support direct read from memory
- nVidia/Mellanox (mlx5):
- SW steering improvements, increasing rules update rate
- Support for enhanced events compression
- extend H/W offload packet manipulation capabilities
- implement IPSec packet offload mode
- nVidia/Mellanox (mlx4):
- better big TCP support
- Netronome Ethernet NICs (nfp):
- IPsec offload support
- add support for multicast filter
- Broadcom:
- RSS and PTP support improvements
- AMD/SolarFlare:
- netlink extened ack improvements
- add basic flower matches to offload, and related stats
- Virtual NICs:
- ibmvnic: introduce affinity hint support
- small / embedded:
- FreeScale fec: add initial XDP support
- Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
- TI am65-cpsw: add suspend/resume support
- Mediatek MT7986: add RX wireless wthernet dispatch support
- Realtek 8169: enable GRO software interrupt coalescing per
default
- Ethernet high-speed switches:
- Microchip (sparx5):
- add support for Sparx5 TC/flower H/W offload via VCAP
- Mellanox mlxsw:
- add 802.1X and MAC Authentication Bypass offload support
- add ip6gre support
- Embedded Ethernet switches:
- Mediatek (mtk_eth_soc):
- improve PCS implementation, add DSA untag support
- enable flow offload support
- Renesas:
- add rswitch R-Car Gen4 gPTP support
- Microchip (lan966x):
- add full XDP support
- add TC H/W offload via VCAP
- enable PTP on bridge interfaces
- Microchip (ksz8):
- add MTU support for KSZ8 series
- Qualcomm 802.11ax WiFi (ath11k):
- support configuring channel dwell time during scan
- MediaTek WiFi (mt76):
- enable Wireless Ethernet Dispatch (WED) offload support
- add ack signal support
- enable coredump support
- remain_on_channel support
- Intel WiFi (iwlwifi):
- enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
- 320 MHz channels support
- RealTek WiFi (rtw89):
- new dynamic header firmware format support
- wake-over-WLAN support"
* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
ipvs: fix type warning in do_div() on 32 bit
net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
net: ipa: add IPA v4.7 support
dt-bindings: net: qcom,ipa: Add SM6350 compatible
bnxt: Use generic HBH removal helper in tx path
IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
selftests: forwarding: Add bridge MDB test
selftests: forwarding: Rename bridge_mdb test
bridge: mcast: Support replacement of MDB port group entries
bridge: mcast: Allow user space to specify MDB entry routing protocol
bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
bridge: mcast: Add support for (*, G) with a source list and filter mode
bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
bridge: mcast: Add a flag for user installed source entries
bridge: mcast: Expose __br_multicast_del_group_src()
bridge: mcast: Expose br_multicast_new_group_src()
bridge: mcast: Add a centralized error path
bridge: mcast: Place netlink policy before validation functions
bridge: mcast: Split (*, G) and (S, G) addition into different functions
bridge: mcast: Do not derive entry type from its filter mode
...
Diffstat (limited to 'Documentation/networking/devlink')
-rw-r--r-- | Documentation/networking/devlink/devlink-info.rst | 5 | ||||
-rw-r--r-- | Documentation/networking/devlink/devlink-port.rst | 168 | ||||
-rw-r--r-- | Documentation/networking/devlink/devlink-region.rst | 13 | ||||
-rw-r--r-- | Documentation/networking/devlink/devlink-trap.rst | 13 | ||||
-rw-r--r-- | Documentation/networking/devlink/etas_es58x.rst | 36 | ||||
-rw-r--r-- | Documentation/networking/devlink/ice.rst | 128 |
6 files changed, 354 insertions, 9 deletions
diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst index 7572bf6de5c1..1242b0e6826b 100644 --- a/Documentation/networking/devlink/devlink-info.rst +++ b/Documentation/networking/devlink/devlink-info.rst @@ -198,6 +198,11 @@ fw.bundle_id Unique identifier of the entire firmware bundle. +fw.bootloader +------------- + +Version of the bootloader. + Future work =========== diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst index 7627b1da01f2..3da590953ce8 100644 --- a/Documentation/networking/devlink/devlink-port.rst +++ b/Documentation/networking/devlink/devlink-port.rst @@ -110,7 +110,7 @@ devlink ports for both the controllers. Function configuration ====================== -A user can configure the function attribute before enumerating the PCI +Users can configure one or more function attributes before enumerating the PCI function. Usually it means, user should configure function attribute before a bus specific device for the function is created. However, when SRIOV is enabled, virtual function devices are created on the PCI bus. @@ -119,9 +119,127 @@ function device to the driver. For subfunctions, this means user should configure port function attribute before activating the port function. A user may set the hardware address of the function using -'devlink port function set hw_addr' command. For Ethernet port function +`devlink port function set hw_addr` command. For Ethernet port function this means a MAC address. +Users may also set the RoCE capability of the function using +`devlink port function set roce` command. + +Users may also set the function as migratable using +'devlink port function set migratable' command. + +Function attributes +=================== + +MAC address setup +----------------- +The configured MAC address of the PCI VF/SF will be used by netdevice and rdma +device created for the PCI VF/SF. + +- Get the MAC address of the VF identified by its unique devlink port index:: + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 + +- Set the MAC address of the VF identified by its unique devlink port index:: + + $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55 + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:11:22:33:44:55 + +- Get the MAC address of the SF identified by its unique devlink port index:: + + $ devlink port show pci/0000:06:00.0/32768 + pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 + function: + hw_addr 00:00:00:00:00:00 + +- Set the MAC address of the SF identified by its unique devlink port index:: + + $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 + + $ devlink port show pci/0000:06:00.0/32768 + pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 + function: + hw_addr 00:00:00:00:88:88 + +RoCE capability setup +--------------------- +Not all PCI VFs/SFs require RoCE capability. + +When RoCE capability is disabled, it saves system memory per PCI VF/SF. + +When user disables RoCE capability for a VF/SF, user application cannot send or +receive any RoCE packets through this VF/SF and RoCE GID table for this PCI +will be empty. + +When RoCE capability is disabled in the device using port function attribute, +VF/SF driver cannot override it. + +- Get RoCE capability of the VF device:: + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 roce enable + +- Set RoCE capability of the VF device:: + + $ devlink port function set pci/0000:06:00.0/2 roce disable + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 roce disable + +migratable capability setup +--------------------------- +Live migration is the process of transferring a live virtual machine +from one physical host to another without disrupting its normal +operation. + +User who want PCI VFs to be able to perform live migration need to +explicitly enable the VF migratable capability. + +When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver +with migration support, the user can migrate the VM with this VF from one HV to a +different one. + +However, when migratable capability is enable, device will disable features which cannot +be migrated. Thus migratable cap can impose limitations on a VF so let the user decide. + +Example of LM with migratable function configuration: +- Get migratable capability of the VF device:: + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 migratable disable + +- Set migratable capability of the VF device:: + + $ devlink port function set pci/0000:06:00.0/2 migratable enable + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 migratable enable + +- Bind VF to VFIO driver with migration support:: + + $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind + $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override + $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind + +Attach VF to the VM. +Start the VM. +Perform live migration. + Subfunction ============ @@ -130,10 +248,11 @@ it is deployed. Subfunction is created and deployed in unit of 1. Unlike SRIOV VFs, a subfunction doesn't require its own PCI virtual function. A subfunction communicates with the hardware through the parent PCI function. -To use a subfunction, 3 steps setup sequence is followed. -(1) create - create a subfunction; -(2) configure - configure subfunction attributes; -(3) deploy - deploy the subfunction; +To use a subfunction, 3 steps setup sequence is followed: + +1) create - create a subfunction; +2) configure - configure subfunction attributes; +3) deploy - deploy the subfunction; Subfunction management is done using devlink port user interface. User performs setup on the subfunction management device. @@ -191,13 +310,48 @@ API allows to configure following rate object's parameters: ``tx_max`` Maximum TX rate value. +``tx_priority`` + Allows for usage of strict priority arbiter among siblings. This + arbitration scheme attempts to schedule nodes based on their priority + as long as the nodes remain within their bandwidth limit. The higher the + priority the higher the probability that the node will get selected for + scheduling. + +``tx_weight`` + Allows for usage of Weighted Fair Queuing arbitration scheme among + siblings. This arbitration scheme can be used simultaneously with the + strict priority. As a node is configured with a higher rate it gets more + BW relative to it's siblings. Values are relative like a percentage + points, they basically tell how much BW should node take relative to + it's siblings. + ``parent`` Parent node name. Parent node rate limits are considered as additional limits to all node children limits. ``tx_max`` is an upper limit for children. ``tx_share`` is a total bandwidth distributed among children. +``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case +nodes with the same priority form a WFQ subgroup in the sibling group +and arbitration among them is based on assigned weights. + +Arbitration flow from the high level: + +#. Choose a node, or group of nodes with the highest priority that stays + within the BW limit and are not blocked. Use ``tx_priority`` as a + parameter for this arbitration. + +#. If group of nodes have the same priority perform WFQ arbitration on + that subgroup. Use ``tx_weight`` as a parameter for this arbitration. + +#. Select the winner node, and continue arbitration flow among it's children, + until leaf node is reached, and the winner is established. + +#. If all the nodes from the highest priority sub-group are satisfied, or + overused their assigned BW, move to the lower priority nodes. + Driver implementations are allowed to support both or either rate object types -and setting methods of their parameters. +and setting methods of their parameters. Additionally driver implementation +may export nodes/leafs and their child-parent relationships. Terms and Definitions ===================== diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst index f06dca9a1eb6..9232cd7da301 100644 --- a/Documentation/networking/devlink/devlink-region.rst +++ b/Documentation/networking/devlink/devlink-region.rst @@ -31,6 +31,15 @@ in its ``devlink_region_ops`` structure. If snapshot id is not set in the ``DEVLINK_CMD_REGION_NEW`` request kernel will allocate one and send the snapshot information to user space. +Regions may optionally allow directly reading from their contents without a +snapshot. Direct read requests are not atomic. In particular a read request +of size 256 bytes or larger will be split into multiple chunks. If atomic +access is required, use a snapshot. A driver wishing to enable this for a +region should implement the ``.read`` callback in the ``devlink_region_ops`` +structure. User space can request a direct read by using the +``DEVLINK_ATTR_REGION_DIRECT`` attribute instead of specifying a snapshot +id. + example usage ------------- @@ -65,6 +74,10 @@ example usage $ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 length 16 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 + # Read from the region without a snapshot + $ devlink region read pci/0000:00:05.0/fw-health address 16 length 16 + 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8 + As regions are likely very device or driver specific, no generic regions are defined. See the driver-specific documentation files for information on the specific regions a driver supports. diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst index 90d1381b88de..2c14dfe69b3a 100644 --- a/Documentation/networking/devlink/devlink-trap.rst +++ b/Documentation/networking/devlink/devlink-trap.rst @@ -485,6 +485,16 @@ be added to the following table: - Traps incoming packets that the device decided to drop because the destination MAC is not configured in the MAC table and the interface is not in promiscuous mode + * - ``eapol`` + - ``control`` + - Traps "Extensible Authentication Protocol over LAN" (EAPOL) packets + specified in IEEE 802.1X + * - ``locked_port`` + - ``drop`` + - Traps packets that the device decided to drop because they failed the + locked bridge port check. That is, packets that were received via a + locked port and whose {SMAC, VID} does not correspond to an FDB entry + pointing to the port Driver-specific Packet Traps ============================ @@ -589,6 +599,9 @@ narrow. The description of these groups must be added to the following table: * - ``parser_error_drops`` - Contains packet traps for packets that were marked by the device during parsing as erroneous + * - ``eapol`` + - Contains packet traps for "Extensible Authentication Protocol over LAN" + (EAPOL) packets specified in IEEE 802.1X Packet Trap Policers ==================== diff --git a/Documentation/networking/devlink/etas_es58x.rst b/Documentation/networking/devlink/etas_es58x.rst new file mode 100644 index 000000000000..3b857d82a44c --- /dev/null +++ b/Documentation/networking/devlink/etas_es58x.rst @@ -0,0 +1,36 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +etas_es58x devlink support +========================== + +This document describes the devlink features implemented by the +``etas_es58x`` device driver. + +Info versions +============= + +The ``etas_es58x`` driver reports the following versions + +.. list-table:: devlink info versions implemented + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``fw`` + - running + - Version of the firmware running on the device. Also available + through ``ethtool -i`` as the first member of the + ``firmware-version``. + * - ``fw.bootloader`` + - running + - Version of the bootloader running on the device. Also available + through ``ethtool -i`` as the second member of the + ``firmware-version``. + * - ``board.rev`` + - fixed + - The hardware revision of the device. + * - ``serial_number`` + - fixed + - The USB serial number. Also available through ``lsusb -v``. diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst index 0c89ceb8986d..625efb3777d5 100644 --- a/Documentation/networking/devlink/ice.rst +++ b/Documentation/networking/devlink/ice.rst @@ -189,12 +189,21 @@ device data. * - ``nvm-flash`` - The contents of the entire flash chip, sometimes referred to as the device's Non Volatile Memory. + * - ``shadow-ram`` + - The contents of the Shadow RAM, which is loaded from the beginning + of the flash. Although the contents are primarily from the flash, + this area also contains data generated during device boot which is + not stored in flash. * - ``device-caps`` - The contents of the device firmware's capabilities buffer. Useful to determine the current state and configuration of the device. -Users can request an immediate capture of a snapshot via the -``DEVLINK_CMD_REGION_NEW`` +Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a +snapshot. The ``device-caps`` region requires a snapshot as the contents are +sent by firmware and can't be split into separate reads. + +Users can request an immediate capture of a snapshot for all three regions +via the ``DEVLINK_CMD_REGION_NEW`` command. .. code:: shell @@ -254,3 +263,118 @@ Users can request an immediate capture of a snapshot via the 0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1 + +Devlink Rate +============ + +The ``ice`` driver implements devlink-rate API. It allows for offload of +the Hierarchical QoS to the hardware. It enables user to group Virtual +Functions in a tree structure and assign supported parameters: tx_share, +tx_max, tx_priority and tx_weight to each node in a tree. So effectively +user gains an ability to control how much bandwidth is allocated for each +VF group. This is later enforced by the HW. + +It is assumed that this feature is mutually exclusive with DCB performed +in FW and ADQ, or any driver feature that would trigger changes in QoS, +for example creation of the new traffic class. The driver will prevent DCB +or ADQ configuration if user started making any changes to the nodes using +devlink-rate API. To configure those features a driver reload is necessary. +Correspondingly if ADQ or DCB will get configured the driver won't export +hierarchy at all, or will remove the untouched hierarchy if those +features are enabled after the hierarchy is exported, but before any +changes are made. + +This feature is also dependent on switchdev being enabled in the system. +It's required bacause devlink-rate requires devlink-port objects to be +present, and those objects are only created in switchdev mode. + +If the driver is set to the switchdev mode, it will export internal +hierarchy the moment VF's are created. Root of the tree is always +represented by the node_0. This node can't be deleted by the user. Leaf +nodes and nodes with children also can't be deleted. + +.. list-table:: Attributes supported + :widths: 15 85 + + * - Name + - Description + * - ``tx_max`` + - maximum bandwidth to be consumed by the tree Node. Rate Limit is + an absolute number specifying a maximum amount of bytes a Node may + consume during the course of one second. Rate limit guarantees + that a link will not oversaturate the receiver on the remote end + and also enforces an SLA between the subscriber and network + provider. + * - ``tx_share`` + - minimum bandwidth allocated to a tree node when it is not blocked. + It specifies an absolute BW. While tx_max defines the maximum + bandwidth the node may consume, the tx_share marks committed BW + for the Node. + * - ``tx_priority`` + - allows for usage of strict priority arbiter among siblings. This + arbitration scheme attempts to schedule nodes based on their + priority as long as the nodes remain within their bandwidth limit. + Range 0-7. Nodes with priority 7 have the highest priority and are + selected first, while nodes with priority 0 have the lowest + priority. Nodes that have the same priority are treated equally. + * - ``tx_weight`` + - allows for usage of Weighted Fair Queuing arbitration scheme among + siblings. This arbitration scheme can be used simultaneously with + the strict priority. Range 1-200. Only relative values mater for + arbitration. + +``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case +nodes with the same priority form a WFQ subgroup in the sibling group +and arbitration among them is based on assigned weights. + +.. code:: shell + + # enable switchdev + $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev + + # at this point driver should export internal hierarchy + $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs + + $ devlink port function rate show + pci/0000:4b:00.0/node_25: type node parent node_24 + pci/0000:4b:00.0/node_24: type node parent node_0 + pci/0000:4b:00.0/node_32: type node parent node_31 + pci/0000:4b:00.0/node_31: type node parent node_30 + pci/0000:4b:00.0/node_30: type node parent node_16 + pci/0000:4b:00.0/node_19: type node parent node_18 + pci/0000:4b:00.0/node_18: type node parent node_17 + pci/0000:4b:00.0/node_17: type node parent node_16 + pci/0000:4b:00.0/node_14: type node parent node_5 + pci/0000:4b:00.0/node_5: type node parent node_3 + pci/0000:4b:00.0/node_13: type node parent node_4 + pci/0000:4b:00.0/node_12: type node parent node_4 + pci/0000:4b:00.0/node_11: type node parent node_4 + pci/0000:4b:00.0/node_10: type node parent node_4 + pci/0000:4b:00.0/node_9: type node parent node_4 + pci/0000:4b:00.0/node_8: type node parent node_4 + pci/0000:4b:00.0/node_7: type node parent node_4 + pci/0000:4b:00.0/node_6: type node parent node_4 + pci/0000:4b:00.0/node_4: type node parent node_3 + pci/0000:4b:00.0/node_3: type node parent node_16 + pci/0000:4b:00.0/node_16: type node parent node_15 + pci/0000:4b:00.0/node_15: type node parent node_0 + pci/0000:4b:00.0/node_2: type node parent node_1 + pci/0000:4b:00.0/node_1: type node parent node_0 + pci/0000:4b:00.0/node_0: type node + pci/0000:4b:00.0/1: type leaf parent node_25 + pci/0000:4b:00.0/2: type leaf parent node_25 + + # let's create some custom node + $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0 + + # second custom node + $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom + + # reassign second VF to newly created branch + $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1 + + # assign tx_weight to the VF + $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5 + + # assign tx_share to the VF + $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps |