1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
|
.. SPDX-License-Identifier: GPL-2.0
.. _devlink_port:
============
Devlink Port
============
``devlink-port`` is a port that exists on the device. It has a logically
separate ingress/egress point of the device. A devlink port can be any one
of many flavours. A devlink port flavour along with port attributes
describe what a port represents.
A device driver that intends to publish a devlink port sets the
devlink port attributes and registers the devlink port.
Devlink port flavours are described below.
.. list-table:: List of devlink port flavours
:widths: 33 90
* - Flavour
- Description
* - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
- Any kind of physical port. This can be an eswitch physical port or any
other physical port on the device.
* - ``DEVLINK_PORT_FLAVOUR_DSA``
- This indicates a DSA interconnect port.
* - ``DEVLINK_PORT_FLAVOUR_CPU``
- This indicates a CPU port applicable only to DSA.
* - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
- This indicates an eswitch port representing a port of PCI
physical function (PF).
* - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
- This indicates an eswitch port representing a port of PCI
virtual function (VF).
* - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
- This indicates an eswitch port representing a port of PCI
subfunction (SF).
* - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
- This indicates a virtual port for the PCI virtual function.
Devlink port can have a different type based on the link layer described below.
.. list-table:: List of devlink port types
:widths: 23 90
* - Type
- Description
* - ``DEVLINK_PORT_TYPE_ETH``
- Driver should set this port type when a link layer of the port is
Ethernet.
* - ``DEVLINK_PORT_TYPE_IB``
- Driver should set this port type when a link layer of the port is
InfiniBand.
* - ``DEVLINK_PORT_TYPE_AUTO``
- This type is indicated by the user when driver should detect the port
type automatically.
PCI controllers
---------------
In most cases a PCI device has only one controller. A controller consists of
potentially multiple physical, virtual functions and subfunctions. A function
consists of one or more ports. This port is represented by the devlink eswitch
port.
A PCI device connected to multiple CPUs or multiple PCI root complexes or a
SmartNIC, however, may have multiple controllers. For a device with multiple
controllers, each controller is distinguished by a unique controller number.
An eswitch is on the PCI device which supports ports of multiple controllers.
An example view of a system with two controllers::
---------------------------------------------------------
| |
| --------- --------- ------- ------- |
----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
| server | | ------- ----/---- ---/----- ------- ---/--- ---/--- |
| pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ |
| connect | | ------- ------- |
----------- | | controller_num=1 (no eswitch) |
------|--------------------------------------------------
(internal wire)
|
---------------------------------------------------------
| devlink eswitch ports and reps |
| ----------------------------------------------------- |
| |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
| |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
| ----------------------------------------------------- |
| |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
| |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
| ----------------------------------------------------- |
| |
| |
----------- | --------- --------- ------- ------- |
| smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
| pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- |
| connect | | | pf0 |______/________/ | pf1 |___/_______/ |
----------- | ------- ------- |
| |
| local controller_num=0 (eswitch) |
---------------------------------------------------------
In the above example, the external controller (identified by controller number = 1)
doesn't have the eswitch. Local controller (identified by controller number = 0)
has the eswitch. The Devlink instance on the local controller has eswitch
devlink ports for both the controllers.
Function configuration
======================
Users can configure one or more function attributes before enumerating the PCI
function. Usually it means, user should configure function attribute
before a bus specific device for the function is created. However, when
SRIOV is enabled, virtual function devices are created on the PCI bus.
Hence, function attribute should be configured before binding virtual
function device to the driver. For subfunctions, this means user should
configure port function attribute before activating the port function.
A user may set the hardware address of the function using
`devlink port function set hw_addr` command. For Ethernet port function
this means a MAC address.
Users may also set the RoCE capability of the function using
`devlink port function set roce` command.
Users may also set the function as migratable using
'devlink port function set migratable' command.
Users may also set the IPsec crypto capability of the function using
`devlink port function set ipsec_crypto` command.
Users may also set the IPsec packet capability of the function using
`devlink port function set ipsec_packet` command.
Function attributes
===================
MAC address setup
-----------------
The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
device created for the PCI VF/SF.
- Get the MAC address of the VF identified by its unique devlink port index::
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00
- Set the MAC address of the VF identified by its unique devlink port index::
$ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:11:22:33:44:55
- Get the MAC address of the SF identified by its unique devlink port index::
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
function:
hw_addr 00:00:00:00:00:00
- Set the MAC address of the SF identified by its unique devlink port index::
$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
function:
hw_addr 00:00:00:00:88:88
RoCE capability setup
---------------------
Not all PCI VFs/SFs require RoCE capability.
When RoCE capability is disabled, it saves system memory per PCI VF/SF.
When user disables RoCE capability for a VF/SF, user application cannot send or
receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
will be empty.
When RoCE capability is disabled in the device using port function attribute,
VF/SF driver cannot override it.
- Get RoCE capability of the VF device::
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 roce enable
- Set RoCE capability of the VF device::
$ devlink port function set pci/0000:06:00.0/2 roce disable
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 roce disable
migratable capability setup
---------------------------
Live migration is the process of transferring a live virtual machine
from one physical host to another without disrupting its normal
operation.
User who want PCI VFs to be able to perform live migration need to
explicitly enable the VF migratable capability.
When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
with migration support, the user can migrate the VM with this VF from one HV to a
different one.
However, when migratable capability is enable, device will disable features which cannot
be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
Example of LM with migratable function configuration:
- Get migratable capability of the VF device::
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 migratable disable
- Set migratable capability of the VF device::
$ devlink port function set pci/0000:06:00.0/2 migratable enable
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 migratable enable
- Bind VF to VFIO driver with migration support::
$ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
$ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
$ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
Attach VF to the VM.
Start the VM.
Perform live migration.
IPsec crypto capability setup
-----------------------------
When user enables IPsec crypto capability for a VF, user application can offload
XFRM state crypto operation (Encrypt/Decrypt) to this VF.
When IPsec crypto capability is disabled (default) for a VF, the XFRM state is
processed in software by the kernel.
- Get IPsec crypto capability of the VF device::
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 ipsec_crypto disabled
- Set IPsec crypto capability of the VF device::
$ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 ipsec_crypto enabled
IPsec packet capability setup
-----------------------------
When user enables IPsec packet capability for a VF, user application can offload
XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as
IPsec encapsulation.
When IPsec packet capability is disabled (default) for a VF, the XFRM state and
policy is processed in software by the kernel.
- Get IPsec packet capability of the VF device::
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 ipsec_packet disabled
- Set IPsec packet capability of the VF device::
$ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 ipsec_packet enabled
Subfunction
============
Subfunction is a lightweight function that has a parent PCI function on which
it is deployed. Subfunction is created and deployed in unit of 1. Unlike
SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
A subfunction communicates with the hardware through the parent PCI function.
To use a subfunction, 3 steps setup sequence is followed:
1) create - create a subfunction;
2) configure - configure subfunction attributes;
3) deploy - deploy the subfunction;
Subfunction management is done using devlink port user interface.
User performs setup on the subfunction management device.
(1) Create
----------
A subfunction is created using a devlink port interface. A user adds the
subfunction by adding a devlink port of subfunction flavour. The devlink
kernel code calls down to subfunction management driver (devlink ops) and asks
it to create a subfunction devlink port. Driver then instantiates the
subfunction port and any associated objects such as health reporters and
representor netdevice.
(2) Configure
-------------
A subfunction devlink port is created but it is not active yet. That means the
entities are created on devlink side, the e-switch port representor is created,
but the subfunction device itself is not created. A user might use e-switch port
representor to do settings, putting it into bridge, adding TC rules, etc. A user
might as well configure the hardware address (such as MAC address) of the
subfunction while subfunction is inactive.
(3) Deploy
----------
Once a subfunction is configured, user must activate it to use it. Upon
activation, subfunction management driver asks the subfunction management
device to instantiate the subfunction device on particular PCI function.
A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
At this point a matching subfunction driver binds to the subfunction's auxiliary device.
Rate object management
======================
Devlink provides API to manage tx rates of single devlink port or a group.
This is done through rate objects, which can be one of the two types:
``leaf``
Represents a single devlink port; created/destroyed by the driver. Since leaf
have 1to1 mapping to its devlink port, in user space it is referred as
``pci/<bus_addr>/<port_index>``;
``node``
Represents a group of rate objects (leafs and/or nodes); created/deleted by
request from the userspace; initially empty (no rate objects added). In
userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
``node_name`` can be any identifier, except decimal number, to avoid
collisions with leafs.
API allows to configure following rate object's parameters:
``tx_share``
Minimum TX rate value shared among all other rate objects, or rate objects
that parts of the parent group, if it is a part of the same group.
``tx_max``
Maximum TX rate value.
``tx_priority``
Allows for usage of strict priority arbiter among siblings. This
arbitration scheme attempts to schedule nodes based on their priority
as long as the nodes remain within their bandwidth limit. The higher the
priority the higher the probability that the node will get selected for
scheduling.
``tx_weight``
Allows for usage of Weighted Fair Queuing arbitration scheme among
siblings. This arbitration scheme can be used simultaneously with the
strict priority. As a node is configured with a higher rate it gets more
BW relative to it's siblings. Values are relative like a percentage
points, they basically tell how much BW should node take relative to
it's siblings.
``parent``
Parent node name. Parent node rate limits are considered as additional limits
to all node children limits. ``tx_max`` is an upper limit for children.
``tx_share`` is a total bandwidth distributed among children.
``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
nodes with the same priority form a WFQ subgroup in the sibling group
and arbitration among them is based on assigned weights.
Arbitration flow from the high level:
#. Choose a node, or group of nodes with the highest priority that stays
within the BW limit and are not blocked. Use ``tx_priority`` as a
parameter for this arbitration.
#. If group of nodes have the same priority perform WFQ arbitration on
that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
#. Select the winner node, and continue arbitration flow among it's children,
until leaf node is reached, and the winner is established.
#. If all the nodes from the highest priority sub-group are satisfied, or
overused their assigned BW, move to the lower priority nodes.
Driver implementations are allowed to support both or either rate object types
and setting methods of their parameters. Additionally driver implementation
may export nodes/leafs and their child-parent relationships.
Terms and Definitions
=====================
.. list-table:: Terms and Definitions
:widths: 22 90
* - Term
- Definitions
* - ``PCI device``
- A physical PCI device having one or more PCI buses consists of one or
more PCI controllers.
* - ``PCI controller``
- A controller consists of potentially multiple physical functions,
virtual functions and subfunctions.
* - ``Port function``
- An object to manage the function of a port.
* - ``Subfunction``
- A lightweight function that has parent PCI function on which it is
deployed.
* - ``Subfunction device``
- A bus device of the subfunction, usually on a auxiliary bus.
* - ``Subfunction driver``
- A device driver for the subfunction auxiliary device.
* - ``Subfunction management device``
- A PCI physical function that supports subfunction management.
* - ``Subfunction management driver``
- A device driver for PCI physical function that supports
subfunction management using devlink port interface.
* - ``Subfunction host driver``
- A device driver for PCI physical function that hosts subfunction
devices. In most cases it is same as subfunction management driver. When
subfunction is used on external controller, subfunction management and
host drivers are different.
|