Age | Commit message (Collapse) | Author | Files | Lines |
|
Currently, the driver uses two different values for maximum MTU, one is
stored in mlxsw_port->dev->max_mtu and the second is stored in
mlxsw_port->max_mtu. The second one is set to value which is queried from
firmware. This value was never tested, and unfortunately is not really
supported. That means that with the existing code, user can set MTU to
X, which is not really supported by firmware and which is bigger than
buffer size which is allocated in pci.
To make the driver consistent, use only mlxsw_port->dev->max_mtu for
maximum MTU value, for buffers headroom add Ethernet frame headers, which
are not included in mlxsw_port->dev->max_mtu. Remove mlxsw_port->max_mtu.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/89fa6f804386b918d337e736e14ac291bb947483.1718275854.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, the driver uses ETH_MAX_MTU as maximum MTU of netdevices,
instead, use the accurate value which is supported by the driver.
Subtract Ethernet headers which are taken into account by hardware for
MTU checking, as described in the previous patch.
Set minimum MTU to ETH_MIN_MTU, as zero MTU is not really supported.
With this change:
a. The stack will do the MTU checking, so we can remove it from the driver.
b. User space will be able to query the actual MTU limits.
Before this patch:
$ ip -j -d link show dev swp1 | jq | grep mtu
"mtu": 1500,
"min_mtu": 0,
"max_mtu": 65535,
With this patch:
$ ip -j -d link show dev swp1 | jq | grep mtu
"mtu": 1500,
"min_mtu": 68,
"max_mtu": 10218,
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/be8232e38c196ecb607f82c5e000ea427ce22abb.1718275854.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Ethernet frame consists of - Ethernet header, payload, FCS. The MTU value
which is used by user is the size of the payload, which means that when
user sets MTU to X, the total frame size will be larger due to the addition
of the Ethernet header and FCS.
Spectrum ASICs take into account Ethernet header and FCS as part of packet
size for MTU check. Adjust MTU value when user sets MTU, to configure the
MTU size which is required by hardware. The Tx header length which was used
by the driver is not relevant for such calculation, take into account
Ethernet header (with VLAN extension) and FCS.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/f3203c2477bb8ed18b1e79642fa3e3713e1e55bb.1718275854.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Simon reported that ndo_change_mtu() methods were never
updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted
in commit 501a90c94510 ("inet: protect against too small
mtu values.")
We read dev->mtu without holding RTNL in many places,
with READ_ONCE() annotations.
It is time to take care of ndo_change_mtu() methods
to use corresponding WRITE_ONCE()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Simon Horman <horms@kernel.org>
Closes: https://lore.kernel.org/netdev/20240505144608.GB67882@kernel.org/
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20240506102812.3025432-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For the report_delta-like interface like a previous patch has added for
collection of NH group statistics, it's easiest to read the counter and
have the HW clear it right away. Thus, change mlxsw_sp_flow_counter_get()
to take a bool indicating whether this should be done.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/6a096ede8ee92d5041e3832242c3bbc137198aba.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
mlxsw_sp stores an array of LAGs. When a port joins a LAG, in case that
this LAG is already in use, we only have to increase the reference counter.
Otherwise, we have to search for an unused LAG ID and configure it in
hardware. When a port leaves a LAG, we have to destroy it only for the last
user. This code can be simplified, for such requirements we usually add
get() and put() functions which create and destroy the object.
Add mlxsw_sp_lag_{get,put}() and use them. These functions take care of
the reference counter and hardware configuration if needed. Change the
reference counter to refcount_t type which catches overflow and underflow
issues.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently, the function mlxsw_sp_lag_index_get() is called twice - first
as part of NETDEV_PRECHANGEUPPER event and later as part of
NETDEV_CHANGEUPPER. This function will be changed in the next patch. To
simplify the code, call it only once as part of NETDEV_CHANGEUPPER
event and set an error message using 'extack' in case of failure.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The maximum number of LAGs is queried from core several times. It is
used to allocate LAG array, and then to iterate over it. In addition, it
is used for PGT initialization. To simplify the code, instead of
querying it several times, store the value as part of 'mlxsw_sp' and use
it.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
A next patch will add mlxsw_sp_lag_{get,put}() functions to handle LAG
reference counting and create/destroy it only for first user/last user.
Remove mlxsw_sp_lag_get() function and access LAG array directly.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The structure mlxsw_sp_upper is used only as LAG. Rename it to
mlxsw_sp_lag and move it to spectrum.c file, as it is used only there.
Move the function mlxsw_sp_lag_get() with the structure.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Mark all Spectrum>2 systems as preferring CFF flood mode if supported by
the firmware.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/8a3d2ad96b943f7e3f53f998bd333a14e19cd641.1701183892.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In the CFF flood mode, the driver has to allocate a table within PGT, which
holds flood vectors for router subport FIDs. For LAGs, these flood vectors
have to obviously be maintained dynamically as port membership in a LAG
changes. But even for physical ports, the flood vectors have to be kept
valid, and may not contain enabled bits corresponding to non-existent
ports. It is therefore not possible to precompute the port part of the RSP
table, it has to be maintained as ports come and go due to splits.
To support the RSP table maintenance, add to FID ops two new ops:
fid_port_init and fid_port_fini, for when a port comes to existence, or
joins a lag, and vice versa. Invoke these ops from
mlxsw_sp_port_fids_init() and mlxsw_sp_port_fids_fini(), which are called
when port is added and removed, respectively. Also add two new hooks for
LAG maintenance, mlxsw_sp_fid_port_join_lag() / _leave_lag() which
transitively call into the same ops.
Later patches will actually add the op implementations themselves, this
just adds the scaffolding.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/234398a23540317abb25f74f920a5c8121faecf0.1701183892.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, mlxsw always uses a "controlled" flood mode on all Nvidia
Spectrum generations. The following patches will however introduce a
possibility to run a "CFF" (for Compressed FID Flooding) mode on newer
machines, if the FW supports it.
Several operations will differ between how they need to be done in
controlled mode vs. CFF mode. Thus the per-FID-family ops will differ
between controlled and CFF, thus the FID family array as such will
differ depending on whether the mode negotiated with FW is controlled
or CFF.
The simple approach of having several globally visible arrays for
spectrum.c to statically choose from no longer works. Instead privatize all
FID initialization and finalization logic, and expose it as ops instead.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/d3fa390d97cf3dbd2f7a28741be69b311e2059e4.1701183891.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
On Spectrum-2, Spectrum-3 and Spectrum-4 machines, request SW
responsibility for placement of the LAG table.
On Spectrum-1, some FW versions claim to support lag_mode field despite
quietly ignoring any settings made to that field. Thus refrain from
attempting to configure lag_mode on those systems at all.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In this patch, if the LAG mode is SW, allocate the LAG table and configure
SGCR to indicate where it was allocated.
We use the default "DDD" (for dynamic data duplication) layout of the LAG
table. In the DDD mode, the membership information for each LAG is copied
in 8 PGT entries. This is done for performance reasons. The LAG table then
needs to be allocated on an address aligned to 8. Deal with this by
moving the LAG init ahead so that the LAG table is allocated at address 0.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As of commit 151b89f6025a ("mlxsw: spectrum_router: Reuse work neighbor
initialization in work scheduler"), the functions
mlxsw_sp_port_lower_dev_hold() and mlxsw_sp_port_dev_put() have no users.
Drop them.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/d0adcd7cb4ea19416294a0f861100edba84c9f36.1690471774.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Enslaving of front panel ports (and their uppers) to netdevices that
already have uppers is currently forbidden. In the previous patches, a
number of replays have been added. Those ensure that various bits of state,
such as next hops or switchdev objects, are offloaded when they become
relevant due to a mlxsw lower being introduced into the topology.
However the act of actually, for example, enslaving a front-panel port to
a bridge with uppers, has been vetoed so far. In this patch, remove the
vetoes and permit the operation.
mlxsw currently validates creation of "interesting" uppers. Thus creating
VLAN netdevices on top of 802.1ad bridges is forbidden if the bridge has an
mlxsw lower, but permitted in general. This validation code never gets run
when a port is introduced as a lower of an existing netdevice structure.
Thus when enslaving an mlxsw netdevice to netdevices with uppers, invoke
the PRECHANGEUPPER event handler for each netdevice above the one that the
front panel port is being enslaved to. This way the tower of netdevices
above the attachment point is validated.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a netdevice is removed from a bridge or a LAG, and it has an IP
address, it should join the router and gain a RIF. Do that by replaying
address addition event on the netdevice.
When handling deslavement of LAG or its upper from a bridge device, the
replay should be done after all the lowers of the LAG have left the bridge.
Thus these scenarios are handled by passing replay_deslavement of false,
and by invoking, after the lowers have been processed, a new helper,
mlxsw_sp_netdevice_post_lag_event(), which does the per-LAG / -upper
handling, and in particular invokes the replay.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Enslaving of front panel ports (and their uppers) to netdevices that
already have uppers is currently forbidden. When this is permitted, any
uppers with IP addresses need to have the NETDEV_UP inetaddr event
replayed, so that any RIFs are created.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If IP address is added to a MACVLAN netdevice, the effect is of configuring
VRRP on the RIF for the netdevice linked to the MACVLAN. Because the
MACVLAN offload is tied to existence of a RIF at the linked netdevice,
adding a MACVLAN is currently not allowed until a RIF is present.
If this requirement stays, it will never be possible to attach a first port
into a topology that involves a MACVLAN. Thus topologies would need to be
built in a certain order, which is impractical.
Additionally, IP address removal, which leads to disappearance of the RIF
that the MACVLAN depends on, cannot be vetoed. Thus even as things stand
now it is possible to get to a state where a MACVLAN netdevice exists
without a RIF, despite having mlxsw lowers. And once the MACVLAN is
un-offloaded due to RIF getting destroyed, recreating the RIF does not
bring it back.
In this patch, accept that MACVLAN can be created out of order and support
that use case.
One option would seem to be to simply recognize MACVLAN netdevices as
"interesting", and let the existing replay mechanisms take care of the
offload. However, that does not address the necessity to reoffload MACVLAN
once a RIF is created.
Thus add a new replay hook, symmetrical to mlxsw_sp_rif_macvlan_flush(),
called mlxsw_sp_rif_macvlan_replay(), which instead of unwinding the
existing offloads, applies the configuration as if the netdevice were
created just now.
Additionally, remove all vetoes and warning messages that checked for
presence of a RIF at the linked device.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently it never happens that a netdevice that is already a bridge slave
would suddenly become mlxsw upper. The only case where this might be
possible as far as mlxsw is concerned, is with LAG netdevices. But if a LAG
has any upper (e.g. is enslaved), enlaving mlxsw port to that LAG is
forbidden. Thus the only way to install a LAG between a bridge and a mlxsw
port is by first enslaving the port to the LAG, and then enslaving that LAG
to a bridge. At that point there are no bridge objects (such as port VLANs)
to replay. Those are added afterwards, and notified as they are created.
This holds even for the PVID.
However in the following patches, the requirement that ports be only
enslaved to masters without uppers, is going to be relaxed. It will
therefore be necessary to replay the existing bridge objects. Without this
replay, e.g. the mlxsw bridge_port_vlan objects are not instantiated, which
causes issues later, as a lot of code relies on their presence.
To that end, add a new notifier block whose sole role is to filter out
events related to the one relevant upper, and forward those to the existing
switchdev notifier block. Pass the new notifier block to
switchdev_bridge_port_offload() when the bridge port is created.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently it never happens that a netdevice that is already a bridge slave
would suddenly become mlxsw upper. The only case where this might be
possible as far as mlxsw is concerned, is with LAG netdevices. But if a LAG
already has an upper, enslaving mlxsw port to that LAG is forbidden. Thus
the only way to install a LAG between a bridge and a mlxsw port is by first
enslaving the port to the LAG, and then enslaving that LAG to a bridge.
However in the following patches, the requirement that ports be only
enslaved to masters without uppers, is going to be relaxed. It will
therefore be necessary to join bridges of LAG uppers. Without this replay,
the mlxsw bridge_port objects are not instantiated, which causes issues
later, as a lot of code relies on their presence.
Therefore in this patch, when the first mlxsw physical netdevice is
enslaved to a LAG, consider bridges upper to the LAG (both the direct
master, if any, and any bridge masters of VLAN uppers), and have the
relevant netdevices join their bridges.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When handling deslavement of LAG or its upper from a bridge device, when
the deslaved netdevice has an IP address, it should join the router. This
should be done after all the lowers of the LAG have left the bridge. The
replay intended to cause the device to join the router therefore cannot be
invoked unconditionally in the event handlers themselves. It can be done
right away if the handler is invoked for a sole device, but when it is
invoked repeated for each LAG lower, the replay needs to be postponed
until after this processing is done.
To that end, add a boolean parameter, replay_deslavement, to
mlxsw_sp_netdevice_port_upper_event(), mlxsw_sp_netdevice_port_vlan_event()
and one helper on the call path. Have the invocations that are done for
sole netdevices pass true, and those done for LAG lowers pass false.
Nothing depends on this flag at this point, but it removes some noise from
the patch that introduces the replay itself.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently the bridge-related handlers bail out when the event is related to
a netdevice that is not an upper of one of the front-panel ports. In order
to allow enslavement of front-panel ports to bridges that already have
uppers, it will be necessary to replay CHANGEUPPER events to validate that
the configuration is offloadable. In order for the replay to be effective,
it must be possible to ignore unsupported configuration in the context of
an actual notifier event, but to still "veto" these configurations when the
validation is performed.
To that end, introduce two parameters to a number of handlers: mlxsw_sp,
because it will not be possible to deduce that from the netdevice lowers;
and process_foreign to indicate whether netdevices that are not front panel
uppers should be validated.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Move the meat of mlxsw_sp_netdevice_event() to a separate function that
does just the validation. This separate helper will be possible to call
later for recursive ascent when validating attachment of a front panel port
to a bridge with uppers.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Expose via devlink-resource the maximum number of port range registers
and their current occupancy. Besides the observability benefits, this
resource will be used by subsequent patches for scale and occupancy
tests.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/7945e0c715dc5efb1617f45f7560c1f1bd0bcf8a.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The Spectrum ASICs have a fixed number of port range registers, each of
which maintains the following parameters:
* Minimum and maximum port.
* Apply port range for source port, destination port or both.
* Apply port range for TCP, UDP or both.
* Apply port range for IPv4, IPv6 or both.
Implement a port range core which takes care of the allocation and
configuration of these registers and exposes an API that allows
in-driver consumers (e.g., the ACL code) to request matching on a range
of either source or destination port.
These registers are going to be used for port range matching in the
flower classifier that already matches on EtherType being IPv4 / IPv6 and
IP protocol being TCP / UDP. As such, there is no need to limit these
registers to a specific EtherType or IP protocol, which will increase
the likelihood of a register being shared by multiple flower filters.
It is unlikely that a filter will match on the same range of both source
and destination ports, which is why each register is only configured to
match on either source or destination port. If a filter requires
matching on a range of both source and destination ports, it will
utilize two port range registers and match on the output of both.
For efficient lookup and traversal, use XArray to store the allocated
port range registers. The XArray uses RCU and an internal spinlock to
synchronise access, so there is no need for a dedicate lock.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/674f00539a0072d455847663b5feb504db51a259.1689092769.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, joining a LAG very simply means that the LAG RIF should be
joined by the subport representing untagged traffic. If the RIF does not
exist, it does not have to be created: if the user wants there to be RIF
for the LAG device, they are supposed to add an IP address, and they are
supposed to do it after tha LAG becomes mlxsw upper.
We can also assume that the LAG has no uppers, otherwise the enslavement is
not allowed.
In the future, these ordering dependencies should be removed. That means
that joining LAG will be more complex operation, possibly involving a lazy
RIF creation, and possibly joining / lazily creating RIFs for VLAN uppers
of the LAG. It will be handy to have a dedicated function that handles all
this. The new function mlxsw_sp_router_port_join_lag() is that.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The validation logic is already in the router code. Move there the notifier
blocks themselves as well.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Spectrum ASICs have a configurable limit on how deep into the packet
they parse. By default, the limit is 96 bytes.
There are several cases where this parsing depth is not enough and there
is a need to increase it. For example, timestamping of PTP packets and a
FIB multipath hash policy that requires hashing on inner fields. The
driver therefore maintains a reference count that reflects the number of
consumers that require an increased parsing depth.
During reload_down() the parsing depth reference count does not
necessarily drop to zero, but the parsing depth itself is restored to
the default during reload_up() when the firmware is reset. It is
therefore possible to end up in situations where the driver thinks that
the parsing depth was increased (reference count is non-zero), when it
is not.
Fix by making sure that all the consumers that increase the parsing
depth reference count also decrease it during reload_down().
Specifically, make sure that when the routing code is de-initialized it
drops the reference count if it was increased because of a FIB multipath
hash policy that requires hashing on inner fields.
Add a warning if the reference count is not zero after the driver was
de-initialized and explicitly reset it to zero during initialization for
good measures.
Fixes: 2d91f0803b84 ("mlxsw: spectrum: Add infrastructure for parsing configuration")
Reported-by: Maksym Yaremchuk <maksymy@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/9c35e1b3e6c1d8f319a2449d14e2b86373f3b3ba.1678727526.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cited commit added 'DEVLINK_CMD_PARAM_DEL' notifications whenever the
network namespace of the devlink instance is changed. Specifically, the
notifications are generated after calling reload_down(), but before
calling reload_up(). At this stage, the data structures accessed while
reading the value of the "acl_region_rehash_interval" devlink parameter
are uninitialized, resulting in a use-after-free [1].
Fix by moving the registration and unregistration of the devlink
parameter to the TCAM code where it is actually used. This means that
the parameter is unregistered during reload_down() and then
re-registered during reload_up(), avoiding the use-after-free between
these two operations.
Reproducer:
# ip netns add test123
# devlink dev reload pci/0000:06:00.0 netns test123
[1]
BUG: KASAN: use-after-free in mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xb2/0xd0
Read of size 4 at addr ffff888162fd37d8 by task devlink/1323
[...]
Call Trace:
<TASK>
dump_stack_lvl+0x95/0xbd
print_report+0x181/0x4a1
kasan_report+0xdb/0x200
mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xb2/0xd0
mlxsw_sp_params_acl_region_rehash_intrvl_get+0x32/0x80
devlink_nl_param_fill.constprop.0+0x29a/0x11e0
devlink_param_notify.constprop.0+0xb9/0x250
devlink_notify_unregister+0xbc/0x470
devlink_reload+0x1aa/0x440
devlink_nl_cmd_reload+0x559/0x11b0
genl_family_rcv_msg_doit.isra.0+0x1f8/0x2e0
genl_rcv_msg+0x558/0x7f0
netlink_rcv_skb+0x170/0x440
genl_rcv+0x2d/0x40
netlink_unicast+0x53f/0x810
netlink_sendmsg+0x961/0xe80
__sys_sendto+0x2a4/0x420
__x64_sys_sendto+0xe5/0x1c0
do_syscall_64+0x38/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: 7d7e9169a3ec ("devlink: move devlink reload notifications back in between _down() and _up() calls")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The "acl_region_rehash_interval" devlink parameter is a "runtime"
parameter, making the call to devl_param_driverinit_value_set()
pointless. Before cited commit the function simply returned an error
(that was not checked), but now it emits a WARNING [1].
Fix by removing the function call.
[1]
WARNING: CPU: 0 PID: 7 at net/devlink/leftover.c:10974
devl_param_driverinit_value_set+0x8c/0x90
[...]
Call Trace:
<TASK>
mlxsw_sp2_params_register+0x83/0xb0 [mlxsw_spectrum]
__mlxsw_core_bus_device_register+0x5e5/0x990 [mlxsw_core]
mlxsw_core_bus_device_register+0x42/0x60 [mlxsw_core]
mlxsw_pci_probe+0x1f0/0x230 [mlxsw_pci]
local_pci_probe+0x1a/0x40
work_for_cpu_fn+0xf/0x20
process_one_work+0x1db/0x390
worker_thread+0x1d5/0x3b0
kthread+0xe5/0x110
ret_from_fork+0x1f/0x30
</TASK>
Fixes: 85fe0b324c83 ("devlink: make devlink_param_driverinit_value_set() return void")
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 1d18bb1a4ddd ("devlink: allow registering parameters after
the instance") as the subject implies introduced possibility to register
devlink params even for already registered devlink instance. This is a
bit problematic, as the consistency or params list was originally
secured by the fact it is static during devlink lifetime. So in order to
protect the params list, take devlink instance lock during the params
operations. Introduce unlocked function variants and use them in drivers
in locked context. Put lock assertions to appropriate places.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
String TLV is not supported by old firmware versions, therefore
'struct mlxsw_core' stores the field 'emad.enable_string_tlv', which is
set to true only after firmware version check.
Instead of assuming that firmware version check is enough to enable
string TLV, a better solution is to query if this TLV is supported from
MGIR register. Add such query and initialize 'emad.enable_string_tlv'
accordingly.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add locked bridge port support by reacting to changes in the
'BR_PORT_LOCKED' flag. When set, enable security checks on the local
port via the previously added SPFSR register.
When security checks are enabled, an incoming packet will trigger an FDB
lookup with the packet's source MAC and the FID it was classified to. If
an FDB entry was not found or was found to be pointing to a different
port, the packet will be dropped. Such packets increment the
"discard_ingress_general" ethtool counter. For added visibility, user
space can trap such packets to the CPU by enabling the "locked_port"
trap. Example:
# devlink trap set pci/0000:06:00.0 trap locked_port action trap
Unlike other configurations done via bridge port flags (e.g., learning,
flooding), security checks are enabled in the device on a per-port basis
and not on a per-{port, VLAN} basis. As such, scenarios where user space
can configure different locking settings for different VLANs configured
on a port need to be vetoed. To that end, veto the following scenarios:
1. Locking is set on a bridge port that is a VLAN upper
2. Locking is set on a bridge port that has VLAN uppers
3. VLAN upper is configured on a locked bridge port
Examples:
# bridge link set dev swp1.10 locked on
Error: mlxsw_spectrum: Locked flag cannot be set on a VLAN upper.
# ip link add link swp1 name swp1.10 type vlan id 10
# bridge link set dev swp1 locked on
Error: mlxsw_spectrum: Locked flag cannot be set on a bridge port that has VLAN uppers.
# bridge link set dev swp1 locked on
# ip link add link swp1 name swp1.10 type vlan id 10
Error: mlxsw_spectrum: VLAN uppers are not supported on a locked port.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add an API to enable or disable security checks on a local port. It will
be used by subsequent patches when the 'BR_PORT_LOCKED' flag is toggled.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove ndo_get_devlink_port which is no longer used alongside with the
implementations in drivers.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Benefit from the previously implemented tracking of netdev events in
devlink code and instead of calling devlink_port_type_eth_set() and
devlink_port_type_clear() to set devlink port type and link to related
netdev, use SET_NETDEV_DEVLINK_PORT() macro to assign devlink_port
pointer to netdevice which is about to be registered.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that the 32bit UP oddity is gone and 32bit uses always a sequence
count, there is no need for the fetch_irq() variants anymore.
Convert to the regular interface.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Starting from Spectrum-4, the maximum number of LAG IDs can be configured
by software via CONFIG_PROFILE command during driver initialization.
Add a dedicated instance of 'struct mlxsw_config_profile' for Spectrum-4
and set the 'max_lag' field to 128, which is the same amount of LAG entries
as in Spectrum-{2,3}. Without this configuration, firmware reserves 256
(the value of 'cap_max_lag' resource) entries at beginning of PGT table for
LAG identifiers, which means that less entries in PGT will be available.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently the driver queries the maximum supported LAG ID from firmware.
This will not be accurate anymore once the driver will configure 'max_lag'
via CONFIG_PROFILE command.
For resource query, firmware returns the maximum LAG ID which is supported
by hardware. Software can configure firmware to do not allocate entries for
all the supported LAGs, and to limit LAG IDs. In this case, the resource
query will not return the actual maximum LAG ID.
Add a helper function for getting this value. In case that 'max_lag' field
was set during initialization, return the value which was used, otherwise,
query firmware for the maximum supported ID.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently as part of removing port, PTP API is called to clear the
existing configuration and set the 'rx_filter' and 'tx_type' to zero.
The clearing is done before unregistering the netdevice, which means that
there is a window of time in which the user can reconfigure PTP in the
port, and this configuration will not be cleared.
Reorder the operations, clear PTP configuration after unregistering the
netdevice.
Fixes: 8748642751ede ("mlxsw: spectrum: PTP: Support SIOCGHWTSTAMP, SIOCSHWTSTAMP ioctls")
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In Spectrum-2 and Spectrum-3, the correction field of PTP packets which are
sent as control packets is not updated at egress port. To overcome this
limitation, PTP packets which require time stamp, should be sent as data
packets with the following details:
1. FID valid = 1
2. FID value above the maximum FID
3. rx_router_port = 1
>From Spectrum-4 and on, this limitation will be solved.
Extend the function which handles TX header, in case that the packet is
a PTP packet, add TX header with type=data and all the above mentioned
requirements. Add operation as part of 'struct mlxsw_sp_ptp_ops', to be
able to separate the handling of PTP packets between different ASICs. Use
the data packet solution only for Spectrum-2 and Spectrum-3. Therefore, add
a dedicated operation structure for Spectrum-4, as it will be same to
Spectrum-2 in PTP implementation, just will not have the limitation of
control packets.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, Tx completions are reported using Completion Queue Element
version 1 (CQEv1). These elements do not contain the Tx time stamp,
which is fine as Spectrum-1 reads Tx time stamps via a dedicated FIFO
and Spectrum-2 does not currently support PTP.
In preparation for Spectrum-2 PTP support, use CQEv2 for Spectrum-2 and
newer ASICs, as this CQE format encodes the Tx time stamp.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Rename the 'read_frc_capable' bit to 'read_clock_capable' since now it can
be both the FRC and UTC clocks.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, the field FID in TX header is defined, but is not used as it is
relevant only for data packets. mlxsw driver currently sends all
host-generated traffic as control packets and not as data packets.
In Spectrum-2 and Spectrum-3, the correction field of PTP packets which
are sent as control packets is not updated at egress port. To overcome this
limitation while adding support for PTP, some packets will be sent as data
packets.
Fix the wrong shift in the definition, to allow using the field later.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The type of time stamp field in the CQE is configured via the
CONFIG_PROFILE command during driver initialization. Add the definition
of the relevant fields to the command's payload and set the type to UTC
for Spectrum-2 and above. This configuration can be done as part of the
preparations to PTP support, as the type of the time stamp will not break
any existing behavior.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Prepare for devlink reload being called with devlink->lock held and
convert the mlxsw driver to use unlocked devlink API during init and
fini flows. Take devl_lock() in reload_down() and reload_up() ops in the
meantime before reload cmd is converted to take the lock itself.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After all the preparations for unified bridge model, finally flip mlxsw
driver to use the new model.
Change config profile, set 'ubridge' to true and remove the configurations
that are relevant only for the legacy model. Set 'flood_mode' to
'controlled' as the current mode is not supported with unified bridge
model.
Remove all the code which is dedicated to the legacy model. Remove
'struct mlxsw_sp.ubridge' variable which was temporarily added to separate
configurations between the models.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Initialize PGT table as part of mlxsw_sp_init(). This table will be used
first in the next patch by FID code to set flooding entries, and later by
MDB code to add multicast entries.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|