| Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull rdma updates from Jason Gunthorpe:
"Many AI driven bug fixes, and several big driver API cleanups
- Driver bug fixes and minor cleanups in mlx5, hns, rxe, efa, siw,
rtrs, mana, irdma, mlx4. Commonly error path flows, integer
arithmetic overflows on unsafe data, out of bounds access, and use
after free issues under races.
- Second half of the new udata API for drivers focusing on uAPI
response
- bnxt_re supports more options for QP creation that will allow a dv
path in rdma-core
- Untangle the module dependencies so drivers don't link to
ib_uverbs.ko as was originall intended
- Provide a new way to handle umems with a consistent simplified uAPI
and update several drivers to use it. This brings dmabuf support to
more places and more drivers
- Support for mlx5 rate limit and packet pacing for UD and UC
- A batch of fixes for the new shared FRMR pools infrastructure"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (148 commits)
RDMA/irdma: Replace waitqueue and flag with completion
RDMA/hns: Fix memory leak of bonding resources
RDMA/rtrs-srv: Bound RDMA-Write length to chunk size in rdma_write_sg
docs: infiniband: correct name of option to enable the ib_uverbs module
RDMA/bnxt_re: Reject GET_TOGGLE_MEM when toggle page was not allocated
RDMA/bnxt_re: Fail DBR related page allocation UAPIs if the feature is disabled
RDMA/bnxt_re: Avoid repeated requests to allocate WC pages
RDMA/bnxt_re: Proper rollback if the ioremap fails
RDMA/bnxt_re: Add a max slot check for SQ
RDMA/bnxt_re: Avoid displaying the kernel pointer
RDMA/bnxt_re: Free CQ toggle page after firmware teardown
RDMA/bnxt_re: Free SRQ toggle page after firmware teardown
RDMA/bnxt_re: Initialize dpi variable to zero
ABI: sysfs-class-infiniband: minor cleanup
RDMA/mlx5: Release the HW‑provided UAR index rather than the SW one
RDMA/mlx5: Fix undefined shift of user RQ WQE size
RDMA/mlx5: Remove raw RSS QP restrack tracking
RDMA/mlx5: Remove DCT restrack tracking
RDMA/mlx5: Drop FRMR pool handle on UMR revoke failure
RDMA/core: Add ib_frmr_pool_drop for unrecoverable handles
...
|
|
Cross-merge networking fixes after downstream PR (net-7.1-rc8).
Conflicts:
drivers/net/ethernet/wangxun/txgbe/txgbe_aml.c
f67aead16e85 ("net: txgbe: rework service event handling")
57d39faed4c9 ("net: txgbe: improve functions of AML 40G devices")
net/rds/info.c
512db8267b73 ("rds: mark snapshot pages dirty in rds_info_getsockopt()")
6e94eeb2a2a6 ("rds: convert to getsockopt_iter")
Adjacent changes:
include/net/sock.h
1ee90b77b727 ("net: guard timestamp cmsgs to real error queue skbs")
f0de88303d5e ("net: make is_skb_wmem() available to modules")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Rate limiting is currently supported only for raw packet QPs, where
the packet pacing index is programmed into the SQC during SQ modify.
Extend rate limit support to UD and UC QPs by setting the pacing
index in the QPC during RTR2RTS and RTS2RTS transitions.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260524-packet-pacing-v1-3-3d79439f8d08@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add the needed capabilities in mlx5_ifc to support packet pacing for UC
and UD QPs.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260524-packet-pacing-v1-1-3d79439f8d08@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
mlx5_query_nic_vport_mac_list() sizes its firmware command buffer using
the PF's log_max_current_uc/mc_list capabilities. When querying a VF
vport with a larger configured max (via devlink), the firmware response
can overflow this buffer:
BUG: KASAN: slab-out-of-bounds in mlx5_query_nic_vport_mac_list+0x453/0x4c0 [mlx5_core]
Read of size 4 at addr ff1100013ffc8a12 by task kworker/u96:2/385
CPU: 12 UID: 0 PID: 385 Comm: kworker/u96:2 Not tainted 7.0.0-rc6+ #1 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)
Workqueue: mlx5_esw_wq esw_vport_change_handler [mlx5_core]
Call Trace:
<TASK>
dump_stack_lvl+0x69/0xa0
print_report+0x176/0x4e4
kasan_report+0xc8/0x100
mlx5_query_nic_vport_mac_list+0x453/0x4c0 [mlx5_core]
esw_update_vport_addr_list+0x2e3/0xda0 [mlx5_core]
esw_vport_change_handle_locked+0xa1f/0x1060 [mlx5_core]
esw_vport_change_handler+0x6a/0x90 [mlx5_core]
process_one_work+0x87f/0x15e0
worker_thread+0x62b/0x1020
kthread+0x375/0x490
ret_from_fork+0x4dc/0x810
ret_from_fork_asm+0x11/0x20
</TASK>
Fix by querying the vport's own HCA caps to size the buffer correctly.
Refactor the function to allocate and return the MAC list internally,
removing the caller's dependency on knowing the correct max.
Fixes: e16aea2744ab ("net/mlx5: Introduce access functions to modify/query vport mac lists")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260604135849.458060-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2026-06-07
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add sd_group_size bits for SD management
net/mlx5: Update IFC allowed_list_size field bits
====================
Link: https://patch.msgid.link/20260607111157.470978-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, mlx5 is querying the MPIR register to get the number of PFs
that should comprise the SD group.
However, this register does not reflect the correct number in complex
deployments. Hence, add an sd_group_size field to nic_vport_context to
determine the correct number of PFs, and add an sd_group_size capability
bit to indicate whether FW supports it.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260529052359.389413-3-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The vport context allowed_list_size was increased from 12 to 16 bits.
Writing to this field is protected by the log_max_current_uc/mc_list
capabilities. On older FW versions these capabilities are limited
to < 2K and only the high bits of the field are extended. This means
that the change is backward compatible with older FW versions.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260529052359.389413-2-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add MLX5_SPF to enum mlx5_func_type so SPFs get their own page counter,
and add the corresponding WARN check at page cleanup. Wait for SPF pages
to be reclaimed during ECPF teardown, alongside the existing host PF and
VF page waits.
SPF page requests are always identified by vhca_id, so the legacy
func_id_to_type() path is not reached for satellite PFs.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260521110843.367329-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a debugfs file exposing per-node DMA pool usage for mlx5_frag_buf
allocations.
# cat /sys/kernel/debug/mlx5/<dev>/frag_buf_dma_pools
node block_size used_blocks allocated_blocks
0 4096 0 0
0 8192 0 0
0 16384 0 0
0 32768 0 0
0 65536 0 0
1 4096 0 0
1 8192 0 0
1 16384 0 0
1 32768 0 0
1 65536 0 0
Signed-off-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260514104925.337570-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add support for VHCA_ID-based page management mode. When the device
firmware advertises the icm_mng_function_id_mode capability with
MLX5_ID_MODE_FUNCTION_VHCA_ID, page management operations between the
driver and firmware may use vhca_id instead of function_id as the
effective function identifier, and the ec_function field is ignored.
Update page management commands to conditionally set ec_function field
only in FUNC_ID mode. Boot page allocation always uses FUNC_ID mode
semantics for backward compatibility, as the capability bit is only
available after set_hca_cap(). If after set_hca_cap() VHCA_ID mode was
set, modify the tracking of the boot pages in page_root_xa to use
vhca_id too.
Add mlx5_esw_vhca_id_to_func_type() to resolve the function type in
VHCA_ID mode, enabling per-type debugfs counters. Use a dedicated
vhca_type_map xarray, to provide lockless lookup. Store the resolved
type on each fw_page at allocation time so reclaim and release paths
read it directly without any lookup.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Akiva Goldberger <agoldberger@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260506133239.276237-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Make the per function type debugfs page counters dynamically added after
mlx5_eswitch_init(). When page management operates in vhca_id mode, only
the function acting as either eSwitch or vport manager can initialize
the eSwitch structure and translate the vhca_id to function type for the
functions to which it supplies pages. The next patch will add support
for page management in vhca_id mode.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Akiva Goldberger <agoldberger@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260506133239.276237-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Representor callbacks can be registered and unregistered while the
E-Switch is already in switchdev mode, and the same E-Switch may also be
reconfigured by devlink, VF changes and SF changes. Serialize these paths
with the per-E-Switch representor mutex instead of relying on ad-hoc bit
state and wait queues.
Take the representor lock around the mode transition, VF/SF representor
changes and representor ops registration. Keep mode_lock and the
representor lock unnested by using the operation flag while the mode lock
is dropped. During mode changes, drop the representor lock around the
auxiliary bus rescan because driver bind/unbind may register or unregister
representor ops.
Split representor ops registration into locked public wrappers and blocked
internal helpers, clear the ops pointer on unregister, and add nested
wrappers for the shared-FDB master IB path that registers peer
representor ops while another E-Switch representor lock is already held.
On unregister, always call __unload_reps_all_vport() before marking reps
unregistered and clearing rep_ops. The per-representor state check makes
this a no-op for types that were not loaded, so unregister no longer has
to infer load state from esw->mode.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260503202726.266415-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add mlx5_dma_pool alloc/free paths, and wire mlx5_frag_buf allocation
and free paths to use them.
mlx5_frag_buf_alloc_node() now selects an mlx5_dma_pool to allocate
fragments from, instead of directly allocating full coherent pages.
mlx5_frag_buf_free() frees from the respective pool.
mlx5_dma_pool_alloc() keeps allocation fast by maintaining pages with
available indexes at the head of the list, so the common allocation path
can take a free index immediately. New backing pages are allocated only
when no free index is available.
mlx5_dma_pool_free() returns released indexes to the pool and frees a
backing page once all of its indexes become free. This avoids keeping
fully free pages for the lifetime of the pool and reduces coherent DMA
memory footprint.
Signed-off-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260429201429.223809-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Introduce mlx5 DMA pool and pool-page data structures, and add the
creation and teardown paths.
Each NUMA node owns a set of mlx5_dma_pool instances, each one with a
different block size. The sizes are defined as all powers of two
starting from MLX5_ADAPTER_PAGE_SHIFT and up to PAGE_SHIFT. Since
mlx5_frag_bufs are used to back objects whose sizes are encoded relative
to MLX5_ADAPTER_PAGE_SHIFT, a smaller block_shift value cannot be used.
Requests larger than PAGE_SIZE continue to be handled as page-sized
fragments, as in the existing frag-buf allocation model.
Signed-off-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260429201429.223809-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Update the query_esw_functions command to support a new response layout
that can report data for multiple network functions. Setting bit 14 of
the op_mod field selects the v1 layout with network_function_params
entries instead of the legacy host_params_context.
The query_host_net_function_v1 read-only capability indicates firmware
support for layout version 1, and query_host_net_function_num_max
advertises the maximum number of network function entries.
Define a new network_function_params layout and a net_function_params
union that groups host_params_context and network_function_params.
Rework the query_esw_functions output to use a flexible array of this
union, and adjust existing driver callers to use it.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260428053851.220089-5-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Drop the unused host_sf_enable array from
mlx5_ifc_query_esw_functions_out_bits layout. This field has been
deprecated in firmware and is not referenced by the mlx5 driver, so it
can be safely removed.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260428053851.220089-4-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add a function_id_type field to the enable_hca and disable_hca command
input layouts in mlx5_ifc.h to allow using vhca_id as the function index
instead of function_id. The new field support by firmware is indicated
by the function_id_type_vhca_id capability bit, which is already exposed
in hca caps.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260428053851.220089-3-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Rename the vport number enums MLX5_VPORT_PF to MLX5_VPORT_HOST_PF and
MLX5_VPORT_FIRST_VF to MLX5_VPORT_FIRST_HOST_VF to indicate that these
vport indices represent the host PF and its VFs. This prepares the code
for upcoming support of an additional PF type.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260428053851.220089-2-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Pull rdma updates from Jason Gunthorpe:
"The usual collection of driver changes, more core infrastructure
updates that typical this cycle:
- Minor cleanups and kernel-doc fixes in bnxt_re, hns, rdmavt, efa,
ocrdma, erdma, rtrs, hfi1, ionic, and pvrdma
- New udata validation framework and driver updates
- Modernize CQ creation interface in mlx4 and mlx5, manage CQ umem in
core
- Promote UMEM to a core component, split out DMA block iterator
logic
- Introduce FRMR pools with aging, statistics, pinned handles, and
netlink control and use it in mlx5
- Add PCIe TLP emulation support in mlx5
- Extend umem to work with revocable pinned dmabuf's and use it in
irdma
- More net namespace improvements for rxe
- GEN4 hardware support in irdma
- First steps to MW and UC support in mana_ib
- Support for CQ umem and doorbells in bnxt_re
- Drop opa_vnic driver from hfi1
Fixes:
- IB/core zero dmac neighbor resolution race
- GID table memory free
- rxe pad/ICRC validation and r_key async errors
- mlx4 external umem for CQ
- umem DMA attributes on unmap
- mana_ib RX steering on RSS QP destroy"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (116 commits)
RDMA/core: Fix user CQ creation for drivers without create_cq
RDMA/ionic: bound node_desc sysfs read with %.64s
IB/core: Fix zero dmac race in neighbor resolution
RDMA/mana_ib: Support memory windows
RDMA/rxe: Validate pad and ICRC before payload_size() in rxe_rcv
RDMA/core: Prefer NLA_NUL_STRING
RDMA/core: Fix memory free for GID table
RDMA/hns: Remove the duplicate calls to ib_copy_validate_udata_in()
RDMA: Remove redundant = {} for udata req structs
RDMA/irdma: Add missing comp_mask check in alloc_ucontext
RDMA/hns: Add missing comp_mask check in create_qp
RDMA/mlx5: Pull comp_mask validation into ib_copy_validate_udata_in_cm()
RDMA: Use ib_copy_validate_udata_in_cm() for zero comp_mask
RDMA/hns: Use ib_copy_validate_udata_in()
RDMA/mlx4: Use ib_copy_validate_udata_in() for QP
RDMA/mlx4: Use ib_copy_validate_udata_in()
RDMA/mlx5: Use ib_copy_validate_udata_in() for MW
RDMA/mlx5: Use ib_copy_validate_udata_in() for SRQ
RDMA/pvrdma: Use ib_copy_validate_udata_in() for srq
RDMA: Use ib_copy_validate_udata_in() for implicit full structs
...
|
|
Pull VFIO updates from Alex Williamson:
- Update QAT vfio-pci variant driver for Gen 5, 420xx devices (Vijay
Sundar Selvamani, Suman Kumar Chakraborty, Giovanni Cabiddu)
- Fix vfio selftest MMIO DMA mapping selftest (Alex Mastro)
- Conversions to const struct class in support of class_create()
deprecation (Jori Koolstra)
- Improve selftest compiler compatibility by avoiding initializer on
variable-length array (Manish Honap)
- Define new uAPI for drivers supporting migration to advise user-
space of new initial data for reducing target startup latency.
Implemented for mlx5 vfio-pci variant driver (Yishai Hadas)
- Enable vfio selftests on aarch64, not just cross-compiles reporting
arm64 (Ted Logan)
- Update vfio selftest driver support to include additional DSA devices
(Yi Lai)
- Unconditionally include debugfs root pointer in vfio device struct,
avoiding a build failure seen in hisi_acc variant driver without
debugfs otherwise (Arnd Bergmann)
- Add support for the s390 ISM (Internal Shared Memory) device via a
new variant driver. The device is unique in the size of its BAR space
(256TiB) and lack of mmap support (Julian Ruess)
- Enforce that vfio-pci drivers implement a name in their ops structure
for use in sequestering SR-IOV VFs (Alex Williamson)
- Prune leftover group notifier code (Paolo Bonzini)
- Fix Xe vfio-pci variant driver to avoid migration support as a
dependency in the reset path and missing release call (Michał
Winiarski)
* tag 'vfio-v7.1-rc1' of https://github.com/awilliam/linux-vfio: (23 commits)
vfio/xe: Add a missing vfio_pci_core_release_dev()
vfio/xe: Reorganize the init to decouple migration from reset
vfio: remove dead notifier code
vfio/pci: Require vfio_device_ops.name
MAINTAINERS: add VFIO ISM PCI DRIVER section
vfio/ism: Implement vfio_pci driver for ISM devices
vfio/pci: Rename vfio_config_do_rw() to vfio_pci_config_rw_single() and export it
vfio: unhide vdev->debug_root
vfio/qat: add support for Intel QAT 420xx VFs
vfio: selftests: Support DMR and GNR-D DSA devices
vfio: selftests: Build tests on aarch64
vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
vfio/mlx5: consider inflight SAVE during PRE_COPY
net/mlx5: Add IFC bits for migration state
vfio: Adapt drivers to use the core helper vfio_check_precopy_ioctl
vfio: Add support for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2
vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase
vfio: selftests: Fix VLA initialisation in vfio_pci_irq_set()
vfio: uapi: fix comment typo
vfio: mdev: replace mtty_dev->vd_class with a const struct class
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2026-04-09
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add icm_mng_function_id_mode cap bit
net/mlx5: Rename MLX5_PF page counter type to MLX5_SELF
net/mlx5: Add vhca_id_type bit to alias context
mlx5: Remove redundant iseg base
====================
Link: https://patch.msgid.link/20260409110431.154894-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Introduce the capability bit icm_mng_function_id_mode to indicate that
the device firmware uses vhca_id instead of function_id as the effective
identifier for the firmware commands MANAGE_PAGES, QUERY_PAGES, and page
request event.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Akiva Goldberger <agoldberger@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260403090028.137783-3-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The MLX5_PF enum value in mlx5_func_type is used to track firmware
page allocations for the page manager function itself, which is either
the ECPF on SmartNIC systems or the host PF when there is no ECPF.
Rename it to MLX5_SELF to accurately reflect that this counter tracks
pages allocated by the manager for its own use, regardless of whether
it is a PF or ECPF.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260403090028.137783-2-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add vhca_id_type bit to alias context which allows indicating the
vhca_id_type to be passed at vhca_id_to_be_accessed, which can be either
HW or SW, note that SW_VHCA_ID must be used to allow alias to work
properly after migration.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260319122211.27384-3-tariqt@nvidia.com
Reviewed-by: Joe Damato <joe@dama.to>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
iseg_base and base_addr both point to BAR0, making iseg_base redundant.
Remove iseg_base and rely on base_addr instead, reducing the size of
struct mlx5_core_dev.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260319122211.27384-2-tariqt@nvidia.com
Reviewed-by: Joe Damato <joe@dama.to>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add the relevant IFC bits for querying an extra migration state from the
device.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20260317161753.18964-5-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2026-03-17
The following pull-request contains common mlx5 updates
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Expose MLX5_UMR_ALIGN definition
{net/RDMA}/mlx5: Add LAG demux table API and vport demux rules
net/mlx5: Add VHCA RX flow destination support for FW steering
net/mlx5: LAG, replace mlx5_get_dev_index with LAG sequence number
net/mlx5: E-switch, modify peer miss rule index to vhca_id
net/mlx5: LAG, use xa_alloc to manage LAG device indices
net/mlx5: LAG, replace pf array with xarray
net/mlx5: Add silent mode set/query and VHCA RX IFC bits
net/mlx5: Add IFC bits for shared headroom pool PBMC support
net/mlx5: Expose TLP emulation capabilities
net/mlx5: Add TLP emulation device capabilities
====================
Link: https://patch.msgid.link/20260317075844.12066-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Expose HW constant value in a shared header, to be used by core/EN
drivers.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260309093435.1850724-10-tariqt@nvidia.com
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Downstream patches will introduce SW-only LAG (e.g. shared_fdb without
HW LAG). In this mode the firmware cannot create the LAG demux table,
but vport demuxing is still required.
Move LAG demux flow-table ownership to the LAG layer and introduce APIs
to init/cleanup the demux table and add/delete per-vport rules. Adjust
the RDMA driver to use the new APIs.
In this mode, the LAG layer will create a flow group that matches vport
metadata. Vports that are not native to the LAG master eswitch add the
demux rule during IB representor load and remove it on unload.
The demux rule forward traffic from said vports to their native eswitch
manager via a new dest type - MLX5_FLOW_DESTINATION_TYPE_VHCA_RX.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260309093435.1850724-9-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Introduce MLX5_FLOW_DESTINATION_TYPE_VHCA_RX as a new flow steering
destination type.
Wire the new destination through flow steering command setup by mapping
it to MLX5_IFC_FLOW_DESTINATION_TYPE_VHCA_RX and passing the vhca id,
extend forward-destination validation to accept it, and teach the flow
steering tracepoint formatter to print rx_vhca_id.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260309093435.1850724-8-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Introduce mlx5_lag_get_dev_seq() which returns a device's sequence
number within the LAG: master is always 0, remaining devices numbered
sequentially. This provides a stable index for peer flow tracking and
vport ordering without depending on native_port_num.
Replace mlx5_get_dev_index() usage in en_tc.c (peer flow array
indexing) and ib_rep.c (vport index ordering) with the new API.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260309093435.1850724-7-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Update the mlx5 IFC headers with newly defined capability and
command-layout bits:
- Add silent_mode_query and rename silent_mode to silent_mode_set cap
fields.
- Add forward_vhca_rx and MLX5_IFC_FLOW_DESTINATION_TYPE_VHCA_RX.
- Expose silent mode fields in the L2 table query command structures.
Update the SD support check to use the new capability name
(silent_mode_set) to match the updated IFC definition.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260309093435.1850724-3-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add hardware interface definitions for shared headroom pool (SHP) in
port buffer management:
- shp_pbmc_pbsr_support: capability bit in PCAM enhanced features
indicating device support for shared headroom pool in PBMC/PBSR.
- shared_headroom_pool: buffer entry in PBMC register (pbmc_reg_bits)
for the shared headroom pool configuration, reusing the bufferx
layout; reduce trailing reserved region accordingly.
Signed-off-by: Alexei Lazar <alazar@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260309093435.1850724-2-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Use the previously introduced shared devlink infrastructure to create
a shared devlink instance for mlx5 PFs that reside on the same physical
chip. The shared instance is identified by the chip's serial number
extracted from PCI VPD (V3 keyword, with fallback to serial number
for older devices).
Each PF that probes calls mlx5_shd_init() which extracts the chip serial
number and uses devlink_shd_get() to get or create the shared instance.
When a PF is removed, mlx5_shd_uninit() calls devlink_shd_put()
to release the reference. The shared instance is automatically destroyed
when the last PF is removed.
Make the PF devlink instances nested in this shared devlink instance,
allowing userspace to identify which PFs belong to the same physical
chip.
Example:
pci/0000:08:00.0: index 0
nested_devlink:
auxiliary/mlx5_core.eth.0
devlink_index/1: index 1
nested_devlink:
pci/0000:08:00.0
pci/0000:08:00.1
auxiliary/mlx5_core.eth.0: index 2
pci/0000:08:00.1: index 3
nested_devlink:
auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1: index 4
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260312100407.551173-14-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This series adds support for Transaction Layer Packet (TLP) emulation
response gateway regions, enabling userspace device emulation software
to write TLP responses directly to lower layers without kernel driver
involvement.
Currently, the mlx5 driver exposes VirtIO emulation access regions via
the MLX5_IB_METHOD_VAR_OBJ_ALLOC ioctl. This series extends that
ioctl to also support allocating TLP response gateway channels for
PCI device emulation use cases.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Expose and query TLP device emulation caps on driver load.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
Introduce the hardware structures and definitions needed for the driver
support of TLP emulation in mlx5_ifc.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
Following mlx5_ib move to using FRMR pools, drop all unused code of MR
cache.
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260226-frmr_pools-v4-7-95360b54f15e@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Driver is using num_vhca_ports capability to distinguish between
multiport master device and multiport slave device. num_vhca_ports is a
capability the driver sets according to the MAX num_vhca_ports
capability reported by FW. On the other hand, light SFs doesn't set the
above capbility.
This leads to wrong results whenever light SFs is checking whether he is
a multiport master or slave.
Therefore, use the MAX capability to distinguish between master and
slave devices.
Fixes: e71383fb9cd1 ("net/mlx5: Light probe local SFs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <Jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260218072904.1764634-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Rename TAUI/TBASE to GAUI/GBASE in 1600G link mode identifier and its
usage in ethtool and link-info tables.
Reported-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Reported-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://patch.msgid.link/20260204194324.1723534-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently the driver has an inconsistent behaviour between modes when it
comes to oversized packets that are not dropped through the physical MTU
check in HW. This can happen for Multi Host configurations where each
port has a different MTU.
Current behavior:
1) Striding RQ in linear mode drops the packet in SW and counts it
with oversize_pkts_sw_drop.
2) Striding RQ in non-linear mode allows it like a normal packet.
3) Legacy RQ can't receive oversized packets by design:
the RX WQE uses MTU sized packet buffers.
This inconsistency is not a violation of the netdev policy [1]
but it is better to be consistent across modes.
This patch aligns (2) with (1) and (3). One exception is added for
LRO: don't drop the oversized packet if it is an LRO packet.
As now rq->hw_mtu always needs to be updated during the MTU change flow,
drop the reset avoidance optimization from mlx5e_change_mtu().
Extract the CQE LRO segments reading into a helper function as it
is used twice now.
[1] Documentation/networking/netdevices.rst#L205
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260203072130.1710255-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add hardware interface definitions to support extended bandwidth rate
limiting in the QoS Enhanced Transmission Selection (ETS) configuration.
The new fields include:
- max_bw_value: extended from 8-bit to 16-bit in ets_tcn_config_reg,
simplifying the implementation by using a single field instead of
separate MSB/LSB fields.
- qetcr_qshr_max_bw_val_msb: capability bit in qcam_qos_feature_cap_mask
indicating device support for the extended 16-bit max_bw_value field.
These interface additions are prerequisites for increasing the per-TC
rate limit beyond 255 Gbps to support higher-bandwidth NICs.
Signed-off-by: Alexei Lazar <alazar@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1768200608-1543180-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add mlx5_lag_query_bond_speed() to query the aggregated speed of
lag configurations with a bond device.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add port change event handling logic for MPESW LAG mode, ensuring
VFs are updated when the speed of LAG physical ports changes.
This triggers a speed update workflow when relevant port state changes
occur, enabling consistent and accurate reporting of VF bandwidth.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, vports report only their parent's uplink speed, which in LAG
setups does not reflect the true aggregated bandwidth. This makes it
hard for upper-layer software to optimize load balancing decisions
based on accurate bandwidth information.
Fix the issue by calculating the possible maximum speed of a LAG as
the sum of speeds of all active uplinks that are part of the LAG.
Propagate this effective max speed to vports associated with the LAG
whenever a relevant event occurs, such as physical port link state
changes or LAG creation/modification.
With this change, upper-layer components receive accurate bandwidth
information corresponding to the active members of the LAG and can
make better load balancing decisions.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Introduce the max_tx_speed field to the query and modify_vport_state
structures.
Add the esw_vport_state_max_tx_speed capability bit, indicating
the firmware support modifying the max_tx_speed field via the
MODIFY_VPORT_STATE command.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
This completes the previous patches by moving notifier registration for
SF dev tables outside the devlink locked critical section in
mlx5_init_one() / mlx5_uninit_one() and into the mlx5_mdev_init() /
mlx5_mdev_uninit() functions.
This is only done for non-SFs, since SFs do not have a SF HW table
themselves.
After this patch, notifiers can grab the PF devlink lock (soon to be
necessary) without creating a locking cycle.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1763325940-1231508-7-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Move the SF table notifiers registration/unregistration outside of
mlx5_init_one() / mlx5_uninit_one() and into the mlx5_mdev_init() /
mlx5_mdev_uninit() functions.
This is only done for non-SFs, since SFs do not have a SF table
themselves and thus don't need notifiers.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1763325940-1231508-6-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Move the SF HW table notifier registration/unregistration outside of
mlx5_init_one() / mlx5_uninit_one() and into the mlx5_mdev_init() /
mlx5_mdev_uninit() functions.
This is only done for non-SFs, since SFs do not have a SF HW table
themselves.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1763325940-1231508-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|