summaryrefslogtreecommitdiff
path: root/include/linux/mlx5
AgeCommit message (Collapse)AuthorFilesLines
2025-02-17net/mlx5: use do_aux_work for PHC overflow checksVadim Fedorenko1-1/+0
[ Upstream commit e61e6c415ba9ff2b32bb6780ce1b17d1d76238f1 ] The overflow_work is using system wq to do overflow checks and updates for PHC device timecounter, which might be overhelmed by other tasks. But there is dedicated kthread in PTP subsystem designed for such things. This patch changes the work queue to proper align with PTP subsystem and to avoid overloading system work queue. The adjfine() function acts the same way as overflow check worker, we can postpone ptp aux worker till the next overflow period after adjfine() was called. Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250107104812.380225-1-vadfed@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-09RDMA/mlx5: Enforce same type port association for multiport RoCEPatrisious Haddad1-0/+6
[ Upstream commit e05feab22fd7dabcd6d272c4e2401ec1acdfdb9b ] Different core device types such as PFs and VFs shouldn't be affiliated together since they have different capabilities, fix that by enforcing type check before doing the affiliation. Fixes: 32f69e4be269 ("{net, IB}/mlx5: Manage port association for multiport RoCE") Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/88699500f690dff1c1852c1ddb71f8a1cc8b956e.1733233480.git.leonro@nvidia.com Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-18net/mlx5: Correct TASR typo into TSARCosmin Ratiu1-1/+1
[ Upstream commit e575d3a6dd22123888defb622b1742aa2d45b942 ] TSAR is the correct spelling (Transmit Scheduling ARbiter). Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240613210036.1125203-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 861cd9b9cb62 ("net/mlx5: Verify support for scheduling element and TSAR type") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-18net/mlx5: Add missing masks and QoS bit masks for scheduling elementsCarolina Jubran1-1/+9
[ Upstream commit 452ef7f86036392005940de54228d42ca0044192 ] Add the missing masks for supported element types and Transmit Scheduling Arbiter (TSAR) types in scheduling elements. Also, add the corresponding bit masks for these types in the QoS capabilities of a NIC scheduler. Fixes: 214baf22870c ("net/mlx5e: Support HTB offload") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-18IB/mlx5: Rename 400G_8X speed to comply to naming conventionPatrisious Haddad1-1/+1
[ Upstream commit b28ad32442bec2f0d9cb660d7d698a1a53c13d08 ] Rename 400G_8X speed to comply to naming convention. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/ac98447cac8379a43fbdb36d56e5fb2b741a97ff.1695204156.git.leon@kernel.org Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Stable-dep-of: 80bf474242b2 ("net/mlx5e: Add missing link mode to ptys2ext_ethtool_map") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-03RDMA/mlx5: Use sq timestamp as QP timestamp when RoCE is disabledOr Har-Toov1-3/+6
[ Upstream commit 0c5275bf75ec3708d95654195ae4ed80d946d088 ] When creating a QP, one of the attributes is TS format (timestamp). In some devices, we have a limitation that all QPs should have the same ts_format. The ts_format is chosen based on the device's capability. The qp_ts_format cap resides under the RoCE caps table, and the cap will be 0 when RoCE is disabled. So when RoCE is disabled, the value that should be queried is sq_ts_format under HCA caps. Consider the case when the system supports REAL_TIME_TS format (0x2), some QPs are created with REAL_TIME_TS as ts_format, and afterwards RoCE gets disabled. When trying to construct a new QP, we can't use the qp_ts_format, that is queried from the RoCE caps table, Since it leads to passing 0x0 (FREE_RUNNING_TS) as the value of the qp_ts_format, which is different than the ts_format of the previously allocated QPs REAL_TIME_TS format (0x2). Thus, to resolve this, read the sq_ts_format, which also reflect the supported ts format for the QP when RoCE is disabled. Fixes: 4806f1e2fee8 ("net/mlx5: Set QP timestamp mode to default") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Link: https://lore.kernel.org/r/32801966eb767c7fd62b8dea3b63991d5fbfe213.1718554199.git.leon@kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12net/mlx5: Fix MTMP register capability offset in MCAM registerGal Pressman1-2/+2
[ Upstream commit 1b9f86c6d53245dab087f1b2c05727b5982142ff ] The MTMP register (0x900a) capability offset is off-by-one, move it to the right place. Fixes: 1f507e80c700 ("net/mlx5: Expose NIC temperature via hardware monitoring kernel API") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12net/mlx5: Add a timeout to acquire the command queue semaphoreAkiva Goldberger1-0/+1
[ Upstream commit 485d65e1357123a697c591a5aeb773994b247ad7 ] Prevent forced completion handling on an entry that has not yet been assigned an index, causing an out of bounds access on idx = -22. Instead of waiting indefinitely for the sem, blocking flow now waits for index to be allocated or a sem acquisition timeout before beginning the timer for FW completion. Kernel log example: mlx5_core 0000:06:00.0: wait_func_handle_exec_timeout:1128:(pid 185911): cmd[-22]: CREATE_UCTX(0xa04) No done completion Fixes: 8e715cd613a1 ("net/mlx5: Set command entry semaphore up once got index free") Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240509112951.590184-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27RDMA/mlx5: Fix fortify source warning while accessing Eth segmentLeon Romanovsky1-1/+4
[ Upstream commit 4d5e86a56615cc387d21c629f9af8fb0e958d350 ] ------------[ cut here ]------------ memcpy: detected field-spanning write (size 56) of single field "eseg->inline_hdr.start" at /var/lib/dkms/mlnx-ofed-kernel/5.8/build/drivers/infiniband/hw/mlx5/wr.c:131 (size 2) WARNING: CPU: 0 PID: 293779 at /var/lib/dkms/mlnx-ofed-kernel/5.8/build/drivers/infiniband/hw/mlx5/wr.c:131 mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib] Modules linked in: 8021q garp mrp stp llc rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) mlx5_core(OE) pci_hyperv_intf mlxdevm(OE) mlx_compat(OE) tls mlxfw(OE) psample nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c nfnetlink mst_pciconf(OE) knem(OE) vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd irqbypass cuse nfsv3 nfs fscache netfs xfrm_user xfrm_algo ipmi_devintf ipmi_msghandler binfmt_misc crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 snd_pcsp aesni_intel crypto_simd cryptd snd_pcm snd_timer joydev snd soundcore input_leds serio_raw evbug nfsd auth_rpcgss nfs_acl lockd grace sch_fq_codel sunrpc drm efi_pstore ip_tables x_tables autofs4 psmouse virtio_net net_failover failover floppy [last unloaded: mlx_compat(OE)] CPU: 0 PID: 293779 Comm: ssh Tainted: G OE 6.2.0-32-generic #32~22.04.1-Ubuntu Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib] Code: 0c 01 00 a8 01 75 25 48 8b 75 a0 b9 02 00 00 00 48 c7 c2 10 5b fd c0 48 c7 c7 80 5b fd c0 c6 05 57 0c 03 00 01 e8 95 4d 93 da <0f> 0b 44 8b 4d b0 4c 8b 45 c8 48 8b 4d c0 e9 49 fb ff ff 41 0f b7 RSP: 0018:ffffb5b48478b570 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffb5b48478b628 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffb5b48478b5e8 R13: ffff963a3c609b5e R14: ffff9639c3fbd800 R15: ffffb5b480475a80 FS: 00007fc03b444c80(0000) GS:ffff963a3dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000556f46bdf000 CR3: 0000000006ac6003 CR4: 00000000003706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? show_regs+0x72/0x90 ? mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib] ? __warn+0x8d/0x160 ? mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib] ? report_bug+0x1bb/0x1d0 ? handle_bug+0x46/0x90 ? exc_invalid_op+0x19/0x80 ? asm_exc_invalid_op+0x1b/0x20 ? mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib] mlx5_ib_post_send_nodrain+0xb/0x20 [mlx5_ib] ipoib_send+0x2ec/0x770 [ib_ipoib] ipoib_start_xmit+0x5a0/0x770 [ib_ipoib] dev_hard_start_xmit+0x8e/0x1e0 ? validate_xmit_skb_list+0x4d/0x80 sch_direct_xmit+0x116/0x3a0 __dev_xmit_skb+0x1fd/0x580 __dev_queue_xmit+0x284/0x6b0 ? _raw_spin_unlock_irq+0xe/0x50 ? __flush_work.isra.0+0x20d/0x370 ? push_pseudo_header+0x17/0x40 [ib_ipoib] neigh_connected_output+0xcd/0x110 ip_finish_output2+0x179/0x480 ? __smp_call_single_queue+0x61/0xa0 __ip_finish_output+0xc3/0x190 ip_finish_output+0x2e/0xf0 ip_output+0x78/0x110 ? __pfx_ip_finish_output+0x10/0x10 ip_local_out+0x64/0x70 __ip_queue_xmit+0x18a/0x460 ip_queue_xmit+0x15/0x30 __tcp_transmit_skb+0x914/0x9c0 tcp_write_xmit+0x334/0x8d0 tcp_push_one+0x3c/0x60 tcp_sendmsg_locked+0x2e1/0xac0 tcp_sendmsg+0x2d/0x50 inet_sendmsg+0x43/0x90 sock_sendmsg+0x68/0x80 sock_write_iter+0x93/0x100 vfs_write+0x326/0x3c0 ksys_write+0xbd/0xf0 ? do_syscall_64+0x69/0x90 __x64_sys_write+0x19/0x30 do_syscall_64+0x59/0x90 ? do_user_addr_fault+0x1d0/0x640 ? exit_to_user_mode_prepare+0x3b/0xd0 ? irqentry_exit_to_user_mode+0x9/0x20 ? irqentry_exit+0x43/0x50 ? exc_page_fault+0x92/0x1b0 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fc03ad14a37 Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007ffdf8697fe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000008024 RCX: 00007fc03ad14a37 RDX: 0000000000008024 RSI: 0000556f46bd8270 RDI: 0000000000000003 RBP: 0000556f46bb1800 R08: 0000000000007fe3 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 R13: 0000556f46bc66b0 R14: 000000000000000a R15: 0000556f46bb2f50 </TASK> ---[ end trace 0000000000000000 ]--- Link: https://lore.kernel.org/r/8228ad34bd1a25047586270f7b1fb4ddcd046282.1706433934.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-15net/mlx5: Check capability for fw_resetMoshe Shemesh1-1/+3
[ Upstream commit 5e6107b499f3fc4748109e1d87fd9603b34f1e0d ] Functions which can't access MFRL (Management Firmware Reset Level) register, have no use of fw_reset structures or events. Remove fw_reset structures allocation and registration for fw reset events notifications for these functions. Having the devlink param enable_remote_dev_reset on functions that don't have this capability is misleading as these functions are not allowed to influence the reset flow. Hence, this patch removes this parameter for such functions. In addition, return not supported on devlink reload action fw_activate for these functions. Fixes: 38b9f903f22b ("net/mlx5: Handle sync reset request event") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01IB/mlx5: Don't expose debugfs entries for RRoCE general parameters if not ↵Mark Zhang1-1/+1
supported [ Upstream commit 43fdbd140238d44e7e847232719fef7d20f9d326 ] debugfs entries for RRoCE general CC parameters must be exposed only when they are supported, otherwise when accessing them there may be a syndrome error in kernel log, for example: $ cat /sys/kernel/debug/mlx5/0000:08:00.1/cc_params/rtt_resp_dscp cat: '/sys/kernel/debug/mlx5/0000:08:00.1/cc_params/rtt_resp_dscp': Invalid argument $ dmesg mlx5_core 0000:08:00.1: mlx5_cmd_out_err:805:(pid 1253): QUERY_CONG_PARAMS(0x824) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x325a82), err(-22) Fixes: 66fb1d5df6ac ("IB/mlx5: Extend debug control for CC parameters") Reviewed-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/e7ade70bad52b7468bdb1de4d41d5fad70c8b71c.1706433934.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-01net/mlx5: Bridge, fix multicast packets sent to uplinkMoshe Shemesh2-1/+2
[ Upstream commit ec7cc38ef9f83553102e84c82536971a81630739 ] To enable multicast packets which are offloaded in bridge multicast offload mode to be sent also to uplink, FTE bit uplink_hairpin_en should be set. Add this bit to FTE for the bridge multicast offload rules. Fixes: 18c2916cee12 ("net/mlx5: Bridge, snoop igmp/mld packets") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-01net/mlx5: Bridge, Enable mcast in smfs steering modeErez Shitrit1-0/+1
[ Upstream commit 653b7eb9d74426397c95061fd57da3063625af65 ] In order to have mcast offloads the driver needs the following: It should know if that mcast comes from wire port, in addition the flow should not be marked as any specific source, that way it will give the flexibility for the driver not to be depended on the way iterator implemented in the FW. Signed-off-by: Erez Shitrit <erezsh@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Stable-dep-of: ec7cc38ef9f8 ("net/mlx5: Bridge, fix multicast packets sent to uplink") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-20RDMA/mlx5: Send events from IB driver about device affiliation statePatrisious Haddad2-0/+4
[ Upstream commit 0d293714ac32650bfb669ceadf7cc2fad8161401 ] Send blocking events from IB driver whenever the device is done being affiliated or if it is removed from an affiliation. This is useful since now the EN driver can register to those event and know when a device is affiliated or not. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://lore.kernel.org/r/a7491c3e483cfd8d962f5f75b9a25f253043384a.1695296682.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Stable-dep-of: 762a55a54eec ("net/mlx5e: Disable IPsec offload support if not FW steering") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-20net/mlx5e: Tidy up IPsec NAT-T SA discoveryLeon Romanovsky1-1/+1
[ Upstream commit c2bf84f1d1a1595dcc45fe867f0e02b331993fee ] IPsec NAT-T packets are UDP encapsulated packets over ESP normal ones. In case they arrive to RX, the SPI and ESP are located in inner header, while the check was performed on outer header instead. That wrong check caused to the situation where received rekeying request was missed and caused to rekey timeout, which "compensated" this failure by completing rekeying. Fixes: d65954934937 ("net/mlx5e: Support IPsec NAT-T functionality") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-20net/mlx5e: Honor user choice of IPsec replay window sizeLeon Romanovsky1-0/+7
[ Upstream commit a5e400a985df8041ed4659ed1462aa9134318130 ] Users can configure IPsec replay window size, but mlx5 driver didn't honor their choice and set always 32bits. Fix assignment logic to configure right size from the beginning. Fixes: 7db21ef4566e ("net/mlx5e: Set IPsec replay sequence numbers") Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-28net/mlx5: Provide an interface to block change of IPsec capabilitiesLeon Romanovsky1-0/+1
mlx5 HW can't perform IPsec offload operation simultaneously both on PF and VFs at the same time. While the previous patches added devlink knobs to change IPsec capabilities dynamically, there is a need to add a logic to block such IPsec capabilities for the cases when IPsec is already configured. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20230825062836.103744-7-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-28net/mlx5: Add IFC bits to support IPsec enable/disableLeon Romanovsky1-0/+3
Add hardware definitions to allow to control IPSec capabilities. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20230825062836.103744-6-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-24Merge branch 'mlx5-next' of ↵Jakub Kicinski4-1/+87
https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Leon Romanovsky says: ==================== mlx5 MACsec RoCEv2 support From Patrisious: This series extends previously added MACsec offload support to cover RoCE traffic either. In order to achieve that, we need configure MACsec with offload between the two endpoints, like below: REMOTE_MAC=10:70:fd:43:71:c0 * ip addr add 1.1.1.1/16 dev eth2 * ip link set dev eth2 up * ip link add link eth2 macsec0 type macsec encrypt on * ip macsec offload macsec0 mac * ip macsec add macsec0 tx sa 0 pn 1 on key 00 dffafc8d7b9a43d5b9a3dfbbf6a30c16 * ip macsec add macsec0 rx port 1 address $REMOTE_MAC * ip macsec add macsec0 rx port 1 address $REMOTE_MAC sa 0 pn 1 on key 01 ead3664f508eb06c40ac7104cdae4ce5 * ip addr add 10.1.0.1/16 dev macsec0 * ip link set dev macsec0 up And in a similar manner on the other machine, while noting the keys order would be reversed and the MAC address of the other machine. RDMA traffic is separated through relevant GID entries and in case of IP ambiguity issue - meaning we have a physical GIDs and a MACsec GIDs with the same IP/GID, we disable our physical GID in order to force the user to only use the MACsec GID. v0: https://lore.kernel.org/netdev/20230813064703.574082-1-leon@kernel.org/ * 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: RDMA/mlx5: Handles RoCE MACsec steering rules addition and deletion net/mlx5: Add RoCE MACsec steering infrastructure in core net/mlx5: Configure MACsec steering for ingress RoCEv2 traffic net/mlx5: Configure MACsec steering for egress RoCEv2 traffic IB/core: Reorder GID delete code for RoCE net/mlx5: Add MACsec priorities in RDMA namespaces RDMA/mlx5: Implement MACsec gid addition and deletion net/mlx5: Maintain fs_id xarray per MACsec device inside macsec steering net/mlx5: Remove netdevice from MACsec steering net/mlx5e: Move MACsec flow steering and statistics database from ethernet to core net/mlx5e: Rename MACsec flow steering functions/parameters to suit core naming style net/mlx5: Remove dependency of macsec flow steering on ethernet net/mlx5e: Move MACsec flow steering operations to be used as core library macsec: add functions to get macsec real netdevice and check offload ==================== Link: https://lore.kernel.org/r/20230821073833.59042-1-leon@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-20RDMA/mlx5: Handles RoCE MACsec steering rules addition and deletionPatrisious Haddad1-1/+3
Add RoCE MACsec rules when a gid is added for the MACsec netdevice and handle their cleanup when the gid is removed or the MACsec SA is deleted. Also support alias IP for the MACsec device, as long as we don't have more ips than what the gid table can hold. In addition handle the case where a gid is added but there are still no SAs added for the MACsec device, so the rules are added later on when the SAs are added. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-08-20net/mlx5: Add RoCE MACsec steering infrastructure in corePatrisious Haddad3-0/+36
Adds all the core steering helper functions that are needed in order to setup RoCE steering rules which includes both the RX and TX rules addition and deletion. As well as exporting the function to be ready to use from the IB driver where we expose functions to allow deletion of all rules, which is needed when a GID is deleted, or a deletion of a specific rule when an SA is deleted, and a similar manner for the rules addition. These functions are used in a later patch by IB driver to trigger the rules addition/deletion when needed. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-08-20net/mlx5: Add MACsec priorities in RDMA namespacesPatrisious Haddad1-0/+2
Add MACsec flow steering priorities in RDMA namespaces. This allows adding tables/rules to forward RoCEv2 traffic to the MACsec crypto tables in NIC_TX domain, and accept RoCEv2 traffic from NIC_RX domain. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-08-20RDMA/mlx5: Implement MACsec gid addition and deletionPatrisious Haddad1-0/+44
Handle MACsec IP ambiguity issue, since mlx5 hw can't support programming both the MACsec and the physical gid when they have the same IP address, because it wouldn't know to whom to steer the traffic. Hence in such case we delete the physical gid from the hw gid table, which would then cause all traffic sent over it to fail, and we'll only be able to send traffic over the MACsec gid. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-08-20net/mlx5e: Move MACsec flow steering and statistics database from ethernet ↵Patrisious Haddad1-0/+3
to core Since now MACsec flow steering (macsec_fs) and MACsec statistics (stats) are maintained by the core driver, move their data as well to be saved inside core structures instead of staying part of ethernet MACsec database. In addition cleanup all MACsec stats functions from the ethernet MACsec code and move what's needed to be part of macsec_fs instead. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-08-15net/mlx5: Remove unused MAX HCA capabilitiesShay Drory2-53/+0
Each device cap has two modes: MAX and CUR. The driver maintains a cache of both modes of the capabilities. For most device caps, the MAX cap mode is never used. Hence, remove all driver queries of the MAX mode of the said caps as well as their helper MACROs. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-15net/mlx5: Remove unused CAPsShay Drory2-59/+1
mlx5 driver queries the device for VECTOR_CALC and SHAMPO caps, but there isn't any user who requires them. As well as, MLX5_MCAM_REGS_0x9080_0x90FF is queried but not used. Thus, drop all usages and definitions of the mentioned caps above. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-15net/mlx5: Check with FW that sync reset completed successfullyMoshe Shemesh1-1/+2
Even if the PF driver had no error on his part of the sync reset flow, the firmware can see wider picture as it syncs all the PFs in the flow. So add at end of sync reset flow check with firmware by reading MFRL register and initialization segment that the flow had no issue from firmware point of view too. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-10net/mlx5: Expose NIC temperature via hardware monitoring kernel APIAdham Faris2-2/+15
Expose NIC temperature by implementing hwmon kernel API, which turns current thermal zone kernel API to redundant. For each one of the supported and exposed thermal diode sensors, expose the following attributes: 1) Input temperature. 2) Highest temperature. 3) Temperature label: Depends on the firmware capability, if firmware doesn't support sensors naming, the fallback naming convention would be: "sensorX", where X is the HW spec (MTMP register) sensor index. 4) Temperature critical max value: refers to the high threshold of Warning Event. Will be exposed as `tempY_crit` hwmon attribute (RO attribute). For example for ConnectX5 HCA's this temperature value will be 105 Celsius, 10 degrees lower than the HW shutdown temperature). 5) Temperature reset history: resets highest temperature. For example, for dualport ConnectX5 NIC with a single IC thermal diode sensor will have 2 hwmon directories (one for each PCI function) under "/sys/class/hwmon/hwmon[X,Y]". Listing one of the directories above (hwmonX/Y) generates the corresponding output below: $ grep -H -d skip . /sys/class/hwmon/hwmon0/* Output ======================================================================= /sys/class/hwmon/hwmon0/name:mlx5 /sys/class/hwmon/hwmon0/temp1_crit:105000 /sys/class/hwmon/hwmon0/temp1_highest:48000 /sys/class/hwmon/hwmon0/temp1_input:46000 /sys/class/hwmon/hwmon0/temp1_label:asic grep: /sys/class/hwmon/hwmon0/temp1_reset_history: Permission denied In addition, displaying the sensors data via lm_sensors generates the corresponding output below: $ sensors Output ======================================================================= mlx5-pci-0800 Adapter: PCI adapter asic: +46.0°C (crit = +105.0°C, highest = +48.0°C) mlx5-pci-0801 Adapter: PCI adapter asic: +46.0°C (crit = +105.0°C, highest = +48.0°C) CC: Jean Delvare <jdelvare@suse.com> Signed-off-by: Adham Faris <afaris@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Acked-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230807180507.22984-3-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-07net/mlx5: Allocate completion EQs dynamicallyMaher Sanalla1-1/+1
This commit enables the dynamic allocation of EQs at runtime, allowing for more flexibility in managing completion EQs and reducing the memory overhead of driver load. Whenever a CQ is created for a given vector index, the driver will lookup to see if there is an already mapped completion EQ for that vector, if so, utilize it. Otherwise, allocate a new EQ on demand and then utilize it for the CQ completion events. Add a protection lock to the EQ table to protect from concurrent EQ creation attempts. While at it, replace mlx5_vector2irqn()/mlx5_vector2eqn() with mlx5_comp_eqn_get() and mlx5_comp_irqn_get() which will allocate an EQ on demand if no EQ is found for the given vector. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-07net/mlx5: Rename mlx5_comp_vectors_count() to mlx5_comp_vectors_max()Maher Sanalla1-1/+1
To accurately represent its purpose, rename the function that retrieves the value of maximum vectors from mlx5_comp_vectors_count() to mlx5_comp_vectors_max(). Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-07net/mlx5: Add IRQ vector to CPU lookup functionMaher Sanalla1-2/+1
Currently, once driver load completes, IRQ requests were performed for all vectors. However, as we move to support dynamic creation of EQs, this will not be the case as some IRQs will not exist at this stage. Thus, in such case, use the default CPU to IRQ mapping which is the serial mapping based on IRQ vector index. Meaning, the n'th vector gets mapped to the n'th CPU. Introduce an API function mlx5_comp_vector_cpu() that takes an IRQ index and provides the corresponding CPU mapping. It utilizes the existing IRQ affinity if defined, or resorts to the default serialized CPU mapping otherwise. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-03net/mlx5e: Make TC and IPsec offloads mutually exclusive on a netdevJianbo Liu1-0/+2
For IPsec packet offload mode, the order of TC offload and IPsec offload on the same netdevice is not aligned with the order in the non-offload software. For example, for RX, the software performs TC first and then IPsec transformation, but the implementation for offload does that in the opposite way. To resolve the difference for now, either IPsec offload or TC offload, not both, is allowed for a specific interface. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/8e2e5e3b0984d785066e8663aaf97b3ba1bb873f.1690802064.git.leon@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03net/mlx5e: Support IPsec packet offload for TX in switchdev modeJianbo Liu1-0/+1
The IPsec encryption is done at the last, so add new prio for IPsec offload in FDB, and put it just lower than the slow path prio and higher than the per-vport prio. Three levels are added for TX. The first one is for ip xfrm policy. The sa table is created in the second level for ip xfrm state. The status table is created at the last to count the number of packets encrypted. The rules, which forward packets to uplink, are changed to forward them to IPsec TX tables first. These rules are restored after those tables are destroyed, which is done immediately when there is no reference to them, just as what does in legacy mode. The support for slow path is added here, by refreshing uplink's channels. But, the handling for TC fast path, which is more complicated, will be added later. Besides, reg c4 is used instead to match reqid. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/cfd0e6ffaf0b8c55ebaa9fb0649b7c504b6b8ec6.1690802064.git.leon@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03net/mlx5e: Handle IPsec offload for RX datapath in switchdev modeJianbo Liu1-0/+3
Reuse tun opts bits in reg c1, to pass IPsec obj id to datapath. As this is only for RX SA and there are only 11 bits, xarray is used to map IPsec obj id to an index, which is between 1 and 0x7ff, and replace obj id to write to reg c1. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/43d60fbcc9cd672a97d7e2a2f7fe6a3d9e9a776d.1690802064.git.leon@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03net/mlx5e: Support IPsec packet offload for RX in switchdev modeJianbo Liu1-0/+1
As decryption must be done first, add new prio for IPsec offload in FDB, and put it just lower than BYPASS prio and higher than TC prio. Three levels are added for RX. The first one is for ip xfrm policy. SA table is created in the second level for ip xfrm state. The status table is created in the last to check the decryption result. If success, packets continue with the next process, or dropped otherwise. For now, the set of reg c1 is removed for swtichdev mode, and the datapath process will be added in the next patch. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/c91063554cf643fb50b99cf093e8a9bf11729de5.1690802064.git.leon@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27net/mlx5: Allocate command stats with xarrayShay Drory1-1/+1
Command stats is an array with more than 2K entries, which amounts to ~180KB. This is way more than actually needed, as only ~190 entries are being used. Therefore, replace the array with xarray. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-07-27net/mlx5: Re-organize mlx5_cmd structShay Drory1-10/+11
Downstream patch will split mlx5_cmd_init() to probe and reload routines. As a preparation, organize mlx5_cmd struct so that any field that will be used in the reload routine are grouped at new nested struct. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-07-27net/mlx5: Devcom, Infrastructure changesRoi Dayan1-2/+2
Update devcom infrastructure to be more generic, without depending on max supported ports definition or a device guid, and also more encapsulated so callers don't need to pass the register devcom component id per event call. Signed-off-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-07-25net/mlx5: Add relevant capabilities bits to support NAT-TLeon Romanovsky1-2/+5
Provide an ability to check if flow steering supports UDP encapsulation and decapsulation of IPsec ESP packets. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-30Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds1-10/+0
Pull rdma updates from Jason Gunthorpe: "This cycle saw a focus on rxe and bnxt_re drivers: - Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma - rxe uses workqueues instead of tasklets - rxe has better compliance around access checks for MRs and rereg_mr - mana supportst he 'v2' FW interface for RX coalescing - hfi1 bug fix for stale cache entries in its MR cache - mlx5 buf fix to handle FW failures when destroying QPs - erdma HW has a new doorbell allocation mechanism for uverbs that is secure - Lots of small cleanups and rework in bnxt_re: - Use the common mmap functions - Support disassociation - Improve FW command flow - support for 'low latency push'" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits) RDMA/bnxt_re: Fix an IS_ERR() vs NULL check RDMA/bnxt_re: Fix spelling mistake "priviledged" -> "privileged" RDMA/bnxt_re: Remove duplicated include in bnxt_re/main.c RDMA/bnxt_re: Refactor code around bnxt_qplib_map_rc() RDMA/bnxt_re: Remove incorrect return check from slow path RDMA/bnxt_re: Enable low latency push RDMA/bnxt_re: Reorg the bar mapping RDMA/bnxt_re: Move the interface version to chip context structure RDMA/bnxt_re: Query function capabilities from firmware RDMA/bnxt_re: Optimize the bnxt_re_init_hwrm_hdr usage RDMA/bnxt_re: Add disassociate ucontext support RDMA/bnxt_re: Use the common mmap helper functions RDMA/bnxt_re: Initialize opcode while sending message RDMA/cma: Remove NULL check before dev_{put, hold} RDMA/rxe: Simplify cq->notify code RDMA/rxe: Fixes mr access supported list RDMA/bnxt_re: optimize the parameters passed to helper functions RDMA/bnxt_re: remove redundant cmdq_bitmap RDMA/bnxt_re: use firmware provided max request timeout RDMA/bnxt_re: cancel all control path command waiters upon error ...
2023-06-27Merge tag 'v6.4' into rdma.git for-nextJason Gunthorpe2-1/+16
Linux 6.4 Resolve conflicts between rdma rc and next in rxe_cq matching linux-next: drivers/infiniband/sw/rxe/rxe_cq.c: https://lore.kernel.org/r/20230622115246.365d30ad@canb.auug.org.au Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-23net/mlx5: Fix reserved at offset in hca_cap registerLama Kayal1-2/+2
A member of struct mlx5_ifc_cmd_hca_cap_bits has been mistakenly assigned the wrong reserved_at offset value. Correct it to align to the right value, thus avoid future miscalculation. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-16net/mlx5: Expose bits for local loopback counterOr Har-Toov1-2/+4
Add needed HW bits for querying local loopback counter and the HCA capability for it. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Avihai Horon <avihaih@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-16net/mlx5: Handle sync reset unload eventMoshe Shemesh2-1/+3
Added a new event handler to firmware sync reset, which is used to support firmware sync reset flow on smart NIC. Adding this new stage to the flow enables the firmware to ensure host PFs unload before ECPFs unload, to avoid race of PFs recovery. If firmware sends sync_reset_unload event to driver the driver should unload and close all HW resources of the function. Once the driver finishes unloading part, it can't get any more events from firmware as event queues are closed, so it polls the reset state field to know when to continue to next stage of the sync reset flow. Added capability bit for supporting sync_reset_unload event. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-16net/mlx5: Expose timeout for sync reset unload stageMoshe Shemesh1-1/+3
Expose new timoueout in Default Timeouts Register to be used on sync reset flow running on smart NIC. In this flow the driver should know how much time to wait from getting unload request till firmware will ask the PF to continue to next stage of the flow. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+12
Cross-merge networking fixes after downstream PR. Conflicts: include/linux/mlx5/driver.h 617f5db1a626 ("RDMA/mlx5: Fix affinity assignment") dc13180824b7 ("net/mlx5: Enable devlink port for embedded cpu VF vports") https://lore.kernel.org/all/20230613125939.595e50b8@canb.auug.org.au/ tools/testing/selftests/net/mptcp/mptcp_join.sh 47867f0a7e83 ("selftests: mptcp: join: skip check if MIB counter not supported") 425ba803124b ("selftests: mptcp: join: support RM_ADDR for used endpoints or not") 45b1a1227a7a ("mptcp: introduces more address related mibs") 0639fa230a21 ("selftests: mptcp: add explicit check for new mibs") https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-16Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds1-0/+12
Pull rdma fixes from Jason Gunthorpe: "This is an unusually large bunch of bug fixes for the later rc cycle, rxe and mlx5 both dumped a lot of things at once. rxe continues to fix itself, and mlx5 is fixing a bunch of "queue counters" related bugs. There is one highly notable bug fix regarding the qkey. This small security check was missed in the original 2005 implementation and it allows some significant issues. Summary: - Two rtrs bug fixes for error unwind bugs - Several rxe bug fixes: * Incorrect Rx packet validation * Using memory without a refcount * Syzkaller found use before initialization * Regression fix for missing locking with the tasklet conversion from this merge window - Have bnxt report the correct link properties to userspace, this was a regression in v6.3 - Several mlx5 bug fixes: * Kernel crash triggerable by userspace for the RAW ethernet profile * Defend against steering refcounting issues created by userspace * Incorrect change of QP port affinity parameters in some LAG configurations - Fix mlx5 Q counters: * Do not over allocate Q counters to allow userspace to use the full port capacity * Kernel crash triggered by eswitch due to mis-use of Q counters * Incorrect mlx5_device for Q counters in some LAG configurations - Properly implement the IBA spec restricting privileged qkeys to root - Always an error when reading from a disassociated device's event queue - isert bug fixes: * Avoid a deadlock with the CM handler and CM ID destruction * Correct list corruption due to incorrect locking * Fix a use after free around connection tear down" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/rxe: Fix rxe_cq_post IB/isert: Fix incorrect release of isert connection IB/isert: Fix possible list corruption in CMA handler IB/isert: Fix dead lock in ib_isert RDMA/mlx5: Fix affinity assignment IB/uverbs: Fix to consider event queue closing also upon non-blocking mode RDMA/uverbs: Restrict usage of privileged QKEYs RDMA/cma: Always set static rate to 0 for RoCE RDMA/mlx5: Fix Q-counters query in LAG mode RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters RDMA/mlx5: Fix Q-counters per vport allocation RDMA/mlx5: Create an indirect flow table for steering anchor RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions RDMA/rxe: Fix the use-before-initialization error of resp_pkts RDMA/bnxt_re: Fix reporting active_{speed,width} attributes RDMA/rxe: Fix ref count error in check_rkey() RDMA/rxe: Fix packet length checks RDMA/rtrs: Fix rxe_dealloc_pd warning RDMA/rtrs: Fix the last iu->buf leak in err path
2023-06-11RDMA/mlx5: Fix affinity assignmentMark Bloch1-0/+12
The cited commit aimed to ensure that Virtual Functions (VFs) assign a queue affinity to a Queue Pair (QP) to distribute traffic when the LAG master creates a hardware LAG. If the affinity was set while the hardware was not in LAG, the firmware would ignore the affinity value. However, this commit unintentionally assigned an affinity to QPs on the LAG master's VPORT even if the RDMA device was not marked as LAG-enabled. In most cases, this was not an issue because when the hardware entered hardware LAG configuration, the RDMA device of the LAG master would be destroyed and a new one would be created, marked as LAG-enabled. The problem arises when a user configures Equal-Cost Multipath (ECMP). In ECMP mode, traffic can be directed to different physical ports based on the queue affinity, which is intended for use by VPORTS other than the E-Switch manager. ECMP mode is supported only if both E-Switch managers are in switchdev mode and the appropriate route is configured via IP. In this configuration, the RDMA device is not destroyed, and we retain the RDMA device that is not marked as LAG-enabled. To ensure correct behavior, Send Queues (SQs) opened by the E-Switch manager through verbs should be assigned strict affinity. This means they will only be able to communicate through the native physical port associated with the E-Switch manager. This will prevent the firmware from assigning affinity and will not allow the SQs to be remapped in case of failover. Fixes: 802dcc7fc5ec ("RDMA/mlx5: Support TX port affinity for VF drivers in LAG mode") Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://lore.kernel.org/r/425b05f4da840bc684b0f7e8ebf61aeb5cef09b0.1685960567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11RDMA/mlx5: Handle DCT QP logic separately from low level QP interfacePatrisious Haddad1-1/+0
Previously when destroying a DCT, if the firmware function for the destruction failed, the common resource would have been destroyed either way, since it was destroyed before the firmware object. Which leads to kernel warning "refcount_t: underflow" which indicates possible use-after-free. Which is triggered when we try to destroy the common resource for the second time and execute refcount_dec_and_test(&common->refcount). So, let's fix the destruction order by factoring out the DCT QP logic to be in separate XArray database. refcount_t: underflow; use-after-free. WARNING: CPU: 8 PID: 1002 at lib/refcount.c:28 refcount_warn_saturate+0xd8/0xe0 Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core overlay mlx5_core fuse CPU: 8 PID: 1002 Comm: python3 Not tainted 5.16.0-rc5+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:refcount_warn_saturate+0xd8/0xe0 Code: ff 48 c7 c7 18 f5 23 82 c6 05 60 70 ff 00 01 e8 d0 0a 45 00 0f 0b c3 48 c7 c7 c0 f4 23 82 c6 05 4c 70 ff 00 01 e8 ba 0a 45 00 <0f> 0b c3 0f 1f 44 00 00 8b 07 3d 00 00 00 c0 74 12 83 f8 01 74 13 RSP: 0018:ffff8881221d3aa8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff8881313e8d40 RCX: ffff88852cc1b5c8 RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff88852cc1b5c0 RBP: ffff888100f70000 R08: ffff88853ffd1ba8 R09: 0000000000000003 R10: 00000000fffff000 R11: 3fffffffffffffff R12: 0000000000000246 R13: ffff888100f71fa0 R14: ffff8881221d3c68 R15: 0000000000000020 FS: 00007efebbb13740(0000) GS:ffff88852cc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005611aac29f80 CR3: 00000001313de004 CR4: 0000000000370ea0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> destroy_resource_common+0x6e/0x95 [mlx5_ib] mlx5_core_destroy_rq_tracked+0x38/0xbe [mlx5_ib] mlx5_ib_destroy_wq+0x22/0x80 [mlx5_ib] ib_destroy_wq_user+0x1f/0x40 [ib_core] uverbs_free_wq+0x19/0x40 [ib_uverbs] destroy_hw_idr_uobject+0x18/0x50 [ib_uverbs] uverbs_destroy_uobject+0x2f/0x190 [ib_uverbs] uobj_destroy+0x3c/0x80 [ib_uverbs] ib_uverbs_cmd_verbs+0x3e4/0xb80 [ib_uverbs] ? uverbs_free_wq+0x40/0x40 [ib_uverbs] ? ip_list_rcv+0xf7/0x120 ? netif_receive_skb_list_internal+0x1b6/0x2d0 ? task_tick_fair+0xbf/0x450 ? __handle_mm_fault+0x11fc/0x1450 ib_uverbs_ioctl+0xa4/0x110 [ib_uverbs] __x64_sys_ioctl+0x3e4/0x8e0 ? handle_mm_fault+0xb9/0x210 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7efebc0be17b Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe71813e78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffe71813fb8 RCX: 00007efebc0be17b RDX: 00007ffe71813fa0 RSI: 00000000c0181b01 RDI: 0000000000000005 RBP: 00007ffe71813f80 R08: 00005611aae96020 R09: 000000000000004f R10: 00007efebbf9ffa0 R11: 0000000000000246 R12: 00007ffe71813f80 R13: 00007ffe71813f4c R14: 00005611aae2eca0 R15: 00007efeae6c89d0 </TASK> Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/4470888466c8a898edc9833286967529cc5f3c0d.1685953497.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11RDMA/mlx5: Reduce QP table exposureLeon Romanovsky1-9/+0
driver.h is common header to whole mlx5 code base, but struct mlx5_qp_table is used in mlx5_ib driver only. So move that struct to be under sole responsibility of mlx5_ib. Link: https://lore.kernel.org/r/bec0dc1158e795813b135d1143147977f26bf668.1685953497.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>