summaryrefslogtreecommitdiff
path: root/drivers/infiniband
AgeCommit message (Collapse)AuthorFilesLines
2026-04-02RDMA/ionic: Preserve and set Ethernet source MAC after ib_ud_header_init()Abhijit Gangurde1-1/+3
commit a08aaf3968aec5d05cd32c801b8cc0c61da69c41 upstream. ionic_build_hdr() populated the Ethernet source MAC (hdr->eth.smac_h) by passing the header’s storage directly to rdma_read_gid_l2_fields(). However, ib_ud_header_init() is called after that and re-initializes the UD header, which wipes the previously written smac_h. As a result, packets are emitted with an zero source MAC address on the wire. Correct the source MAC by reading the GID-derived smac into a temporary buffer and copy it after ib_ud_header_init() completes. Fixes: e8521822c733 ("RDMA/ionic: Register device ops for control path") Cc: stable@vger.kernel.org # 6.18 Signed-off-by: Abhijit Gangurde <abhijit.gangurde@amd.com> Link: https://patch.msgid.link/20260227061809.2979990-1-abhijit.gangurde@amd.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02RDMA/irdma: Harden depth calculation functionsShiraz Saleem1-17/+22
[ Upstream commit e37afcb56ae070477741fe2d6e61fc0c542cce2d ] An issue was exposed where OS can pass in U32_MAX for SQ/RQ/SRQ size. This can cause integer overflow and truncation of SQ/RQ/SRQ depth returning a success when it should have failed. Harden the functions to do all depth calculations and boundary checking in u64 sizes. Fixes: 563e1feb5f6e ("RDMA/irdma: Add SRQ support") Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02RDMA/irdma: Return EINVAL for invalid arp index errorTatyana Nikolova1-7/+10
[ Upstream commit 7221f581eefa79ead06e171044f393fb7ee22f87 ] When rdma_connect() fails due to an invalid arp index, user space rdma core reports ENOMEM which is confusing. Modify irdma_make_cm_node() to return the correct error code. Fixes: 146b9756f14c ("RDMA/irdma: Add connection manager") Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02RDMA/irdma: Fix deadlock during netdev reset with active connectionsAnil Samal1-1/+2
[ Upstream commit 6f52370970ac07d352a7af4089e55e0e6425f827 ] Resolve deadlock that occurs when user executes netdev reset while RDMA applications (e.g., rping) are active. The netdev reset causes ice driver to remove irdma auxiliary driver, triggering device_delete and subsequent client removal. During client removal, uverbs_client waits for QP reference count to reach zero while cma_client holds the final reference, creating circular dependency and indefinite wait in iWARP mode. Skip QP reference count wait during device reset to prevent deadlock. Fixes: c8f304d75f6c ("RDMA/irdma: Prevent QP use after free") Signed-off-by: Anil Samal <anil.samal@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02RDMA/irdma: Remove reset check from irdma_modify_qp_to_err()Tatyana Nikolova1-2/+0
[ Upstream commit c45c6ebd693b944f1ffe429fdfb6cc1674c237be ] During reset, irdma_modify_qp() to error should be called to disconnect the QP. Without this fix, if not preceded by irdma_modify_qp() to error, the API call irdma_destroy_qp() gets stuck waiting for the QP refcount to go to zero, because the cm_node associated with this QP isn't disconnected. Fixes: 915cc7ac0f8e ("RDMA/irdma: Add miscellaneous utility definitions") Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02RDMA/irdma: Clean up unnecessary dereference of event->cm_nodeIvan Barrera1-6/+6
[ Upstream commit b415399c9a024d574b65479636f0d4eb625b9abd ] The cm_node is available and the usage of cm_node and event->cm_node seems arbitrary. Clean up unnecessary dereference of event->cm_node. Fixes: 146b9756f14c ("RDMA/irdma: Add connection manager") Signed-off-by: Ivan Barrera <ivan.d.barrera@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02RDMA/irdma: Remove a NOP wait_event() in irdma_modify_qp_roce()Tatyana Nikolova1-2/+0
[ Upstream commit 5e8f0239731a83753473b7aa91bda67bbdff5053 ] Remove a NOP wait_event() in irdma_modify_qp_roce() which is relevant for iWARP and likely a copy and paste artifact for RoCEv2. The wait event is for sending a reset on a TCP connection, after the reset has been requested in irdma_modify_qp(), which occurs only in iWarp mode. Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs") Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02RDMA/irdma: Update ibqp state to error if QP is already in error stateTatyana Nikolova1-0/+2
[ Upstream commit 8c1f19a2225cf37b3f8ab0b5a8a5322291cda620 ] In irdma_modify_qp() update ibqp state to error if the irdma QP is already in error state, otherwise the ibqp state which is visible to the consumer app remains stale. Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs") Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02RDMA/irdma: Initialize free_qp completion before using itJacob Moroni1-1/+1
[ Upstream commit 11a95521fb93c91e2d4ef9d53dc80ef0a755549b ] In irdma_create_qp, if ib_copy_to_udata fails, it will call irdma_destroy_qp to clean up which will attempt to wait on the free_qp completion, which is not initialized yet. Fix this by initializing the completion before the ib_copy_to_udata call. Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs") Signed-off-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02RDMA/efa: Fix possible deadlockEthan Tidmore1-0/+1
[ Upstream commit 0f2055db7b630559870afb40fc84490816ab8ec5 ] In the error path for efa_com_alloc_comp_ctx() the semaphore assigned to &aq->avail_cmds is not released. Detected by Smatch: drivers/infiniband/hw/efa/efa_com.c:662 efa_com_cmd_exec() warn: inconsistent returns '&aq->avail_cmds' Add release for &aq->avail_cmds in efa_com_alloc_comp_ctx() error path. Fixes: ef3b06742c8a2 ("RDMA/efa: Fix use of completion ctx after free") Signed-off-by: Ethan Tidmore <ethantidmore06@gmail.com> Link: https://patch.msgid.link/20260314045730.1143862-1-ethantidmore06@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02RDMA/rw: Fall back to direct SGE on MR pool exhaustionChuck Lever1-3/+18
[ Upstream commit 00da250c21b074ea9494c375d0117b69e5b1d0a4 ] When IOMMU passthrough mode is active, ib_dma_map_sgtable_attrs() produces no coalescing: each scatterlist page maps 1:1 to a DMA entry, so sgt.nents equals the raw page count. A 1 MB transfer yields 256 DMA entries. If that count exceeds the device's max_sgl_rd threshold (an optimization hint from mlx5 firmware), rdma_rw_io_needs_mr() steers the operation into the MR registration path. Each such operation consumes one or more MRs from a pool sized at max_rdma_ctxs -- roughly one MR per concurrent context. Under write-intensive workloads that issue many concurrent RDMA READs, the pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns -EAGAIN. Upper layer protocols treat this as a fatal DMA mapping failure and tear down the connection. The max_sgl_rd check is a performance optimization, not a correctness requirement: the device can handle large SGE counts via direct posting, just less efficiently than with MR registration. When the MR pool cannot satisfy a request, falling back to the direct SGE (map_wrs) path avoids the connection reset while preserving the MR optimization for the common case where pool resources are available. Add a fallback in rdma_rw_ctx_init() so that -EAGAIN from rdma_rw_init_mr_wrs() triggers direct SGE posting instead of propagating the error. iWARP devices, which mandate MR registration for RDMA READs, and force_mr debug mode continue to treat -EAGAIN as terminal. Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260313194201.5818-2-cel@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02RDMA/efa: Fix use of completion ctx after freeYonatan Nachum1-48/+39
[ Upstream commit ef3b06742c8a201d0e83edc9a33a89a4fe3009f8 ] On admin queue completion handling, if the admin command completed with error we print data from the completion context. The issue is that we already freed the completion context in polling/interrupts handler which means we print data from context in an unknown state (it might be already used again). Change the admin submission flow so alloc/dealloc of the context will be symmetric and dealloc will be called after any potential use of the context. Fixes: 68fb9f3e312a ("RDMA/efa: Remove redundant NULL pointer check of CQE") Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com> Reviewed-by: Michael Margolin <mrgolin@amazon.com> Signed-off-by: Yonatan Nachum <ynachum@amazon.com> Link: https://patch.msgid.link/20260308165350.18219-1-ynachum@amazon.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02RDMA/efa: Improve admin completion context state machineYonatan Nachum1-41/+50
[ Upstream commit dab5825491f7b0ea92a09390f39df0a51100f12f ] Add a new unused state to the admin completion contexts state machine instead of the occupied field. This improves the completion validity check because it now enforce the context to be in submitted state prior to completing it. Also add allocated state as a intermediate state between unused and submitted. Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com> Reviewed-by: Michael Margolin <mrgolin@amazon.com> Signed-off-by: Yonatan Nachum <ynachum@amazon.com> Link: https://patch.msgid.link/20251210130614.36460-3-ynachum@amazon.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Stable-dep-of: ef3b06742c8a ("RDMA/efa: Fix use of completion ctx after free") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02RDMA/efa: Check stored completion CTX command ID with received oneYonatan Nachum1-2/+4
[ Upstream commit 4b01ec0f133b3fe1038dc538d6bfcbd72462d2f0 ] In admin command completion, we receive a CQE with the command ID which is constructed from context index and entropy bits from the admin queue producer counter. To try to detect memory corruptions in the received CQE, validate the full command ID of the fetched context with the CQE command ID. If there is a mismatch, complete the CQE with error. Also use LSBs of the admin queue producer counter to better detect entropy mismatch between smaller number of commands. Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com> Reviewed-by: Michael Margolin <mrgolin@amazon.com> Signed-off-by: Yonatan Nachum <ynachum@amazon.com> Link: https://patch.msgid.link/20251210130614.36460-2-ynachum@amazon.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Stable-dep-of: ef3b06742c8a ("RDMA/efa: Fix use of completion ctx after free") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12RDMA/ionic: Fix kernel stack leak in ionic_create_cq()Jason Gunthorpe1-1/+1
commit faa72102b178c7ae6c6afea23879e7c84fc59b4e upstream. struct ionic_cq_resp resp { __u32 cqid[2]; // offset 0 - PARTIALLY SET (see below) __u8 udma_mask; // offset 8 - SET (resp.udma_mask = vcq->udma_mask) __u8 rsvd[7]; // offset 9 - NEVER SET <- LEAK }; rsvd[7]: 7 bytes of stack memory leaked unconditionally. cqid[2]: The loop at line 1256 iterates over udma_idx but skips indices where !(vcq->udma_mask & BIT(udma_idx)). The array has 2 entries but udma_count could be 1, meaning cqid[1] might never be written via ionic_create_cq_common(). If udma_mask only has bit 0 set, cqid[1] (4 bytes) is also leaked. So potentially 11 bytes leaked. Cc: stable@vger.kernel.org Fixes: e8521822c733 ("RDMA/ionic: Register device ops for control path") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://patch.msgid.link/4-v1-83e918d69e73+a9-rdma_udata_rc_jgg@nvidia.com Acked-by: Abhijit Gangurde <abhijit.gangurde@amd.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-12RDMA/irdma: Fix kernel stack leak in irdma_create_user_ah()Jason Gunthorpe1-1/+1
commit 74586c6da9ea222a61c98394f2fc0a604748438c upstream. struct irdma_create_ah_resp { // 8 bytes, no padding __u32 ah_id; // offset 0 - SET (uresp.ah_id = ah->sc_ah.ah_info.ah_idx) __u8 rsvd[4]; // offset 4 - NEVER SET <- LEAK }; rsvd[4]: 4 bytes of stack memory leaked unconditionally. Only ah_id is assigned before ib_respond_udata(). The reserved members of the structure were not zeroed. Cc: stable@vger.kernel.org Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://patch.msgid.link/3-v1-83e918d69e73+a9-rdma_udata_rc_jgg@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-12IB/mthca: Add missed mthca_unmap_user_db() for mthca_create_srq()Jason Gunthorpe1-2/+3
commit 117942ca43e2e3c3d121faae530989931b7f67e1 upstream. Fix a user triggerable leak on the system call failure path. Cc: stable@vger.kernel.org Fixes: ec34a922d243 ("[PATCH] IB/mthca: Add SRQ implementation") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://patch.msgid.link/2-v1-83e918d69e73+a9-rdma_udata_rc_jgg@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04RDMA/umem: Fix double dma_buf_unpin in failure pathJacob Moroni1-3/+1
[ Upstream commit 104016eb671e19709721c1b0048dd912dc2e96be ] In ib_umem_dmabuf_get_pinned_with_dma_device(), the call to ib_umem_dmabuf_map_pages() can fail. If this occurs, the dmabuf is immediately unpinned but the umem_dmabuf->pinned flag is still set. Then, when ib_umem_release() is called, it calls ib_umem_dmabuf_revoke() which will call dma_buf_unpin() again. Fix this by removing the immediate unpin upon failure and just let the ib_umem_release/revoke path handle it. This also ensures the proper unmap-unpin unwind ordering if the dmabuf_map_pages call happened to fail due to dma_resv_wait_timeout (and therefore has a non-NULL umem_dmabuf->sgt). Fixes: 1e4df4a21c5a ("RDMA/umem: Allow pinned dmabuf umem usage") Signed-off-by: Jacob Moroni <jmoroni@google.com> Link: https://patch.msgid.link/20260224234153.1207849-1-jmoroni@google.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04RDMA/efa: Fix typo in efa_alloc_mr()Jason Gunthorpe1-1/+1
[ Upstream commit f22c77ce49db0589103d96487dca56f5b2136362 ] The pattern is to check the entire driver request space, not just sizeof something unrelated. Fixes: 40909f664d27 ("RDMA/efa: Add EFA verbs implementation") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://patch.msgid.link/1-v1-83e918d69e73+a9-rdma_udata_rc_jgg@nvidia.com Acked-by: Michael Margolin <mrgolin@amazon.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04RDMA/ionic: Fix potential NULL pointer dereference in ionic_query_portKamal Heib1-0/+2
[ Upstream commit fd80bd7105f88189f47d465ca8cb7d115570de30 ] The function ionic_query_port() calls ib_device_get_netdev() without checking the return value which could lead to NULL pointer dereference, Fix it by checking the return value and return -ENODEV if the 'ndev' is NULL. Fixes: 2075bbe8ef03 ("RDMA/ionic: Register device ops for miscellaneous functionality") Signed-off-by: Kamal Heib <kheib@redhat.com> Link: https://patch.msgid.link/20260220222125.16973-2-kheib@redhat.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04RDMA/core: Fix stale RoCE GIDs during netdev events at registrationJiri Pirko3-1/+49
[ Upstream commit 9af0feae8016ba58ad7ff784a903404986b395b1 ] RoCE GID entries become stale when netdev properties change during the IB device registration window. This is reproducible with a udev rule that sets a MAC address when a VF netdev appears: ACTION=="add", SUBSYSTEM=="net", KERNEL=="eth4", \ RUN+="/sbin/ip link set eth4 address 88:22:33:44:55:66" After VF creation, show_gids displays GIDs derived from the original random MAC rather than the configured one. The root cause is a race between netdev event processing and device registration: CPU 0 (driver) CPU 1 (udev/workqueue) ────────────── ────────────────────── ib_register_device() ib_cache_setup_one() gid_table_setup_one() _gid_table_setup_one() ← GID table allocated rdma_roce_rescan_device() ← GIDs populated with OLD MAC ip link set eth4 addr NEW_MAC NETDEV_CHANGEADDR queued netdevice_event_work_handler() ib_enum_all_roce_netdevs() ← Iterates DEVICE_REGISTERED ← Device NOT marked yet, SKIP! enable_device_and_get() xa_set_mark(DEVICE_REGISTERED) ← Too late, event was lost The netdev event handler uses ib_enum_all_roce_netdevs() which only iterates devices marked DEVICE_REGISTERED. However, this mark is set late in the registration process, after the GID cache is already populated. Events arriving in this window are silently dropped. Fix this by introducing a new xarray mark DEVICE_GID_UPDATES that is set immediately after the GID table is allocated and initialized. Use the new mark in ib_enum_all_roce_netdevs() function to iterate devices instead of DEVICE_REGISTERED. This is safe because: - After _gid_table_setup_one(), all required structures exist (port_data, immutable, cache.gid) - The GID table mutex serializes concurrent access between the initial rescan and event handlers - Event handlers correctly update stale GIDs even when racing with rescan - The mark is cleared in ib_cache_cleanup_one() before teardown This also fixes similar races for IP address events (inetaddr_event, inet6addr_event) which use the same enumeration path. Fixes: 0df91bb67334 ("RDMA/devices: Use xarray to store the client_data") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20260127093839.126291-1-jiri@resnulli.us Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84 Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04RDMA/rtrs-clt: For conn rejection use actual err numberMd Haris Iqbal1-2/+2
[ Upstream commit fc290630702b530c2969061e7ef0d869a5b6dc4f ] When the connection establishment request is rejected from the server side, then the actual error number sent back should be used. Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Link: https://patch.msgid.link/20260107161517.56357-10-haris.iqbal@ionos.com Reviewed-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Reviewed-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmallocYi Liu1-1/+1
[ Upstream commit 58b604dfc7bb753f91bc0ccd3fa705e14e6edfb4 ] Since wqe_size in ib_uverbs_unmarshall_recv() is user-provided and already validated, but can still be large, add __GFP_NOWARN to suppress memory allocation warnings for large sizes, consistent with the similar fix in ib_uverbs_post_send(). Fixes: 67cdb40ca444 ("[IB] uverbs: Implement more commands") Signed-off-by: Yi Liu <liuy22@mails.tsinghua.edu.cn> Link: https://patch.msgid.link/20260129094900.3517706-1-liuy22@mails.tsinghua.edu.cn Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/core: add rdma_rw_max_sge() helper for SQ sizingChuck Lever1-15/+38
[ Upstream commit afcae7d7b8a278a6c29e064f99e5bafd4ac1fb37 ] svc_rdma_accept() computes sc_sq_depth as the sum of rq_depth and the number of rdma_rw contexts (ctxts). This value is used to allocate the Send CQ and to initialize the sc_sq_avail credit pool. However, when the device uses memory registration for RDMA operations, rdma_rw_init_qp() inflates the QP's max_send_wr by a factor of three per context to account for REG and INV work requests. The Send CQ and credit pool remain sized for only one work request per context, causing Send Queue exhaustion under heavy NFS WRITE workloads. Introduce rdma_rw_max_sge() to compute the actual number of Send Queue entries required for a given number of rdma_rw contexts. Upper layer protocols call this helper before creating a Queue Pair so that their Send CQs and credit accounting match the QP's true capacity. Update svc_rdma_accept() to use rdma_rw_max_sge() when computing sc_sq_depth, ensuring the credit pool reflects the work requests that rdma_rw_init_qp() will reserve. Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260128005400.25147-5-cel@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/rxe: Fix race condition in QP timer handlersLi Zhijian2-0/+6
[ Upstream commit 87bf646921430e303176edc4eb07c30160361b73 ] I encontered the following warning: WARNING: drivers/infiniband/sw/rxe/rxe_task.c:249 at rxe_sched_task+0x1c8/0x238 [rdma_rxe], CPU#0: swapper/0/0 ... libsha1 [last unloaded: ip6_udp_tunnel] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G C 6.19.0-rc5-64k-v8+ #37 PREEMPT Tainted: [C]=CRAP Hardware name: Raspberry Pi 4 Model B Rev 1.2 Call trace: rxe_sched_task+0x1c8/0x238 [rdma_rxe] (P) retransmit_timer+0x130/0x188 [rdma_rxe] call_timer_fn+0x68/0x4d0 __run_timers+0x630/0x888 ... WARNING: drivers/infiniband/sw/rxe/rxe_task.c:38 at rxe_sched_task+0x1c0/0x238 [rdma_rxe], CPU#0: swapper/0/0 ... WARNING: drivers/infiniband/sw/rxe/rxe_task.c:111 at do_work+0x488/0x5c8 [rdma_rxe], CPU#3: kworker/u17:4/93400 ... refcount_t: underflow; use-after-free. WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x138/0x1a0, CPU#3: kworker/u17:4/93400 The issue is caused by a race condition between retransmit_timer() and rxe_destroy_qp, leading to the Queue Pair's (QP) reference count dropping to zero during timer handler execution. It seems this warning is harmless because rxe_qp_do_cleanup() will flush all pending timers and requests. Example of flow causing the issue: CPU0 CPU1 retransmit_timer() { spin_lock_irqsave rxe_destroy_qp() __rxe_cleanup() __rxe_put() // qp->ref_count decrease to 0 rxe_qp_do_cleanup() { if (qp->valid) { rxe_sched_task() { WARN_ON(rxe_read(task->qp) <= 0); } } spin_unlock_irqrestore } spin_lock_irqsave qp->valid = 0 spin_unlock_irqrestore } Ensure the QP's reference count is maintained and its validity is checked within the timer callbacks by adding calls to rxe_get(qp) and corresponding rxe_put(qp) after use. Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Fixes: d94671632572 ("RDMA/rxe: Rewrite rxe_task.c") Link: https://patch.msgid.link/20260120074437.623018-1-lizhijian@fujitsu.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/mlx5: Fix memory leak in GET_DATA_DIRECT_SYSFS_PATH handlerZilin Guan1-2/+2
[ Upstream commit 9b9d253908478f504297ac283c514e5953ddafa6 ] The UVERBS_HANDLER(MLX5_IB_METHOD_GET_DATA_DIRECT_SYSFS_PATH) function allocates memory for the device path using kobject_get_path(). If the length of the device path exceeds the output buffer length, the function returns -ENOSPC but does not free the allocated memory, resulting in a memory leak. Add a kfree() call to the error path to ensure the allocated memory is properly freed. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: ec7ad6530909 ("RDMA/mlx5: Introduce GET_DATA_DIRECT_SYSFS_PATH ioctl") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Link: https://patch.msgid.link/20260126074801.627898-1-zilin@seu.edu.cn Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/uverbs: Validate wqe_size before using it in ib_uverbs_post_sendYi Liu1-1/+4
[ Upstream commit 1956f0a74ccf5dc9c3ef717f2985c3ed3400aab0 ] ib_uverbs_post_send() uses cmd.wqe_size from userspace without any validation before passing it to kmalloc() and using the allocated buffer as struct ib_uverbs_send_wr. If a user provides a small wqe_size value (e.g., 1), kmalloc() will succeed, but subsequent accesses to user_wr->opcode, user_wr->num_sge, and other fields will read beyond the allocated buffer, resulting in an out-of-bounds read from kernel heap memory. This could potentially leak sensitive kernel information to userspace. Additionally, providing an excessively large wqe_size can trigger a WARNING in the memory allocation path, as reported by syzkaller. This is inconsistent with ib_uverbs_unmarshall_recv() which properly validates that wqe_size >= sizeof(struct ib_uverbs_recv_wr) before proceeding. Add the same validation for ib_uverbs_post_send() to ensure wqe_size is at least sizeof(struct ib_uverbs_send_wr). Fixes: c3bea3d2dc53 ("RDMA/uverbs: Use the iterator for ib_uverbs_unmarshall_recv()") Signed-off-by: Yi Liu <liuy22@mails.tsinghua.edu.cn> Link: https://patch.msgid.link/20260122142900.2356276-2-liuy22@mails.tsinghua.edu.cn Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZELi Zhijian2-97/+194
[ Upstream commit 12985e5915a0b8354796efadaaeb201eed115377 ] The current implementation incorrectly handles memory regions (MRs) with page sizes different from the system PAGE_SIZE. The core issue is that rxe_set_page() is called with mr->page_size step increments, but the page_list stores individual struct page pointers, each representing PAGE_SIZE of memory. ib_sg_to_page() has ensured that when i>=1 either a) SG[i-1].dma_end and SG[i].dma_addr are contiguous or b) SG[i-1].dma_end and SG[i].dma_addr are mr->page_size aligned. This leads to incorrect iova-to-va conversion in scenarios: 1) page_size < PAGE_SIZE (e.g., MR: 4K, system: 64K): ibmr->iova = 0x181800 sg[0]: dma_addr=0x181800, len=0x800 sg[1]: dma_addr=0x173000, len=0x1000 Access iova = 0x181800 + 0x810 = 0x182010 Expected VA: 0x173010 (second SG, offset 0x10) Before fix: - index = (0x182010 >> 12) - (0x181800 >> 12) = 1 - page_offset = 0x182010 & 0xFFF = 0x10 - xarray[1] stores system page base 0x170000 - Resulting VA: 0x170000 + 0x10 = 0x170010 (wrong) 2) page_size > PAGE_SIZE (e.g., MR: 64K, system: 4K): ibmr->iova = 0x18f800 sg[0]: dma_addr=0x18f800, len=0x800 sg[1]: dma_addr=0x170000, len=0x1000 Access iova = 0x18f800 + 0x810 = 0x190010 Expected VA: 0x170010 (second SG, offset 0x10) Before fix: - index = (0x190010 >> 16) - (0x18f800 >> 16) = 1 - page_offset = 0x190010 & 0xFFFF = 0x10 - xarray[1] stores system page for dma_addr 0x170000 - Resulting VA: system page of 0x170000 + 0x10 = 0x170010 (wrong) Yi Zhang reported a kernel panic[1] years ago related to this defect. Solution: 1. Replace xarray with pre-allocated rxe_mr_page array for sequential indexing (all MR page indices are contiguous) 2. Each rxe_mr_page stores both struct page* and offset within the system page 3. Handle MR page_size != PAGE_SIZE relationships: - page_size > PAGE_SIZE: Split MR pages into multiple system pages - page_size <= PAGE_SIZE: Store offset within system page 4. Add boundary checks and compatibility validation This ensures correct iova-to-va conversion regardless of MR page size and system PAGE_SIZE relationship, while improving performance through array-based sequential access. Tests on 4K and 64K PAGE_SIZE hosts: - rdma-core/pytests $ ./build/bin/run_tests.py --dev eth0_rxe - blktest: $ TIMEOUT=30 QUICK_RUN=1 USE_RXE=1 NVMET_TRTYPES=rdma ./check nvme srp rnbd [1] https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/ Fixes: 592627ccbdff ("RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Link: https://patch.msgid.link/20260116032753.2574363-1-lizhijian@fujitsu.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27IB/mlx5: Fix port speed query for representorsOr Har-Toov1-6/+14
[ Upstream commit 18ea78e2ae83d1d86a72d21d9511927e57e2c0e1 ] When querying speed information for a representor in switchdev mode, the code previously used the first device in the eswitch, which may not match the device that actually owns the representor. In setups such as multi-port eswitch or LAG, this led to incorrect port attributes being reported. Fix this by retrieving the correct core device from the representor's eswitch before querying its port attributes. Fixes: 27f9e0ccb6da ("net/mlx5: Lag, Add single RDMA device in multiport mode") Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260115-port-speed-query-fix-v2-1-3bde6a3c78e7@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/mlx5: Fix UMR hang in LAG error state unloadChiara Meiohas2-9/+68
[ Upstream commit ebc2164a4cd4314503f1a0c8e7aaf76d7e5fa211 ] During firmware reset in LAG mode, a race condition causes the driver to hang indefinitely while waiting for UMR completion during device unload. See [1]. In LAG mode the bond device is only registered on the master, so it never sees sys_error events from the slave. During firmware reset this causes UMR waits to hang forever on unload as the slave is dead but the master hasn't entered error state yet, so UMR posts succeed but completions never arrive. Fix this by adding a sys_error notifier that gets registered before MLX5_IB_STAGE_IB_REG and stays alive until after ib_unregister_device(). This ensures error events reach the bond device throughout teardown. [1] Call Trace: __schedule+0x2bd/0x760 schedule+0x37/0xa0 schedule_preempt_disabled+0xa/0x10 __mutex_lock.isra.6+0x2b5/0x4a0 __mlx5_ib_dereg_mr+0x606/0x870 [mlx5_ib] ? __xa_erase+0x4a/0xa0 ? _cond_resched+0x15/0x30 ? wait_for_completion+0x31/0x100 ib_dereg_mr_user+0x48/0xc0 [ib_core] ? rdmacg_uncharge_hierarchy+0xa0/0x100 destroy_hw_idr_uobject+0x20/0x50 [ib_uverbs] uverbs_destroy_uobject+0x37/0x150 [ib_uverbs] __uverbs_cleanup_ufile+0xda/0x140 [ib_uverbs] uverbs_destroy_ufile_hw+0x3a/0xf0 [ib_uverbs] ib_uverbs_remove_one+0xc3/0x140 [ib_uverbs] remove_client_context+0x8b/0xd0 [ib_core] disable_device+0x8c/0x130 [ib_core] __ib_unregister_device+0x10d/0x180 [ib_core] ib_unregister_device+0x21/0x30 [ib_core] __mlx5_ib_remove+0x1e4/0x1f0 [mlx5_ib] auxiliary_bus_remove+0x1e/0x30 device_release_driver_internal+0x103/0x1f0 bus_remove_device+0xf7/0x170 device_del+0x181/0x410 mlx5_rescan_drivers_locked.part.10+0xa9/0x1d0 [mlx5_core] mlx5_disable_lag+0x253/0x260 [mlx5_core] mlx5_lag_disable_change+0x89/0xc0 [mlx5_core] mlx5_eswitch_disable+0x67/0xa0 [mlx5_core] mlx5_unload+0x15/0xd0 [mlx5_core] mlx5_unload_one+0x71/0xc0 [mlx5_core] mlx5_sync_reset_reload_work+0x83/0x100 [mlx5_core] process_one_work+0x1a7/0x360 worker_thread+0x30/0x390 ? create_worker+0x1a0/0x1a0 kthread+0x116/0x130 ? kthread_flush_work_fn+0x10/0x10 ret_from_fork+0x22/0x40 Fixes: ede132a5cf55 ("RDMA/mlx5: Move events notifier registration to be after device registration") Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260113-umr-hand-lag-fix-v1-1-3dc476e00cd9@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/iwcm: Fix workqueue list corruption by removing work_listJacob Moroni2-36/+21
[ Upstream commit 7874eeacfa42177565c01d5198726671acf7adf2 ] The commit e1168f0 ("RDMA/iwcm: Simplify cm_event_handler()") changed the work submission logic to unconditionally call queue_work() with the expectation that queue_work() would have no effect if work was already pending. The problem is that a free list of struct iwcm_work is used (for which struct work_struct is embedded), so each call to queue_work() is basically unique and therefore does indeed queue the work. This causes a problem in the work handler which walks the work_list until it's empty to process entries. This means that a single run of the work handler could process item N+1 and release it back to the free list while the actual workqueue entry is still queued. It could then get reused (INIT_WORK...) and lead to list corruption in the workqueue logic. Fix this by just removing the work_list. The workqueue already does this for us. This fixes the following error that was observed when stress testing with ucmatose on an Intel E830 in iWARP mode: [ 151.465780] list_del corruption. next->prev should be ffff9f0915c69c08, but was ffff9f0a1116be08. (next=ffff9f0a15b11c08) [ 151.466639] ------------[ cut here ]------------ [ 151.466986] kernel BUG at lib/list_debug.c:67! [ 151.467349] Oops: invalid opcode: 0000 [#1] SMP NOPTI [ 151.467753] CPU: 14 UID: 0 PID: 2306 Comm: kworker/u64:18 Not tainted 6.19.0-rc4+ #1 PREEMPT(voluntary) [ 151.468466] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 151.469192] Workqueue: 0x0 (iw_cm_wq) [ 151.469478] RIP: 0010:__list_del_entry_valid_or_report+0xf0/0x100 [ 151.469942] Code: c7 58 5f 4c b2 e8 10 50 aa ff 0f 0b 48 89 ef e8 36 57 cb ff 48 8b 55 08 48 89 e9 48 89 de 48 c7 c7 a8 5f 4c b2 e8 f0 4f aa ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 [ 151.471323] RSP: 0000:ffffb15644e7bd68 EFLAGS: 00010046 [ 151.471712] RAX: 000000000000006d RBX: ffff9f0915c69c08 RCX: 0000000000000027 [ 151.472243] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f0a37d9c600 [ 151.472768] RBP: ffff9f0a15b11c08 R08: 0000000000000000 R09: c0000000ffff7fff [ 151.473294] R10: 0000000000000001 R11: ffffb15644e7bba8 R12: ffff9f092339ee68 [ 151.473817] R13: ffff9f0900059c28 R14: ffff9f092339ee78 R15: 0000000000000000 [ 151.474344] FS: 0000000000000000(0000) GS:ffff9f0a847b5000(0000) knlGS:0000000000000000 [ 151.474934] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 151.475362] CR2: 0000559e233a9088 CR3: 000000020296b004 CR4: 0000000000770ef0 [ 151.475895] PKRU: 55555554 [ 151.476118] Call Trace: [ 151.476331] <TASK> [ 151.476497] move_linked_works+0x49/0xa0 [ 151.476792] __pwq_activate_work.isra.46+0x2f/0xa0 [ 151.477151] pwq_dec_nr_in_flight+0x1e0/0x2f0 [ 151.477479] process_scheduled_works+0x1c8/0x410 [ 151.477823] worker_thread+0x125/0x260 [ 151.478108] ? __pfx_worker_thread+0x10/0x10 [ 151.478430] kthread+0xfe/0x240 [ 151.478671] ? __pfx_kthread+0x10/0x10 [ 151.478955] ? __pfx_kthread+0x10/0x10 [ 151.479240] ret_from_fork+0x208/0x270 [ 151.479523] ? __pfx_kthread+0x10/0x10 [ 151.479806] ret_from_fork_asm+0x1a/0x30 [ 151.480103] </TASK> Fixes: e1168f09b331 ("RDMA/iwcm: Simplify cm_event_handler()") Signed-off-by: Jacob Moroni <jmoroni@google.com> Link: https://patch.msgid.link/20260112020006.1352438-1-jmoroni@google.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/rxe: Fix double free in rxe_srq_from_initJiasheng Jiang1-3/+3
[ Upstream commit 0beefd0e15d962f497aad750b2d5e9c3570b66d1 ] In rxe_srq_from_init(), the queue pointer 'q' is assigned to 'srq->rq.queue' before copying the SRQ number to user space. If copy_to_user() fails, the function calls rxe_queue_cleanup() to free the queue, but leaves the now-invalid pointer in 'srq->rq.queue'. The caller of rxe_srq_from_init() (rxe_create_srq) eventually calls rxe_srq_cleanup() upon receiving the error, which triggers a second rxe_queue_cleanup() on the same memory, leading to a double free. The call trace looks like this: kmem_cache_free+0x.../0x... rxe_queue_cleanup+0x1a/0x30 [rdma_rxe] rxe_srq_cleanup+0x42/0x60 [rdma_rxe] rxe_elem_release+0x31/0x70 [rdma_rxe] rxe_create_srq+0x12b/0x1a0 [rdma_rxe] ib_create_srq_user+0x9a/0x150 [ib_core] Fix this by moving 'srq->rq.queue = q' after copy_to_user. Fixes: aae0484e15f0 ("IB/rxe: avoid srq memory leak") Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Link: https://patch.msgid.link/20260112015412.29458-1-jiashengjiangcool@gmail.com Reviewed-by: Zhu Yanjun <yanjun.Zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/rtrs-srv: fix SG mappingRoman Penyaev1-5/+20
[ Upstream commit 83835f7c07b523c7ca2a5ad0a511670b5810539e ] This fixes the following error on the server side: RTRS server session allocation failed: -EINVAL caused by the caller of the `ib_dma_map_sg()`, which does not expect less mapped entries, than requested, which is in the order of things and can be easily reproduced on the machine with enabled IOMMU. The fix is to treat any positive number of mapped sg entries as a successful mapping and cache DMA addresses by traversing modified SG table. Fixes: 9cb837480424 ("RDMA/rtrs: server: main functionality") Signed-off-by: Roman Penyaev <r.peniaev@gmail.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Link: https://patch.msgid.link/20260107161517.56357-2-haris.iqbal@ionos.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/mlx5: Fix ucaps init error flowMaher Sanalla1-1/+5
[ Upstream commit 6dc78c53de99e4ed9868d4f0fc6da6e46f52fe4d ] In mlx5_ib_stage_caps_init(), if mlx5_ib_init_ucaps() fails after mlx5_ib_init_var_table() succeeds, the VAR bitmap is leaked since the function returns without cleanup. Thus, cleanup the var table bitmap in case of error of initializing ucaps before exiting, preventing the leak above. Fixes: cf7174e8982f ("RDMA/mlx5: Create UCAP char devices for supported device capabilities") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/20260104-ib-core-misc-v1-3-00367f77f3a8@nvidia.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/hns: Notify ULP of remaining soft-WCs during resetChengchang Tang1-0/+23
[ Upstream commit 0789f929900d85b80b343c5f04f8b9444e991384 ] During a reset, software-generated WCs cannot be reported via interrupts. This may cause the ULP to miss some WCs. To avoid this, add check in the CQ arm process: if a hardware reset has occurred and there are still unreported soft-WCs, notify the ULP to handle the remaining WCs, thereby preventing any loss of completions. Fixes: 626903e9355b ("RDMA/hns: Add support for reporting wc as software mode") Signed-off-by: Chengchang Tang <tangchengchang@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20260104064057.1582216-5-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/hns: Fix RoCEv1 failure due to DSCPJunxian Huang2-25/+26
[ Upstream commit 84bd5d60f0a2b9c763c5e6d0b3d8f4f61f6c5470 ] DSCP is not supported in RoCEv1, but get_dscp() is still called. If get_dscp() returns an error, it'll eventually cause create_ah to fail even when using RoCEv1. Correct the return value and avoid calling get_dscp() when using RoCEv1. Fixes: ee20cc17e9d8 ("RDMA/hns: Support DSCP") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20260104064057.1582216-4-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/hns: Fix WQ_MEM_RECLAIM warningChengchang Tang1-1/+2
[ Upstream commit c0a26bbd3f99b7b03f072e3409aff4e6ec8af6f6 ] When sunrpc is used, if a reset triggered, our wq may lead the following trace: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAIM hns_roce_irq_workq:flush_work_handle [hns_roce_hw_v2] WARNING: CPU: 0 PID: 8250 at kernel/workqueue.c:2644 check_flush_dependency+0xe0/0x144 Call trace: check_flush_dependency+0xe0/0x144 start_flush_work.constprop.0+0x1d0/0x2f0 __flush_work.isra.0+0x40/0xb0 flush_work+0x14/0x30 hns_roce_v2_destroy_qp+0xac/0x1e0 [hns_roce_hw_v2] ib_destroy_qp_user+0x9c/0x2b4 rdma_destroy_qp+0x34/0xb0 rpcrdma_ep_destroy+0x28/0xcc [rpcrdma] rpcrdma_ep_put+0x74/0xb4 [rpcrdma] rpcrdma_xprt_disconnect+0x1d8/0x260 [rpcrdma] xprt_rdma_connect_worker+0xc0/0x120 [rpcrdma] process_one_work+0x1cc/0x4d0 worker_thread+0x154/0x414 kthread+0x104/0x144 ret_from_fork+0x10/0x18 Since QP destruction frees memory, this wq should have the WQ_MEM_RECLAIM. Fixes: ffd541d45726 ("RDMA/hns: Add the workqueue framework for flush cqe handler") Signed-off-by: Chengchang Tang <tangchengchang@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20260104064057.1582216-2-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27IB/cache: update gid cache on client reregister eventEtienne AUJAMES1-1/+2
[ Upstream commit ddd6c8c873e912cb1ead79def54de5e24ff71c80 ] Some HCAs (e.g: ConnectX4) do not trigger a IB_EVENT_GID_CHANGE on subnet prefix update from SM (PortInfo). Since the commit d58c23c92548 ("IB/core: Only update PKEY and GID caches on respective events"), the GID cache is updated exclusively on IB_EVENT_GID_CHANGE. If this event is not emitted, the subnet prefix in the IPoIB interface’s hardware address remains set to its default value (0xfe80000000000000). Then rdma_bind_addr() failed because it relies on hardware address to find the port GID (subnet_prefix + port GUID). This patch fixes this issue by updating the GID cache on IB_EVENT_CLIENT_REREGISTER event (emitted on PortInfo::ClientReregister=1). Fixes: d58c23c92548 ("IB/core: Only update PKEY and GID caches on respective events") Signed-off-by: Etienne AUJAMES <eaujames@ddn.com> Link: https://patch.msgid.link/aVUfsO58QIDn5bGX@eaujamesFR0130 Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/rtrs: server: remove dead codeHonggang LI1-7/+1
[ Upstream commit a3572bdc3a028ca47f77d7166ac95b719cf77d50 ] As rkey had been initialized to zero, the WARN_ON_ONCE should never been triggered. Remove it. Fixes: 9cb837480424 ("RDMA/rtrs: server: main functionality") Signed-off-by: Honggang LI <honggangli@163.com> Link: https://patch.msgid.link/20251224023819.138846-1-honggangli@163.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27RDMA/umad: Reject negative data_len in ib_umad_writeYunJe Shin1-2/+6
commit 5551b02fdbfd85a325bb857f3a8f9c9f33397ed2 upstream. ib_umad_write computes data_len from user-controlled count and the MAD header sizes. With a mismatched user MAD header size and RMPP header length, data_len can become negative and reach ib_create_send_mad(). This can make the padding calculation exceed the segment size and trigger an out-of-bounds memset in alloc_send_rmpp_list(). Add an explicit check to reject negative data_len before creating the send buffer. KASAN splat: [ 211.363464] BUG: KASAN: slab-out-of-bounds in ib_create_send_mad+0xa01/0x11b0 [ 211.364077] Write of size 220 at addr ffff88800c3fa1f8 by task spray_thread/102 [ 211.365867] ib_create_send_mad+0xa01/0x11b0 [ 211.365887] ib_umad_write+0x853/0x1c80 Fixes: 2be8e3ee8efd ("IB/umad: Add P_Key index support") Signed-off-by: YunJe Shin <ioerts@kookmin.ac.kr> Link: https://patch.msgid.link/20260203100628.1215408-1-ioerts@kookmin.ac.kr Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27RDMA/siw: Fix potential NULL pointer dereference in header processingYunJe Shin1-1/+2
commit 14ab3da122bd18920ad57428f6cf4fade8385142 upstream. If siw_get_hdr() returns -EINVAL before set_rx_fpdu_context(), qp->rx_fpdu can be NULL. The error path in siw_tcp_rx_data() dereferences qp->rx_fpdu->more_ddp_segs without checking, which may lead to a NULL pointer deref. Only check more_ddp_segs when rx_fpdu is present. KASAN splat: [ 101.384271] KASAN: null-ptr-deref in range [0x00000000000000c0-0x00000000000000c7] [ 101.385869] RIP: 0010:siw_tcp_rx_data+0x13ad/0x1e50 Fixes: 8b6a361b8c48 ("rdma/siw: receive path") Signed-off-by: YunJe Shin <ioerts@kookmin.ac.kr> Link: https://patch.msgid.link/20260204092546.489842-1-ioerts@kookmin.ac.kr Acked-by: Bernard Metzler <bernard.metzler@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08RDMA/cm: Fix leaking the multicast GID table referenceJason Gunthorpe1-0/+3
commit 57f3cb6c84159d12ba343574df2115fb18dd83ca upstream. If the CM ID is destroyed while the CM event for multicast creating is still queued the cancel_work_sync() will prevent the work from running which also prevents destroying the ah_attr. This leaks a refcount and triggers a WARN: GID entry ref leak for dev syz1 index 2 ref=573 WARNING: CPU: 1 PID: 655 at drivers/infiniband/core/cache.c:809 release_gid_table drivers/infiniband/core/cache.c:806 [inline] WARNING: CPU: 1 PID: 655 at drivers/infiniband/core/cache.c:809 gid_table_release_one+0x284/0x3cc drivers/infiniband/core/cache.c:886 Destroy the ah_attr after canceling the work, it is safe to call this twice. Link: https://patch.msgid.link/r/0-v1-4285d070a6b2+20a-rdma_mc_gid_leak_syz_jgg@nvidia.com Cc: stable@vger.kernel.org Fixes: fe454dc31e84 ("RDMA/ucma: Fix use-after-free bug in ucma_create_uevent") Reported-by: syzbot+b0da83a6c0e2e2bddbd4@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68232e7b.050a0220.f2294.09f6.GAE@google.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08RDMA/core: Check for the presence of LS_NLA_TYPE_DGID correctlyJason Gunthorpe1-23/+10
commit a7b8e876e0ef0232b8076972c57ce9a7286b47ca upstream. The netlink response for RDMA_NL_LS_OP_IP_RESOLVE should always have a LS_NLA_TYPE_DGID attribute, it is invalid if it does not. Use the nl parsing logic properly and call nla_parse_deprecated() to fill the nlattrs array and then directly index that array to get the data for the DGID. Just fail if it is NULL. Remove the for loop searching for the nla, and squash the validation and parsing into one function. Fixes an uninitialized read from the stack triggered by userspace if it does not provide the DGID to a kernel initiated RDMA_NL_LS_OP_IP_RESOLVE query. BUG: KMSAN: uninit-value in hex_byte_pack include/linux/hex.h:13 [inline] BUG: KMSAN: uninit-value in ip6_string+0xef4/0x13a0 lib/vsprintf.c:1490 hex_byte_pack include/linux/hex.h:13 [inline] ip6_string+0xef4/0x13a0 lib/vsprintf.c:1490 ip6_addr_string+0x18a/0x3e0 lib/vsprintf.c:1509 ip_addr_string+0x245/0xee0 lib/vsprintf.c:1633 pointer+0xc09/0x1bd0 lib/vsprintf.c:2542 vsnprintf+0xf8a/0x1bd0 lib/vsprintf.c:2930 vprintk_store+0x3ae/0x1530 kernel/printk/printk.c:2279 vprintk_emit+0x307/0xcd0 kernel/printk/printk.c:2426 vprintk_default+0x3f/0x50 kernel/printk/printk.c:2465 vprintk+0x36/0x50 kernel/printk/printk_safe.c:82 _printk+0x17e/0x1b0 kernel/printk/printk.c:2475 ib_nl_process_good_ip_rsep drivers/infiniband/core/addr.c:128 [inline] ib_nl_handle_ip_res_resp+0x963/0x9d0 drivers/infiniband/core/addr.c:141 rdma_nl_rcv_msg drivers/infiniband/core/netlink.c:-1 [inline] rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline] rdma_nl_rcv+0xefa/0x11c0 drivers/infiniband/core/netlink.c:259 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline] netlink_unicast+0xf04/0x12b0 net/netlink/af_netlink.c:1346 netlink_sendmsg+0x10b3/0x1250 net/netlink/af_netlink.c:1896 sock_sendmsg_nosec net/socket.c:714 [inline] __sock_sendmsg+0x333/0x3d0 net/socket.c:729 ____sys_sendmsg+0x7e0/0xd80 net/socket.c:2617 ___sys_sendmsg+0x271/0x3b0 net/socket.c:2671 __sys_sendmsg+0x1aa/0x300 net/socket.c:2703 __compat_sys_sendmsg net/compat.c:346 [inline] __do_compat_sys_sendmsg net/compat.c:353 [inline] __se_compat_sys_sendmsg net/compat.c:350 [inline] __ia32_compat_sys_sendmsg+0xa4/0x100 net/compat.c:350 ia32_sys_call+0x3f6c/0x4310 arch/x86/include/generated/asm/syscalls_32.h:371 do_syscall_32_irqs_on arch/x86/entry/syscall_32.c:83 [inline] __do_fast_syscall_32+0xb0/0x150 arch/x86/entry/syscall_32.c:306 do_fast_syscall_32+0x38/0x80 arch/x86/entry/syscall_32.c:331 do_SYSENTER_32+0x1f/0x30 arch/x86/entry/syscall_32.c:3 Link: https://patch.msgid.link/r/0-v1-3fbaef094271+2cf-rdma_op_ip_rslv_syz_jgg@nvidia.com Cc: stable@vger.kernel.org Fixes: ae43f8286730 ("IB/core: Add IP to GID netlink offload") Reported-by: syzbot+938fcd548c303fe33c1a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/68dc3dac.a00a0220.102ee.004f.GAE@google.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08RDMA/bnxt_re: fix dma_free_coherent() pointerThomas Fourier1-3/+1
[ Upstream commit fcd431a9627f272b4c0bec445eba365fe2232a94 ] The dma_alloc_coherent() allocates a dma-mapped buffer, pbl->pg_arr[i]. The dma_free_coherent() should pass the same buffer to dma_free_coherent() and not page-aligned. Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Link: https://patch.msgid.link/20251230085121.8023-2-fourier.thomas@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08RDMA/rtrs: Fix clt_path::max_pages_per_mr calculationHonggang LI1-0/+1
[ Upstream commit 43bd09d5b750f700499ae8ec45fd41a4c48673e6 ] If device max_mr_size bits in the range [mr_page_shift+31:mr_page_shift] are zero, the `min3` function will set clt_path::max_pages_per_mr to zero. `alloc_path_reqs` will pass zero, which is invalid, as the third parameter to `ib_alloc_mr`. Fixes: 6a98d71daea1 ("RDMA/rtrs: client: main functionality") Signed-off-by: Honggang LI <honggangli@163.com> Link: https://patch.msgid.link/20251229025617.13241-1-honggangli@163.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08IB/rxe: Fix missing umem_odp->umem_mutex unlock on error pathLi Zhijian1-1/+3
[ Upstream commit 3c68cf68233e556e0102f45b69f7448908dc1f44 ] rxe_odp_map_range_and_lock() must release umem_odp->umem_mutex when an error occurs, including cases where rxe_check_pagefault() fails. Fixes: 2fae67ab63db ("RDMA/rxe: Add support for Send/Recv/Write/Read with ODP") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Link: https://patch.msgid.link/20251226094112.3042583-1-lizhijian@fujitsu.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08RDMA/bnxt_re: Fix to use correct page size for PDE tableKalesh AP1-2/+2
[ Upstream commit 3d70e0fb0f289b0c778041c5bb04d099e1aa7c1c ] In bnxt_qplib_alloc_init_hwq(), while allocating memory for PDE table driver incorrectly is using the "pg_size" value passed to the function. Fixed to use the right value 4K. Also, fixed the allocation size for PBL table. Fixes: 0c4dcd602817 ("RDMA/bnxt_re: Refactor hardware queue memory allocation") Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20251223131855.145955-1-kalesh-anakkur.purayil@broadcom.com Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08RDMA/bnxt_re: Fix OOB write in bnxt_re_copy_err_stats()Ding Hui1-3/+3
[ Upstream commit 9b68a1cc966bc947d00e4c0df7722d118125aa37 ] Commit ef56081d1864 ("RDMA/bnxt_re: RoCE related hardware counters update") added three new counters and placed them after BNXT_RE_OUT_OF_SEQ_ERR. BNXT_RE_OUT_OF_SEQ_ERR acts as a boundary marker for allocating hardware statistics with different num_counters values on chip_gen_p5_p7 devices. As a result, BNXT_RE_NUM_STD_COUNTERS are used when allocating hw_stats, which leads to an out-of-bounds write in bnxt_re_copy_err_stats(). The counters BNXT_RE_REQ_CQE_ERROR, BNXT_RE_RESP_CQE_ERROR, and BNXT_RE_RESP_REMOTE_ACCESS_ERRS are applicable to generic hardware, not only p5/p7 devices. Fix this by moving these counters before BNXT_RE_OUT_OF_SEQ_ERR so they are included in the generic counter set. Fixes: ef56081d1864 ("RDMA/bnxt_re: RoCE related hardware counters update") Reported-by: Yingying Zheng <zhengyingying@sangfor.com.cn> Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Link: https://patch.msgid.link/20251208072110.28874-1-dinghui@sangfor.com.cn Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Tested-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08RDMA/bnxt_re: Fix IB_SEND_IP_CSUM handling in post_sendAlok Tiwari1-6/+1
[ Upstream commit f01765a2361323e78e3d91b1cb1d5527a83c5cf7 ] The bnxt_re SEND path checks wr->send_flags to enable features such as IP checksum offload. However, send_flags is a bitmask and may contain multiple flags (e.g. IB_SEND_SIGNALED | IB_SEND_IP_CSUM), while the existing code uses a switch() statement that only matches when send_flags is exactly IB_SEND_IP_CSUM. As a result, checksum offload is not enabled when additional SEND flags are present. Replace the switch() with a bitmask test: if (wr->send_flags & IB_SEND_IP_CSUM) This ensures IP checksum offload is enabled correctly when multiple SEND flags are used. Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Link: https://patch.msgid.link/20251219093308.2415620-1-alok.a.tiwari@oracle.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08RDMA/core: always drop device refcount in ib_del_sub_device_and_put()Tetsuo Handa1-1/+3
[ Upstream commit fa3c411d21ebc26ffd175c7256c37cefa35020aa ] Since nldev_deldev() (introduced by commit 060c642b2ab8 ("RDMA/nldev: Add support to add/delete a sub IB device through netlink") grabs a reference using ib_device_get_by_index() before calling ib_del_sub_device_and_put(), we need to drop that reference before returning -EOPNOTSUPP error. Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84 Fixes: bca51197620a ("RDMA/core: Support IB sub device with type "SMI"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://patch.msgid.link/80749a85-cbe2-460c-8451-42516013f9fa@I-love.SAKURA.ne.jp Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>