kernel/linux.git/drivers/infiniband/core, branch v6.12.94

RDMA/umem: Fix truncation for block sizes >= 4G

2026-06-19T11:42:37+00:00

[ Upstream commit 15fe76e23615f502d051ef0768f86babaf08746c ] When the iommu is used the linearization of the mapping can give a single block that is very large split across multiple SG entries. When __rdma_block_iter_next() reassembles the split SG entries it is overflowing the 32 bit stack values and computed the wrong DMA addresses for blocks after the truncation. Use the right types to hold DMA addresses. Link: https://patch.msgid.link/r/1-v1-88303e9e509f+f7-ib_umem_types_jgg@nvidia.com Cc: stable@vger.kernel.org Fixes: a808273a495c ("RDMA/verbs: Add a DMA iterator to return aligned contiguous memory blocks") Signed-off-by: Jason Gunthorpe Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

RDMA: Move DMA block iterator logic into dedicated files

2026-06-19T11:42:37+00:00

[ Upstream commit 6094ea64c69520ed1e770e7c79c43412de202bfa ] The DMA iterator logic was mixed into verbs and umem-specific code, forcing all users to include rdma/ib_umem.h. Move the block iterator logic into iter.c and rdma/iter.h so that rdma/ib_umem.h and rdma/ib_verbs.h can be separated in a follow-up patch. Link: https://patch.msgid.link/20260213-refactor-umem-v1-1-f3be85847922@nvidia.com Signed-off-by: Leon Romanovsky Stable-dep-of: 15fe76e23615 ("RDMA/umem: Fix truncation for block sizes >= 4G") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

RDMA: During rereg_mr ensure that REREG_ACCESS is compatible

2026-06-19T11:42:37+00:00

[ Upstream commit badad6fad60def1b9805559dd81dbab3d97b82aa ] If IB_MR_REREG_ACCESS changes from RO to RW then the umem has to be re-evaluated to ensure it is properly pinned as RW. Since the umem is hidden inside each driver's mr struct add a ib_umem_check_rereg() function that each driver has to call before processing IB_MR_REREG_ACCESS. mlx4 has to retain its duplicate ib_access_writable check because it implements IB_MR_REREG_ACCESS | IB_MR_REREG_TRANS by changing both items in place sequentially while the MR is live, so it will continue to not support this combination. Cc: stable@vger.kernel.org Fixes: b40656aa7d55 ("RDMA/umem: remove FOLL_FORCE usage") Link: https://patch.msgid.link/r/0-v1-06fb1a2d6cf5+107-rereg_access_jgg@nvidia.com Reported-by: Philip Tsukerman Signed-off-by: Jason Gunthorpe Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

RDMA/umem: Add helpers for umem dmabuf revoke lock

2026-06-19T11:42:37+00:00

[ Upstream commit 3a0b171302eea1732a168e26db3b8461f51cc1f9 ] Added helpers to acquire and release the umem dmabuf revoke lock. The intent is to avoid the need for drivers to peek into the ib_umem_dmabuf internals to get the dma_resv_lock and bring us one step closer to abstracting ib_umem_dmabuf away from drivers in general. Signed-off-by: Jacob Moroni Link: https://patch.msgid.link/20260305170826.3803155-5-jmoroni@google.com Signed-off-by: Leon Romanovsky Stable-dep-of: badad6fad60d ("RDMA: During rereg_mr ensure that REREG_ACCESS is compatible") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

RDMA/umem: Move umem dmabuf revoke logic into helper function

2026-06-19T11:42:37+00:00

[ Upstream commit 797291a66ce346c96114b72222fc290d402da005 ] This same logic will eventually be reused from within the invalidate_mappings callback which already has the dma_resv_lock held, so break it out into a separate function so it can be reused. Signed-off-by: Jacob Moroni Link: https://patch.msgid.link/20260305170826.3803155-3-jmoroni@google.com Signed-off-by: Leon Romanovsky Stable-dep-of: badad6fad60d ("RDMA: During rereg_mr ensure that REREG_ACCESS is compatible") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

RDMA/umem: Add ib_umem_dmabuf_get_pinned_and_lock helper

2026-06-19T11:42:37+00:00

[ Upstream commit 553dfa8cbd0c6d36adae042d9738ddf8f8765ac7 ] Move the inner logic of ib_umem_dmabuf_get_pinned_with_dma_device() to a new static function that returns with the lock held upon success. The intent is to allow reuse for the future get_pinned_revocable_and_lock function. Signed-off-by: Jacob Moroni Link: https://patch.msgid.link/20260305170826.3803155-2-jmoroni@google.com Signed-off-by: Leon Romanovsky Stable-dep-of: badad6fad60d ("RDMA: During rereg_mr ensure that REREG_ACCESS is compatible") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

RDMA/core: Prefer NLA_NUL_STRING

2026-05-23T11:04:45+00:00

[ Upstream commit 6ed3d14fc45d3da6025e7fe4a6a09066856698e2 ] These attributes are evaluated as c-string (passed to strcmp), but NLA_STRING doesn't check for the presence of a \0 terminator. Either this needs to switch to nla_strcmp() and needs to adjust printf fmt specifier to not use plain %s, or this needs to use NLA_NUL_STRING. As the code has been this way for long time, it seems to me that userspace does include the terminating nul, even tough its not enforced so far, and thus NLA_NUL_STRING use is the simpler solution. Fixes: 30dc5e63d6a5 ("RDMA/core: Add support for iWARP Port Mapper user space service") Link: https://patch.msgid.link/r/20260330122742.13315-1-fw@strlen.de Signed-off-by: Florian Westphal Signed-off-by: Jason Gunthorpe Signed-off-by: Sasha Levin

IB/core: Fix zero dmac race in neighbor resolution

2026-05-07T04:09:42+00:00

commit 5e6de34d82b49cab9d8a42063e9cd0f22a4f31e5 upstream. dst_fetch_ha() checks nud_state without holding the neighbor lock, then copies ha under the seqlock. A race in __neigh_update() where nud_state is set to NUD_REACHABLE before ha is written allows dst_fetch_ha() to read a zero MAC address while the seqlock reports no concurrent writer. netevent_callback amplifies this by waking ALL pending addr_req workers when ANY neighbor becomes NUD_VALID. At scale (N peers resolving ARP concurrently), the hit probability scales as N^2, making it near-certain for large RDMA workloads. N(A): neigh_update(A) W(A): addr_resolve(A) | [sleep] | write_lock_bh(&A->lock) | | A->nud_state = NUD_REACHABLE | | // A->ha is still 0 | | [woken by netevent_cb() of | another neighbour] | | dst_fetch_ha(A) | | A->nud_state & NUD_VALID | | read_seqbegin(&A->ha_lock) | | snapshot = A->ha /* 0 */ | | read_seqretry(&A->ha_lock) | | return snapshot | seqlock(&A->ha_lock) | A->ha = mac_A /* too late */ | sequnlock(&A->ha_lock) | write_unlock_bh(&A->lock) The incorrect/zero mac is read and programmed in the device QP while it was not yet updated. This causes silent packet loss and eventual RETRY_EXC_ERR. Fix by holding the neighbor read lock across the nud_state check and ha copy in dst_fetch_ha(), ensuring it synchronizes with __neigh_update() which is updating while holding the write lock. Cc: stable@vger.kernel.org Fixes: 92ebb6a0a13a ("IB/cm: Remove now useless rcu_lock in dst_fetch_ha") Link: https://patch.msgid.link/r/20260405-fix-dmac-race-v1-1-cfa1ec2ce54a@nvidia.com Signed-off-by: Chen Zhao Reviewed-by: Parav Pandit Signed-off-by: Leon Romanovsky Signed-off-by: Jason Gunthorpe Signed-off-by: Greg Kroah-Hartman

RDMA/rw: Fall back to direct SGE on MR pool exhaustion

2026-04-02T11:09:37+00:00

[ Upstream commit 00da250c21b074ea9494c375d0117b69e5b1d0a4 ] When IOMMU passthrough mode is active, ib_dma_map_sgtable_attrs() produces no coalescing: each scatterlist page maps 1:1 to a DMA entry, so sgt.nents equals the raw page count. A 1 MB transfer yields 256 DMA entries. If that count exceeds the device's max_sgl_rd threshold (an optimization hint from mlx5 firmware), rdma_rw_io_needs_mr() steers the operation into the MR registration path. Each such operation consumes one or more MRs from a pool sized at max_rdma_ctxs -- roughly one MR per concurrent context. Under write-intensive workloads that issue many concurrent RDMA READs, the pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns -EAGAIN. Upper layer protocols treat this as a fatal DMA mapping failure and tear down the connection. The max_sgl_rd check is a performance optimization, not a correctness requirement: the device can handle large SGE counts via direct posting, just less efficiently than with MR registration. When the MR pool cannot satisfy a request, falling back to the direct SGE (map_wrs) path avoids the connection reset while preserving the MR optimization for the common case where pool resources are available. Add a fallback in rdma_rw_ctx_init() so that -EAGAIN from rdma_rw_init_mr_wrs() triggers direct SGE posting instead of propagating the error. iWARP devices, which mandate MR registration for RDMA READs, and force_mr debug mode continue to treat -EAGAIN as terminal. Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages") Signed-off-by: Chuck Lever Reviewed-by: Christoph Hellwig Link: https://patch.msgid.link/20260313194201.5818-2-cel@kernel.org Signed-off-by: Leon Romanovsky Signed-off-by: Sasha Levin

RDMA/umem: Fix double dma_buf_unpin in failure path

2026-03-04T12:21:34+00:00

[ Upstream commit 104016eb671e19709721c1b0048dd912dc2e96be ] In ib_umem_dmabuf_get_pinned_with_dma_device(), the call to ib_umem_dmabuf_map_pages() can fail. If this occurs, the dmabuf is immediately unpinned but the umem_dmabuf->pinned flag is still set. Then, when ib_umem_release() is called, it calls ib_umem_dmabuf_revoke() which will call dma_buf_unpin() again. Fix this by removing the immediate unpin upon failure and just let the ib_umem_release/revoke path handle it. This also ensures the proper unmap-unpin unwind ordering if the dmabuf_map_pages call happened to fail due to dma_resv_wait_timeout (and therefore has a non-NULL umem_dmabuf->sgt). Fixes: 1e4df4a21c5a ("RDMA/umem: Allow pinned dmabuf umem usage") Signed-off-by: Jacob Moroni Link: https://patch.msgid.link/20260224234153.1207849-1-jmoroni@google.com Signed-off-by: Leon Romanovsky Signed-off-by: Sasha Levin