kernel/linux.git/drivers/infiniband/core/rw.c, branch linux-7.0.y

RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path

2026-03-17T19:00:56+00:00

When IOVA-based DMA mapping is unavailable (e.g., IOMMU passthrough mode), rdma_rw_ctx_init_bvec() falls back to checking rdma_rw_io_needs_mr() with the raw bvec count. Unlike the scatterlist path in rdma_rw_ctx_init(), which passes a post-DMA-mapping entry count that reflects coalescing of physically contiguous pages, the bvec path passes the pre-mapping page count. This overstates the number of DMA entries, causing every multi-bvec RDMA READ to consume an MR from the QP's pool. Under NFS WRITE workloads the server performs RDMA READs to pull data from the client. With the inflated MR demand, the pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns -EAGAIN. svcrdma treats this as a DMA mapping failure, closes the connection, and the client reconnects -- producing a cycle of 71% RPC retransmissions and ~100 reconnections per test run. RDMA WRITEs (NFS READ direction) are unaffected because DMA_TO_DEVICE never triggers the max_sgl_rd check. Remove the rdma_rw_io_needs_mr() gate from the bvec path entirely, so that bvec RDMA operations always use the map_wrs path (direct WR posting without MR allocation). The bvec caller has no post-DMA-coalescing segment count available -- xdr_buf and svc_rqst hold pages as individual pointers, and physical contiguity is discovered only during DMA mapping -- so the raw page count cannot serve as a reliable input to rdma_rw_io_needs_mr(). iWARP devices, which require MRs unconditionally, are handled by an earlier check in rdma_rw_ctx_init_bvec() and are unaffected. Fixes: bea28ac14cab ("RDMA/core: add MR support for bvec-based RDMA operations") Signed-off-by: Chuck Lever Reviewed-by: Christoph Hellwig Link: https://patch.msgid.link/20260313194201.5818-3-cel@kernel.org Signed-off-by: Leon Romanovsky

RDMA/rw: Fall back to direct SGE on MR pool exhaustion

2026-03-17T19:00:38+00:00

When IOMMU passthrough mode is active, ib_dma_map_sgtable_attrs() produces no coalescing: each scatterlist page maps 1:1 to a DMA entry, so sgt.nents equals the raw page count. A 1 MB transfer yields 256 DMA entries. If that count exceeds the device's max_sgl_rd threshold (an optimization hint from mlx5 firmware), rdma_rw_io_needs_mr() steers the operation into the MR registration path. Each such operation consumes one or more MRs from a pool sized at max_rdma_ctxs -- roughly one MR per concurrent context. Under write-intensive workloads that issue many concurrent RDMA READs, the pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns -EAGAIN. Upper layer protocols treat this as a fatal DMA mapping failure and tear down the connection. The max_sgl_rd check is a performance optimization, not a correctness requirement: the device can handle large SGE counts via direct posting, just less efficiently than with MR registration. When the MR pool cannot satisfy a request, falling back to the direct SGE (map_wrs) path avoids the connection reset while preserving the MR optimization for the common case where pool resources are available. Add a fallback in rdma_rw_ctx_init() so that -EAGAIN from rdma_rw_init_mr_wrs() triggers direct SGE posting instead of propagating the error. iWARP devices, which mandate MR registration for RDMA READs, and force_mr debug mode continue to treat -EAGAIN as terminal. Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages") Signed-off-by: Chuck Lever Reviewed-by: Christoph Hellwig Link: https://patch.msgid.link/20260313194201.5818-2-cel@kernel.org Signed-off-by: Leon Romanovsky

Convert more 'alloc_obj' cases to default GFP_KERNEL arguments

2026-02-22T04:03:00+00:00

This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds

Convert 'alloc_obj' family to use the new default GFP_KERNEL argument

2026-02-22T01:09:51+00:00

This was done entirely with mindless brute force, using git grep -l '\

treewide: Replace kmalloc with kmalloc_obj for non-scalar types

2026-02-21T09:02:28+00:00

This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook

RDMA/core: add rdma_rw_max_sge() helper for SQ sizing

2026-01-28T10:54:53+00:00

svc_rdma_accept() computes sc_sq_depth as the sum of rq_depth and the number of rdma_rw contexts (ctxts). This value is used to allocate the Send CQ and to initialize the sc_sq_avail credit pool. However, when the device uses memory registration for RDMA operations, rdma_rw_init_qp() inflates the QP's max_send_wr by a factor of three per context to account for REG and INV work requests. The Send CQ and credit pool remain sized for only one work request per context, causing Send Queue exhaustion under heavy NFS WRITE workloads. Introduce rdma_rw_max_sge() to compute the actual number of Send Queue entries required for a given number of rdma_rw contexts. Upper layer protocols call this helper before creating a Queue Pair so that their Send CQs and credit accounting match the QP's true capacity. Update svc_rdma_accept() to use rdma_rw_max_sge() when computing sc_sq_depth, ensuring the credit pool reflects the work requests that rdma_rw_init_qp() will reserve. Reviewed-by: Christoph Hellwig Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages") Signed-off-by: Chuck Lever Link: https://patch.msgid.link/20260128005400.25147-5-cel@kernel.org Signed-off-by: Leon Romanovsky

RDMA/core: add MR support for bvec-based RDMA operations

2026-01-28T10:54:53+00:00

The bvec-based RDMA API currently returns -EOPNOTSUPP when Memory Region registration is required. This prevents iWARP devices from using the bvec path, since iWARP requires MR registration for RDMA READ operations. The force_mr debug parameter is also unusable with bvec input. Add rdma_rw_init_mr_wrs_bvec() to handle MR registration for bvec arrays. The approach creates a synthetic scatterlist populated with DMA addresses from the bvecs, then reuses the existing ib_map_mr_sg() infrastructure. This avoids driver changes while keeping the implementation small. The synthetic scatterlist is stored in the rdma_rw_ctx for cleanup. On destroy, the MRs are returned to the pool and the bvec DMA mappings are released using the stored addresses. Signed-off-by: Chuck Lever Link: https://patch.msgid.link/20260128005400.25147-4-cel@kernel.org Reviewed-by: Christoph Hellwig Signed-off-by: Leon Romanovsky

RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations

2026-01-28T10:54:53+00:00

The bvec RDMA API maps each bvec individually via dma_map_phys(), requiring an IOTLB sync for each mapping. For large I/O operations with many bvecs, this overhead becomes significant. The two-step IOVA API (dma_iova_try_alloc / dma_iova_link / dma_iova_sync) allocates a contiguous IOVA range upfront, links all physical pages without IOTLB syncs, then performs a single sync at the end. This reduces IOTLB flushes from O(n) to O(1). It also requires only a single output dma_addr_t compared to extra per-input element storage in struct scatterlist. Reviewed-by: Christoph Hellwig Signed-off-by: Chuck Lever Link: https://patch.msgid.link/20260128005400.25147-3-cel@kernel.org Signed-off-by: Leon Romanovsky

RDMA/core: add bio_vec based RDMA read/write API

2026-01-28T10:54:53+00:00

The existing rdma_rw_ctx_init() API requires callers to construct a scatterlist, which is then DMA-mapped page by page. Callers that already have data in bio_vec form (such as the NVMe-oF target) must first convert to scatterlist, adding overhead and complexity. Introduce rdma_rw_ctx_init_bvec() and rdma_rw_ctx_destroy_bvec() to accept bio_vec arrays directly. The new helpers use dma_map_phys() for hardware RDMA devices and virtual addressing for software RDMA devices (rxe, siw), avoiding intermediate scatterlist construction. Memory registration (MR) path support is deferred to a follow-up series; callers requiring MR-based transfers (iWARP devices or force_mr=1) receive -EOPNOTSUPP and should use the scatterlist API. Reviewed-by: Christoph Hellwig Signed-off-by: Chuck Lever Link: https://patch.msgid.link/20260128005400.25147-2-cel@kernel.org Signed-off-by: Leon Romanovsky

RDMA/core: Fix a couple of obvious typos in comments

2023-10-04T18:55:44+00:00

Fix typos. Signed-off-by: Chuck Lever Link: https://lore.kernel.org/r/169643338101.8035.6826446669479247727.stgit@manet.1015granger.net Signed-off-by: Leon Romanovsky