summaryrefslogtreecommitdiff
path: root/include/linux/sunrpc/svc_rdma.h
AgeCommit message (Collapse)AuthorFilesLines
2021-03-31svcrdma: Remove svc_rdma_recv_ctxt::rc_pages and ::rc_argChuck Lever1-3/+0
These fields are no longer used. The size of struct svc_rdma_recv_ctxt is now less than 300 bytes on x86_64, down from 2440 bytes. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-31svcrdma: Remove sc_read_complete_qChuck Lever1-2/+0
Now that svc_rdma_recvfrom() waits for Read completion, sc_read_complete_q is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-22svcrdma: Remove unused sc_pages fieldChuck Lever1-2/+1
Clean up. This significantly reduces the size of struct svc_rdma_send_ctxt. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-22svcrdma: Normalize Send page handlingChuck Lever1-0/+1
Currently svc_rdma_sendto() migrates xdr_buf pages into a separate page list and NULLs out a bunch of entries in rq_pages while the pages are under I/O. The Send completion handler then frees those pages later. Instead, let's wait for the Send completion, then handle page releasing in the nfsd thread. I'd like to avoid the cost of 250+ put_page() calls in the Send completion handler, which is single- threaded. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-22svcrdma: Maintain a Receive water markChuck Lever1-0/+2
Post more Receives when the number of pending Receives drops below a water mark. The batch mechanism is disabled if the underlying device cannot support a reasonably-sized Receive Queue. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-11svcrdma: Revert "svcrdma: Reduce Receive doorbell rate"Chuck Lever1-1/+0
I tested commit 43042b90cae1 ("svcrdma: Reduce Receive doorbell rate") with mlx4 (IB) and software iWARP and didn't find any issues. However, I recently got my hardware iWARP setup back on line (FastLinQ) and it's crashing hard on this commit (confirmed via bisect). The failure mode is complex. - After a connection is established, the first Receive completes normally. - But the second and third Receives have garbage in their Receive buffers. The server responds with ERR_VERS as a result. - When the client tears down the connection to retry, a couple of posted Receives flush twice, and that corrupts the recv_ctxt free list. - __svc_rdma_free then faults or loops infinitely while destroying the xprt's recv_ctxts. Since 43042b90cae1 ("svcrdma: Reduce Receive doorbell rate") does not fix a bug but is a scalability enhancement, it's safe and appropriate to revert it while working on a replacement. Fixes: 43042b90cae1 ("svcrdma: Reduce Receive doorbell rate") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: Reduce Receive doorbell rateChuck Lever1-0/+1
This is similar to commit e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)") which added Receive batching to the client. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: Deprecate stat variables that are no longer usedChuck Lever1-5/+0
Clean up. We are not permitted to remove old proc files. Instead, convert these variables to stubs that are only ever allowed to display a value of zero. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: Restore read and write statsChuck Lever1-2/+2
Now that we have an efficient mechanism to update these two stats, let's start maintaining them again. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: Convert rdma_stat_sq_starve to a per-CPU counterChuck Lever1-1/+1
Avoid the overhead of a memory bus lock cycle for counting a value that is hardly every used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: Convert rdma_stat_recv to a per-CPU counterChuck Lever1-1/+2
Receives are frequent events. Avoid the overhead of a memory bus lock cycle for counting a value that is hardly every used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use the new parsed chunk list when pulling Read chunksChuck Lever1-3/+3
As a pre-requisite for handling multiple Read chunks in each Read list, convert svc_rdma_recv_read_chunk() to use the new parsed Read chunk list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Remove chunk list pointersChuck Lever1-4/+0
Clean up: These pointers are no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Support multiple Write chunks in svc_rdma_send_reply_chunkChuck Lever1-1/+1
Refactor svc_rdma_send_reply_chunk() so that it Sends only the parts of rq_res that do not contain a result payload. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Support multiple Write chunks in svc_rdma_map_reply_msg()Chuck Lever1-1/+1
Refactor: svc_rdma_map_reply_msg() is restructured to DMA map only the parts of rq_res that do not contain a result payload. This change has been tested to confirm that it does not cause a regression in the no Write chunk and single Write chunk cases. Multiple Write chunks have not been tested. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Support multiple write chunks when pulling upChuck Lever1-0/+2
When counting the number of SGEs needed to construct a Send request, do not count result payloads. And, when copying the Reply message into the pull-up buffer, result payloads are not to be copied to the Send buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to construct RDMA WritesChuck Lever1-3/+2
Refactor: Instead of re-parsing the ingress RPC Call transport header when constructing RDMA Writes, use the new parsed chunk lists for the Write list and Reply chunk, which are version-agnostic and already XDR-decoded. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to detect reverse direction repliesChuck Lever1-0/+1
Refactor: Don't duplicate header decoding smarts here. Instead, use the new parsed chunk lists. Note that the XID sanity test is also removed. The XID is already looked up by the cb handler, and is rejected if it's not recognized. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Add a "parsed chunk list" data structureChuck Lever1-0/+12
This simple data structure binds the location of each data payload inside of an RPC message to the chunk that will be used to push it to or pull it from the client. There are several benefits to this small additional overhead: * It enables support for more than one chunk in incoming Read and Write lists. * It translates the version-specific on-the-wire format into a generic in-memory structure, enabling support for multiple versions of the RPC/RDMA transport protocol. * It enables the server to re-organize a chunk list if it needs to adjust where Read chunk data lands in server memory without altering the contents of the XDR-encoded Receive buffer. Construction of these lists is done while sanity checking each incoming RPC/RDMA header. Subsequent patches will make use of the generated data structures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Post RDMA Writes while XDR encoding repliesChuck Lever1-3/+1
The only RPC/RDMA ordering requirement between RDMA Writes and RDMA Sends is that the responder must post the Writes on the Send queue before posting the Send that conveys the RPC Reply for that Write payload. The Linux NFS server implementation now has a transport method that can post result Payload Writes earlier than svc_rdma_sendto: ->xpo_result_payload() This gets RDMA Writes going earlier so they are more likely to be complete at the remote end before the Send completes. Some care must be taken with pulled-up Replies. We don't want to push the Write chunk and then send the same payload data via Send. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Rename svc_encode_read_payload()Chuck Lever1-2/+2
Clean up: "result payload" is a less confusing name for these payloads. "READ payload" reflects only the NFS usage. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-14svcrdma: Record send_ctxt completion ID in trace_svcrdma_post_send()Chuck Lever1-1/+2
First, refactor: Dereference the svc_rdma_send_ctxt inside svc_rdma_send() instead of at every call site. Then, it can be passed into trace_svcrdma_post_send() to get the proper completion ID. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-14svcrdma: Introduce Send completion IDsChuck Lever1-0/+2
Set up a completion ID in each svc_rdma_send_ctxt. The ID is used to match an incoming Send completion to a transport and to a previous ib_post_send(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-14svcrdma: Introduce Receive completion IDsChuck Lever1-0/+4
Set up a completion ID in each svc_rdma_recv_ctxt. The ID is used to match an incoming Receive completion to a transport and to a previous ib_post_recv(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-14svcrdma: Remove declarations for functions long removedChuck Lever1-4/+0
Pavane pour une infante défunte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-14svcrdma: Make svc_rdma_send_error_msg() a global functionChuck Lever1-0/+4
Prepare for svc_rdma_send_error_msg() to be invoked from another source file. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18svcrdma: Remove the SVCRDMA_DEBUG macroChuck Lever1-1/+0
Clean up: Commit d21b05f101ae ("rdma: SVCRMDA Header File") introduced the SVCRDMA_DEBUG macro, but it doesn't seem to have been used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18svcrdma: Fix backchannel return codeChuck Lever1-3/+2
Way back when I was writing the RPC/RDMA server-side backchannel code, I misread the TCP backchannel reply handler logic. When svc_tcp_recvfrom() successfully receives a backchannel reply, it does not return -EAGAIN. It sets XPT_DATA and returns zero. Update svc_rdma_recvfrom() to return zero. Here, XPT_DATA doesn't need to be set again: it is set whenever a new message is received, behind a spin lock in a single threaded context. Also, if handling the cb reply is not successful, the message is simply dropped. There's no special message framing to deal with as there is in the TCP case. Now that the handle_bc_reply() return value is ignored, I've removed the dprintk call sites in the error exit of handle_bc_reply() in favor of trace points in other areas that already report the error cases. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-17svcrdma: Fix leak of svc_rdma_recv_ctxt objectsChuck Lever1-0/+1
Utilize the xpo_release_rqst transport method to ensure that each rqstp's svc_rdma_recv_ctxt object is released even when the server cannot return a Reply for that rqstp. Without this fix, each RPC whose Reply cannot be sent leaks one svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped Receive buffer, and any pages that might be part of the Reply message. The leak is infrequent unless the network fabric is unreliable or Kerberos is in use, as GSS sequence window overruns, which result in connection loss, are more common on fast transports. Fixes: 3a88092ee319 ("svcrdma: Preserve Receive buffer until svc_rdma_sendto") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Avoid DMA mapping small RPC RepliesChuck Lever1-0/+1
On some platforms, DMA mapping part of a page is more costly than copying bytes. Indeed, not involving the I/O MMU can help the RPC/RDMA transport scale better for tiny I/Os across more RDMA devices. This is because interaction with the I/O MMU is eliminated for each of these small I/Os. Without the explicit unmapping, the NIC no longer needs to do a costly internal TLB shoot down for buffers that are just a handful of bytes. Since pull-up is now a more a frequent operation, I've introduced a trace point in the pull-up path. It can be used for debugging or user-space tools that count pull-up frequency. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Fix double sync of transport header bufferChuck Lever1-3/+0
Performance optimization: Avoid syncing the transport buffer twice when Reply buffer pull-up is necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Refactor chunk list encodersChuck Lever1-0/+2
Same idea as the receive-side changes I did a while back: use xdr_stream helpers rather than open-coding the XDR chunk list encoders. This builds the Reply transport header from beginning to end without backtracking. As additional clean-ups, fill in documenting comments for the XDR encoders and sprinkle some trace points in the new encoding functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Update synopsis of svc_rdma_map_reply_msg()Chuck Lever1-2/+3
Preparing for subsequent patches, no behavior change expected. Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto() path. This enables passing more information about Requester- provided Write and Reply chunks into those lower-level functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Update synopsis of svc_rdma_send_reply_chunk()Chuck Lever1-1/+1
Preparing for subsequent patches, no behavior change expected. Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto() path. This enables passing more information about Requester- provided Write and Reply chunks into the lower-level send functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: De-duplicate code that locates Write and Reply chunksChuck Lever1-0/+2
Cache the locations of the Requester-provided Write list and Reply chunk so that the Send path doesn't need to parse the Call header again. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Use struct xdr_stream to decode ingress transport headersChuck Lever1-0/+1
The logic that checks incoming network headers has to be scrupulous. De-duplicate: replace open-coded buffer overflow checks with the use of xdr_stream helpers that are used most everywhere else XDR decoding is done. One minor change to the sanity checks: instead of checking the length of individual segments, cap the length of the whole chunk to be sure it can fit in the set of pages available in rq_pages. This should be a better test of whether the server can handle the chunks in each request. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16nfsd: Fix NFSv4 READ on RDMA when using readvChuck Lever1-1/+7
svcrdma expects that the payload falls precisely into the xdr_buf page vector. This does not seem to be the case for nfsd4_encode_readv(). This code is called only when fops->splice_read is missing or when RQ_SPLICE_OK is clear, so it's not a noticeable problem in many common cases. Add new transport method: ->xpo_read_payload so that when a READ payload does not fit exactly in rq_res's page vector, the XDR encoder can inform the RPC transport exactly where that payload is, without the payload's XDR pad. That way, when a Write chunk is present, the transport knows what byte range in the Reply message is supposed to be matched with the chunk. Note that the Linux NFS server implementation of NFS/RDMA can currently handle only one Write chunk per RPC-over-RDMA message. This simplifies the implementation of this fix. Fixes: b04209806384 ("nfsd4: allow exotic read compounds") Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053 Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2019-08-19svcrdma: Use llist for managing cache of recv_ctxtsChuck Lever1-2/+3
Use a wait-free mechanism for managing the svc_rdma_recv_ctxts free list. Subsequently, sc_recv_lock can be eliminated. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-08-19svcrdma: Remove svc_rdma_wqChuck Lever1-1/+0
Clean up: the system workqueue will work just as well. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-12-28sunrpc: remove unused xpo_prep_reply_hdr callbackVasily Averin1-1/+0
xpo_prep_reply_hdr are not used now. It was defined for tcp transport only, however it cannot be called indirectly, so let's move it to its caller and remove unused callback. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-11-29svcrdma: Optimize the logic that selects the R_key to invalidateChuck Lever1-0/+1
o Select the R_key to invalidate while the CPU cache still contains the received RPC Call transport header, rather than waiting until we're about to send the RPC Reply. o Choose Send With Invalidate if there is exactly one distinct R_key in the received transport header. If there's more than one, the client will have to perform local invalidation after it has already waited for remote invalidation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29svcrdma: Increase the default connection credit limitChuck Lever1-6/+7
Reduce queuing on clients by allowing more credits by default. 64 is the default NFSv4.1 slot table size on Linux clients. This size prevents the credit limit from putting RPC requests to sleep again after they have already slept waiting for a session slot. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Remove unused svc_rdma_op_ctxtChuck Lever1-21/+0
Clean up: Eliminate a structure that is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Persistently allocate and DMA-map Send buffersChuck Lever1-2/+6
While sending each RPC Reply, svc_rdma_sendto allocates and DMA- maps a separate buffer where the RPC/RDMA transport header is constructed. The buffer is unmapped and released in the Send completion handler. This is significant per-RPC overhead, especially for small RPCs. Instead, allocate and DMA-map a buffer, and cache it in each svc_rdma_send_ctxt. This buffer and its mapping can be re-used for each RPC, saving the cost of memory allocation and DMA mapping. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Remove post_send_wrChuck Lever1-3/+0
Clean up: Now that the send_wr is part of the svc_rdma_send_ctxt, svc_rdma_post_send_wr is nearly empty. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Don't overrun the SGE array in svc_rdma_send_ctxtChuck Lever1-6/+3
Receive buffers are always the same size, but each Send WR has a variable number of SGEs, based on the contents of the xdr_buf being sent. While assembling a Send WR, keep track of the number of SGEs so that we don't exceed the device's maximum, or walk off the end of the Send SGE array. For now the Send path just fails if it exceeds the maximum. The current logic in svc_rdma_accept bases the maximum number of Send SGEs on the largest NFS request that can be sent or received. In the transport layer, the limit is actually based on the capabilities of the underlying device, not on properties of the Upper Layer Protocol. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Introduce svc_rdma_send_ctxtChuck Lever1-12/+23
svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt free list. This eliminates the overhead of calling kmalloc / kfree, both of which grab a globally shared lock that disables interrupts. Introduce a replacement to svc_rdma_op_ctxt's that is built especially for the svcrdma Send path. Subsequent patches will take advantage of this new structure by allocating real resources which are then cached in these objects. The allocations are freed when the transport is torn down. I've renamed the structure so that static type checking can be used to ensure that uses of op_ctxt and send_ctxt are not confused. As an additional clean up, structure fields are renamed to conform with kernel coding conventions. Additional clean ups: - Handle svc_rdma_send_ctxt_get allocation failure at each call site, rather than pre-allocating and hoping we guessed correctly - All send_ctxt_put call-sites request page freeing, so remove the @free_pages argument - All send_ctxt_put call-sites unmap SGEs, so fold that into svc_rdma_send_ctxt_put Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Clean up Send SGE accountingChuck Lever1-1/+1
Clean up: Since there's already a svc_rdma_op_ctxt being passed around with the running count of mapped SGEs, drop unneeded parameters to svc_rdma_post_send_wr(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Refactor svc_rdma_dma_map_bufChuck Lever1-7/+0
Clean up: svc_rdma_dma_map_buf does mostly the same thing as svc_rdma_dma_map_page, so let's fold these together. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Allocate recv_ctxt's on CPU handling ReceivesChuck Lever1-0/+1
There is a significant latency penalty when processing an ingress Receive if the Receive buffer resides in memory that is not on the same NUMA node as the the CPU handling completions for a CQ. The system administrator and the device driver determine which CPU handles completions. This CPU does not change during life of the CQ. Further the Upper Layer does not have any visibility of which CPU it is. Allocating Receive buffers in the Receive completion handler guarantees that Receive buffers are allocated on the preferred NUMA node for that CQ. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>