summaryrefslogtreecommitdiff
path: root/net/sunrpc/xprtrdma
AgeCommit message (Collapse)AuthorFilesLines
2017-07-13xprtrdma: On invalidation failure, remove MWs from rl_registeredChuck Lever1-0/+1
Callers assume the ro_unmap_sync and ro_unmap_safe methods empty the list of registered MRs. Ensure that all paths through fmr_op_unmap_sync() remove MWs from that list. Fixes: 9d6b04097882 ("xprtrdma: Place registered MWs on a ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13svcrdma: fix an incorrect check on -E2BIG and -EINVALColin Ian King1-1/+1
The current check will always be true and will always jump to err1, this looks dubious to me. I believe && should be used instead of ||. Detected by CoverityScan, CID#1450120 ("Logically Dead Code") Fixes: 107c1d0a991a ("svcrdma: Avoid Send Queue overflow") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12svcrdma: Remove svc_rdma_chunk_ctxt::cc_dir fieldChuck Lever1-10/+8
Clean up: No need to save the I/O direction. The functions that release svc_rdma_chunk_ctxt already know what direction to use. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12svcrdma: use offset_in_page() macroChuck Lever1-2/+3
Clean up: Use offset_in_page() macro instead of open-coding. Reported-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12svcrdma: Clean up after converting svc_rdma_recvfrom to rdma_rw APIChuck Lever1-35/+4
Clean up: Registration mode details are now handled by the rdma_rw API, and thus can be removed from svcrdma. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12svcrdma: Clean-up svc_rdma_unmap_dmaChuck Lever1-14/+5
There's no longer a need to compare each SGE's lkey with the PD's local_dma_lkey. Now that FRWR is gone, all DMA mappings are for pages that were registered with this key. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12svcrdma: Remove frmr cacheChuck Lever1-86/+0
Clean up: Now that the svc_rdma_recvfrom path uses the rdma_rw API, the details of Read sink buffer registration are dealt with by the kernel's RDMA core. This cache is no longer used, and can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12svcrdma: Remove unused Read completion handlersChuck Lever1-84/+9
Clean up: The generic RDMA R/W API conversion of svc_rdma_recvfrom replaced the Register, Read, and Invalidate completion handlers. Remove the old ones, which are no longer used. These handlers shared some helper code with svc_rdma_wc_send. Fold the wc_common helper back into the one remaining completion handler. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12svcrdma: Properly compute .len and .buflen for received RPC CallsChuck Lever2-11/+5
When an RPC-over-RDMA request is received, the Receive buffer contains a Transport Header possibly followed by an RPC message. Even though rq_arg.head[0] (as passed to NFSD) does not contain the Transport Header header, currently rq_arg.len includes the size of the Transport Header. That violates the intent of the xdr_buf API contract. .buflen should include everything, but .len should be exactly the length of the RPC message in the buffer. The rq_arg fields are summed together at the end of svc_rdma_recvfrom to obtain the correct return value. rq_arg.len really ought to contain the correct number of bytes already, but it currently doesn't due to the above misbehavior. Let's instead ensure that .buflen includes the length of the transport header, and that .len is always equal to head.iov_len + .page_len + tail.iov_len . Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12svcrdma: Use generic RDMA R/W API in RPC Call pathChuck Lever2-454/+106
The current svcrdma recvfrom code path has a lot of detail about registration mode and the type of port (iWARP, IB, etc). Instead, use the RDMA core's generic R/W API. This shares code with other RDMA-enabled ULPs that manages the gory details of buffer registration and the posting of RDMA Read Work Requests. Since the Read list marshaling code is being replaced, I took the opportunity to replace C structure-based XDR encoding code with more portable code that uses pointer arithmetic. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12svcrdma: Add recvfrom helpers to svc_rdma_rw.cChuck Lever1-1/+424
svc_rdma_rw.c already contains helpers for the sendto path. Introduce helpers for the recvfrom path. The plan is to replace the local NFSD bespoke code that constructs and posts RDMA Read Work Requests with calls to the rdma_rw API. This shares code with other RDMA-enabled ULPs that manages the gory details of buffer registration and posting Work Requests. This new code also puts all RDMA_NOMSG-specific logic in one place. Lastly, the use of rqstp->rq_arg.pages is deprecated in favor of using rqstp->rq_pages directly, for clarity. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28svcrdma: Don't account for Receive queue "starvation"Chuck Lever1-15/+6
>From what I can tell, calling ->recvfrom when there is no work to do is a normal part of operation. This is the only way svc_recv can tell when there is no more data ready to receive on the transport. Neither the TCP nor the UDP transport implementations have a "starve" metric. The cost of receive starvation accounting is bumping an atomic, which results in extra (IMO unnecessary) bus traffic between CPU sockets, while holding a spin lock. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28svcrdma: Improve Reply chunk sanity checkingChuck Lever1-6/+11
Identify malformed transport headers and unsupported chunk combinations as early as possible. - Ensure that segment lengths are not crazy. - Ensure that the Reply chunk's segment count is not crazy. With a 1KB inline threshold, the largest number of Write segments that can be conveyed is about 60 (for a RDMA_NOMSG Reply message). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28svcrdma: Improve Write chunk sanity checkingChuck Lever1-5/+47
Identify malformed transport headers and unsupported chunk combinations as early as possible. - Reject RPC-over-RDMA messages that contain more than one Write chunk, since this implementation does not support more than one per message. - Ensure that segment lengths are not crazy. - Ensure that the chunk's segment count is not crazy. With a 1KB inline threshold, the largest number of Write segments that can be conveyed is about 60 (for a RDMA_NOMSG Reply message). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28svcrdma: Improve Read chunk sanity checkingChuck Lever1-18/+37
Identify malformed transport headers and unsupported chunk combinations as early as possible. - Reject RPC-over-RDMA messages that contain more than one Read chunk, since this implementation currently does not support more than one per RPC transaction. - Ensure that segment lengths are not crazy. - Remove the segment count check. With a 1KB inline threshold, the largest number of Read segments that can be conveyed is about 40 (for a RDMA_NOMSG Call message). This is nowhere near RPCSVC_MAXPAGES. As far as I can tell, that was just a sanity check and does not enforce an implementation limit. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28svcrdma: Remove svc_rdma_marshal.cChuck Lever3-170/+128
svc_rdma_marshal.c has one remaining exported function -- svc_rdma_xdr_decode_req -- and it has a single call site. Take the same approach as the sendto path, and move this function into the source file where it is called. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28svcrdma: Avoid Send Queue overflowChuck Lever2-1/+6
Sanity case: Catch the case where more Work Requests are being posted to the Send Queue than there are Send Queue Entries. This might happen if a client sends a chunk with more segments than there are SQEs for the transport. The server can't send that reply, so the transport will deadlock unless the client drops the RPC. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28svcrdma: Squelch disconnection messagesChuck Lever1-3/+10
The server displays "svcrdma: failed to post Send WR (-107)" in the kernel log when the client disconnects. This could flood the server's log, so remove the message. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-24xprtrdma: Delete an error message for a failed memory allocation in ↵Markus Elfring1-4/+2
xprt_rdma_bc_setup() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-10Merge tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linuxLinus Torvalds8-749/+1100
Pull nfsd updates from Bruce Fields: "Another RDMA update from Chuck Lever, and a bunch of miscellaneous bugfixes" * tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux: (26 commits) nfsd: Fix up the "supattr_exclcreat" attributes nfsd: encoders mustn't use unitialized values in error cases nfsd: fix undefined behavior in nfsd4_layout_verify lockd: fix lockd shutdown race NFSv4: Fix callback server shutdown SUNRPC: Refactor svc_set_num_threads() NFSv4.x/callback: Create the callback service through svc_create_pooled lockd: remove redundant check on block svcrdma: Clean out old XDR encoders svcrdma: Remove the req_map cache svcrdma: Remove unused RDMA Write completion handler svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt svcrdma: Clean up RPC-over-RDMA backchannel reply processing svcrdma: Report Write/Reply chunk overruns svcrdma: Clean up RDMA_ERROR path svcrdma: Use rdma_rw API in RPC reply path svcrdma: Introduce local rdma_rw API helpers svcrdma: Clean up svc_rdma_get_inv_rkey() svcrdma: Add helper to save pages under I/O svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT ...
2017-04-26svcrdma: Clean out old XDR encodersChuck Lever1-70/+0
Clean up: These have been replaced and are no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Remove the req_map cacheChuck Lever2-152/+0
req_maps are no longer used by the send path and can thus be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Remove unused RDMA Write completion handlerChuck Lever1-18/+0
Clean up. All RDMA Write completions are now handled by svc_rdma_wc_write_ctx. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxtChuck Lever1-3/+3
The sge array in struct svc_rdma_op_ctxt is no longer used for sending RDMA Write WRs. It need only accommodate the construction of Send and Receive WRs. The maximum inline size is the largest payload it needs to handle now. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Clean up RPC-over-RDMA backchannel reply processingChuck Lever2-16/+29
Replace C structure-based XDR decoding with pointer arithmetic. Pointer arithmetic is considered more portable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Report Write/Reply chunk overrunsChuck Lever1-2/+56
Observed at Connectathon 2017. If a client has underestimated the size of a Write or Reply chunk, the Linux server writes as much payload data as it can, then it recognizes there was a problem and closes the connection without sending the transport header. This creates a couple of problems: <> The client never receives indication of the server-side failure, so it continues to retransmit the bad RPC. Forward progress on the transport is blocked. <> The reply payload pages are not moved out of the svc_rqst, thus they can be released by the RPC server before the RDMA Writes have completed. The new rdma_rw-ized helpers return a distinct error code when a Write/Reply chunk overrun occurs, so it's now easy for the caller (svc_rdma_sendto) to recognize this case. Instead of dropping the connection, post an RDMA_ERROR message. The client now sees an RDMA_ERROR and can properly terminate the RPC transaction. As part of the new logic, set up the same delayed release for these payload pages as would have occurred in the normal case. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Clean up RDMA_ERROR pathChuck Lever3-63/+51
Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can be refactored to reduce code duplication and remove C structure- based XDR encoding. It is also relocated to the source file that contains its only caller. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Use rdma_rw API in RPC reply pathChuck Lever3-354/+350
The current svcrdma sendto code path posts one RDMA Write WR at a time. Each of these Writes typically carries a small number of pages (for instance, up to 30 pages for mlx4 devices). That means a 1MB NFS READ reply requires 9 ib_post_send() calls for the Write WRs, and one for the Send WR carrying the actual RPC Reply message. Instead, use the new rdma_rw API. The details of Write WR chain construction and memory registration are taken care of in the RDMA core. svcrdma can focus on the details of the RPC-over-RDMA protocol. This gives three main benefits: 1. All Write WRs for one RDMA segment are posted in a single chain. As few as one ib_post_send() for each Write chunk. 2. The Write path can now use FRWR to register the Write buffers. If the device's maximum page list depth is large, this means a single Write WR is needed for each RPC's Write chunk data. 3. The new code introduces support for RPCs that carry both a Write list and a Reply chunk. This combination can be used for an NFSv4 READ where the data payload is large, and thus is removed from the Payload Stream, but the Payload Stream is still larger than the inline threshold. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Introduce local rdma_rw API helpersChuck Lever3-1/+517
The plan is to replace the local bespoke code that constructs and posts RDMA Read and Write Work Requests with calls to the rdma_rw API. This shares code with other RDMA-enabled ULPs that manages the gory details of buffer registration and posting Work Requests. Some design notes: o The structure of RPC-over-RDMA transport headers is flexible, allowing multiple segments per Reply with arbitrary alignment, each with a unique R_key. Write and Send WRs continue to be built and posted in separate code paths. However, one whole chunk (with one or more RDMA segments apiece) gets exactly one ib_post_send and one work completion. o svc_xprt reference counting is modified, since a chain of rdma_rw_ctx structs generates one completion, no matter how many Write WRs are posted. o The current code builds the transport header as it is construct- ing Write WRs. I've replaced that with marshaling of transport header data items in a separate step. This is because the exact structure of client-provided segments may not align with the components of the server's reply xdr_buf, or the pages in the page list. Thus parts of each client-provided segment may be written at different points in the send path. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Clean up svc_rdma_get_inv_rkey()Chuck Lever1-22/+17
Replace C structure-based XDR decoding with more portable code that instead uses pointer arithmetic. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Add helper to save pages under I/OChuck Lever1-13/+18
Clean up: extract the logic to save pages under I/O into a helper to add a big documenting comment without adding clutter in the send path. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULTChuck Lever2-3/+1
The Send Queue depth is temporarily reduced to 1 SQE per credit. The new rdma_rw API does an internal computation, during QP creation, to increase the depth of the Send Queue to handle RDMA Read and Write operations. This change has to come before the NFSD code paths are updated to use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases the size of the SQ too much, resulting in memory allocation failures during QP creation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Add svc_rdma_map_reply_hdr()Chuck Lever2-38/+59
Introduce a helper to DMA-map a reply's transport header before sending it. This will in part replace the map vector cache. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-26svcrdma: Move send_wr to svc_rdma_op_ctxtChuck Lever2-35/+40
Clean up: Move the ib_send_wr off the stack, and move common code to post a Send Work Request into a helper. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25xprtrdma: Remove rpcrdma_buffer::rb_poolChuck Lever1-1/+0
Since commit 1e465fd4ff47 ("xprtrdma: Replace send and receive arrays"), this field is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25xprtrdma: Squelch ENOBUFS warningsChuck Lever1-3/+5
When ro_map is out of buffers, that's not a permanent error, so don't report a problem. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25xprtrdma: Annotate receive workqueueChuck Lever1-1/+1
Micro-optimize the receive workqueue by marking it's anchor "read- mostly." Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25xprtrdma: Revert commit d0f36c46deeaChuck Lever1-26/+7
Device removal is now adequately supported. Pinning the underlying device driver to prevent removal while an NFS mount is active is no longer necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25xprtrdma: Restore transport after device removalChuck Lever1-0/+48
After a device removal, enable the transport connect worker to restore normal operation if there is another device with connectivity to the server. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25xprtrdma: Refactor rpcrdma_ep_connectChuck Lever1-46/+63
I'm about to add another arm to if (ep->rep_connected != 0) It will be cleaner to use a switch statement here. We'll be looking for a couple of specific errnos, or "anything else," basically to sort out the difference between a normal reconnect and recovery from device removal. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25xprtrdma: Support unplugging an HCA from under an NFS mountChuck Lever3-9/+101
The device driver for the underlying physical device associated with an RPC-over-RDMA transport can be removed while RPC-over-RDMA transports are still in use (ie, while NFS filesystems are still mounted and active). The IB core performs a connection event upcall to request that consumers free all RDMA resources associated with a transport. There may be pending RPCs when this occurs. Care must be taken to release associated resources without leaving references that can trigger a subsequent crash if a signal or soft timeout occurs. We rely on the caller of the transport's ->close method to ensure that the previous RPC task has invoked xprt_release but the transport remains write-locked. A DEVICE_REMOVE upcall forces a disconnect then sleeps. When ->close is invoked, it destroys the transport's H/W resources, then wakes the upcall, which completes and allows the core driver unload to continue. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=266 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25xprtrdma: Use same device when mapping or syncing DMA buffersChuck Lever3-9/+14
When the underlying device driver is reloaded, ia->ri_device will be replaced. All cached copies of that device pointer have to be updated as well. Commit 54cbd6b0c6b9 ("xprtrdma: Delay DMA mapping Send and Receive buffers") added the rg_device field to each regbuf. As part of handling a device removal, rpcrdma_dma_unmap_regbuf is invoked on all regbufs for a transport. Simply calling rpcrdma_dma_map_regbuf for each Receive buffer after the driver has been reloaded should reinitialize rg_device correctly for every case except rpcrdma_wc_receive, which still uses rpcrdma_rep::rr_device. Ensure the same device that was used to map a Receive buffer is also used to sync it in rpcrdma_wc_receive by using rg_device there instead of rr_device. This is the only use of rr_device, so it can be removed. The use of regbufs in the send path is also updated, for completeness. Fixes: 54cbd6b0c6b9 ("xprtrdma: Delay DMA mapping Send and ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25xprtrdma: Refactor rpcrdma_ia_open()Chuck Lever3-27/+32
In order to unload a device driver and reload it, xprtrdma will need to close a transport's interface adapter, and then call rpcrdma_ia_open again, possibly finding a different interface adapter. Make rpcrdma_ia_open safe to call on the same transport multiple times. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25xprtrdma: Detect unreachable NFS/RDMA servers more reliablyChuck Lever1-0/+22
Current NFS clients rely on connection loss to determine when to retransmit. In particular, for protocols like NFSv4, clients no longer rely on RPC timeouts to drive retransmission: NFSv4 servers are required to terminate a connection when they need a client to retransmit pending RPCs. When a server is no longer reachable, either because it has crashed or because the network path has broken, the server cannot actively terminate a connection. Thus NFS clients depend on transport-level keepalive to determine when a connection must be replaced and pending RPCs retransmitted. However, RDMA RC connections do not have a native keepalive mechanism. If an NFS/RDMA server crashes after a client has sent RPCs successfully (an RC ACK has been received for all OTW RDMA requests), there is no way for the client to know the connection is moribund. In addition, new RDMA requests are subject to the RPC-over-RDMA credit limit. If the client has consumed all granted credits with NFS traffic, it is not allowed to send another RDMA request until the server replies. Thus it has no way to send a true keepalive when the workload has already consumed all credits with pending RPCs. To address this, forcibly disconnect a transport when an RPC times out. This prevents moribund connections from stopping the detection of failover or other configuration changes on the server. Note that even if the connection is still good, retransmitting any RPC will trigger a disconnect thanks to this logic in xprt_rdma_send_request: /* Must suppress retransmit to maintain credits */ if (req->rl_connect_cookie == xprt->connect_cookie) goto drop_connection; req->rl_connect_cookie = xprt->connect_cookie; Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25xprtrdma: Cancel refresh worker during buffer shutdownChuck Lever1-0/+1
Trying to create MRs while the transport is being torn down can cause a crash. Fixes: e2ac236c0b65 ("xprtrdma: Allocate MRs on demand") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-01Merge tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-0/+1
Pull nfsd fixes from Bruce Fields: "The restriction of NFSv4 to TCP went overboard and also broke the backchannel; fix. Also some minor refinements to the nfsd version-setting interface that we'd like to get fixed before release" * tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux: svcrdma: set XPT_CONG_CTRL flag for bc xprt NFSD: fix nfsd_reset_versions for NFSv4. NFSD: fix nfsd_minorversion(.., NFSD_AVAIL) NFSD: further refinement of content of /proc/fs/nfsd/versions nfsd: map the ENOKEY to nfserr_perm for avoiding warning SUNRPC/backchanel: set XPT_CONG_CTRL flag for bc xprt
2017-03-29svcrdma: set XPT_CONG_CTRL flag for bc xprtChuck Lever1-0/+1
Same change as Kinglong Mee's fix for the TCP backchannel service. Fixes: 5283b03ee5cd ("nfs/nfsd/sunrpc: enforce transport...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-03-18Merge tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds1-1/+2
Pull NFS client fixes from Anna Schumaker: "We have a handful of stable fixes to fix kernel warnings and other bugs that have been around for a while. We've also found a few other reference counting bugs and memory leaks since the initial 4.11 pull. Stable Bugfixes: - Fix decrementing nrequests in NFS v4.2 COPY to fix kernel warnings - Prevent a double free in async nfs4_exchange_id() - Squelch a kbuild sparse complaint for xprtrdma Other Bugfixes: - Fix a typo (NFS_ATTR_FATTR_GROUP_NAME) that causes a memory leak - Fix a reference leak that causes kernel warnings - Make nfs4_cb_sv_ops static to fix a sparse warning - Respect a server's max size in CREATE_SESSION - Handle errors from nfs4_pnfs_ds_connect - Flexfiles layout shouldn't mark devices as unavailable" * tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: pNFS/flexfiles: never nfs4_mark_deviceid_unavailable pNFS: return status from nfs4_pnfs_ds_connect NFSv4.1 respect server's max size in CREATE_SESSION NFS prevent double free in async nfs4_exchange_id nfs: make nfs4_cb_sv_ops static xprtrdma: Squelch kbuild sparse complaint NFS: fix the fault nrequests decreasing for nfs_inode COPY NFSv4: fix a reference leak caused WARNING messages nfs4: fix a typo of NFS_ATTR_FATTR_GROUP_NAME
2017-03-17xprtrdma: Squelch kbuild sparse complaintChuck Lever1-1/+2
New complaint from kbuild for 4.9.y: net/sunrpc/xprtrdma/verbs.c:489:19: sparse: incompatible types in comparison expression (different type sizes) verbs.c: 489 max_sge = min(ia->ri_device->attrs.max_sge, RPCRDMA_MAX_SEND_SGES); I can't reproduce this running sparse here. Likewise, "make W=1 net/sunrpc/xprtrdma/verbs.o" never indicated any issue. A little poking suggests that because the range of its values is small, gcc can make the actual width of RPCRDMA_MAX_SEND_SGES smaller than the width of an unsigned integer. Fixes: 16f906d66cd7 ("xprtrdma: Reduce required number of send SGEs") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-03-02Merge tag 'nfs-for-4.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds6-112/+118
Pull NFS client updates from Anna Schumaker: "Highlights include: Stable bugfixes: - NFSv4: Fix memory and state leak in _nfs4_open_and_get_state - xprtrdma: Fix Read chunk padding - xprtrdma: Per-connection pad optimization - xprtrdma: Disable pad optimization by default - xprtrdma: Reduce required number of send SGEs - nlm: Ensure callback code also checks that the files match - pNFS/flexfiles: If the layout is invalid, it must be updated before retrying - NFSv4: Fix reboot recovery in copy offload - Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE" - NFSv4: fix getacl head length estimation - NFSv4: fix getacl ERANGE for sum ACL buffer sizes Features: - Add and use dprintk_cont macros - Various cleanups to NFS v4.x to reduce code duplication and complexity - Remove unused cr_magic related code - Improvements to sunrpc "read from buffer" code - Clean up sunrpc timeout code and allow changing TCP timeout parameters - Remove duplicate mw_list management code in xprtrdma - Add generic functions for encoding and decoding xdr streams Bugfixes: - Clean up nfs_show_mountd_netid - Make layoutreturn_ops static and use NULL instead of 0 to fix sparse warnings - Properly handle -ERESTARTSYS in nfs_rename() - Check if register_shrinker() failed during rpcauth_init() - Properly clean up procfs/pipefs entries - Various NFS over RDMA related fixes - Silence unititialized variable warning in sunrpc" * tag 'nfs-for-4.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (64 commits) NFSv4: fix getacl ERANGE for some ACL buffer sizes NFSv4: fix getacl head length estimation Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE" NFSv4: Fix reboot recovery in copy offload pNFS/flexfiles: If the layout is invalid, it must be updated before retrying NFSv4: Clean up owner/group attribute decode SUNRPC: Add a helper function xdr_stream_decode_string_dup() NFSv4: Remove bogus "struct nfs_client" argument from decode_ace() NFSv4: Fix the underestimation of delegation XDR space reservation NFSv4: Replace callback string decode function with a generic NFSv4: Replace the open coded decode_opaque_inline() with the new generic NFSv4: Replace ad-hoc xdr encode/decode helpers with xdr_stream_* generics SUNRPC: Add generic helpers for xdr_stream encode/decode sunrpc: silence uninitialized variable warning nlm: Ensure callback code also checks that the files match sunrpc: Allow xprt->ops->timer method to sleep xprtrdma: Refactor management of mw_list field xprtrdma: Handle stale connection rejection xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs xprtrdma: Shrink send SGEs array ...