summaryrefslogtreecommitdiff
path: root/net/sunrpc/xprtrdma
AgeCommit message (Collapse)AuthorFilesLines
2014-11-27Merge tag 'nfs-rdma-for-3.19' of ↵Trond Myklebust3-17/+107
git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next Pull NFS client RDMA changes for 3.19 from Anna Schumaker: "NFS: Client side changes for RDMA These patches various bugfixes and cleanups for using NFS over RDMA, including better error handling and performance improvements by using pad optimization. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>" * tag 'nfs-rdma-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma: xprtrdma: Display async errors xprtrdma: Enable pad optimization xprtrdma: Re-write rpcrdma_flush_cqs() xprtrdma: Refactor tasklet scheduling xprtrdma: unmap all FMRs during transport disconnect xprtrdma: Cap req_cqinit xprtrdma: Return an errno from rpcrdma_register_external()
2014-11-25xprtrdma: Display async errorsChuck Lever1-4/+32
An async error upcall is a hard error, and should be reported in the system log. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25xprtrdma: Enable pad optimizationChuck Lever1-1/+1
The Linux NFS/RDMA server used to reject NFSv3 WRITE requests when pad optimization was enabled. That bug was fixed by commit e560e3b510d2 ("svcrdma: Add zero padding if the client doesn't send it"). We can now enable pad optimization on the client, which helps performance and is supported now by both Linux and Solaris servers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25xprtrdma: Re-write rpcrdma_flush_cqs()Chuck Lever1-2/+9
Currently rpcrdma_flush_cqs() attempts to avoid code duplication, and simply invokes rpcrdma_recvcq_upcall and rpcrdma_sendcq_upcall. 1. rpcrdma_flush_cqs() can run concurrently with provider upcalls. Both flush_cqs() and the upcalls were invoking ib_poll_cq() in different threads using the same wc buffers (ep->rep_recv_wcs and ep->rep_send_wcs), added by commit 1c00dd077654 ("xprtrmda: Reduce calls to ib_poll_cq() in completion handlers"). During transport disconnect processing, this sometimes resulted in the same reply getting added to the rpcrdma_tasklets_g list more than once, which corrupted the list. 2. The upcall functions drain only a limited number of CQEs, thanks to the poll budget added by commit 8301a2c047cc ("xprtrdma: Limit work done by completion handler"). Fixes: a7bc211ac926 ("xprtrdma: On disconnect, don't ignore ... ") BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=276 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25xprtrdma: Refactor tasklet schedulingChuck Lever1-5/+12
Restore the separate function that schedules the reply handling tasklet. I need to call it from two different paths. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25xprtrdma: unmap all FMRs during transport disconnectChuck Lever2-2/+42
When using RPCRDMA_MTHCAFMR memory registration, after a few transport disconnect / reconnect cycles, ib_map_phys_fmr() starts to return EINVAL because the provider has exhausted its map pool. Make sure that all FMRs are unmapped during transport disconnect, and that ->send_request remarshals them during an RPC retransmit. This resets the transport's MRs to ensure that none are leaked during a disconnect. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25xprtrdma: Cap req_cqinitChuck Lever2-1/+9
Recent work made FRMR registration and invalidation completions unsignaled. This greatly reduces the adapter interrupt rate. Every so often, however, a posted send Work Request is allowed to signal. Otherwise, the provider's Work Queue will wrap and the workload will hang. The number of Work Requests that are allowed to remain unsignaled is determined by the value of req_cqinit. Currently, this is set to the size of the send Work Queue divided by two, minus 1. For FRMR, the send Work Queue is the maximum number of concurrent RPCs (currently 32) times the maximum number of Work Requests an RPC might use (currently 7, though some adapters may need more). For mlx4, this is 224 entries. This leaves completion signaling disabled for 111 send Work Requests. Some providers hold back dispatching Work Requests until a CQE is generated. If completions are disabled, then no CQEs are generated for quite some time, and that can stall the Work Queue. I've seen this occur running xfstests generic/113 over NFSv4, where eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM because the Work Queue has overflowed. The connection is dropped and re-established. Cap the rep_cqinit setting so completions are not left turned off for too long. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25xprtrdma: Return an errno from rpcrdma_register_external()Chuck Lever1-2/+2
The RPC/RDMA send_request method and the chunk registration code expects an errno from the registration function. This allows the upper layers to distinguish between a recoverable failure (for example, temporary memory exhaustion) and a hard failure (for example, a bug in the registration logic). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25sunrpc: eliminate RPC_DEBUGJeff Layton3-10/+10
It's always set to whatever CONFIG_SUNRPC_DEBUG is, so just use that. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-10-08Merge branch 'for-3.18' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2-1/+8
Pull nfsd updates from Bruce Fields: "Highlights: - support the NFSv4.2 SEEK operation (allowing clients to support SEEK_HOLE/SEEK_DATA), thanks to Anna. - end the grace period early in a number of cases, mitigating a long-standing annoyance, thanks to Jeff - improve SMP scalability, thanks to Trond" * 'for-3.18' of git://linux-nfs.org/~bfields/linux: (55 commits) nfsd: eliminate "to_delegation" define NFSD: Implement SEEK NFSD: Add generic v4.2 infrastructure svcrdma: advertise the correct max payload nfsd: introduce nfsd4_callback_ops nfsd: split nfsd4_callback initialization and use nfsd: introduce a generic nfsd4_cb nfsd: remove nfsd4_callback.cb_op nfsd: do not clear rpc_resp in nfsd4_cb_done_sequence nfsd: fix nfsd4_cb_recall_done error handling nfsd4: clarify how grace period ends nfsd4: stop grace_time update at end of grace period nfsd: skip subsequent UMH "create" operations after the first one for v4.0 clients nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcalls nfsd: serialize nfsdcltrack upcalls for a particular client nfsd: pass extra info in env vars to upcalls to allow for early grace period end nfsd: add a v4_end_grace file to /proc/fs/nfsd lockd: add a /proc/fs/lockd/nlm_end_grace file nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETE nfsd: remove redundant boot_time parm from grace_done client tracking op ...
2014-09-29svcrdma: advertise the correct max payloadSteve Wise2-1/+8
Svcrdma currently advertises 1MB, which is too large. The correct value is the minimum of RPCSVC_MAXPAYLOAD and the max scatter-gather allowed in an NFSRDMA IO chunk * the host page size. This bug is usually benign because the Linux X64 NFSRDMA client correctly limits the payload size to the correct value (64*4096 = 256KB). But if the Linux client is PPC64 with a 64KB page size, then the client will indeed use a payload size that will overflow the server. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-25NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()NeilBrown1-2/+0
Now that nfs_release_page() doesn't block indefinitely, other deadlock avoidance mechanisms aren't needed. - it doesn't hurt for kswapd to block occasionally. If it doesn't want to block it would clear __GFP_WAIT. The current_is_kswapd() was only added to avoid deadlocks and we have a new approach for that. - memory allocation in the SUNRPC layer can very rarely try to ->releasepage() a page it is trying to handle. The deadlock is removed as nfs_release_page() doesn't block indefinitely. So we don't need to set PF_FSTRANS for sunrpc network operations any more. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-14Merge tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds4-296/+604
Pull NFS client updates from Trond Myklebust: "Highlights include: - stable fix for a bug in nfs3_list_one_acl() - speed up NFS path walks by supporting LOOKUP_RCU - more read/write code cleanups - pNFS fixes for layout return on close - fixes for the RCU handling in the rpcsec_gss code - more NFS/RDMA fixes" * tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits) nfs: reject changes to resvport and sharecache during remount NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired error SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcred NFS: fix two problems in lookup_revalidate in RCU-walk NFS: allow lockless access to access_cache NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU NFS: support RCU_WALK in nfs_permission() sunrpc/auth: allow lockless (rcu) lookup of credential cache. NFS: prepare for RCU-walk support but pushing tests later in code. NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used. NFS: add checks for returned value of try_module_get() nfs: clear_request_commit while holding i_lock pnfs: add pnfs_put_lseg_async pnfs: find swapped pages on pnfs commit lists too nfs: fix comment and add warn_on for PG_INODE_REF nfs: check wait_on_bit_lock err in page_group_lock sunrpc: remove "ec" argument from encrypt_v2 operation sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.c sunrpc: clean up sparse endianness warnings in gss_krb5_seal.c ...
2014-08-06svcrdma: remove rdma_create_qp() failure recovery logicSteve Wise1-17/+2
In svc_rdma_accept(), if rdma_create_qp() fails, there is useless logic to try and call rdma_create_qp() again with reduced sge depths. The assumption, I guess, was that perhaps the initial sge depths chosen were too big. However they initial depths are selected based on the rdma device attribute max_sge returned from ib_query_device(). If rdma_create_qp() fails, it would not be because the max_send_sge and max_recv_sge values passed in exceed the device's max. So just remove this code. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01xprtrdma: Handle additional connection eventsChuck Lever1-11/+14
Commit 38ca83a5 added RDMA_CM_EVENT_TIMEWAIT_EXIT. But that status is relevant only for consumers that re-use their QPs on new connections. xprtrdma creates a fresh QP on reconnection, so that event should be explicitly ignored. Squelch the alarming "unexpected CM event" message. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Remove RPCRDMA_PERSISTENT_REGISTRATION macroChuck Lever1-13/+0
Clean up. RPCRDMA_PERSISTENT_REGISTRATION was a compile-time switch between RPCRDMA_REGISTER mode and RPCRDMA_ALLPHYSICAL mode. Since RPCRDMA_REGISTER has been removed, there's no need for the extra conditional compilation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Make rpcrdma_ep_disconnect() return voidChuck Lever3-12/+6
Clean up: The return code is used only for dprintk's that are already redundant. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Schedule reply tasklet once per upcallChuck Lever1-16/+15
Minor optimization: grab rpcrdma_tk_lock_g and disable hard IRQs just once after clearing the receive completion queue. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Allocate each struct rpcrdma_mw separatelyChuck Lever1-99/+143
Currently rpcrdma_buffer_create() allocates struct rpcrdma_mw's as a single contiguous area of memory. It amounts to quite a bit of memory, and there's no requirement for these to be carved from a single piece of contiguous memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Rename frmr_wrChuck Lever1-13/+13
Clean up: Name frmr_wr after the opcode of the Work Request, consistent with the send and local invalidation paths. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Disable completions for LOCAL_INV Work RequestsChuck Lever1-9/+8
Instead of relying on a completion to change the state of an FRMR to FRMR_IS_INVALID, set it in advance. If an error occurs, a completion will fire anyway and mark the FRMR FRMR_IS_STALE. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Disable completions for FAST_REG_MR Work RequestsChuck Lever1-5/+4
Instead of relying on a completion to change the state of an FRMR to FRMR_IS_VALID, set it in advance. If an error occurs, a completion will fire anyway and mark the FRMR FRMR_IS_STALE. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Don't post a LOCAL_INV in rpcrdma_register_frmr_external()Chuck Lever1-20/+2
Any FRMR arriving in rpcrdma_register_frmr_external() is now guaranteed to be either invalid, or to be targeted by a queued LOCAL_INV that will invalidate it before the adapter processes the FAST_REG_MR being built here. The problem with current arrangement of chaining a LOCAL_INV to the FAST_REG_MR is that if the transport is not connected, the LOCAL_INV is flushed and the FAST_REG_MR is flushed. This leaves the FRMR valid with the old rkey. But rpcrdma_register_frmr_external() has already bumped the in-memory rkey. Next time through rpcrdma_register_frmr_external(), a LOCAL_INV and FAST_REG_MR is attempted again because the FRMR is still valid. But the rkey no longer matches the hardware's rkey, and a memory management operation error occurs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Reset FRMRs after a flushed LOCAL_INV Work RequestChuck Lever1-2/+92
When a LOCAL_INV Work Request is flushed, it leaves an FRMR in the VALID state. This FRMR can be returned by rpcrdma_buffer_get(), and must be knocked down in rpcrdma_register_frmr_external() before it can be re-used. Instead, capture these in rpcrdma_buffer_get(), and reset them. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Reset FRMRs when FAST_REG_MR is flushed by a disconnectChuck Lever2-2/+66
FAST_REG_MR Work Requests update a Memory Region's rkey. Rkey's are used to block unwanted access to the memory controlled by an MR. The rkey is passed to the receiver (the NFS server, in our case), and is also used by xprtrdma to invalidate the MR when the RPC is complete. When a FAST_REG_MR Work Request is flushed after a transport disconnect, xprtrdma cannot tell whether the WR actually hit the adapter or not. So it is indeterminant at that point whether the existing rkey is still valid. After the transport connection is re-established, the next FAST_REG_MR or LOCAL_INV Work Request against that MR can sometimes fail because the rkey value does not match what xprtrdma expects. The only reliable way to recover in this case is to deregister and register the MR before it is used again. These operations can be done only in a process context, so handle it in the transport connect worker. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Properly handle exhaustion of the rb_mws listChuck Lever1-32/+71
If the rb_mws list is exhausted, clean up and return NULL so that call_allocate() will delay and try again. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Chain together all MWs in same buffer poolChuck Lever2-1/+7
During connection loss recovery, need to visit every MW in a buffer pool. Any MW that is in use by an RPC will not be on the rb_mws list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Back off rkey when FAST_REG_MR failsChuck Lever1-0/+1
If posting a FAST_REG_MR Work Reqeust fails, revert the rkey update to avoid subsequent IB_WC_MW_BIND_ERR completions. Suggested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Unclutter struct rpcrdma_mr_segChuck Lever2-28/+50
Clean ups: - make it obvious that the rl_mw field is a pointer -- allocated separately, not as part of struct rpcrdma_mr_seg - promote "struct {} frmr;" to a named type - promote the state enum to a named type - name the MW state field the same way other fields in rpcrdma_mw are named Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Don't invalidate FRMRs if registration failsChuck Lever1-3/+5
If FRMR registration fails, it's likely to transition the QP to the error state. Or, registration may have failed because the QP is _already_ in ERROR. Thus calling rpcrdma_deregister_external() in rpcrdma_create_chunks() is useless in FRMR mode: the LOCAL_INVs just get flushed. It is safe to leave existing registrations: when FRMR registration is tried again, rpcrdma_register_frmr_external() checks if each FRMR is already/still VALID, and knocks it down first if it is. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: On disconnect, don't ignore pending CQEsChuck Lever1-5/+9
xprtrdma is currently throwing away queued completions during a reconnect. RPC replies posted just before connection loss, or successful completions that change the state of an FRMR, can be missed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Update rkeys after transport reconnectChuck Lever3-42/+54
Various reports of: rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0 ep ffff8800bfd3e848 Ensure that rkeys in already-marshalled RPC/RDMA headers are refreshed after the QP has been replaced by a reconnect. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=249 Suggested-by: Selvin Xavier <Selvin.Xavier@Emulex.Com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Limit data payload size for ALLPHYSICALChuck Lever3-1/+45
When the client uses physical memory registration, each page in the payload gets its own array entry in the RPC/RDMA header's chunk list. Therefore, don't advertise a maximum payload size that would require more array entries than can fit in the RPC buffer where RPC/RDMA headers are built. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=248 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Protect ia->ri_id when unmapping/invalidating MRsChuck Lever2-6/+18
Ensure ia->ri_id remains valid while invoking dma_unmap_page() or posting LOCAL_INV during a transport reconnect. Otherwise, ia->ri_id->device or ia->ri_id->qp is NULL, which triggers a panic. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=259 Fixes: ec62f40 'xprtrdma: Ensure ia->ri_id->qp is not NULL when reconnecting' Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-08-01xprtrdma: Fix panic in rpcrdma_register_frmr_external()Chuck Lever1-5/+7
seg1->mr_nsegs is not yet initialized when it is used to unmap segments during an error exit. Use the same unmapping logic for all error exits. "if (frmr_wr.wr.fast_reg.length < len) {" used to be a BUG_ON check. The broken code will never be executed under normal operation. Fixes: c977dea (xprtrdma: Remove BUG_ON() call sites) Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-23svcrdma: Add zero padding if the client doesn't send itChuck Lever1-0/+28
See RFC 5666 section 3.7: clients don't have to send zero XDR padding. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=246 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-22xprtrdma: Fix DMA-API-DEBUG warning by checking dma_map resultYan Burman1-0/+3
Fix the following warning when DMA-API debug is enabled by checking ib_dma_map_single result: [ 1455.345548] ------------[ cut here ]------------ [ 1455.346863] WARNING: CPU: 3 PID: 3929 at /home/yanb/kernel/net-next/lib/dma-debug.c:1140 check_unmap+0x4e5/0x990() [ 1455.349350] mlx4_core 0000:00:07.0: DMA-API: device driver failed to check map error[device address=0x000000007c9f2090] [size=2656 bytes] [mapped as single] [ 1455.349350] Modules linked in: xprtrdma netconsole configfs nfsv3 nfs_acl ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm autofs4 auth_rpcgss oid_registry nfsv4 nfs fscache lockd sunrpc dm_mirror dm_region_hash dm_log microcode pcspkr mlx4_ib ib_sa ib_mad ib_core ib_addr mlx4_en ipv6 ptp pps_core vxlan mlx4_core virtio_balloon cirrus ttm drm_kms_helper drm sysimgblt sysfillrect syscopyarea i2c_piix4 i2c_core button ext3 jbd virtio_blk virtio_net virtio_pci virtio_ring virtio uhci_hcd ata_generic ata_piix libata [ 1455.349350] CPU: 3 PID: 3929 Comm: mount.nfs Not tainted 3.15.0-rc1-dbg+ #13 [ 1455.349350] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 1455.349350] 0000000000000474 ffff880069dcf628 ffffffff8151c341 ffffffff817b69d8 [ 1455.349350] ffff880069dcf678 ffff880069dcf668 ffffffff8105b5fc 0000000069dcf658 [ 1455.349350] ffff880069dcf778 ffff88007b0c9f00 ffffffff8255ec40 0000000000000a60 [ 1455.349350] Call Trace: [ 1455.349350] [<ffffffff8151c341>] dump_stack+0x52/0x81 [ 1455.349350] [<ffffffff8105b5fc>] warn_slowpath_common+0x8c/0xc0 [ 1455.349350] [<ffffffff8105b6e6>] warn_slowpath_fmt+0x46/0x50 [ 1455.349350] [<ffffffff812e6305>] check_unmap+0x4e5/0x990 [ 1455.349350] [<ffffffff81521fb0>] ? _raw_spin_unlock_irq+0x30/0x60 [ 1455.349350] [<ffffffff812e6a0a>] debug_dma_unmap_page+0x5a/0x60 [ 1455.349350] [<ffffffffa0389583>] rpcrdma_deregister_internal+0xb3/0xd0 [xprtrdma] [ 1455.349350] [<ffffffffa038a639>] rpcrdma_buffer_destroy+0x69/0x170 [xprtrdma] [ 1455.349350] [<ffffffffa03872ff>] xprt_rdma_destroy+0x3f/0xb0 [xprtrdma] [ 1455.349350] [<ffffffffa04a95ff>] xprt_destroy+0x6f/0x80 [sunrpc] [ 1455.349350] [<ffffffffa04a9625>] xprt_put+0x15/0x20 [sunrpc] [ 1455.349350] [<ffffffffa04a899a>] rpc_free_client+0x8a/0xe0 [sunrpc] [ 1455.349350] [<ffffffffa04a8a58>] rpc_release_client+0x68/0xa0 [sunrpc] [ 1455.349350] [<ffffffffa04a9060>] rpc_shutdown_client+0xb0/0xc0 [sunrpc] [ 1455.349350] [<ffffffffa04a8f5d>] ? rpc_ping+0x5d/0x70 [sunrpc] [ 1455.349350] [<ffffffffa04a91ab>] rpc_create_xprt+0xbb/0xd0 [sunrpc] [ 1455.349350] [<ffffffffa04a9273>] rpc_create+0xb3/0x160 [sunrpc] [ 1455.349350] [<ffffffff81129749>] ? __probe_kernel_read+0x69/0xb0 [ 1455.349350] [<ffffffffa053851c>] nfs_create_rpc_client+0xdc/0x100 [nfs] [ 1455.349350] [<ffffffffa0538cfa>] nfs_init_client+0x3a/0x90 [nfs] [ 1455.349350] [<ffffffffa05391c8>] nfs_get_client+0x478/0x5b0 [nfs] [ 1455.349350] [<ffffffffa0538e50>] ? nfs_get_client+0x100/0x5b0 [nfs] [ 1455.349350] [<ffffffff81172c6d>] ? kmem_cache_alloc_trace+0x24d/0x260 [ 1455.349350] [<ffffffffa05393f3>] nfs_create_server+0xf3/0x4c0 [nfs] [ 1455.349350] [<ffffffffa0545ff0>] ? nfs_request_mount+0xf0/0x1a0 [nfs] [ 1455.349350] [<ffffffffa031c0c3>] nfs3_create_server+0x13/0x30 [nfsv3] [ 1455.349350] [<ffffffffa0546293>] nfs_try_mount+0x1f3/0x230 [nfs] [ 1455.349350] [<ffffffff8108ea21>] ? get_parent_ip+0x11/0x50 [ 1455.349350] [<ffffffff812d6343>] ? __this_cpu_preempt_check+0x13/0x20 [ 1455.349350] [<ffffffff810d632b>] ? try_module_get+0x6b/0x190 [ 1455.349350] [<ffffffffa05449f7>] nfs_fs_mount+0x187/0x9d0 [nfs] [ 1455.349350] [<ffffffffa0545940>] ? nfs_clone_super+0x140/0x140 [nfs] [ 1455.349350] [<ffffffffa0543b20>] ? nfs_auth_info_match+0x40/0x40 [nfs] [ 1455.349350] [<ffffffff8117e360>] mount_fs+0x20/0xe0 [ 1455.349350] [<ffffffff811a1c16>] vfs_kern_mount+0x76/0x160 [ 1455.349350] [<ffffffff811a29a8>] do_mount+0x428/0xae0 [ 1455.349350] [<ffffffff811a30f0>] SyS_mount+0x90/0xe0 [ 1455.349350] [<ffffffff8152af52>] system_call_fastpath+0x16/0x1b [ 1455.349350] ---[ end trace f1f31572972e211d ]--- Signed-off-by: Yan Burman <yanb@mellanox.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-18svcrdma: Select NFSv4.1 backchannel transport based on forward channelChuck Lever1-0/+1
The current code always selects XPRT_TRANSPORT_BC_TCP for the back channel, even when the forward channel was not TCP (eg, RDMA). When a 4.1 mount is attempted with RDMA, the server panics in the TCP BC code when trying to send CB_NULL. Instead, construct the transport protocol number from the forward channel transport or'd with XPRT_TRANSPORT_BC. Transports that do not support bi-directional RPC will not have registered a "BC" transport, causing create_backchannel_client() to fail immediately. Fixes: https://bugzilla.linux-nfs.org/show_bug.cgi?id=265 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-11svcrdma: send_write() must not overflow the device's max sgeSteve Wise1-24/+15
Function send_write() must stop creating sges when it reaches the device max and return the amount sent in the RDMA Write to the caller. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-11Merge tag 'nfs-for-3.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds4-583/+396
Pull NFS client updates from Trond Myklebust: "Highlights include: - massive cleanup of the NFS read/write code by Anna and Dros - support multiple NFS read/write requests per page in order to deal with non-page aligned pNFS striping. Also cleans up the r/wsize < page size code nicely. - stable fix for ensuring inode is declared uptodate only after all the attributes have been checked. - stable fix for a kernel Oops when remounting - NFS over RDMA client fixes - move the pNFS files layout driver into its own subdirectory" * tag 'nfs-for-3.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits) NFS: populate ->net in mount data when remounting pnfs: fix lockup caused by pnfs_generic_pg_test NFSv4.1: Fix typo in dprintk NFSv4.1: Comment is now wrong and redundant to code NFS: Use raw_write_seqcount_begin/end int nfs4_reclaim_open_state xprtrdma: Disconnect on registration failure xprtrdma: Remove BUG_ON() call sites xprtrdma: Avoid deadlock when credit window is reset SUNRPC: Move congestion window constants to header file xprtrdma: Reset connection timeout after successful reconnect xprtrdma: Use macros for reconnection timeout constants xprtrdma: Allocate missing pagelist xprtrdma: Remove Tavor MTU setting xprtrdma: Ensure ia->ri_id->qp is not NULL when reconnecting xprtrdma: Reduce the number of hardway buffer allocations xprtrdma: Limit work done by completion handler xprtrmda: Reduce calls to ib_poll_cq() in completion handlers xprtrmda: Reduce lock contention in completion handlers xprtrdma: Split the completion queue xprtrdma: Make rpcrdma_ep_destroy() return void ...
2014-06-07svcrdma: Fence LOCAL_INV work requestsSteve Wise1-1/+1
Fencing forces the invalidate to only happen after all prior send work requests have been completed. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Reported by : Devesh Sharma <Devesh.Sharma@Emulex.Com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-07svcrdma: refactor marshalling logicSteve Wise3-604/+331
This patch refactors the NFSRDMA server marshalling logic to remove the intermediary map structures. It also fixes an existing bug where the NFSRDMA server was not minding the device fast register page list length limitations. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
2014-06-04xprtrdma: Disconnect on registration failureChuck Lever2-23/+42
If rpcrdma_register_external() fails during request marshaling, the current RPC request is killed. Instead, this RPC should be retried after reconnecting the transport instance. The most likely reason for registration failure with FRMR is a failed post_send, which would be due to a remote transport disconnect or memory exhaustion. These issues can be recovered by a retry. Problems encountered in the marshaling logic itself will not be corrected by trying again, so these should still kill a request. Now that we've added a clean exit for marshaling errors, take the opportunity to defang some BUG_ON's. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-06-04xprtrdma: Remove BUG_ON() call sitesChuck Lever2-9/+12
If an error occurs in the marshaling logic, fail the RPC request being processed, but leave the client running. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-06-04xprtrdma: Avoid deadlock when credit window is resetChuck Lever3-19/+7
Update the cwnd while processing the server's reply. Otherwise the next task on the xprt_sending queue is still subject to the old credit window. Currently, no task is awoken if the old congestion window is still exceeded, even if the new window is larger, and a deadlock results. This is an issue during a transport reconnect. Servers don't normally shrink the credit window, but the client does reset it to 1 when reconnecting so the server can safely grow it again. As a minor optimization, remove the hack of grabbing the initial cwnd size (which happens to be RPC_CWNDSCALE) and using that value as the congestion scaling factor. The scaling value is invariant, and we are better off without the multiplication operation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-06-04xprtrdma: Reset connection timeout after successful reconnectChuck Lever1-0/+1
If the new connection is able to make forward progress, reset the re-establish timeout. Otherwise it keeps growing even if disconnect events are rare. The same behavior as TCP is adopted: reconnect immediately if the transport instance has been able to make some forward progress. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-06-04xprtrdma: Use macros for reconnection timeout constantsChuck Lever1-7/+12
Clean up: Ensure the same max and min constant values are used everywhere when setting reconnect timeouts. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-06-04xprtrdma: Allocate missing pagelistShirley Ma1-0/+6
GETACL relies on transport layer to alloc memory for reply buffer. However xprtrdma assumes that the reply buffer (pagelist) has been pre-allocated in upper layer. This problem was reported by IOL OFA lab test on PPC. Signed-off-by: Shirley Ma <shirley.ma@oracle.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Edward Mossman <emossman@iol.unh.edu> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-06-04xprtrdma: Remove Tavor MTU settingChuck Lever1-14/+0
Clean up. Remove HCA-specific clutter in xprtrdma, which is supposed to be device-independent. Hal Rosenstock <hal@dev.mellanox.co.il> observes: > Note that there is OpenSM option (enable_quirks) to return 1K MTU > in SA PathRecord responses for Tavor so that can be used for this. > The default setting for enable_quirks is FALSE so that would need > changing. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-06-04xprtrdma: Ensure ia->ri_id->qp is not NULL when reconnectingChuck Lever1-9/+20
Devesh Sharma <Devesh.Sharma@Emulex.Com> reports that after a disconnect, his HCA is failing to create a fresh QP, leaving ia_ri->ri_id->qp set to NULL. But xprtrdma still allows RPCs to wake up and post LOCAL_INV as they exit, causing an oops. rpcrdma_ep_connect() is allowing the wake-up by leaking the QP creation error code (-EPERM in this case) to the RPC client's generic layer. xprt_connect_status() does not recognize -EPERM, so it kills pending RPC tasks immediately rather than retrying the connect. Re-arrange the QP creation logic so that when it fails on reconnect, it leaves ->qp with the old QP rather than NULL. If pending RPC tasks wake and exit, LOCAL_INV work requests will flush rather than oops. On initial connect, leaving ->qp == NULL is OK, since there are no pending RPCs that might use ->qp. But be sure not to try to destroy a NULL QP when rpcrdma_ep_connect() is retried. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>