summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-04-24Revert "NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down"Chuck Lever1-18/+2
The reverted commit attempted to enable NFSD to retransmit pending callback operations if an NFS client disconnects, but unintentionally introduces a hazardous behavior regression if the client becomes permanently unreachable while callback operations are still pending. A disconnect can occur due to network partition or if the NFS server needs to force the NFS client to retransmit (for example, if a GSS window under-run occurs). Reverting the commit will make NFSD behave the same as it did in v6.8 and before. Pending callback operations are permanently lost if the client connection is terminated before the client receives them. For some callback operations, this loss is not harmful. However, for CB_RECALL, the loss means a delegation might be revoked unnecessarily. For CB_OFFLOAD, pending COPY operations will never complete unless the NFS client subsequently sends an OFFLOAD_STATUS operation, which the Linux NFS client does not currently implement. These issues still need to be addressed somehow. Reported-by: Dai Ngo <dai.ngo@oracle.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218735 Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-20Revert "svcrdma: Add Write chunk WRs to the RPC's Send WR chain"Chuck Lever3-78/+26
Performance regression reported with NFS/RDMA using Omnipath, bisected to commit e084ee673c77 ("svcrdma: Add Write chunk WRs to the RPC's Send WR chain"). Tracing on the server reports: nfsd-7771 [060] 1758.891809: svcrdma_sq_post_err: cq.id=205 cid=226 sc_sq_avail=13643/851 status=-12 sq_post_err reports ENOMEM, and the rdma->sc_sq_avail (13643) is larger than rdma->sc_sq_depth (851). The number of available Send Queue entries is always supposed to be smaller than the Send Queue depth. That seems like a Send Queue accounting bug in svcrdma. As it's getting to be late in the 6.9-rc cycle, revert this commit. It can be revisited in a subsequent kernel release. Link: https://bugzilla.kernel.org/show_bug.cgi?id=218743 Fixes: e084ee673c77 ("svcrdma: Add Write chunk WRs to the RPC's Send WR chain") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-11NFSD: fix endianness issue in nfsd4_encode_fattr4Vasily Gorbik1-24/+23
The nfs4 mount fails with EIO on 64-bit big endian architectures since v6.7. The issue arises from employing a union in the nfsd4_encode_fattr4() function to overlay a 32-bit array with a 64-bit values based bitmap, which does not function as intended. Address the endianness issue by utilizing bitmap_from_arr32() to copy 32-bit attribute masks into a bitmap in an endianness-agnostic manner. Cc: stable@vger.kernel.org Fixes: fce7913b13d0 ("NFSD: Use a bitmask loop to encode FATTR4 results") Link: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2060217 Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-11SUNRPC: Fix rpcgss_context trace event acceptor fieldSteven Rostedt (Google)1-2/+2
The rpcgss_context trace event acceptor field is a dynamically sized string that records the "data" parameter. But this parameter is also dependent on the "len" field to determine the size of the data. It needs to use __string_len() helper macro where the length can be passed in. It also incorrectly uses strncpy() to save it instead of __assign_str(). As these macros can change, it is not wise to open code them in trace events. As of commit c759e609030c ("tracing: Remove __assign_str_len()"), __assign_str() can be used for both __string() and __string_len() fields. Before that commit, __assign_str_len() is required to be used. This needs to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing: Rework __assign_str() and __string() to not duplicate getting the string") is the commit that makes __string_str_len() obsolete). Cc: stable@vger.kernel.org Fixes: 0c77668ddb4e ("SUNRPC: Introduce trace points in rpc_auth_gss.ko") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-05nfsd: hold a lighter-weight client reference over CB_RECALL_ANYJeff Layton1-5/+2
Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the client. While a callback job is technically an RPC that counter is really more for client-driven RPCs, and this has the effect of preventing the client from being unhashed until the callback completes. If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we can end up in a situation where the callback can't complete on the (now dead) callback channel, but the new client can't connect because the old client can't be unhashed. This usually manifests as a NFS4ERR_DELAY return on the CREATE_SESSION operation. The job is only holding a reference to the client so it can clear a flag after the RPC completes. Fix this by having CB_RECALL_ANY instead hold a reference to the cl_nfsdfs.cl_ref. Typically we only take that sort of reference when dealing with the nfsdfs info files, but it should work appropriately here to ensure that the nfs4_client doesn't disappear. Fixes: 44df6f439a17 ("NFSD: add delegation reaper to react to low memory condition") Reported-by: Vladimir Benes <vbenes@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-04SUNRPC: Fix a slow server-side memory leak with RPC-over-TCPChuck Lever1-9/+1
Jan Schunk reports that his small NFS servers suffer from memory exhaustion after just a few days. A bisect shows that commit e18e157bb5c8 ("SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call") is the first bad commit. That commit assumed that sock_sendmsg() releases all the pages in the underlying bio_vec array, but the reality is that it doesn't. svc_xprt_release() releases the rqst's response pages, but the record marker page fragment isn't one of those, so it is never released. This is a narrow fix that can be applied to stable kernels. A more extensive fix is in the works. Reported-by: Jan Schunk <scpcom@gmx.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218671 Fixes: e18e157bb5c8 ("SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call") Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: David Howells <dhowells@redhat.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-27NFSD: CREATE_SESSION must never cache NFS4ERR_DELAY repliesChuck Lever1-11/+25
There are one or two cases where CREATE_SESSION returns NFS4ERR_DELAY in order to force the client to wait a bit and try CREATE_SESSION again. However, after commit e4469c6cc69b ("NFSD: Fix the NFSv4.1 CREATE_SESSION operation"), NFSD caches that response in the CREATE_SESSION slot. Thus, when the client resends the CREATE_SESSION, the server always returns the cached NFS4ERR_DELAY response rather than actually executing the request and properly recording its outcome. This blocks the client from making further progress. RFC 8881 Section 15.1.1.3 says: > If NFS4ERR_DELAY is returned on an operation other than SEQUENCE > that validly appears as the first operation of a request ... [t]he > request can be retried in full without modification. In this case > as well, the replier MUST avoid returning a response containing > NFS4ERR_DELAY as the response to an initial operation of a request > solely on the basis of its presence in the reply cache. Neither the original NFSD code nor the discussion in section 18.36.4 refer explicitly to this important requirement, so I missed it. Note also that not only must the server not cache NFS4ERR_DELAY, but it has to not advance the CREATE_SESSION slot sequence number so that it can properly recognize and accept the client's retry. Reported-by: Dai Ngo <dai.ngo@oracle.com> Fixes: e4469c6cc69b ("NFSD: Fix the NFSv4.1 CREATE_SESSION operation") Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-22SUNRPC: Revert 561141dd494382217bace4d1a51d08168420eaceChuck Lever1-6/+8
Scott reports an occasional scatterlist BUG that is triggered by the RFC 8009 Kunit test, then says: > Looking through the git history of the auth_gss code, there are various > places where static buffers were replaced by dynamically allocated ones > because they're being used with scatterlists. Reported-by: Scott Mayhew <smayhew@redhat.com> Fixes: 561141dd4943 ("SUNRPC: Use a static buffer for the checksum initialization vector") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-22nfsd: Fix error cleanup path in nfsd_rename()Jan Kara1-1/+2
Commit a8b0026847b8 ("rename(): avoid a deadlock in the case of parents having no common ancestor") added an error bail out path. However this path does not drop the remount protection that has been acquired. Fix the cleanup path to properly drop the remount protection. Fixes: a8b0026847b8 ("rename(): avoid a deadlock in the case of parents having no common ancestor") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-09NFSD: Clean up nfsd4_encode_replay()Chuck Lever2-16/+31
Replace open-coded encoding logic with the use of conventional XDR utility functions. Add a tracepoint to make replays observable in field troubleshooting situations. The WARN_ON is removed. A stack trace is of little use, as there is only one call site for nfsd4_encode_replay(), and a buffer length shortage here is unlikely. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-05NFSD: send OP_CB_RECALL_ANY to clients when number of delegations reaches ↵Dai Ngo1-0/+3
its limit The NFS server should ask clients to voluntarily return unused delegations when the number of granted delegations reaches the max_delegations. This is so that the server can continue to grant delegations for new requests. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-05NFSD: Document nfsd_setattr() fill-attributes behaviorChuck Lever1-0/+7
Add an explanation to prevent the future removal of the fill- attribute call sites in nfsd_setattr(). Some NFSv3 client implementations don't behave correctly if wcc data is not present in an NFSv3 SETATTR reply. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr()Trond Myklebust8-21/+25
The main point of the guarded SETATTR is to prevent races with other WRITE and SETATTR calls. That requires that the check of the guard time against the inode ctime be done after taking the inode lock. Furthermore, we need to take into account the 32-bit nature of timestamps in NFSv3, and the possibility that files may change at a faster rate than once a second. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Fix a regression in nfsd_setattr()Trond Myklebust2-2/+11
Commit bb4d53d66e4b ("NFSD: use (un)lock_inode instead of fh_(un)lock for file operations") broke the NFSv3 pre/post op attributes behaviour when doing a SETATTR rpc call by stripping out the calls to fh_fill_pre_attrs() and fh_fill_post_attrs(). Fixes: bb4d53d66e4b ("NFSD: use (un)lock_inode instead of fh_(un)lock for file operations") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Message-ID: <20240216012451.22725-1-trondmy@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: OP_CB_RECALL_ANY should recall both read and write delegationsDai Ngo1-0/+2
Add RCA4_TYPE_MASK_WDATA_DLG to ra_bmval bitmask of OP_CB_RECALL_ANY Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: handle GETATTR conflict with write delegationDai Ngo4-13/+123
If the GETATTR request on a file that has write delegation in effect and the request attributes include the change info and size attribute then the request is handled as below: Server sends CB_GETATTR to client to get the latest change info and file size. If these values are the same as the server's cached values then the GETATTR proceeds as normal. If either the change info or file size is different from the server's cached values, or the file was already marked as modified, then: . update time_modify and time_metadata into file's metadata with current time . encode GETATTR as normal except the file size is encoded with the value returned from CB_GETATTR . mark the file as modified If the CB_GETATTR fails for any reasons, the delegation is recalled and NFS4ERR_DELAY is returned for the GETATTR. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: add support for CB_GETATTR callbackDai Ngo3-1/+128
Includes: . CB_GETATTR proc for nfs4_cb_procedures[] . XDR encoding and decoding function for CB_GETATTR request/reply . add nfs4_cb_fattr to nfs4_delegation for sending CB_GETATTR and store file attributes from client's reply. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Document the phases of CREATE_SESSIONChuck Lever1-0/+6
As described in RFC 8881 Section 18.36.4, CREATE_SESSION can be split into four phases. NFSD's implementation now does it like that description. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Fix the NFSv4.1 CREATE_SESSION operationChuck Lever1-26/+31
RFC 8881 Section 18.36.4 discusses the implementation of the NFSv4.1 CREATE_SESSION operation. The section defines four phases of operation. Phase 2 processes the CREATE_SESSION sequence ID. As a separate step, Phase 3 evaluates the CREATE_SESSION arguments. The problem we are concerned with is when phase 2 is successful but phase 3 fails. The spec language in this case is "No changes are made to any client records on the server." RFC 8881 Section 18.35.4 defines a "client record", and it does /not/ contain any details related to the special CREATE_SESSION slot. Therefore NFSD is incorrect to skip incrementing the CREATE_SESSION sequence id when phase 3 (see Section 18.36.4) of CREATE_SESSION processing fails. In other words, even though NFSD happens to store the cs_slot in a client record, in terms of the protocol the slot is logically separate from the client record. Three complications: 1. The world has moved on since commit 86c3e16cc7aa ("nfsd4: confirm only on succesful create_session") broke this. So we can't simply revert that commit. 2. NFSD's CREATE_SESSION implementation does not cleanly delineate the logic of phases 2 and 3. So this won't be a surgical fix. 3. Because of the way it currently handles the CREATE_SESSION slot sequence number, nfsd4_create_session() isn't caching error responses in the CREATE_SESSION slot. Instead of replaying the response cache in those cases, it's executing the transaction again. Reorganize the CREATE_SESSION slot sequence number accounting. This requires that error responses are appropriately cached in the CREATE_SESSION slot (once it is found). Reported-by: Connor Smith <connor.smith@hitachivantara.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218382 Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: clean up comments over nfs4_client definitionChen Hanxiao1-3/+4
nfsd fault injection has been deprecated since commit 9d60d93198c6 ("Deprecate nfsd fault injection") and removed by commit e56dc9e2949e ("nfsd: remove fault injection code") So remove the outdated parts about fault injection. Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01svcrdma: Add Write chunk WRs to the RPC's Send WR chainChuck Lever3-26/+78
Chain RDMA Writes that convey Write chunks onto the local Send chain. This means all WRs for an RPC Reply are now posted with a single ib_post_send() call, and there is a single Send completion when all of these are done. That reduces both the per-transport doorbell rate and completion rate. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01svcrdma: Post WRs for Write chunks in svc_rdma_sendto()Chuck Lever3-41/+51
Refactor to eventually enable svcrdma to post the Write WRs for each RPC response using the same ib_post_send() as the Send WR (ie, as a single WR chain). svc_rdma_result_payload (originally svc_rdma_read_payload) was added so that the upper layer XDR encoder could identify a range of bytes to be possibly conveyed by RDMA (if a Write chunk was provided by the client). The purpose of commit f6ad77590a5d ("svcrdma: Post RDMA Writes while XDR encoding replies") was to post as much of the result payload outside of svc_rdma_sendto() as possible because svc_rdma_sendto() used to be called with the xpt_mutex held. However, since commit ca4faf543a33 ("SUNRPC: Move xpt_mutex into socket xpo_sendto methods"), the xpt_mutex is no longer held when calling svc_rdma_sendto(). Thus, that benefit is no longer an issue. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01svcrdma: Post the Reply chunk and Send WR togetherChuck Lever3-39/+66
Reduce the doorbell and Send completion rates when sending RPC/RDMA replies that have Reply chunks. NFS READDIR procedures typically return their result in a Reply chunk, for example. Instead of calling ib_post_send() to post the Write WRs for the Reply chunk, and then calling it again to post the Send WR that conveys the transport header, chain the Write WRs to the Send WR and call ib_post_send() only once. Thanks to the Send Queue completion ordering rules, when the Send WR completes, that guarantees that Write WRs posted before it have also completed successfully. Thus all Write WRs for the Reply chunk can remain unsignaled. Instead of handling a Write completion and then a Send completion, only the Send completion is seen, and it handles clean up for both the Writes and the Send. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01svcrdma: Move write_info for Reply chunks into struct svc_rdma_send_ctxtChuck Lever4-40/+82
Since the RPC transaction's svc_rdma_send_ctxt will stay around for the duration of the RDMA Write operation, the write_info structure for the Reply chunk can reside in the request's svc_rdma_send_ctxt instead of being allocated separately. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01svcrdma: Post Send WR chainChuck Lever3-19/+38
Eventually I'd like the server to post the reply's Send WR along with any Write WRs using only a single call to ib_post_send(), in order to reduce the NIC's doorbell rate. To do this, add an anchor for a WR chain to svc_rdma_send_ctxt, and refactor svc_rdma_send() to post this WR chain to the Send Queue. For the moment, the posted chain will continue to contain a single Send WR. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01svcrdma: Fix retry loop in svc_rdma_send()Chuck Lever1-11/+17
Don't call ib_post_send() at all if the transport is already shutting down. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01svcrdma: Prevent a UAF in svc_rdma_send()Chuck Lever1-3/+8
In some error flow cases, svc_rdma_wc_send() releases @ctxt. Copy the sc_cid field in @ctxt to a stack variable in order to guarantee that the value is available after the ib_post_send() call. In case the new comment looks a little strange, this will be done with at least one more field in a subsequent patch. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01svcrdma: Fix SQ wake-upsChuck Lever1-3/+3
Ensure there is a wake-up when increasing sc_sq_avail. Likewise, if a wake-up is done, sc_sq_avail needs to be updated, otherwise the wait_event() conditional is never going to be met. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01svcrdma: Increase the per-transport rw_ctx countChuck Lever1-2/+7
rdma_rw_mr_factor() returns the smallest number of MRs needed to move a particular number of pages. svcrdma currently asks for the number of MRs needed to move RPCSVC_MAXPAGES (a little over one megabyte), as that is the number of pages in the largest r/wsize the server supports. This call assumes that the client's NIC can bundle a full one megabyte payload in a single rdma_segment. In fact, most NICs cannot handle a full megabyte with a single rkey / rdma_segment. Clients will typically split even a single Read chunk into many segments. The server needs one MR to read each rdma_segment in a Read chunk, and thus each one needs an rw_ctx. svcrdma has been vastly underestimating the number of rw_ctxs needed to handle 64 RPC requests with large Read chunks using small rdma_segments. Unfortunately there doesn't seem to be a good way to estimate this number without knowing the client NIC's capabilities. Even then, the client RPC/RDMA implementation is still free to split a chunk into smaller segments (for example, it might be using physical registration, which needs an rdma_segment per page). The best we can do for now is choose a number that will guarantee forward progress in the worst case (one page per segment). At some later point, we could add some mechanisms to make this much less of a problem: - Add a core API to add more rw_ctxs to an already-established QP - svcrdma could treat rw_ctx exhaustion as a temporary error and try again - Limit the number of Reads in flight Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01svcrdma: Update max_send_sges after QP is createdChuck Lever1-0/+1
rdma_create_qp() can modify cap.max_send_sges. Copy the new value to the svcrdma transport so it is bound by the new limit instead of the requested one. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01svcrdma: Report CQ depths in debugging outputChuck Lever1-1/+2
Check that svc_rdma_accept() is allocating an appropriate number of CQEs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01svcrdma: Reserve an extra WQE for ib_drain_rq()Chuck Lever1-1/+1
Do as other ULPs already do: ensure there is an extra Receive WQE reserved for the tear-down drain WR. I haven't heard reports of problems but it can't hurt. Note that rq_depth is used to compute the Send Queue depth as well, so this fix should affect both the SQ and RQ. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01MAINTAINERS: add Alex Aring as Reviewer for file locking codeJeff Layton1-0/+1
Alex helps co-maintain the DLM code and did some recent work to fix up how lockd and GFS2 work together. Add him as a Reviewer for file locking changes. Acked-by: Alexander Aring <aahringo@redhat.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Simplify the allocation of slab caches in nfsd4_init_slabsKunwu Chan1-14/+7
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Make the code cleaner and more readable. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Simplify the allocation of slab caches in nfsd_drc_slab_createKunwu Chan1-2/+1
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. And change cache name from 'nfsd_drc' to 'nfsd_cacherep'. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Simplify the allocation of slab caches in nfsd_file_cache_initKunwu Chan1-4/+2
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Simplify the allocation of slab caches in nfsd4_init_pnfsKunwu Chan1-4/+2
commit 0a31bd5f2bbb ("KMEM_CACHE(): simplify slab cache creation") introduces a new macro. Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: don't call locks_release_private() twice concurrentlyNeilBrown1-1/+1
It is possible for free_blocked_lock() to be called twice concurrently, once from nfsd4_lock() and once from nfsd4_release_lockowner() calling remove_blocked_locks(). This is why a kref was added. It is perfectly safe for locks_delete_block() and kref_put() to be called in parallel as they use locking or atomicity respectively as protection. However locks_release_private() has no locking. It is safe for it to be called twice sequentially, but not concurrently. This patch moves that call from free_blocked_lock() where it could race with itself, to free_nbl() where it cannot. This will slightly delay the freeing of private info or release of the owner - but not by much. It is arguably more natural for this freeing to happen in free_nbl() where the structure itself is freed. This bug was found by code inspection - it has not been seen in practice. Fixes: 47446d74f170 ("nfsd4: add refcount for nfsd4_blocked_lock") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow layout state to be admin-revoked.NeilBrown4-16/+50
When there is layout state on a filesystem that is being "unlocked" that is now revoked, which involves closing the nfsd_file and releasing the vfs lease. To avoid races, ->ls_file can now be accessed either: - under ->fi_lock for the state's sc_file or - under rcu_read_lock() if nfsd_file_get() is used. To support this, ->fence_client and nfsd4_cb_layout_fail() now take a second argument being the nfsd_file. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow delegation state ids to be revoked and then freedNeilBrown1-3/+25
Revoking state through 'unlock_filesystem' now revokes any delegation states found. When the stateids are then freed by the client, the revoked stateids will be cleaned up correctly. As there is already support for revoking delegations, we build on that for admin-revoking. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow open state ids to be revoked and then freedNeilBrown1-1/+24
Revoking state through 'unlock_filesystem' now revokes any open states found. When the stateids are then freed by the client, the revoked stateids will be cleaned up correctly. Possibly the related lock states should be revoked too, but a subsequent patch will do that for all lock state on the superblock. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow lock state ids to be revoked and then freedNeilBrown1-1/+39
Revoking state through 'unlock_filesystem' now revokes any lock states found. When the stateids are then freed by the client, the revoked stateids will be cleaned up correctly. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow admin-revoked NFSv4.0 state to be freed.NeilBrown2-1/+101
For NFSv4.1 and later the client easily discovers if there is any admin-revoked state and will then find and explicitly free it. For NFSv4.0 there is no such mechanism. The client can only find that state is admin-revoked if it tries to use that state, and there is no way for it to explicitly free the state. So the server must hold on to the stateid (at least) for an indefinite amount of time. A RELEASE_LOCKOWNER request might justify forgetting some of these stateids, as would the whole clients lease lapsing, but these are not reliable. This patch takes two approaches. Whenever a client uses an revoked stateid, that stateid is then discarded and will not be recognised again. This might confuse a client which expect to get NFS4ERR_ADMIN_REVOKED consistently once it get it at all, but should mostly work. Hopefully one error will lead to other resources being closed (e.g. process exits), which will result in more stateid being freed when a CLOSE attempt gets NFS4ERR_ADMIN_REVOKED. Also, any admin-revoked stateids that have been that way for more than one lease time are periodically revoke. No actual freeing of state happens in this patch. That will come in future patches which handle the different sorts of revoked state. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: report in /proc/fs/nfsd/clients/*/states when state is admin-revokeNeilBrown1-1/+9
Add "admin-revoked" to the status information for any states that have been admin-revoked. This can be useful for confirming correct behaviour. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow state with no file to appear in /proc/fs/nfsd/clients/*/statesNeilBrown1-60/+58
Change the "show" functions to show some content even if a file cannot be found. This is the case for admin-revoked state. This is primarily useful for debugging - to ensure states are being removed eventually. So change several seq_printf() to seq_puts(). Some of these are needed to keep checkpatch happy. Others were done for consistency. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: prepare for supporting admin-revocation of stateNeilBrown5-2/+98
The NFSv4 protocol allows state to be revoked by the admin and has error codes which allow this to be communicated to the client. This patch - introduces a new state-id status SC_STATUS_ADMIN_REVOKED which can be set on open, lock, or delegation state. - reports NFS4ERR_ADMIN_REVOKED when these are accessed - introduces a per-client counter of these states and returns SEQ4_STATUS_ADMIN_STATE_REVOKED when the counter is not zero. Decrements this when freeing any admin-revoked state. - introduces stub code to find all interesting states for a given superblock so they can be revoked via the 'unlock_filesystem' file in /proc/fs/nfsd/ No actual states are handled yet. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: split sc_status out of sc_typeNeilBrown4-141/+151
sc_type identifies the type of a state - open, lock, deleg, layout - and also the status of a state - closed or revoked. This is a bit untidy and could get worse when "admin-revoked" states are added. So clean it up. With this patch, the type is now all that is stored in sc_type. This is zero when the state is first added to ->cl_stateids (causing it to be ignored), and is then set appropriately once it is fully initialised. It is set under ->cl_lock to ensure atomicity w.r.t lookup. It is now never cleared. sc_type is still a bit-set even though at most one bit is set. This allows lookup functions to be given a bitmap of acceptable types. sc_type is now an unsigned short rather than char. There is no value in restricting to just 8 bits. All the constants now start SC_TYPE_ matching the field in which they are stored. Keeping the existing names and ensuring clear separation from non-type flags would have required something like NFS4_STID_TYPE_CLOSED which is cumbersome. The "NFS4" prefix is redundant was they only appear in NFS4 code, so remove that and change STID to SC to match the field. The status is stored in a separate unsigned short named "sc_status". It has two flags: SC_STATUS_CLOSED and SC_STATUS_REVOKED. CLOSED combines NFS4_CLOSED_STID, NFS4_CLOSED_DELEG_STID, and is used for SC_TYPE_LOCK and SC_TYPE_LAYOUT instead of setting the sc_type to zero. These flags are only ever set, never cleared. For deleg stateids they are set under the global state_lock. For open and lock stateids they are set under ->cl_lock. For layout stateids they are set under ->ls_lock nfs4_unhash_stid() has been removed, and we never set sc_type = 0. This was only used for LOCK and LAYOUT stids and they now use SC_STATUS_CLOSED. Also TRACE_DEFINE_NUM() calls for the various STID #define have been removed because these things are not enums, and so that call is incorrect. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: avoid race after unhash_delegation_locked()NeilBrown1-10/+10
NFS4_CLOSED_DELEG_STID and NFS4_REVOKED_DELEG_STID are similar in purpose. REVOKED is used for NFSv4.1 states which have been revoked because the lease has expired. CLOSED is used in other cases. The difference has two practical effects. 1/ REVOKED states are on the ->cl_revoked list 2/ REVOKED states result in nfserr_deleg_revoked from nfsd4_verify_open_stid() and nfsd4_validate_stateid while CLOSED states result in nfserr_bad_stid. Currently a state that is being revoked is first set to "CLOSED" in unhash_delegation_locked(), then possibly to "REVOKED" in revoke_delegation(), at which point it is added to the cl_revoked list. It is possible that a stateid test could see the CLOSED state which really should be REVOKED, and so return the wrong error code. So it is safest to remove this window of inconsistency. With this patch, unhash_delegation_locked() always sets the state correctly, and revoke_delegation() no longer changes the state. Also remove a redundant test on minorversion when NFS4_REVOKED_DELEG_STID is seen - it can only be seen when minorversion is non-zero. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: don't call functions with side-effecting inside WARN_ON()NeilBrown1-5/+5
Code like: WARN_ON(foo()) looks like an assertion and might not be expected to have any side effects. When testing if a function with side-effects fails a construct like if (foo()) WARN_ON(1); makes the intent more obvious. nfsd has several WARN_ON calls where the test has side effects, so it would be good to change them. These cases don't really need the WARN_ON. They have never failed in 8 years of usage so let's just remove the WARN_ON wrapper. Suggested-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: hold ->cl_lock for hash_delegation_locked()NeilBrown1-0/+3
The protocol for creating a new state in nfsd is to allocate the state leaving it largely uninitialised, add that state to the ->cl_stateids idr so as to reserve a state-id, then complete initialisation of the state and only set ->sc_type to non-zero once the state is fully initialised. If a state is found in the idr with ->sc_type == 0, it is ignored. The ->cl_lock lock is used to avoid races - it is held while checking sc_type during lookup, and held when a non-zero value is stored in ->sc_type. ... except... hash_delegation_locked() finalises the initialisation of a delegation state, but does NOT hold ->cl_lock. So this patch takes ->cl_lock at the appropriate time w.r.t other locks, and so ensures there are no races (which are extremely unlikely in any case). As ->fi_lock is often taken when ->cl_lock is held, we need to take ->cl_lock first of those two. Currently ->cl_lock and state_lock are never both taken at the same time. We need both for this patch so an arbitrary choice is needed concerning which to take first. As state_lock is more global, it might be more contended, so take it first. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>