summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2026-01-02nfsd: use correct loop termination in nfsd4_revoke_states()NeilBrown1-1/+1
The loop in nfsd4_revoke_states() stops one too early because the end value given is CLIENT_HASH_MASK where it should be CLIENT_HASH_SIZE. This means that an admin request to drop all locks for a filesystem will miss locks held by clients which hash to the maximum possible hash value. Fixes: 1ac3629bf012 ("nfsd: prepare for supporting admin-revocation of state") Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neil@brown.name> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-02nfsd: provide locking for v4_end_graceNeilBrown4-5/+44
Writing to v4_end_grace can race with server shutdown and result in memory being accessed after it was freed - reclaim_str_hashtbl in particularly. We cannot hold nfsd_mutex across the nfsd4_end_grace() call as that is held while client_tracking_op->init() is called and that can wait for an upcall to nfsdcltrack which can write to v4_end_grace, resulting in a deadlock. nfsd4_end_grace() is also called by the landromat work queue and this doesn't require locking as server shutdown will stop the work and wait for it before freeing anything that nfsd4_end_grace() might access. However, we must be sure that writing to v4_end_grace doesn't restart the work item after shutdown has already waited for it. For this we add a new flag protected with nn->client_lock. It is set only while it is safe to make client tracking calls, and v4_end_grace only schedules work while the flag is set with the spinlock held. So this patch adds a nfsd_net field "client_tracking_active" which is set as described. Another field "grace_end_forced", is set when v4_end_grace is written. After this is set, and providing client_tracking_active is set, the laundromat is scheduled. This "grace_end_forced" field bypasses other checks for whether the grace period has finished. This resolves a race which can result in use-after-free. Reported-by: Li Lingfeng <lilingfeng3@huawei.com> Closes: https://lore.kernel.org/linux-nfs/20250623030015.2353515-1-neil@brown.name/T/#t Fixes: 7f5ef2e900d9 ("nfsd: add a v4_end_grace file to /proc/fs/nfsd") Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neil@brown.name> Tested-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-02NFSD: Fix permission check for read access to executable-only filesScott Mayhew1-2/+2
Commit abc02e5602f7 ("NFSD: Support write delegations in LAYOUTGET") added NFSD_MAY_OWNER_OVERRIDE to the access flags passed from nfsd4_layoutget() to fh_verify(). This causes LAYOUTGET to fail for executable-only files, and causes xfstests generic/126 to fail on pNFS SCSI. To allow read access to executable-only files, what we really want is: 1. The "permissions" portion of the access flags (the lower 6 bits) must be exactly NFSD_MAY_READ 2. The "hints" portion of the access flags (the upper 26 bits) can contain any combination of NFSD_MAY_OWNER_OVERRIDE and NFSD_MAY_READ_IF_EXEC Fixes: abc02e5602f7 ("NFSD: Support write delegations in LAYOUTGET") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-02NFSD: Remove NFSERR_EAGAINChuck Lever5-6/+1
I haven't found an NFSERR_EAGAIN in RFCs 1094, 1813, 7530, or 8881. None of these RFCs have an NFS status code that match the numeric value "11". Based on the meaning of the EAGAIN errno, I presume the use of this status in NFSD means NFS4ERR_DELAY. So replace the one usage of nfserr_eagain, and remove it from NFSD's NFS status conversion tables. As far as I can tell, NFSERR_EAGAIN has existed since the pre-git era, but was not actually used by any code until commit f4e44b393389 ("NFSD: delay unmount source's export after inter-server copy completed."), at which time it become possible for NFSD to return a status code of 11 (which is not valid NFS protocol). Fixes: f4e44b393389 ("NFSD: delay unmount source's export after inter-server copy completed.") Cc: stable@vger.kernel.org Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-25nfsd: Drop the client reference in client_states_open()Haoxiang Li1-1/+3
In error path, call drop_client() to drop the reference obtained by get_nfsdfs_clp(). Fixes: 78599c42ae3c ("nfsd4: add file to display list of client's opens") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Haoxiang Li <lihaoxiang@isrc.iscas.ac.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-25nfsd: use ATTR_DELEG in nfsd4_finalize_deleg_timestamps()Jeff Layton1-1/+1
When finalizing timestamps that have never been updated and preparing to release the delegation lease, the notify_change() call can trigger a delegation break, and fail to update the timestamps. When this happens, there will be messages like this in dmesg: [ 2709.375785] Unable to update timestamps on inode 00:39:263: -11 Since this code is going to release the lease just after updating the timestamps, breaking the delegation is undesirable. Fix this by setting ATTR_DELEG in ia_valid, in order to avoid the delegation break. Fixes: e5e9b24ab8fa ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation") Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-25nfsd: fix nfsd_file reference leak in nfsd4_add_rdaccess_to_wrdeleg()Chuck Lever1-4/+10
nfsd4_add_rdaccess_to_wrdeleg() unconditionally overwrites fp->fi_fds[O_RDONLY] with a newly acquired nfsd_file. However, if the client already has a SHARE_ACCESS_READ open from a previous OPEN operation, this action overwrites the existing pointer without releasing its reference, orphaning the previous reference. Additionally, the function originally stored the same nfsd_file pointer in both fp->fi_fds[O_RDONLY] and fp->fi_rdeleg_file with only a single reference. When put_deleg_file() runs, it clears fi_rdeleg_file and calls nfs4_file_put_access() to release the file. However, nfs4_file_put_access() only releases fi_fds[O_RDONLY] when the fi_access[O_RDONLY] counter drops to zero. If another READ open exists on the file, the counter remains elevated and the nfsd_file reference from the delegation is never released. This potentially causes open conflicts on that file. Then, on server shutdown, these leaks cause __nfsd_file_cache_purge() to encounter files with an elevated reference count that cannot be cleaned up, ultimately triggering a BUG() in kmem_cache_destroy() because there are still nfsd_file objects allocated in that cache. Fixes: e7a8ebc305f2 ("NFSD: Offer write delegation for OPEN with OPEN4_SHARE_ACCESS_WRITE") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-25lockd: fix vfs_test_lock() callsNeilBrown4-18/+24
Usage of vfs_test_lock() is somewhat confused. Documentation suggests it is given a "lock" but this is not the case. It is given a struct file_lock which contains some details of the sort of lock it should be looking for. In particular passing a "file_lock" containing fl_lmops or fl_ops is meaningless and possibly confusing. This is particularly problematic in lockd. nlmsvc_testlock() receives an initialised "file_lock" from xdr-decode, including manager ops and an owner. It then mistakenly passes this to vfs_test_lock() which might replace the owner and the ops. This can lead to confusion when freeing the lock. The primary role of the 'struct file_lock' passed to vfs_test_lock() is to report a conflicting lock that was found, so it makes more sense for nlmsvc_testlock() to pass "conflock", which it uses for returning the conflicting lock. With this change, freeing of the lock is not confused and code in __nlm4svc_proc_test() and __nlmsvc_proc_test() can be simplified. Documentation for vfs_test_lock() is improved to reflect its real purpose, and a WARN_ON_ONCE() is added to avoid a similar problem in the future. Reported-by: Olga Kornievskaia <okorniev@redhat.com> Closes: https://lore.kernel.org/all/20251021130506.45065-1-okorniev@redhat.com Signed-off-by: NeilBrown <neil@brown.name> Fixes: 20fa19027286 ("nfs: add export operations") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-18NFSD: NFSv4 file creation neglects setting ACLChuck Lever1-1/+2
An NFSv4 client that sets an ACL with a named principal during file creation retrieves the ACL afterwards, and finds that it is only a default ACL (based on the mode bits) and not the ACL that was requested during file creation. This violates RFC 8881 section 6.4.1.3: "the ACL attribute is set as given". The issue occurs in nfsd_create_setattr(), which calls nfsd_attrs_valid() to determine whether to call nfsd_setattr(). However, nfsd_attrs_valid() checks only for iattr changes and security labels, but not POSIX ACLs. When only an ACL is present, the function returns false, nfsd_setattr() is skipped, and the POSIX ACL is never applied to the inode. Subsequently, when the client retrieves the ACL, the server finds no POSIX ACL on the inode and returns one generated from the file's mode bits rather than returning the originally-specified ACL. Reported-by: Aurélien Couderc <aurelien.couderc2002@gmail.com> Fixes: c0cbe70742f4 ("NFSD: add posix ACLs to struct nfsd_attrs") Cc: Roland Mainz <roland.mainz@nrubsig.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-18NFSD: Clear TIME_DELEG in the suppattr_exclcreat bitmapChuck Lever1-1/+7
>From RFC 8881: 5.8.1.14. Attribute 75: suppattr_exclcreat > The bit vector that would set all REQUIRED and RECOMMENDED > attributes that are supported by the EXCLUSIVE4_1 method of file > creation via the OPEN operation. The scope of this attribute > applies to all objects with a matching fsid. There's nothing in RFC 8881 that states that suppattr_exclcreat is or is not allowed to contain bits for attributes that are clear in the reported supported_attrs bitmask. But it doesn't make sense for an NFS server to indicate that it /doesn't/ implement an attribute, but then also indicate that clients /are/ allowed to set that attribute using OPEN(create) with EXCLUSIVE4_1. The FATTR4_WORD2_TIME_DELEG attributes are also not to be allowed for OPEN(create) with EXCLUSIVE4_1. It doesn't make sense to set a delegated timestamp on a new file. Fixes: 7e13f4f8d27d ("nfsd: handle delegated timestamps in SETATTR") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-18NFSD: Clear SECLABEL in the suppattr_exclcreat bitmapChuck Lever1-0/+5
>From RFC 8881: 5.8.1.14. Attribute 75: suppattr_exclcreat > The bit vector that would set all REQUIRED and RECOMMENDED > attributes that are supported by the EXCLUSIVE4_1 method of file > creation via the OPEN operation. The scope of this attribute > applies to all objects with a matching fsid. There's nothing in RFC 8881 that states that suppattr_exclcreat is or is not allowed to contain bits for attributes that are clear in the reported supported_attrs bitmask. But it doesn't make sense for an NFS server to indicate that it /doesn't/ implement an attribute, but then also indicate that clients /are/ allowed to set that attribute using OPEN(create) with EXCLUSIVE4_1. Ensure that the SECURITY_LABEL and ACL bits are not set in the suppattr_exclcreat bitmask when they are also not set in the supported_attrs bitmask. Fixes: 8c18f2052e75 ("nfsd41: SUPPATTR_EXCLCREAT attribute") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-18nfsd: fix memory leak in nfsd_create_serv error pathsShardul Bankar1-1/+4
When nfsd_create_serv() calls percpu_ref_init() to initialize nn->nfsd_net_ref, it allocates both a percpu reference counter and a percpu_ref_data structure (64 bytes). However, if the function fails later due to svc_create_pooled() returning NULL or svc_bind() returning an error, these allocations are not cleaned up, resulting in a memory leak. The leak manifests as: - Unreferenced percpu allocation (8 bytes per CPU) - Unreferenced percpu_ref_data structure (64 bytes) Fix this by adding percpu_ref_exit() calls in both error paths to properly clean up the percpu_ref_init() allocations. This patch fixes the percpu_ref leak in nfsd_create_serv() seen as an auxiliary leak in syzbot report 099461f8558eb0a1f4f3; the prepare_creds() and vsock-related leaks in the same report remain to be addressed separately. Reported-by: syzbot+099461f8558eb0a1f4f3@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=099461f8558eb0a1f4f3 Fixes: 47e988147f40 ("nfsd: add nfsd_serv_try_get and nfsd_serv_put") Signed-off-by: Shardul Bankar <shardul.b@mpiricsoftware.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-08nfsd: Mark variable __maybe_unused to avoid W=1 build breakAndy Shevchenko1-1/+1
Clang is not happy about set but (in some cases) unused variable: fs/nfsd/export.c:1027:17: error: variable 'inode' set but not used [-Werror,-Wunused-but-set-variable] since it's used as a parameter to dprintk() which might be configured a no-op. To avoid uglifying code with the specific ifdeffery just mark the variable __maybe_unused. The commit [1], which introduced this behaviour, is quite old and hence the Fixes tag points to the first of the Git era. Link: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=0431923fb7a1 [1] Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-08svcrdma: bound check rq_pages index in inline pathJoshua Rogers1-0/+3
svc_rdma_copy_inline_range indexed rqstp->rq_pages[rc_curpage] without verifying rc_curpage stays within the allocated page array. Add guards before the first use and after advancing to a new page. Fixes: d7cc73972661 ("svcrdma: support multiple Read chunks per RPC") Cc: stable@vger.kernel.org Signed-off-by: Joshua Rogers <linux@joshua.hu> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-08svcrdma: return 0 on success from svc_rdma_copy_inline_rangeJoshua Rogers1-1/+1
The function comment specifies 0 on success and -EINVAL on invalid parameters. Make the tail return 0 after a successful copy loop. Fixes: d7cc73972661 ("svcrdma: support multiple Read chunks per RPC") Cc: stable@vger.kernel.org Signed-off-by: Joshua Rogers <linux@joshua.hu> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-08svcrdma: use rc_pageoff for memcpy byte offsetJoshua Rogers1-1/+1
svc_rdma_copy_inline_range added rc_curpage (page index) to the page base instead of the byte offset rc_pageoff. Use rc_pageoff so copies land within the current page. Found by ZeroPath (https://zeropath.com) Fixes: 8e122582680c ("svcrdma: Move svc_rdma_read_info::ri_pageno to struct svc_rdma_recv_ctxt") Cc: stable@vger.kernel.org Signed-off-by: Joshua Rogers <linux@joshua.hu> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-08SUNRPC: svcauth_gss: avoid NULL deref on zero length gss_token in ↵Joshua Rogers1-1/+2
gss_read_proxy_verf A zero length gss_token results in pages == 0 and in_token->pages[0] is NULL. The code unconditionally evaluates page_address(in_token->pages[0]) for the initial memcpy, which can dereference NULL even when the copy length is 0. Guard the first memcpy so it only runs when length > 0. Fixes: 5866efa8cbfb ("SUNRPC: Fix svcauth_gss_proxy_init()") Cc: stable@vger.kernel.org Signed-off-by: Joshua Rogers <linux@joshua.hu> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-03NFSD: nfsd-io-modes: Separate listsBagas Sanjaya1-0/+5
Sphinx reports htmldocs indentation warnings: Documentation/filesystems/nfs/nfsd-io-modes.rst:58: ERROR: Unexpected indentation. [docutils] Documentation/filesystems/nfs/nfsd-io-modes.rst:59: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils] These caused the lists to be shown as long running paragraphs merged with their previous paragraphs. Fix these by separating the lists with a blank line. Fixes: fa8d4e6784d1b6 ("NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/linux-next/20251202152506.7a2d2d41@canb.auug.org.au/ Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-03NFSD: nfsd-io-modes: Wrap shell snippets in literal code blocksBagas Sanjaya1-12/+16
Sphinx reports htmldocs indentation warnings: Documentation/filesystems/nfs/nfsd-io-modes.rst:29: ERROR: Unexpected indentation. [docutils] Documentation/filesystems/nfs/nfsd-io-modes.rst:34: ERROR: Unexpected indentation. [docutils] Fix these by wrapping shell snippets in literal code blocks. Fixes: fa8d4e6784d1b6 ("NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/linux-next/20251202152506.7a2d2d41@canb.auug.org.au/ Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-03NFSD: Add toctree entry for NFSD IO modes docsBagas Sanjaya1-0/+1
Commit fa8d4e6784d1b6 ("NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst") adds documentation for NFSD I/O modes, but it forgets to add toctree entry for it. Hence, Sphinx reports: Documentation/filesystems/nfs/nfsd-io-modes.rst: WARNING: document isn't included in any toctree [toc.not_included] Add the entry. Fixes: fa8d4e6784d1b6 ("NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/linux-next/20251202152506.7a2d2d41@canb.auug.org.au/ Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-01NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rstMike Snitzer1-0/+144
This document details the NFSD IO modes that are configurable using NFSD's experimental debugfs interfaces: /sys/kernel/debug/nfsd/io_cache_read /sys/kernel/debug/nfsd/io_cache_write This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's debugfs interfaces are replaced with per-export controls). Future updates will provide more specific guidance and howto information to help others use and evaluate NFSD's IO modes: BUFFERED, DONTCACHE and DIRECT. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-01NFSD: Implement NFSD_IO_DIRECT for NFS WRITEMike Snitzer3-4/+144
When NFSD_IO_DIRECT is selected via the /sys/kernel/debug/nfsd/io_cache_write experimental tunable, split incoming unaligned NFS WRITE requests into a prefix, middle and suffix segment, as needed. The middle segment is now DIO-aligned and the prefix and/or suffix are unaligned. Synchronous buffered IO is used for the unaligned segments, and IOCB_DIRECT is used for the middle DIO-aligned extent. Although IOCB_DIRECT avoids the use of the page cache, by itself it doesn't guarantee data durability. For UNSTABLE WRITE requests, durability is obtained by a subsequent NFS COMMIT request. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Co-developed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-12-01NFSD: Make FILE_SYNC WRITEs comply with specChuck Lever1-2/+12
Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it does not also persist file time stamps. To wit, Section 18.32.3 of RFC 8881 mandates: > The client specifies with the stable parameter the method of how > the data is to be processed by the server. If stable is > FILE_SYNC4, the server MUST commit the data written plus all file > system metadata to stable storage before returning results. This > corresponds to the NFSv2 protocol semantics. Any other behavior > constitutes a protocol violation. If stable is DATA_SYNC4, then > the server MUST commit all of the data to stable storage and > enough of the metadata to retrieve the data before returning. Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") replaced: - flags |= RWF_SYNC; with: + kiocb.ki_flags |= IOCB_DSYNC; which appears to be correct given: if (flags & RWF_SYNC) kiocb_flags |= IOCB_DSYNC; in kiocb_set_rw_flags(). However the author of that commit did not appreciate that the previous line in kiocb_set_rw_flags() results in IOCB_SYNC also being set: kiocb_flags |= (__force int) (flags & RWF_SUPPORTED); RWF_SUPPORTED contains RWF_SYNC, and RWF_SYNC is the same bit as IOCB_SYNC. Reviewers at the time did not catch the omission. Reported-by: Mike Snitzer <snitzer@kernel.org> Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t Fixes: 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-25NFSD: Add trace point for SCSI fencing operation.Dai Ngo2-1/+42
Add trace point to print client IP address, net namespace number, device name and status of SCSI pr_preempt command. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-25NFSD: use correct reservation type in nfsd4_scsi_fence_clientDai Ngo1-1/+2
The reservation type argument for the pr_preempt call should match the one used in nfsd4_block_get_device_info_scsi. Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support") Cc: stable@vger.kernel.org Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-25xdrgen: Don't generate unnecessary semicolonChuck Lever18-18/+18
The Jinja2 templates add a semicolon at the end of every function. The C language does not require this punctuation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-25xdrgen: Fix union declarationsChuck Lever1-0/+4
Add a missing template file. This file is used when a union is defined as a public API (ie, "pragma public <union name>;"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-25NFSD: don't start nfsd if sv_permsocks is emptyOlga Kornievskaia1-23/+5
Previously, while trying to create a server instance, if no listening sockets were present then default parameter udp and tcp listeners were created. It's unclear what purpose was of starting these listeners were and how this could have been triggered by the userland setup. This patch proposed to ensure the reverse that we never end in a situation where no listener sockets are created and we are trying to create nfsd threads. The problem it solves is: when nfs.conf only has tcp=n (and nothing else for the choice of transports), nfsdctl would still start the server and create udp and tcp listeners. Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-25xdrgen: handle _XdrString in union encoder/decoderKhushal Chitturi2-9/+31
Running xdrgen on xdrgen/tests/test.x fails when generating encoder or decoder functions for union members of type _XdrString. It was because _XdrString does not have a spec attribute like _XdrBasic, leading to AttributeError. This patch updates emit_union_case_spec_definition and emit_union_case_spec_decoder/encoder to handle _XdrString by assigning type_name = "char *" and avoiding referencing to spec. Testing: Fixed xdrgen tool was run on originally failing test file (tools/net/sunrpc/xdrgen/tests/test.x) and now completes without AttributeError. Modified xdrgen tool was also run against nfs4_1.x (Documentation/sunrpc/xdr/nfs4_1.x). The output header file matches with nfs4_1.h (include/linux/sunrpc/xdrgen/nfs4_1.h). This validates the patch for all XDR input files currently within the kernel. Changes since v2: - Moved the shebang to the first line - Removed SPDX header to match style of current xdrgen files Changes since v1: - Corrected email address in Signed-off-by. - Wrapped patch description lines to 72 characters. Signed-off-by: Khushal Chitturi <kc9282016@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-25xdrgen: Fix the variable-length opaque field decoder templateChuck Lever1-1/+1
Ensure that variable-length opaques are decoded into the named field, and do not overwrite the structure itself. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-25xdrgen: Make the xdrgen script location-independentChuck Lever1-0/+5
The @pythondir@ placeholder is meant for build-time substitution, such as with autoconf. autoconf is not used in the kernel. Let's replace that mechanism with one that better enables the xdrgen script to be run from any directory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-25xdrgen: Generalize/harden pathname constructionChuck Lever1-5/+6
Use Python's built-in Path constructor to find the Jinja templates. This provides better error checking, proper use of path component separators, and more reliable location of the template files. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-21lockd: don't allow locking on reexported NFSv2/3Jeff Layton3-1/+26
Since commit 9254c8ae9b81 ("nfsd: disallow file locking and delegations for NFSv4 reexport"), file locking when reexporting an NFS mount via NFSv4 is expressly prohibited by nfsd. Do the same in lockd: Add a new nlmsvc_file_cannot_lock() helper that will test whether file locking is allowed for a given file, and return nlm_lck_denied_nolocks if it isn't. Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-21MAINTAINERS: add a nfsd blocklayout reviewerChristoph Hellwig1-0/+4
Add a minimal entry for the block layout driver to make sure Christoph who wrote the code gets Cced on all patches. The actual maintenance stays with the nfsd maintainer team. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17nfsd: Use MD5 library instead of crypto_shashEric Biggers2-69/+11
Update NFSD's support for "legacy client tracking" (which uses MD5) to use the MD5 library instead of crypto_shash. This has several benefits: - Simpler code. Notably, much of the error-handling code is no longer needed, since the library functions can't fail. - Improved performance due to reduced overhead. A microbenchmark of nfs4_make_rec_clidname() shows a speedup from 1455 cycles to 425. - The MD5 code can now safely be built as a loadable module when nfsd is built as a loadable module. (Previously, nfsd forced the MD5 code to built-in, presumably to work around the unreliability of the name-based loading.) Thus select MD5 from the tristate option NFSD if NFSD_LEGACY_CLIENT_TRACKING, instead of from the bool option NFSD_V4. - Fixes a bug where legacy client tracking was not supported on kernels booted with "fips=1", due to crypto_shash not allowing MD5 to be used. This particular use of MD5 is not for a cryptographic purpose, though, so it is acceptable even when fips=1 (see https://lore.kernel.org/r/dae495a93cbcc482f4ca23c3a0d9360a1fd8c3a8.camel@redhat.com/). Signed-off-by: Eric Biggers <ebiggers@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17nfsd: stop pretending that we cache the SEQUENCE reply.NeilBrown2-61/+18
nfsd does not cache the reply to a SEQUENCE. As the comment above nfsd4_replay_cache_entry() says: * The sequence operation is not cached because we can use the slot and * session values. The comment above nfsd4_cache_this() suggests otherwise. * The session reply cache only needs to cache replies that the client * actually asked us to. But it's almost free for us to cache compounds * consisting of only a SEQUENCE op, so we may as well cache those too. * Also, the protocol doesn't give us a convenient response in the case * of a replay of a solo SEQUENCE op that wasn't cached The code in nfsd4_store_cache_entry() makes it clear that only responses beyond 'cstate.data_offset' are actually cached, and data_offset is set at the end of nfsd4_encode_sequence() *after* the sequence response has been encoded. This patch simplifies code and removes the confusing comments. - nfsd4_is_solo_sequence() is discarded as not-useful. - nfsd4_cache_this() is now trivial so it too is discarded with the code placed in-line at the one call-site in nfsd4_store_cache_entry(). - nfsd4_enc_sequence_replay() is open-coded in to nfsd4_replay_cache_entry(), and then simplified to (hopefully) make the process of replaying a reply clearer. Signed-off-by: NeilBrown <neil@brown.name> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17NFS: nfsd-maintainer-entry-profile: Inline function name prefixesBagas Sanjaya1-3/+3
Sphinx reports htmldocs warnings: Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst:185: ERROR: Unknown target name: "nfsd". [docutils] Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst:188: ERROR: Unknown target name: "nfsdn". [docutils] Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst:192: ERROR: Unknown target name: "nfsd4m". [docutils] These are due to Sphinx confusing function name prefixes for external link syntax. Fix the warnings by inlining the prefixes. Fixes: 3a1ce35030e1e0 ("NFSD: Add a subsystem policy document") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/linux-next/20251117174218.29365f30@canb.auug.org.au/ Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17NFSD: Add a subsystem policy documentChuck Lever3-0/+549
Steer contributors to NFSD's patchworks instance, list our patch submission preferences, and more. The new document is based on the existing netdev and xfs subsystem policy documents. This is an attempt to add transparency to the process of accepting contributions to NFSD and getting them merged upstream. Suggested-by: "Darrick J. Wong" <djwong@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: NeilBrown <neil@brown.name> [ cel: Hand-edits to address review comments ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17sunrpc: allocate a separate bvec array for socket sendsJeff Layton2-7/+51
svc_tcp_sendmsg() calls xdr_buf_to_bvec() with the second slot of rq_bvec as the start, but doesn't reduce the array length by one, which could lead to an array overrun. Also, rq_bvec is always rq_maxpages in length, which can be too short in some cases, since the TCP record marker consumes a slot. Fix both problems by adding a separate bvec array to the svc_sock that is specifically for sending. For TCP, make this array one slot longer than rq_maxpages, to account for the record marker. For UDP, only allocate as large an array as we need since it's limited to 64k of payload. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17SUNRPC: Improve "fragment too large" warningChuck Lever1-3/+4
Including the client IP address that generated the overrun traffic seems like it would be helpful. The message now reads: kernel: svc: nfsd oversized RPC fragment (1064958 octets) from 100.64.0.11:45866 Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17NFSD: Implement NFSD_IO_DIRECT for NFS READChuck Lever4-0/+87
Add an experimental option that forces NFS READ operations to use direct I/O instead of reading through the NFS server's page cache. There is already at least one other layer of read caching: the page cache on NFS clients. The server's page cache, in many cases, is unlikely to provide additional benefit. Some benchmarks have demonstrated that the server's page cache is actively detrimental for workloads whose working set is larger than the server's available physical memory. For instance, on small NFS servers, cached NFS file content can squeeze out local memory consumers. For large sequential workloads, an enormous amount of data flows into and out of the page cache and is consumed by NFS clients exactly once -- caching that data is expensive to do and totally valueless. For now this is a hidden option that can be enabled on test systems for benchmarking. In the longer term, this option might be enabled persistently or per-export. When the exported file system does not support direct I/O, NFSD falls back to using either DONTCACHE or buffered I/O to fulfill NFS READ requests. Suggested-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17NFSD: Relocate the xdr_reserve_space_vec() call siteChuck Lever1-4/+16
In order to detect when a direct READ is possible, we need the send buffer's .page_len to be zero when there is nothing in the buffer's .pages array yet. However, when xdr_reserve_space_vec() extends the size of the xdr_stream to accommodate a READ payload, it adds to the send buffer's .page_len. It should be safe to reserve the stream space /after/ the VFS read operation completes. This is, for example, how an NFSv3 READ works: the VFS read goes into the rq_bvec, and is then added to the send xdr_stream later by svcxdr_encode_opaque_pages(). Now that xdr_reserve_space_vec() uses the number of bytes actually read, the xdr_truncate_encode() call is no longer necessary. Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17NFSD: pass nfsd_file to nfsd_iter_read()Mike Snitzer3-8/+9
Prepare for nfsd_iter_read() to use the DIO alignment stored in nfsd_file by passing the nfsd_file to nfsd_iter_read() rather than just the file which is associaed with the nfsd_file. This means nfsd4_encode_readv() now also needs the nfsd_file rather than the file. Instead of changing the file arg to be the nfsd_file, we discard the file arg as the nfsd_file (and indeed the file) is already available via the "read" argument. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17NFSD/blocklayout: Support multiple extents per LAYOUTGETSergey Bashirov1-13/+34
Allow the pNFS server to respond with multiple extents to a LAYOUTGET request, thereby avoiding unnecessary load on the server and improving performance for the client. The number of LAYOUTGET requests is significantly reduced for various file access patterns, including random and parallel writes. Additionally, this change allows the client to request layouts with the loga_minlength value greater than the minimum possible length of a single extent in XFS. We use this functionality to fix a livelock in the client. Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17NFSD/blocklayout: Introduce layout content structureSergey Bashirov3-13/+63
Add a layout content structure instead of a single extent. The ability to store and encode an array of extents is then used to implement support for multiple extents per LAYOUTGET. Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17NFSD/blocklayout: Extract extent mapping from proc_layoutgetSergey Bashirov1-49/+66
No changes in functionality. Split the proc_layoutget function to create a helper function that maps single extent to the requested range. This helper function is then used to implement support for multiple extents per LAYOUTGET. Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17NFSD/blocklayout: Fix minlength check in proc_layoutgetSergey Bashirov1-1/+3
The extent returned by the file system may have a smaller offset than the segment offset requested by the client. In this case, the minimum segment length must be checked against the requested range. Otherwise, the client may not be able to continue the read/write operation. Fixes: 8650b8a05850 ("nfsd: pNFS block layout driver") Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17svcrdma: Increase the server's default RPC/RDMA credit grantChuck Lever1-1/+1
The range of commits from commit e3274026e2ec ("SUNRPC: move all of xprt handling into svc_xprt_handle()") to commit 15d39883ee7d ("SUNRPC: change the back-channel queue to lwq") enabled NFSD performance to scale better as the number of nfsd threads is increased. These commits were merged in v6.7. Now that the nfsd thread count can scale to more threads, permit individual clients to make more use of those threads. Increase the RPC/RDMA per-connection credit grant from 64 to 128 -- same as the Linux NFS client. Simple single client fio-based benchmarking so far shows only improvement, no regression. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17NFSD: Update comment documenting unsupported fattr4 attributesChuck Lever1-2/+1
TIME_CREATE has been supported since commit e377a3e698fb ("nfsd: Add support for the birth time attribute"). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-11-17nfsd: delete unreachable confusing code in nfs4_open_delegation()Matvey Kovalev1-5/+0
op_delegate_type is assigned OPEN_DELEGATE_NONE just before the if-block where condition specifies it not be equal to OPEN_DELEGATE_NONE. Compiler treats the block as unreachable and optimizes it out from the resulting executable. In that aspect commit d08d32e6e5c0 ("nfsd4: return delegation immediately if lease fails") notably makes no difference. Seems it's better to just drop this code instead of fiddling with memory barriers or atomics. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Matvey Kovalev <matvey.kovalev@ispras.ru> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>