summaryrefslogtreecommitdiff
path: root/fs/nfs
AgeCommit message (Collapse)AuthorFilesLines
7 daysMerge tag 'vfs-6.16-rc1.misc' of ↵Linus Torvalds1-10/+10
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Use folios for symlinks in the page cache FUSE already uses folios for its symlinks. Mirror that conversion in the generic code and the NFS code. That lets us get rid of a few folio->page->folio conversions in this path, and some of the few remaining users of read_cache_page() / read_mapping_page() - Try and make a few filesystem operations killable on the VFS inode->i_mutex level - Add sysctl vfs_cache_pressure_denom for bulk file operations Some workloads need to preserve more dentries than we currently allow through out sysctl interface A HDFS servers with 12 HDDs per server, on a HDFS datanode startup involves scanning all files and caching their metadata (including dentries and inodes) in memory. Each HDD contains approximately 2 million files, resulting in a total of ~20 million cached dentries after initialization To minimize dentry reclamation, they set vfs_cache_pressure to 1. Despite this configuration, memory pressure conditions can still trigger reclamation of up to 50% of cached dentries, reducing the cache from 20 million to approximately 10 million entries. During the subsequent cache rebuild period, any HDFS datanode restart operation incurs substantial latency penalties until full cache recovery completes To maintain service stability, more dentries need to be preserved during memory reclamation. The current minimum reclaim ratio (1/100 of total dentries) remains too aggressive for such workload. This patch introduces vfs_cache_pressure_denom for more granular cache pressure control The configuration [vfs_cache_pressure=1, vfs_cache_pressure_denom=10000] effectively maintains the full 20 million dentry cache under memory pressure, preventing datanode restart performance degradation - Avoid some jumps in inode_permission() using likely()/unlikely() - Avid a memory access which is most likely a cache miss when descending into devcgroup_inode_permission() - Add fastpath predicts for stat() and fdput() - Anonymous inodes currently don't come with a proper mode causing issues in the kernel when we want to add useful VFS debug assert. Fix that by giving them a proper mode and masking it off when we report it to userspace which relies on them not having any mode - Anonymous inodes currently allow to change inode attributes because the VFS falls back to simple_setattr() if i_op->setattr isn't implemented. This means the ownership and mode for every single user of anon_inode_inode can be changed. Block that as it's either useless or actively harmful. If specific ownership is needed the respective subsystem should allocate anonymous inodes from their own private superblock - Raise SB_I_NODEV and SB_I_NOEXEC on the anonymous inode superblock - Add proper tests for anonymous inode behavior - Make it easy to detect proper anonymous inodes and to ensure that we can detect them in codepaths such as readahead() Cleanups: - Port pidfs to the new anon_inode_{g,s}etattr() helpers - Try to remove the uselib() system call - Add unlikely branch hint return path for poll - Add unlikely branch hint on return path for core_sys_select - Don't allow signals to interrupt getdents copying for fuse - Provide a size hint to dir_context for during readdir() - Use writeback_iter directly in mpage_writepages - Update compression and mtime descriptions in initramfs documentation - Update main netfs API document - Remove useless plus one in super_cache_scan() - Remove unnecessary NULL-check guards during setns() - Add separate separate {get,put}_cgroup_ns no-op cases Fixes: - Fix typo in root= kernel parameter description - Use KERN_INFO for infof()|info_plog()|infofc() - Correct comments of fs_validate_description() - Mark an unlikely if condition with unlikely() in vfs_parse_monolithic_sep() - Delete macro fsparam_u32hex() - Remove unused and problematic validate_constant_table() - Fix potential unsigned integer underflow in fs_name() - Make file-nr output the total allocated file handles" * tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (43 commits) fs: Pass a folio to page_put_link() nfs: Use a folio in nfs_get_link() fs: Convert __page_get_link() to use a folio fs/read_write: make default_llseek() killable fs/open: make do_truncate() killable fs/open: make chmod_common() and chown_common() killable include/linux/fs.h: add inode_lock_killable() readdir: supply dir_context.count as readdir buffer size hint vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations fuse: don't allow signals to interrupt getdents copying Documentation: fix typo in root= kernel parameter description include/cgroup: separate {get,put}_cgroup_ns no-op case kernel/nsproxy: remove unnecessary guards fs: use writeback_iter directly in mpage_writepages fs: remove useless plus one in super_cache_scan() fs: add S_ANON_INODE fs: remove uselib() system call device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission() fs/fs_parse: Remove unused and problematic validate_constant_table() fs: touch up predicts in inode_permission() ...
7 daysMerge tag 'vfs-6.16-rc1.async.dir' of ↵Linus Torvalds1-6/+5
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs directory lookup updates from Christian Brauner: "This contains cleanups for the lookup_one*() family of helpers. We expose a set of functions with names containing "lookup_one_len" and others without the "_len". This difference has nothing to do with "len". It's rater a historical accident that can be confusing. The functions without "_len" take a "mnt_idmap" pointer. This is found in the "vfsmount" and that is an important question when choosing which to use: do you have a vfsmount, or are you "inside" the filesystem. A related question is "is permission checking relevant here?". nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len functions. They pass nop_mnt_idmap and refuse to work on filesystems which have any other idmap. This work changes nfsd and cachefile to use the lookup_one family of functions and to explictily pass &nop_mnt_idmap which is consistent with all other vfs interfaces used where &nop_mnt_idmap is explicitly passed. The remaining uses of the "_one" functions do not require permission checks so these are renamed to be "_noperm" and the permission checking is removed. This series also changes these lookup function to take a qstr instead of separate name and len. In many cases this simplifies the call" * tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: VFS: change lookup_one_common and lookup_noperm_common to take a qstr Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS VFS: rename lookup_one_len family to lookup_noperm and remove permission check cachefiles: Use lookup_one() rather than lookup_one_len() nfsd: Use lookup_one() rather than lookup_one_len() VFS: improve interface for lookup_one functions
2025-05-16NFS: Avoid flushing data while holding directory locks in nfs_rename()Trond Myklebust2-1/+16
The Linux client assumes that all filehandles are non-volatile for renames within the same directory (otherwise sillyrename cannot work). However, the existence of the Linux 'subtree_check' export option has meant that nfs_rename() has always assumed it needs to flush writes before attempting to rename. Since NFSv4 does allow the client to query whether or not the server exhibits this behaviour, and since knfsd does actually set the appropriate flag when 'subtree_check' is enabled on an export, it should be OK to optimise away the write flushing behaviour in the cases where it is clearly not needed. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2025-05-16NFS/pnfs: Fix the error path in pnfs_layoutreturn_retry_later_locked()Trond Myklebust1-17/+13
If there isn't a valid layout, or the layout stateid has changed, the cleanup after a layout return should clear out the old data. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-05-16NFSv4/pnfs: Reset the layout state after a layoutreturnTrond Myklebust1-0/+9
If there are still layout segments in the layout plh_return_lsegs list after a layout return, we should be resetting the state to ensure they eventually get returned as well. Fixes: 68f744797edd ("pNFS: Do not free layout segments that are marked for return") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-05-15fs: Pass a folio to page_put_link()Matthew Wilcox (Oracle)1-1/+1
All callers now have a folio. Pass it to page_put_link(), saving a hidden call to compound_head(). Also add kernel-doc for page_get_link() and page_put_link(). Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/20250514171316.3002934-4-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15nfs: Use a folio in nfs_get_link()Matthew Wilcox (Oracle)1-10/+10
Mirror the changes to __page_get_link() by retrieving a folio from the page cache instead of a page. Removes two hidden calls to compound_head(). Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/20250514171316.3002934-3-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-28NFS/localio: Fix a race in nfs_local_open_fh()Trond Myklebust1-1/+1
Once the clp->cl_uuid.lock has been dropped, another CPU could come in and free the struct nfsd_file that was just added. To prevent that from happening, take the RCU read lock before dropping the spin lock. Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
2025-04-28nfs: nfs3acl: drop useless assignment in nfs3_get_acl()Sergey Shtylyov1-1/+1
In nfs3_get_acl(), the local variable status is assigned the result of nfs_refresh_inode() inside the *switch* statement, but that value gets overwritten in the next *if* statement's true branch and is completely ignored if that branch isn't taken... Found by Linux Verification Center (linuxtesting.org) with the Svace static analysis tool. Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/c32dced7-a4fa-43c0-aafe-ef6c819c2f91@omp.ru Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-04-28nfs: direct: drop useless initializer in nfs_direct_write_completion()Sergey Shtylyov1-1/+1
In nfs_direct_write_completion(), the local variable req isn't used outside the *while* loop and is assigned to right at the start of that loop's body, so its initializer appears useless -- drop it; then move the declaration to the loop body (which happens to have a pointless empty line anyway)... Found by Linux Verification Center (linuxtesting.org) with the Svace static analysis tool. Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/416219f5-7983-484b-b5a7-5fb7da9561f7@omp.ru Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-04-28nfs: move the nfs4_data_server_cache into struct nfs_netJeff Layton3-15/+29
Since struct nfs4_pnfs_ds should not be shared between net namespaces, move from a global list of objects to a per-netns list and spinlock. Tested-by: Sargun Dillon <sargun@sargun.me> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/20250410-nfs-ds-netns-v2-2-f80b7979ba80@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-04-28nfs: don't share pNFS DS connections between net namespacesJeff Layton4-11/+14
Currently, different NFS clients can share the same DS connections, even when they are in different net namespaces. If a containerized client creates a DS connection, another container can find and use it. When the first client exits, the connection will close which can lead to stalls in other clients. Add a net namespace pointer to struct nfs4_pnfs_ds, and compare those value to the caller's netns in _data_server_lookup_locked() when searching for a nfs4_pnfs_ds to match. Reported-by: Omar Sandoval <osandov@osandov.com> Reported-by: Sargun Dillon <sargun@sargun.me> Closes: https://lore.kernel.org/linux-nfs/Z_ArpQC_vREh_hEA@telecaster/ Tested-by: Sargun Dillon <sargun@sargun.me> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/20250410-nfs-ds-netns-v2-1-f80b7979ba80@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-04-28nfs: handle failure of nfs_get_lock_context in unlock pathLi Lingfeng1-1/+8
When memory is insufficient, the allocation of nfs_lock_context in nfs_get_lock_context() fails and returns -ENOMEM. If we mistakenly treat an nfs4_unlockdata structure (whose l_ctx member has been set to -ENOMEM) as valid and proceed to execute rpc_run_task(), this will trigger a NULL pointer dereference in nfs4_locku_prepare. For example: BUG: kernel NULL pointer dereference, address: 000000000000000c PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 15 UID: 0 PID: 12 Comm: kworker/u64:0 Not tainted 6.15.0-rc2-dirty #60 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 Workqueue: rpciod rpc_async_schedule RIP: 0010:nfs4_locku_prepare+0x35/0xc2 Code: 89 f2 48 89 fd 48 c7 c7 68 69 ef b5 53 48 8b 8e 90 00 00 00 48 89 f3 RSP: 0018:ffffbbafc006bdb8 EFLAGS: 00010246 RAX: 000000000000004b RBX: ffff9b964fc1fa00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: fffffffffffffff4 RDI: ffff9ba53fddbf40 RBP: ffff9ba539934000 R08: 0000000000000000 R09: ffffbbafc006bc38 R10: ffffffffb6b689c8 R11: 0000000000000003 R12: ffff9ba539934030 R13: 0000000000000001 R14: 0000000004248060 R15: ffffffffb56d1c30 FS: 0000000000000000(0000) GS:ffff9ba5881f0000(0000) knlGS:00000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000000c CR3: 000000093f244000 CR4: 00000000000006f0 Call Trace: <TASK> __rpc_execute+0xbc/0x480 rpc_async_schedule+0x2f/0x40 process_one_work+0x232/0x5d0 worker_thread+0x1da/0x3d0 ? __pfx_worker_thread+0x10/0x10 kthread+0x10d/0x240 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x34/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: CR2: 000000000000000c ---[ end trace 0000000000000000 ]--- Free the allocated nfs4_unlockdata when nfs_get_lock_context() fails and return NULL to terminate subsequent rpc_run_task, preventing NULL pointer dereference. Fixes: f30cb757f680 ("NFS: Always wait for I/O completion before unlock") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250417072508.3850532-1-lilingfeng3@huawei.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-04-28pNFS/flexfiles: Record the RPC errors in the I/O tracepointsTrond Myklebust2-15/+25
When debugging I/O issues, we want to see not just the NFS level errors, but also the RPC level problems, so record both in the tracepoints. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-04-28NFSv4/pnfs: Layoutreturn on close must handle fatal networking errorsTrond Myklebust1-0/+12
If we have a fatal ENETDOWN or ENETUNREACH error, then the layoutreturn on close code should also handle that as fatal, and free the layouts. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2025-04-28NFSv4: Handle fatal ENETDOWN and ENETUNREACH errorsTrond Myklebust1-0/+9
Ensure that the NFSv4 error handling code recognises the RPC_TASK_NETUNREACH_FATAL flag, and handles the ENETDOWN and ENETUNREACH errors accordingly. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2025-04-19Merge tag 'nfsd-6.15-1' of ↵Linus Torvalds3-12/+1
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - v6.15 libcrc clean-up makes invalid configurations possible - Fix a potential deadlock introduced during the v6.15 merge window * tag 'nfsd-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: decrease sc_count directly if fail to queue dl_recall nfs: add missing selections of CONFIG_CRC32
2025-04-13nfs: add missing selections of CONFIG_CRC32Eric Biggers3-12/+1
nfs.ko, nfsd.ko, and lockd.ko all use crc32_le(), which is available only when CONFIG_CRC32 is enabled. But the only NFS kconfig option that selected CONFIG_CRC32 was CONFIG_NFS_DEBUG, which is client-specific and did not actually guard the use of crc32_le() even on the client. The code worked around this bug by only actually calling crc32_le() when CONFIG_CRC32 is built-in, instead hard-coding '0' in other cases. This avoided randconfig build errors, and in real kernels the fallback code was unlikely to be reached since CONFIG_CRC32 is 'default y'. But, this really needs to just be done properly, especially now that I'm planning to update CONFIG_CRC32 to not be 'default y'. Therefore, make CONFIG_NFS_FS, CONFIG_NFSD, and CONFIG_LOCKD select CONFIG_CRC32. Then remove the fallback code that becomes unnecessary, as well as the selection of CONFIG_CRC32 from CONFIG_NFS_DEBUG. Fixes: 1264a2f053a3 ("NFS: refactor code for calculating the crc32 hash of a filehandle") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-04-08VFS: rename lookup_one_len family to lookup_noperm and remove permission checkNeilBrown1-6/+5
The lookup_one_len family of functions is (now) only used internally by a filesystem on itself either - in a context where permission checking is irrelevant such as by a virtual filesystem populating itself, or xfs accessing its ORPHANAGE or dquota accessing the quota file; or - in a context where a permission check (MAY_EXEC on the parent) has just been performed such as a network filesystem finding in "silly-rename" file in the same directory. This is also the context after the _parentat() functions where currently lookup_one_qstr_excl() is used. So the permission check is pointless. The name "one_len" is unhelpful in understanding the purpose of these functions and should be changed. Most of the callers pass the len as "strlen()" so using a qstr and QSTR() can simplify the code. This patch renames these functions (include lookup_positive_unlocked() which is part of the family despite the name) to have a name based on "lookup_noperm". They are changed to receive a 'struct qstr' instead of separate name and len. In a few cases the use of QSTR() results in a new call to strlen(). try_lookup_noperm() takes a pointer to a qstr instead of the whole qstr. This is consistent with d_hash_and_lookup() (which is nearly identical) and useful for lookup_noperm_unlocked(). The new lookup_noperm_common() doesn't take a qstr yet. That will be tidied up in a subsequent patch. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/r/20250319031545.2999807-5-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-03Merge tag 'nfs-for-6.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds19-58/+537
Pull NFS client updates from Trond Myklebust: "Bugfixes: - Three fixes for looping in the NFSv4 state manager delegation code - Fix for the NFSv4 state XDR code (Neil Brown) - Fix a leaked reference in nfs_lock_and_join_requests() - Fix a use-after-free in the delegation return code Features: - Implement the NFSv4.2 copy offload OFFLOAD_STATUS operation to allow monitoring of an in-progress copy - Add a mount option to force NFSv3/NFSv4 to use READDIRPLUS in a getdents() call - SUNRPC now allows some basic management of an existing RPC client's connections using sysfs - Improvements to the automated teardown of a NFS client when the container it was initiated from gets killed - Improvements to prevent tasks from getting stuck in a killable wait state after calling exit_signals()" * tag 'nfs-for-6.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (29 commits) nfs: Add missing release on error in nfs_lock_and_join_requests() NFSv4: Check for delegation validity in nfs_start_delegation_return_locked() NFS: Don't allow waiting for exiting tasks SUNRPC: Don't allow waiting for exiting tasks NFSv4: Treat ENETUNREACH errors as fatal for state recovery NFSv4: clp->cl_cons_state < 0 signifies an invalid nfs_client NFSv4: Further cleanups to shutdown loops NFS: Shut down the nfs_client only after all the superblocks SUNRPC: rpc_clnt_set_transport() must not change the autobind setting SUNRPC: rpcbind should never reset the port to the value '0' pNFS/flexfiles: Report ENETDOWN as a connection error pNFS/flexfiles: Treat ENETUNREACH errors as fatal in containers NFS: Treat ENETUNREACH errors as fatal in containers NFS: Add a mount option to make ENETUNREACH errors fatal sunrpc: Add a sysfs file for one-step xprt deletion sunrpc: Add a sysfs file for adding a new xprt sunrpc: Add a sysfs files for rpc_clnt information sunrpc: Add a sysfs attr for xprtsec NFS: Add implid to sysfs NFS: Extend rdirplus mount option with "force|none" ...
2025-04-02nfs: Add missing release on error in nfs_lock_and_join_requests()Dan Carpenter1-1/+3
Call nfs_release_request() on this error path before returning. Fixes: c3f2235782c3 ("nfs: fold nfs_folio_find_and_lock_request into nfs_lock_and_join_requests") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/3aaaa3d5-1c8a-41e4-98c7-717801ddd171@stanley.mountain Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-04-01Merge tag 'nfsd-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds1-1/+2
Pull nfsd updates from Chuck Lever: "Neil Brown contributed more scalability improvements to NFSD's open file cache, and Jeff Layton contributed a menagerie of repairs to NFSD's NFSv4 callback / backchannel implementation. Mike Snitzer contributed a change to NFS re-export support that disables support for file locking on a re-exported NFSv4 mount. This is because NFSv4 state recovery is currently difficult if not impossible for re-exported NFS mounts. The change aims to prevent data integrity exposures after the re-export server crashes. Work continues on the evolving NFSD netlink administrative API. Many thanks to the contributors, reviewers, testers, and bug reporters who participated during the v6.15 development cycle" * tag 'nfsd-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (45 commits) NFSD: Add a Kconfig setting to enable delegated timestamps sysctl: Fixes nsm_local_state bounds nfsd: use a long for the count in nfsd4_state_shrinker_count() nfsd: remove obsolete comment from nfs4_alloc_stid nfsd: remove unneeded forward declaration of nfsd4_mark_cb_fault() nfsd: reorganize struct nfs4_delegation for better packing nfsd: handle errors from rpc_call_async() nfsd: move cb_need_restart flag into cb_flags nfsd: replace CB_GETATTR_BUSY with NFSD4_CALLBACK_RUNNING nfsd: eliminate cl_ra_cblist and NFSD4_CLIENT_CB_RECALL_ANY nfsd: prevent callback tasks running concurrently nfsd: disallow file locking and delegations for NFSv4 reexport nfsd: filecache: drop the list_lru lock during lock gc scans nfsd: filecache: don't repeatedly add/remove files on the lru list nfsd: filecache: introduce NFSD_FILE_RECENT nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc() nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync() NFSD: Re-organize nfsd_file_gc_worker() nfsd: filecache: remove race handling. fs: nfs: acl: Avoid -Wflex-array-member-not-at-end warning ...
2025-03-28NFSv4: Check for delegation validity in nfs_start_delegation_return_locked()Trond Myklebust1-1/+2
Check that the delegation is still attached after taking the spin lock in nfs_start_delegation_return_locked(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-28NFS: Don't allow waiting for exiting tasksTrond Myklebust4-3/+15
Once a task calls exit_signals() it can no longer be signalled. So do not allow it to do killable waits. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-26NFSv4: Treat ENETUNREACH errors as fatal for state recoveryTrond Myklebust1-1/+9
If a containerised process is killed and causes an ENETUNREACH or ENETDOWN error to be propagated to the state manager, then mark the nfs_client as being dead so that we don't loop in functions that are expecting recovery to succeed. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-26NFSv4: clp->cl_cons_state < 0 signifies an invalid nfs_clientTrond Myklebust1-1/+1
If someone calls nfs_mark_client_ready(clp, status) with a negative value for status, then that should signal that the nfs_client is no longer valid. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-26NFSv4: Further cleanups to shutdown loopsTrond Myklebust2-2/+2
Replace the tests for the RPC client being shut down with tests for whether the nfs_client is in an error state. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-26NFS: Shut down the nfs_client only after all the superblocksTrond Myklebust1-1/+21
The nfs_client manages state for all the superblocks in the "cl_superblocks" list, so it must not be shut down until all of them are gone. Fixes: 7d3e26a054c8 ("NFS: Cancel all existing RPC tasks when shutdown") Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-24Merge tag 'vfs-6.15-rc1.async.dir' of ↵Linus Torvalds5-46/+66
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async dir updates from Christian Brauner: "This contains cleanups that fell out of the work from async directory handling: - Change kern_path_locked() and user_path_locked_at() to never return a negative dentry. This simplifies the usability of these helpers in various places - Drop d_exact_alias() from the remaining place in NFS where it is still used. This also allows us to drop the d_exact_alias() helper completely - Drop an unnecessary call to fh_update() from nfsd_create_locked() - Change i_op->mkdir() to return a struct dentry Change vfs_mkdir() to return a dentry provided by the filesystems which is hashed and positive. This allows us to reduce the number of cases where the resulting dentry is not positive to very few cases. The code in these places becomes simpler and easier to understand. - Repack DENTRY_* and LOOKUP_* flags" * tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: fix inline emphasis warning VFS: Change vfs_mkdir() to return the dentry. nfs: change mkdir inode_operation to return alternate dentry if needed. fuse: return correct dentry for ->mkdir ceph: return the correct dentry on mkdir hostfs: store inode in dentry after mkdir if possible. Change inode_operations.mkdir to return struct dentry * nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() VFS: add common error checks to lookup_one_qstr_excl() VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry VFS: repack LOOKUP_ bit flags. VFS: repack DENTRY_ flags.
2025-03-21pNFS/flexfiles: Report ENETDOWN as a connection errorTrond Myklebust1-0/+1
If the client should see an ENETDOWN when trying to connect to the data server, it might still be able to talk to the metadata server through another NIC. If so, report the error. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-21pNFS/flexfiles: Treat ENETUNREACH errors as fatal in containersTrond Myklebust3-2/+28
Propagate the NFS_MOUNT_NETUNREACH_FATAL flag to work with the pNFS flexfiles client. In these circumstances, the client needs to treat the ENETDOWN and ENETUNREACH errors as fatal, and should abandon the attempted I/O. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-21NFS: Treat ENETUNREACH errors as fatal in containersTrond Myklebust3-0/+10
Propagate the NFS_MOUNT_NETUNREACH_FATAL flag to work with the generic NFS client. If the flag is set, the client will receive ENETDOWN and ENETUNREACH errors from the RPC layer, and is expected to treat them as being fatal. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-21NFS: Add a mount option to make ENETUNREACH errors fatalTrond Myklebust2-0/+42
If the NFS client was initially created in a container, and that container is torn down, there is usually no possibity to go back and destroy any NFS clients that are hung because their virtual network devices have been unlinked. Add a flag that tells the NFS client that in these circumstances, it should treat ENETDOWN and ENETUNREACH errors as fatal to the NFS client. The option defaults to being on when the mount happens from inside a net namespace that is not "init_net". Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-21NFS: Add implid to sysfsAnna Schumaker1-0/+60
The Linux NFS server added support for returning this information during an EXCHANGE_ID in Linux v6.13. This is something and admin might want to query, so let's add it to sysfs. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Link: https://lore.kernel.org/r/20250207204225.594002-2-anna@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-21NFS: Extend rdirplus mount option with "force|none"Benjamin Coddington3-4/+31
There are certain users that wish to force the NFS client to choose READDIRPLUS over READDIR for a particular mount. Update the "rdirplus" mount option to optionally accept values. For "rdirplus=force", the NFS client will always attempt to use READDDIRPLUS. The setting of "rdirplus=none" is aliased to the existing "nordirplus". Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/c4cf0de4c8be0930b91bc74bee310d289781cd3b.1741885071.git.bcodding@redhat.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-17NFS: Refactor trace_nfs4_offload_cancelChuck Lever2-1/+11
Add a trace_nfs4_offload_status trace point that looks just like trace_nfs4_offload_cancel. Promote that event to an event class to avoid duplicating code. An alternative approach would be to expand trace_nfs4_offload_status to report more of the actual OFFLOAD_STATUS result. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/20250113153235.48706-16-cel@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-17NFS: Use NFSv4.2's OFFLOAD_STATUS operationChuck Lever1-11/+59
We've found that there are cases where a transport disconnection results in the loss of callback RPCs. NFS servers typically do not retransmit callback operations after a disconnect. This can be a problem for the Linux NFS client's current implementation of asynchronous COPY, which waits indefinitely for a CB_OFFLOAD callback. If a transport disconnect occurs while an async COPY is running, there's a good chance the client will never get the completing CB_OFFLOAD. Fix this by implementing the OFFLOAD_STATUS operation so that the Linux NFS client can probe the NFS server if it doesn't see a CB_OFFLOAD in a reasonable amount of time. This patch implements a simplistic check. As future work, the client might also be able to detect whether there is no forward progress on the request asynchronous COPY operation, and CANCEL it. Suggested-by: Olga Kornievskaia <kolga@netapp.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218735 Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/20250113153235.48706-15-cel@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-17NFS: Implement NFSv4.2's OFFLOAD_STATUS operationChuck Lever2-1/+105
Enable the Linux NFS client to observe the progress of an offloaded asynchronous COPY operation. This new operation will be put to use in a subsequent patch. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/20250113153235.48706-14-cel@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-17NFS: Implement NFSv4.2's OFFLOAD_STATUS XDRChuck Lever2-0/+87
Add XDR encoding and decoding functions for the NFSv4.2 OFFLOAD_STATUS operation. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/20250113153235.48706-13-cel@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-17NFS: fix open_owner_id_maxsz and related fields.NeilBrown1-9/+9
A recent change increased the size of an NFSv4 open owner, but didn't increase the corresponding max_sz defines. This is not know to have caused failure, but should be fixed. This patch also fixes some relates _maxsz fields that are wrong. Note that the XXX_owner_id_maxsz values now are only the size of the id and do NOT include the len field that will always preceed the id in xdr encoding. I think this is clearer. Reported-by: David Disseldorp <ddiss@suse.com> Fixes: d98f72272500 ("nfs: simplify and guarantee owner uniqueness.") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-17NFSv4: Avoid unnecessary scans of filesystems for delayed delegationsTrond Myklebust1-6/+12
The amount of looping through the list of delegations is occasionally leading to soft lockups. If the state manager was asked to manage the delayed return of delegations, then only scan those filesystems containing delegations that were marked as being delayed. Fixes: be20037725d1 ("NFSv4: Fix delegation return in cases where we have to retry") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-17NFSv4: Avoid unnecessary scans of filesystems for expired delegationsTrond Myklebust1-0/+7
The amount of looping through the list of delegations is occasionally leading to soft lockups. If the state manager was asked to reap the expired delegations, it should scan only those filesystems that hold delegations that need to be reaped. Fixes: 7f156ef0bf45 ("NFSv4: Clean up nfs_delegation_reap_expired()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-17NFSv4: Avoid unnecessary scans of filesystems for returning delegationsTrond Myklebust1-0/+5
The amount of looping through the list of delegations is occasionally leading to soft lockups. If the state manager was asked to return delegations asynchronously, it should only scan those filesystems that hold delegations that need to be returned. Fixes: af3b61bf6131 ("NFSv4: Clean up nfs_client_return_marked_delegations()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-17NFSv4: Don't trigger uneccessary scans for return-on-close delegationsTrond Myklebust1-15/+18
The amount of looping through the list of delegations is occasionally leading to soft lockups. Avoid at least some loops by not requiring the NFSv4 state manager to scan for delegations that are marked for return-on-close. Instead, either mark them for immediate return (if possible) or else leave it up to nfs4_inode_return_delegation_on_close() to return them once the file is closed by the application. Fixes: b757144fd77c ("NFSv4: Be less aggressive about returning delegations for open files") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-10nfsd: disallow file locking and delegations for NFSv4 reexportMike Snitzer1-1/+2
We do not and cannot support file locking with NFS reexport over NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport server reboot cannot allow clients to recover locks because the source NFS server has not rebooted, and so it is not in grace. Since the source NFS server is not in grace, it cannot offer any guarantees that the file won't have been changed between the locks getting lost and any attempt to recover/reclaim them. The same applies to delegations and any associated locks, so disallow them too. Clients are no longer allowed to get file locks or delegations from a reexport server, any attempts will fail with operation not supported. Update the "Reboot recovery" section accordingly in Documentation/filesystems/nfs/reexport.rst Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-09Merge tag 'mm-hotfixes-stable-2025-03-08-16-27' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "33 hotfixes. 24 are cc:stable and the remainder address post-6.13 issues or aren't considered necessary for -stable kernels. 26 are for MM and 7 are for non-MM. - "mm: memory_failure: unmap poisoned folio during migrate properly" from Ma Wupeng fixes a couple of two year old bugs involving the migration of hwpoisoned folios. - "selftests/damon: three fixes for false results" from SeongJae Park fixes three one year old bugs in the SAMON selftest code. The remainder are singletons and doubletons. Please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2025-03-08-16-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (33 commits) mm/page_alloc: fix uninitialized variable rapidio: add check for rio_add_net() in rio_scan_alloc_net() rapidio: fix an API misues when rio_add_net() fails MAINTAINERS: .mailmap: update Sumit Garg's email address Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone" mm: fix finish_fault() handling for large folios mm: don't skip arch_sync_kernel_mappings() in error paths mm: shmem: remove unnecessary warning in shmem_writepage() userfaultfd: fix PTE unmapping stack-allocated PTE copies userfaultfd: do not block on locking a large folio with raised refcount mm: zswap: use ATOMIC_LONG_INIT to initialize zswap_stored_pages mm: shmem: fix potential data corruption during shmem swapin mm: fix kernel BUG when userfaultfd_move encounters swapcache selftests/damon/damon_nr_regions: sort collected regiosn before checking with min/max boundaries selftests/damon/damon_nr_regions: set ops update for merge results check to 100ms selftests/damon/damos_quota: make real expectation of quota exceeds include/linux/log2.h: mark is_power_of_2() with __always_inline NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback mm, swap: avoid BUG_ON in relocate_cluster() mm: swap: use correct step in loop to wait all clusters in wait_for_allocation() ...
2025-03-06NFS: fix nfs_release_folio() to not deadlock via kcompactd writebackMike Snitzer1-1/+2
Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for it so nfs_release_folio() can skip calling nfs_wb_folio() from kcompactd. Otherwise NFS can deadlock waiting for kcompactd enduced writeback which recurses back to NFS (which triggers writeback to NFSD via NFS loopback mount on the same host, NFSD blocks waiting for XFS's call to __filemap_get_folio): 6070.550357] INFO: task kcompactd0:58 blocked for more than 4435 seconds. {--- [58] "kcompactd0" [<0>] folio_wait_bit+0xe8/0x200 [<0>] folio_wait_writeback+0x2b/0x80 [<0>] nfs_wb_folio+0x80/0x1b0 [nfs] [<0>] nfs_release_folio+0x68/0x130 [nfs] [<0>] split_huge_page_to_list_to_order+0x362/0x840 [<0>] migrate_pages_batch+0x43d/0xb90 [<0>] migrate_pages_sync+0x9a/0x240 [<0>] migrate_pages+0x93c/0x9f0 [<0>] compact_zone+0x8e2/0x1030 [<0>] compact_node+0xdb/0x120 [<0>] kcompactd+0x121/0x2e0 [<0>] kthread+0xcf/0x100 [<0>] ret_from_fork+0x31/0x40 [<0>] ret_from_fork_asm+0x1a/0x30 ---} [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20250225022002.26141-1-snitzer@kernel.org Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05nfs: change mkdir inode_operation to return alternate dentry if needed.NeilBrown4-38/+59
mkdir now allows a different dentry to be returned which is sometimes relevant for nfs. This patch changes the nfs_rpc_ops mkdir op to return a dentry, and passes that back to the caller. The mkdir nfs_rpc_op will return NULL if the original dentry should be used. This matches the mkdir inode_operation. nfs4_do_create() is duplicated to nfs4_do_mkdir() which is changed to handle the specifics of directories. Consequently the current special handling for directories is removed from nfs4_do_create() Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-6-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27Change inode_operations.mkdir to return struct dentry *NeilBrown2-6/+6
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have complete control of sequencing on the actual filesystem (e.g. on a different server) and may find that the inode created for a mkdir request already exists in the icache and dcache by the time the mkdir request returns. For example, if the filesystem is mounted twice the directory could be visible on the other mount before it is on the original mount, and a pair of name_to_handle_at(), open_by_handle_at() calls could instantiate the directory inode with an IS_ROOT() dentry before the first mkdir returns. This means that the dentry passed to ->mkdir() may not be the one that is associated with the inode after the ->mkdir() completes. Some callers need to interact with the inode after the ->mkdir completes and they currently need to perform a lookup in the (rare) case that the dentry is no longer hashed. This lookup-after-mkdir requires that the directory remains locked to avoid races. Planned future patches to lock the dentry rather than the directory will mean that this lookup cannot be performed atomically with the mkdir. To remove this barrier, this patch changes ->mkdir to return the resulting dentry if it is different from the one passed in. Possible returns are: NULL - the directory was created and no other dentry was used ERR_PTR() - an error occurred non-NULL - this other dentry was spliced in This patch only changes file-systems to return "ERR_PTR(err)" instead of "err" or equivalent transformations. Subsequent patches will make further changes to some file-systems to return a correct dentry. Not all filesystems reliably result in a positive hashed dentry: - NFS, cifs, hostfs will sometimes need to perform a lookup of the name to get inode information. Races could result in this returning something different. Note that this lookup is non-atomic which is what we are trying to avoid. Placing the lookup in filesystem code means it only happens when the filesystem has no other option. - kernfs and tracefs leave the dentry negative and the ->revalidate operation ensures that lookup will be called to correctly populate the dentry. This could be fixed but I don't think it is important to any of the users of vfs_mkdir() which look at the dentry. The recommendation to use d_drop();d_splice_alias() is ugly but fits with current practice. A planned future patch will change this. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27Merge patch series "prep patches for my mkdir series"Christian Brauner1-3/+1
NeilBrown <neilb@suse.de> says: These two patches are cleanup are dependencies for my mkdir changes and subsequence directory locking changes. * patches from https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de: (2 commits) nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() Link: https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>