<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/linux/nfs_fs.h, branch v6.1.175</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.1.175</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.1.175'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2024-11-14T12:15:10+00:00</updated>
<entry>
<title>NFSv3: handle out-of-order write replies.</title>
<updated>2024-11-14T12:15:10+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2023-03-21T22:27:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ba5634feb29f059eb71b12a3113a07338137d5cf'/>
<id>urn:sha1:ba5634feb29f059eb71b12a3113a07338137d5cf</id>
<content type='text'>
[ Upstream commit 3db63daabe210af32a09533fe7d8d47c711a103c ]

NFSv3 includes pre/post wcc attributes which allow the client to
determine if all changes to the file have been made by the client
itself, or if any might have been made by some other client.

If there are gaps in the pre/post ctime sequence it must be assumed that
some other client changed the file in that gap and the local cache must
be suspect.  The next time the file is opened the cache should be
invalidated.

Since Commit 1c341b777501 ("NFS: Add deferred cache invalidation for
close-to-open consistency violations") in linux 5.3 the Linux client has
been triggering this invalidation.  The chunk in nfs_update_inode() in
particularly triggers.

Unfortunately Linux NFS assumes that all replies will be processed in
the order sent, and will arrive in the order processed.  This is not
true in general.  Consequently Linux NFS might ignore the wcc info in a
WRITE reply because the reply is in response to a WRITE that was sent
before some other request for which a reply has already been seen.  This
is detected by Linux using the gencount tests in nfs_inode_attr_cmp().

Also, when the gencount tests pass it is still possible that the request
were processed on the server in a different order, and a gap seen in
the ctime sequence might be filled in by a subsequent reply, so gaps
should not immediately trigger delayed invalidation.

The net result is that writing to a server and then reading the file
back can result in going to the server for the read rather than serving
it from cache - all because a couple of replies arrived out-of-order.
This is a performance regression over kernels before 5.3, though the
change in 5.3 is a correctness improvement.

This has been seen with Linux writing to a Netapp server which
occasionally re-orders requests.  In testing the majority of requests
were in-order, but a few (maybe 2 or three at a time) could be
re-ordered.

This patch addresses the problem by recording any gaps seen in the
pre/post ctime sequence and not triggering invalidation until either
there are too many gaps to fit in the table, or until there are no more
active writes and the remaining gaps cannot be resolved.

We allocate a table of 16 gaps on demand.  If the allocation fails we
revert to current behaviour which is of little cost as we are unlikely
to be able to cache the writes anyway.

In the table we store "start-&gt;end" pair when iversion is updated and
"end&lt;-start" pairs pre/post pairs reported by the server.  Usually these
exactly cancel out and so nothing is stored.  When there are
out-of-order replies we do store gaps and these will eventually be
cancelled against later replies when this client is the only writer.

If the final write is out-of-order there may be one gap remaining when
the file is closed.  This will be noticed and if there is precisely on
gap and if the iversion can be advanced to match it, then we do so.

This patch makes no attempt to handle directories correctly.  The same
problem potentially exists in the out-of-order replies to create/unlink
requests can cause future lookup requires to be sent to the server
unnecessarily.  A similar scheme using the same primitives could be used
to notice and handle out-of-order replies.

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Signed-off-by: Anna Schumaker &lt;Anna.Schumaker@Netapp.com&gt;
Stable-dep-of: 867da60d463b ("nfs: avoid i_lock contention in nfs_clear_invalid_mapping")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>nfs: fix UAF in direct writes</title>
<updated>2024-04-03T13:19:34+00:00</updated>
<author>
<name>Josef Bacik</name>
<email>josef@toxicpanda.com</email>
</author>
<published>2024-03-01T16:49:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3abc2d160ed8213948b147295d77d44a22c88fa3'/>
<id>urn:sha1:3abc2d160ed8213948b147295d77d44a22c88fa3</id>
<content type='text'>
[ Upstream commit 17f46b803d4f23c66cacce81db35fef3adb8f2af ]

In production we have been hitting the following warning consistently

------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0
Workqueue: nfsiod nfs_direct_write_schedule_work [nfs]
RIP: 0010:refcount_warn_saturate+0x9c/0xe0
PKRU: 55555554
Call Trace:
 &lt;TASK&gt;
 ? __warn+0x9f/0x130
 ? refcount_warn_saturate+0x9c/0xe0
 ? report_bug+0xcc/0x150
 ? handle_bug+0x3d/0x70
 ? exc_invalid_op+0x16/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? refcount_warn_saturate+0x9c/0xe0
 nfs_direct_write_schedule_work+0x237/0x250 [nfs]
 process_one_work+0x12f/0x4a0
 worker_thread+0x14e/0x3b0
 ? ZSTD_getCParams_internal+0x220/0x220
 kthread+0xdc/0x120
 ? __btf_name_valid+0xa0/0xa0
 ret_from_fork+0x1f/0x30

This is because we're completing the nfs_direct_request twice in a row.

The source of this is when we have our commit requests to submit, we
process them and send them off, and then in the completion path for the
commit requests we have

if (nfs_commit_end(cinfo.mds))
	nfs_direct_write_complete(dreq);

However since we're submitting asynchronous requests we sometimes have
one that completes before we submit the next one, so we end up calling
complete on the nfs_direct_request twice.

The only other place we use nfs_generic_commit_list() is in
__nfs_commit_inode, which wraps this call in a

nfs_commit_begin();
nfs_commit_end();

Which is a common pattern for this style of completion handling, one
that is also repeated in the direct code with get_dreq()/put_dreq()
calls around where we process events as well as in the completion paths.

Fix this by using the same pattern for the commit requests.

Before with my 200 node rocksdb stress running this warning would pop
every 10ish minutes.  With my patch the stress test has been running for
several hours without popping.

Signed-off-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust &lt;trond.myklebust@hammerspace.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>NFS: Cleanup to remove unused flag NFS_CONTEXT_RESEND_WRITES</title>
<updated>2022-08-13T17:02:14+00:00</updated>
<author>
<name>Trond Myklebust</name>
<email>trond.myklebust@hammerspace.com</email>
</author>
<published>2022-08-13T12:51:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5f6277a0c15e1ea54b6fd3d78c9fff7bfe42556c'/>
<id>urn:sha1:5f6277a0c15e1ea54b6fd3d78c9fff7bfe42556c</id>
<content type='text'>
Signed-off-by: Trond Myklebust &lt;trond.myklebust@hammerspace.com&gt;
</content>
</entry>
<entry>
<title>NFS: Fix another fsync() issue after a server reboot</title>
<updated>2022-08-13T17:02:13+00:00</updated>
<author>
<name>Trond Myklebust</name>
<email>trond.myklebust@hammerspace.com</email>
</author>
<published>2022-08-13T12:22:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=67f4b5dc49913abcdb5cc736e73674e2f352f81d'/>
<id>urn:sha1:67f4b5dc49913abcdb5cc736e73674e2f352f81d</id>
<content type='text'>
Currently, when the writeback code detects a server reboot, it redirties
any pages that were not committed to disk, and it sets the flag
NFS_CONTEXT_RESEND_WRITES in the nfs_open_context of the file descriptor
that dirtied the file. While this allows the file descriptor in question
to redrive its own writes, it violates the fsync() requirement that we
should be synchronising all writes to disk.
While the problem is infrequent, we do see corner cases where an
untimely server reboot causes the fsync() call to abandon its attempt to
sync data to disk and causing data corruption issues due to missed error
conditions or similar.

In order to tighted up the client's ability to deal with this situation
without introducing livelocks, add a counter that records the number of
times pages are redirtied due to a server reboot-like condition, and use
that in fsync() to redrive the sync to disk.

Fixes: 2197e9b06c22 ("NFS: Fix up fsync() when the server rebooted")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust &lt;trond.myklebust@hammerspace.com&gt;
</content>
</entry>
<entry>
<title>NFS: don't unhash dentry during unlink/rename</title>
<updated>2022-08-08T20:25:56+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2022-08-01T00:33:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3c59366c207e4c6c6569524af606baf017a55c61'/>
<id>urn:sha1:3c59366c207e4c6c6569524af606baf017a55c61</id>
<content type='text'>
NFS unlink() (and rename over existing target) must determine if the
file is open, and must perform a "silly rename" instead of an unlink (or
before rename) if it is.  Otherwise the client might hold a file open
which has been removed on the server.

Consequently if it determines that the file isn't open, it must block
any subsequent opens until the unlink/rename has been completed on the
server.

This is currently achieved by unhashing the dentry.  This forces any
open attempt to the slow-path for lookup which will block on i_rwsem on
the directory until the unlink/rename completes.  A future patch will
change the VFS to only get a shared lock on i_rwsem for unlink, so this
will no longer work.

Instead we introduce an explicit interlock.  A special value is stored
in dentry-&gt;d_fsdata while the unlink/rename is running and
-&gt;d_revalidate blocks while that value is present.  When -&gt;d_revalidate
unblocks, the dentry will be invalid.  This closes the race
without requiring exclusion on i_rwsem.

d_fsdata is already used in two different ways.
1/ an IS_ROOT directory dentry might have a "devname" stored in
   d_fsdata.  Such a dentry doesn't have a name and so cannot be the
   target of unlink or rename.  For safety we check if an old devname
   is still stored, and remove it if it is.
2/ a dentry with DCACHE_NFSFS_RENAMED set will have a 'struct
   nfs_unlinkdata' stored in d_fsdata.  While this is set maydelete()
   will fail, so an unlink or rename will never proceed on such
   a dentry.

Neither of these can be in effect when a dentry is the target of unlink
or rename.  So we can expect d_fsdata to be NULL, and store a special
value ((void*)1) which is given the name NFS_FSDATA_BLOCKED to indicate
that any lookup will be blocked.

The d_count() is incremented under d_lock() when a lookup finds the
dentry, so we check d_count() is low, and set NFS_FSDATA_BLOCKED under
the same lock to avoid any races.

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Signed-off-by: Trond Myklebust &lt;trond.myklebust@hammerspace.com&gt;
</content>
</entry>
<entry>
<title>Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm</title>
<updated>2022-05-26T19:32:41+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2022-05-26T19:32:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=98931dd95fd489fcbfa97da563505a6f071d7c77'/>
<id>urn:sha1:98931dd95fd489fcbfa97da563505a6f071d7c77</id>
<content type='text'>
Pull MM updates from Andrew Morton:
 "Almost all of MM here. A few things are still getting finished off,
  reviewed, etc.

   - Yang Shi has improved the behaviour of khugepaged collapsing of
     readonly file-backed transparent hugepages.

   - Johannes Weiner has arranged for zswap memory use to be tracked and
     managed on a per-cgroup basis.

   - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
     runtime enablement of the recent huge page vmemmap optimization
     feature.

   - Baolin Wang contributes a series to fix some issues around hugetlb
     pagetable invalidation.

   - Zhenwei Pi has fixed some interactions between hwpoisoned pages and
     virtualization.

   - Tong Tiangen has enabled the use of the presently x86-only
     page_table_check debugging feature on arm64 and riscv.

   - David Vernet has done some fixup work on the memcg selftests.

   - Peter Xu has taught userfaultfd to handle write protection faults
     against shmem- and hugetlbfs-backed files.

   - More DAMON development from SeongJae Park - adding online tuning of
     the feature and support for monitoring of fixed virtual address
     ranges. Also easier discovery of which monitoring operations are
     available.

   - Nadav Amit has done some optimization of TLB flushing during
     mprotect().

   - Neil Brown continues to labor away at improving our swap-over-NFS
     support.

   - David Hildenbrand has some fixes to anon page COWing versus
     get_user_pages().

   - Peng Liu fixed some errors in the core hugetlb code.

   - Joao Martins has reduced the amount of memory consumed by
     device-dax's compound devmaps.

   - Some cleanups of the arch-specific pagemap code from Anshuman
     Khandual.

   - Muchun Song has found and fixed some errors in the TLB flushing of
     transparent hugepages.

   - Roman Gushchin has done more work on the memcg selftests.

  ... and, of course, many smaller fixes and cleanups. Notably, the
  customary million cleanup serieses from Miaohe Lin"

* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
  mm: kfence: use PAGE_ALIGNED helper
  selftests: vm: add the "settings" file with timeout variable
  selftests: vm: add "test_hmm.sh" to TEST_FILES
  selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
  selftests: vm: add migration to the .gitignore
  selftests/vm/pkeys: fix typo in comment
  ksm: fix typo in comment
  selftests: vm: add process_mrelease tests
  Revert "mm/vmscan: never demote for memcg reclaim"
  mm/kfence: print disabling or re-enabling message
  include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
  include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
  mm: fix a potential infinite loop in start_isolate_page_range()
  MAINTAINERS: add Muchun as co-maintainer for HugeTLB
  zram: fix Kconfig dependency warning
  mm/shmem: fix shmem folio swapoff hang
  cgroup: fix an error handling path in alloc_pagecache_max_30M()
  mm: damon: use HPAGE_PMD_SIZE
  tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
  nodemask.h: fix compilation error with GCC12
  ...
</content>
</entry>
<entry>
<title>nfs: rename nfs_direct_IO and use as -&gt;swap_rw</title>
<updated>2022-05-10T01:20:48+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2022-05-10T01:20:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=eb79f3af9395fbe448e91b0940a6c395b7d06be4'/>
<id>urn:sha1:eb79f3af9395fbe448e91b0940a6c395b7d06be4</id>
<content type='text'>
The nfs_direct_IO() exists to support SWAP IO, but hasn't worked for a
while.  We now need a -&gt;swap_rw function which behaves slightly
differently, returning zero for success rather than a byte count.

So modify nfs_direct_IO accordingly, rename it, and use it as the
-&gt;swap_rw function.

Link: https://lkml.kernel.org/r/165119301493.15698.7491285551903597618.stgit@noble.brown
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Tested-by: Geert Uytterhoeven &lt;geert+renesas@glider.be&gt; (on Renesas RSK+RZA1 with 32 MiB of SDRAM)
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Trond Myklebust &lt;trond.myklebust@hammerspace.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>nfs: Convert nfs to read_folio</title>
<updated>2022-05-09T20:21:46+00:00</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2022-04-29T15:12:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=65d023af7f29eb1250a6105141a74776bae7e1f8'/>
<id>urn:sha1:65d023af7f29eb1250a6105141a74776bae7e1f8</id>
<content type='text'>
This is a "weak" conversion which converts straight back to using pages.
A full conversion should be performed at some point, hopefully by
someone familiar with the filesystem.

Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs</title>
<updated>2022-03-30T01:55:37+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2022-03-30T01:55:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=965181d7ef7e1a863477536dc328c23a7ebc8a1d'/>
<id>urn:sha1:965181d7ef7e1a863477536dc328c23a7ebc8a1d</id>
<content type='text'>
Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:

   - Switch NFS to use readahead instead of the obsolete readpages.

   - Readdir fixes to improve cacheability of large directories when
     there are multiple readers and writers.

   - Readdir performance improvements when doing a seekdir() immediately
     after opening the directory (common when re-exporting NFS).

   - NFS swap improvements from Neil Brown.

   - Loosen up memory allocation to permit direct reclaim and write back
     in cases where there is no danger of deadlocking the writeback code
     or NFS swap.

   - Avoid sillyrename when the NFSv4 server claims to support the
     necessary features to recover the unlinked but open file after
     reboot.

  Bugfixes:

   - Patch from Olga to add a mount option to control NFSv4.1 session
     trunking discovery, and default it to being off.

   - Fix a lockup in nfs_do_recoalesce().

   - Two fixes for list iterator variables being used when pointing to
     the list head.

   - Fix a kernel memory scribble when reading from a non-socket
     transport in /sys/kernel/sunrpc.

   - Fix a race where reconnecting to a server could leave the TCP
     socket stuck forever in the connecting state.

   - Patch from Neil to fix a shutdown race which can leave the SUNRPC
     transport timer primed after we free the struct xprt itself.

   - Patch from Xin Xiong to fix reference count leaks in the NFSv4.2
     copy offload.

   - Sunrpc patch from Olga to avoid resending a task on an offlined
     transport.

  Cleanups:

   - Patches from Dave Wysochanski to clean up the fscache code"

* tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (91 commits)
  NFSv4/pNFS: Fix another issue with a list iterator pointing to the head
  NFS: Don't loop forever in nfs_do_recoalesce()
  SUNRPC: Don't return error values in sysfs read of closed files
  SUNRPC: Do not dereference non-socket transports in sysfs
  NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error
  SUNRPC don't resend a task on an offlined transport
  NFS: replace usage of found with dedicated list iterator variable
  SUNRPC: avoid race between mod_timer() and del_timer_sync()
  pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod
  pNFS/flexfiles: Ensure pNFS allocation modes are consistent with nfsiod
  NFSv4/pnfs: Ensure pNFS allocation modes are consistent with nfsiod
  NFS: Avoid writeback threads getting stuck in mempool_alloc()
  NFS: nfsiod should not block forever in mempool_alloc()
  SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent
  SUNRPC: Fix unx_lookup_cred() allocation
  NFS: Fix memory allocation in rpc_alloc_task()
  NFS: Fix memory allocation in rpc_malloc()
  SUNRPC: Improve accuracy of socket ENOBUFS determination
  SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE
  SUNRPC: Fix socket waits for write buffer space
  ...
</content>
</entry>
<entry>
<title>NFS: nfsiod should not block forever in mempool_alloc()</title>
<updated>2022-03-22T19:52:56+00:00</updated>
<author>
<name>Trond Myklebust</name>
<email>trond.myklebust@hammerspace.com</email>
</author>
<published>2022-03-21T16:34:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=515dcdcd48736576c6f5c197814da6f81c60a21e'/>
<id>urn:sha1:515dcdcd48736576c6f5c197814da6f81c60a21e</id>
<content type='text'>
The concern is that since nfsiod is sometimes required to kick off a
commit, it can get locked up waiting forever in mempool_alloc() instead
of failing gracefully and leaving the commit until later.

Try to allocate from the slab first, with GFP_KERNEL | __GFP_NORETRY,
then fall back to a non-blocking attempt to allocate from the memory
pool.

Signed-off-by: Trond Myklebust &lt;trond.myklebust@hammerspace.com&gt;
</content>
</entry>
</feed>
