<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/fs/namei.c, branch linux-6.0.y</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=linux-6.0.y</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=linux-6.0.y'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2022-11-26T08:27:56+00:00</updated>
<entry>
<title>mm: fs: initialize fsdata passed to write_begin/write_end interface</title>
<updated>2022-11-26T08:27:56+00:00</updated>
<author>
<name>Alexander Potapenko</name>
<email>glider@google.com</email>
</author>
<published>2022-09-15T15:04:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=80ec940ee52feec55200d116a4987511524dc928'/>
<id>urn:sha1:80ec940ee52feec55200d116a4987511524dc928</id>
<content type='text'>
commit 1468c6f4558b1bcd92aa0400f2920f9dc7588402 upstream.

Functions implementing the a_ops-&gt;write_end() interface accept the `void
*fsdata` parameter that is supposed to be initialized by the corresponding
a_ops-&gt;write_begin() (which accepts `void **fsdata`).

However not all a_ops-&gt;write_begin() implementations initialize `fsdata`
unconditionally, so it may get passed uninitialized to a_ops-&gt;write_end(),
resulting in undefined behavior.

Fix this by initializing fsdata with NULL before the call to
write_begin(), rather than doing so in all possible a_ops implementations.

This patch covers only the following cases found by running x86 KMSAN
under syzkaller:

 - generic_perform_write()
 - cont_expand_zero() and generic_cont_expand_simple()
 - page_symlink()

Other cases of passing uninitialized fsdata may persist in the codebase.

Link: https://lkml.kernel.org/r/20220915150417.722975-43-glider@google.com
Signed-off-by: Alexander Potapenko &lt;glider@google.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: Andrey Konovalov &lt;andreyknvl@gmail.com&gt;
Cc: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Borislav Petkov &lt;bp@alien8.de&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Eric Biggers &lt;ebiggers@google.com&gt;
Cc: Eric Biggers &lt;ebiggers@kernel.org&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;
Cc: Ilya Leoshkevich &lt;iii@linux.ibm.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Marco Elver &lt;elver@google.com&gt;
Cc: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Michael S. Tsirkin &lt;mst@redhat.com&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Petr Mladek &lt;pmladek@suse.com&gt;
Cc: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vasily Gorbik &lt;gor@linux.ibm.com&gt;
Cc: Vegard Nossum &lt;vegard.nossum@oracle.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux</title>
<updated>2022-08-09T16:52:28+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2022-08-09T16:52:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=426b4ca2d6a5ab51f6b6175d06e4f8ddea434cdf'/>
<id>urn:sha1:426b4ca2d6a5ab51f6b6175d06e4f8ddea434cdf</id>
<content type='text'>
Pull setgid updates from Christian Brauner:
 "This contains the work to move setgid stripping out of individual
  filesystems and into the VFS itself.

  Creating files that have both the S_IXGRP and S_ISGID bit raised in
  directories that themselves have the S_ISGID bit set requires
  additional privileges to avoid security issues.

  When a filesystem creates a new inode it needs to take care that the
  caller is either in the group of the newly created inode or they have
  CAP_FSETID in their current user namespace and are privileged over the
  parent directory of the new inode. If any of these two conditions is
  true then the S_ISGID bit can be raised for an S_IXGRP file and if not
  it needs to be stripped.

  However, there are several key issues with the current implementation:

   - S_ISGID stripping logic is entangled with umask stripping.

     For example, if the umask removes the S_IXGRP bit from the file
     about to be created then the S_ISGID bit will be kept.

     The inode_init_owner() helper is responsible for S_ISGID stripping
     and is called before posix_acl_create(). So we can end up with two
     different orderings:

     1. FS without POSIX ACL support

        First strip umask then strip S_ISGID in inode_init_owner().

        In other words, if a filesystem doesn't support or enable POSIX
        ACLs then umask stripping is done directly in the vfs before
        calling into the filesystem:

     2. FS with POSIX ACL support

        First strip S_ISGID in inode_init_owner() then strip umask in
        posix_acl_create().

        In other words, if the filesystem does support POSIX ACLs then
        unmask stripping may be done in the filesystem itself when
        calling posix_acl_create().

     Note that technically filesystems are free to impose their own
     ordering between posix_acl_create() and inode_init_owner() meaning
     that there's additional ordering issues that influence S_ISGID
     inheritance.

     (Note that the commit message of commit 1639a49ccdce ("fs: move
     S_ISGID stripping into the vfs_*() helpers") gets the ordering
     between inode_init_owner() and posix_acl_create() the wrong way
     around. I realized this too late.)

   - Filesystems that don't rely on inode_init_owner() don't get S_ISGID
     stripping logic.

     While that may be intentional (e.g. network filesystems might just
     defer setgid stripping to a server) it is often just a security
     issue.

     Note that mandating the use of inode_init_owner() was proposed as
     an alternative solution but that wouldn't fix the ordering issues
     and there are examples such as afs where the use of
     inode_init_owner() isn't possible.

     In any case, we should also try the cleaner and generalized
     solution first before resorting to this approach.

   - We still have S_ISGID inheritance bugs years after the initial
     round of S_ISGID inheritance fixes:

       e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes")
       01ea173e103e ("xfs: fix up non-directory creation in SGID directories")
       fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories")

  All of this led us to conclude that the current state is too messy.
  While we won't be able to make it completely clean as
  posix_acl_create() is still a filesystem specific call we can improve
  the S_SIGD stripping situation quite a bit by hoisting it out of
  inode_init_owner() and into the respective vfs creation operations.

  The obvious advantage is that we don't need to rely on individual
  filesystems getting S_ISGID stripping right and instead can
  standardize the ordering between S_ISGID and umask stripping directly
  in the VFS.

  A few short implementation notes:

   - The stripping logic needs to happen in vfs_*() helpers for the sake
     of stacking filesystems such as overlayfs that rely on these
     helpers taking care of S_ISGID stripping.

   - Security hooks have never seen the mode as it is ultimately seen by
     the filesystem because of the ordering issue we mentioned. Nothing
     is changed for them. We simply continue to strip the umask before
     passing the mode down to the security hooks.

   - The following filesystems use inode_init_owner() and thus relied on
     S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs,
     hfsplus, hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs,
     overlayfs, ramfs, reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs,
     bpf, tmpfs.

     We've audited all callchains as best as we could. More details can
     be found in the commit message to 1639a49ccdce ("fs: move S_ISGID
     stripping into the vfs_*() helpers")"

* tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  ceph: rely on vfs for setgid stripping
  fs: move S_ISGID stripping into the vfs_*() helpers
  fs: Add missing umask strip in vfs_tmpfile
  fs: add mode_strip_sgid() helper
</content>
</entry>
<entry>
<title>fs: move S_ISGID stripping into the vfs_*() helpers</title>
<updated>2022-07-21T09:34:16+00:00</updated>
<author>
<name>Yang Xu</name>
<email>xuyang2018.jy@fujitsu.com</email>
</author>
<published>2022-07-14T06:11:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1639a49ccdce58ea248841ed9b23babcce6dbb0b'/>
<id>urn:sha1:1639a49ccdce58ea248841ed9b23babcce6dbb0b</id>
<content type='text'>
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.

Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.

When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.

However, there are several key issues with the current implementation:

* S_ISGID stripping logic is entangled with umask stripping.

  If a filesystem doesn't support or enable POSIX ACLs then umask
  stripping is done directly in the vfs before calling into the
  filesystem.
  If the filesystem does support POSIX ACLs then unmask stripping may be
  done in the filesystem itself when calling posix_acl_create().

  Since umask stripping has an effect on S_ISGID inheritance, e.g., by
  stripping the S_IXGRP bit from the file to be created and all relevant
  filesystems have to call posix_acl_create() before inode_init_owner()
  where we currently take care of S_ISGID handling S_ISGID handling is
  order dependent. IOW, whether or not you get a setgid bit depends on
  POSIX ACLs and umask and in what order they are called.

  Note that technically filesystems are free to impose their own
  ordering between posix_acl_create() and inode_init_owner() meaning
  that there's additional ordering issues that influence S_SIGID
  inheritance.

* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
  stripping logic.

  While that may be intentional (e.g. network filesystems might just
  defer setgid stripping to a server) it is often just a security issue.

This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.

So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.

We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).

The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:

sys_mknod()
-&gt; do_mknodat(mode)
   -&gt; .mknod = ovl_mknod(mode)
      -&gt; ovl_create(mode)
         -&gt; vfs_mknod(mode)

get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.

Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.

The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.

All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the -&gt;mkdir(), -&gt;mknod(),
-&gt;create(), -&gt;tmpfile(), -&gt;rename() inode operations.

Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The -&gt;symlink() and -&gt;link() inode operations trivially inherit
the mode from the target and the -&gt;rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.

In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:

* btrfs allows the creation of filesystem objects through various
  ioctls(). Snapshot creation literally takes a snapshot and so the mode
  is fully preserved and S_ISGID stripping doesn't apply.

  Creating a new subvolum relies on inode_init_owner() in
  btrfs_new_subvol_inode() but only creates directories and doesn't
  raise S_ISGID.

* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
  xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
  the actual extents ocfs2 uses a separate ioctl() that also creates the
  target file.

  Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
  inode_init_owner() to strip the S_ISGID bit. This is the only place
  where a filesystem needs to call mode_strip_sgid() directly but this
  is self-inflicted pain.

* spufs doesn't go through the vfs at all and doesn't use ioctl()s
  either. Instead it has a dedicated system call spufs_create() which
  allows the creation of filesystem objects. But spufs only creates
  directories and doesn't allo S_SIGID bits, i.e. it specifically only
  allows 0777 bits.

* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.

The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.

Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.

Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner &lt;david@fromorbit.com&gt;
Suggested-by: Christian Brauner (Microsoft) &lt;brauner@kernel.org&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Reviewed-and-Tested-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Yang Xu &lt;xuyang2018.jy@fujitsu.com&gt;
[&lt;brauner@kernel.org&gt;: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>fs: Add missing umask strip in vfs_tmpfile</title>
<updated>2022-07-19T13:13:20+00:00</updated>
<author>
<name>Yang Xu</name>
<email>xuyang2018.jy@fujitsu.com</email>
</author>
<published>2022-07-14T06:11:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ac6800e279a22b28f4fc21439843025a0d5bf03e'/>
<id>urn:sha1:ac6800e279a22b28f4fc21439843025a0d5bf03e</id>
<content type='text'>
All creation paths except for O_TMPFILE handle umask in the vfs directly
if the filesystem doesn't support or enable POSIX ACLs. If the filesystem
does then umask handling is deferred until posix_acl_create().
Because, O_TMPFILE misses umask handling in the vfs it will not honor
umask settings. Fix this by adding the missing umask handling.

Link: https://lore.kernel.org/r/1657779088-2242-2-git-send-email-xuyang2018.jy@fujitsu.com
Fixes: 60545d0d4610 ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...")
Cc: &lt;stable@vger.kernel.org&gt; # 4.19+
Reported-by: Christian Brauner (Microsoft) &lt;brauner@kernel.org&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Reviewed-and-Tested-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Acked-by: Christian Brauner (Microsoft) &lt;brauner@kernel.org&gt;
Signed-off-by: Yang Xu &lt;xuyang2018.jy@fujitsu.com&gt;
Signed-off-by: Christian Brauner (Microsoft) &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>step_into(): move fetching -&gt;d_inode past handle_mounts()</title>
<updated>2022-07-06T17:16:07+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2022-07-04T02:35:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3bd8bc897161730042051cd5f9c6ed1e94cb5453'/>
<id>urn:sha1:3bd8bc897161730042051cd5f9c6ed1e94cb5453</id>
<content type='text'>
... and lose messing with it in __follow_mount_rcu()

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>lookup_fast(): don't bother with inode</title>
<updated>2022-07-06T17:16:03+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2022-07-04T02:20:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4cb640248041dab1c718a6140d758dad5a84b8ec'/>
<id>urn:sha1:4cb640248041dab1c718a6140d758dad5a84b8ec</id>
<content type='text'>
Note that validation of -&gt;d_seq after -&gt;d_inode fetch is gone, along
with fetching of -&gt;d_inode itself.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>follow_dotdot{,_rcu}(): don't bother with inode</title>
<updated>2022-07-06T17:15:59+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2022-07-04T02:18:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b16c001de0f66bc633aefe770a8b0a75c8c39a3b'/>
<id>urn:sha1:b16c001de0f66bc633aefe770a8b0a75c8c39a3b</id>
<content type='text'>
step_into() will fetch it, TYVM.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>step_into(): lose inode argument</title>
<updated>2022-07-06T17:15:52+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2022-07-04T02:07:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a4f5b52167a80edec74093fe6fef291a0318f4ba'/>
<id>urn:sha1:a4f5b52167a80edec74093fe6fef291a0318f4ba</id>
<content type='text'>
make handle_mounts() always fetch it.  This is just the first step -
the callers of step_into() will stop trying to calculate the sucker,
etc.

The passed value should be equal to dentry-&gt;d_inode in all cases;
in RCU mode - fetched after we'd sampled -&gt;d_seq.  Might as well
fetch it here.  We do need to validate -&gt;d_seq, which duplicates
the check currently done in lookup_fast(); that duplication will
go away shortly.

After that change handle_mounts() always ignores the initial value of
*inode and always sets it on success.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>namei: stash the sampled -&gt;d_seq into nameidata</title>
<updated>2022-07-06T17:14:42+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2022-07-04T22:12:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=03fa86e9f79d8b9a6aa28965829a4a8646139a0a'/>
<id>urn:sha1:03fa86e9f79d8b9a6aa28965829a4a8646139a0a</id>
<content type='text'>
New field: nd-&gt;next_seq.  Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered.  Used instead of an arseload
of local variables, arguments, etc.

step_into() has lost seq argument; nd-&gt;next_seq is used, so dentry passed
to it must be the one -&gt;next_seq is about.

There are two requirements for RCU pathwalk:
	1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
	2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.

The use of seq numbers is the way we achieve that.  Invariant we want
to maintain is:
	if RCU pathwalk can reach the state with given nd-&gt;path, nd-&gt;inode
and nd-&gt;seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd-&gt;path and nd-&gt;inode after having
traversed the same part of pathname, and observe the nd-&gt;path.dentry-&gt;d_seq
equal to what RCU pathwalk has in nd-&gt;seq

	For transition from parent to child, we sample child's -&gt;d_seq
and verify that parent's -&gt;d_seq remains unchanged.  Anything that
disrupts parent-child relationship would've bumped -&gt;d_seq on both.
	For transitions from child to parent we sample parent's -&gt;d_seq
and verify that child's -&gt;d_seq has not changed.  Same reasoning as
for the previous case applies.
	For transition from mountpoint to root of mounted we sample
the -&gt;d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk.  That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
-&gt;d_seq of that
	For transitions from root of mounted to mountpoint we sample
-&gt;d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk.  The same reasoning as in the
previous case applies.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>namei: move clearing LOOKUP_RCU towards rcu_read_unlock()</title>
<updated>2022-07-06T16:40:31+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2022-07-06T16:40:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=6e180327153071281dbbf6a16759e49862debdca'/>
<id>urn:sha1:6e180327153071281dbbf6a16759e49862debdca</id>
<content type='text'>
try_to_unlazy()/try_to_unlazy_next() drop LOOKUP_RCU in the
very beginning and do rcu_read_unlock() only at the very end.
However, nothing done in between even looks at the flag in
question; might as well clear it at the same time we unlock.

Note that try_to_unlazy_next() used to call legitimize_mnt(),
which might drop/regain rcu_read_lock() in some cases.  This
is no longer true, so we really have rcu_read_lock() held
all along until the end.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
</feed>
