<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/fs/mount.h, branch v6.12.80</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.12.80</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.12.80'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2025-01-17T12:40:50+00:00</updated>
<entry>
<title>fs: kill MNT_ONRB</title>
<updated>2025-01-17T12:40:50+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2024-12-15T20:17:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1ca9de8867a93ca2b863ba4bf855e174f165ea26'/>
<id>urn:sha1:1ca9de8867a93ca2b863ba4bf855e174f165ea26</id>
<content type='text'>
commit 344bac8f0d73fe970cd9f5b2f132906317d29e8b upstream.

Move mnt-&gt;mnt_node into the union with mnt-&gt;mnt_rcu and mnt-&gt;mnt_llist
instead of keeping it with mnt-&gt;mnt_list. This allows us to use
RB_CLEAR_NODE(&amp;mnt-&gt;mnt_node) in umount_tree() as well as
list_empty(&amp;mnt-&gt;mnt_node). That in turn allows us to remove MNT_ONRB.

This also fixes the bug reported in [1] where seemingly MNT_ONRB wasn't
set in @mnt-&gt;mnt_flags even though the mount was present in the mount
rbtree of the mount namespace.

The root cause is the following race. When a btrfs subvolume is mounted
a temporary mount is created:

btrfs_get_tree_subvol()
{
        mnt = fc_mount()
        // Register the newly allocated mount with sb-&gt;mounts:
        lock_mount_hash();
        list_add_tail(&amp;mnt-&gt;mnt_instance, &amp;mnt-&gt;mnt.mnt_sb-&gt;s_mounts);
        unlock_mount_hash();
}

and registered on sb-&gt;s_mounts. Later it is added to an anonymous mount
namespace via mount_subvol():

-&gt; mount_subvol()
   -&gt; mount_subtree()
      -&gt; alloc_mnt_ns()
         mnt_add_to_ns()
         vfs_path_lookup()
         put_mnt_ns()

The mnt_add_to_ns() call raises MNT_ONRB in @mnt-&gt;mnt_flags. If someone
concurrently does a ro remount:

reconfigure_super()
-&gt; sb_prepare_remount_readonly()
   {
           list_for_each_entry(mnt, &amp;sb-&gt;s_mounts, mnt_instance) {
   }

all mounts registered in sb-&gt;s_mounts are visited and first
MNT_WRITE_HOLD is raised, then MNT_READONLY is raised, and finally
MNT_WRITE_HOLD is removed again.

The flag modification for MNT_WRITE_HOLD/MNT_READONLY and MNT_ONRB race
so MNT_ONRB might be lost.

Fixes: 2eea9ce4310d ("mounts: keep list of mounts in an rbtree")
Cc: &lt;stable@kernel.org&gt; # v6.8+
Link: https://lore.kernel.org/r/20241215-vfs-6-14-mount-work-v1-1-fd55922c4af8@kernel.org
Link: https://lore.kernel.org/r/ec6784ed-8722-4695-980a-4400d4e7bd1a@gmx.com [1]
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'vfs-6.12.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs</title>
<updated>2024-09-16T09:15:26+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-09-16T09:15:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9020d0d844ad58a051f90b1e5b82ba34123925b9'/>
<id>urn:sha1:9020d0d844ad58a051f90b1e5b82ba34123925b9</id>
<content type='text'>
Pull vfs mount updates from Christian Brauner:
 "Recently, we added the ability to list mounts in other mount
  namespaces and the ability to retrieve namespace file descriptors
  without having to go through procfs by deriving them from pidfds.

  This extends nsfs in two ways:

   (1) Add the ability to retrieve information about a mount namespace
       via NS_MNT_GET_INFO.

       This will return the mount namespace id and the number of mounts
       currently in the mount namespace. The number of mounts can be
       used to size the buffer that needs to be used for listmount() and
       is in general useful without having to actually iterate through
       all the mounts.

      The structure is extensible.

   (2) Add the ability to iterate through all mount namespaces over
       which the caller holds privilege returning the file descriptor
       for the next or previous mount namespace.

       To retrieve a mount namespace the caller must be privileged wrt
       to it's owning user namespace. This means that PID 1 on the host
       can list all mounts in all mount namespaces or that a container
       can list all mounts of its nested containers.

       Optionally pass a structure for NS_MNT_GET_INFO with
       NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount
       namespace in one go.

  (1) and (2) can be implemented for other namespace types easily.

  Together with recent api additions this means one can iterate through
  all mounts in all mount namespaces without ever touching procfs.

  The commit message in 49224a345c48 ('Merge patch series "nsfs: iterate
  through mount namespaces"') contains example code how to do this"

* tag 'vfs-6.12.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nsfs: iterate through mount namespaces
  file: add fput() cleanup helper
  fs: add put_mnt_ns() cleanup helper
  fs: allow mount namespace fd
</content>
</entry>
<entry>
<title>fs: mounts: Remove unused declaration mnt_cursor_del()</title>
<updated>2024-08-30T06:22:32+00:00</updated>
<author>
<name>Yue Haibing</name>
<email>yuehaibing@huawei.com</email>
</author>
<published>2024-08-03T11:50:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2e91f69afa7eb4e0bbff2254bb4cc8f97b76c009'/>
<id>urn:sha1:2e91f69afa7eb4e0bbff2254bb4cc8f97b76c009</id>
<content type='text'>
Commit 2eea9ce4310d ("mounts: keep list of mounts in an rbtree")
removed the implementation but leave declaration.

Signed-off-by: Yue Haibing &lt;yuehaibing@huawei.com&gt;
Link: https://lore.kernel.org/r/20240803115000.589872-1-yuehaibing@huawei.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>nsfs: iterate through mount namespaces</title>
<updated>2024-08-09T10:46:59+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2024-07-19T11:41:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a1d220d9dafa8d76ba60a784a1016c3134e6a1e8'/>
<id>urn:sha1:a1d220d9dafa8d76ba60a784a1016c3134e6a1e8</id>
<content type='text'>
It is already possible to list mounts in other mount namespaces and to
retrieve namespace file descriptors without having to go through procfs
by deriving them from pidfds.

Augment these abilities by adding the ability to retrieve information
about a mount namespace via NS_MNT_GET_INFO. This will return the mount
namespace id and the number of mounts currently in the mount namespace.
The number of mounts can be used to size the buffer that needs to be
used for listmount() and is in general useful without having to actually
iterate through all the mounts. The structure is extensible.

And add the ability to iterate through all mount namespaces over which
the caller holds privilege returning the file descriptor for the next or
previous mount namespace.

To retrieve a mount namespace the caller must be privileged wrt to it's
owning user namespace. This means that PID 1 on the host can list all
mounts in all mount namespaces or that a container can list all mounts
of its nested containers.

Optionally pass a structure for NS_MNT_GET_INFO with
NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount namespace
in one go. Both ioctls can be implemented for other namespace types
easily.

Together with recent api additions this means one can iterate through
all mounts in all mount namespaces without ever touching procfs.

Link: https://lore.kernel.org/r/20240719-work-mount-namespace-v1-5-834113cab0d2@kernel.org
Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'vfs-6.11.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs</title>
<updated>2024-07-15T18:54:04+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-07-15T18:54:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f608cabaeda472887c008e42398e8fca14e8f411'/>
<id>urn:sha1:f608cabaeda472887c008e42398e8fca14e8f411</id>
<content type='text'>
Pull vfs mount query updates from Christian Brauner:
 "This contains work to extend the abilities of listmount() and
  statmount() and various fixes and cleanups.

  Features:

   - Allow iterating through mounts via listmount() from newest to
     oldest. This makes it possible for mount(8) to keep iterating the
     mount table in reverse order so it gets newest mounts first.

   - Relax permissions on listmount() and statmount().

     It's not necessary to have capabilities in the initial namespace:
     it is sufficient to have capabilities in the owning namespace of
     the mount namespace we're located in to list unreachable mounts in
     that namespace.

   - Extend both listmount() and statmount() to list and stat mounts in
     foreign mount namespaces.

     Currently the only way to iterate over mount entries in mount
     namespaces that aren't in the caller's mount namespace is by
     crawling through /proc in order to find /proc/&lt;pid&gt;/mountinfo for
     the relevant mount namespace.

     This is both very clumsy and hugely inefficient. So extend struct
     mnt_id_req with a new member that allows to specify the mount
     namespace id of the mount namespace we want to look at.

     Luckily internally we already have most of the infrastructure for
     this so we just need to expose it to userspace. Give userspace a
     way to retrieve the id of a mount namespace via statmount() and
     through a new nsfs ioctl() on mount namespace file descriptor.

     This comes with appropriate selftests.

   - Expose mount options through statmount().

     Currently if userspace wants to get mount options for a mount and
     with statmount(), they still have to open /proc/&lt;pid&gt;/mountinfo to
     parse mount options. Simply the information through statmount()
     directly.

     Afterwards it's possible to only rely on statmount() and
     listmount() to retrieve all and more information than
     /proc/&lt;pid&gt;/mountinfo provides.

     This comes with appropriate selftests.

  Fixes:

   - Avoid copying to userspace under the namespace semaphore in
     listmount.

  Cleanups:

   - Simplify the error handling in listmount by relying on our newly
     added cleanup infrastructure.

   - Refuse invalid mount ids early for both listmount and statmount"

* tag 'vfs-6.11.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: reject invalid last mount id early
  fs: refuse mnt id requests with invalid ids early
  fs: find rootfs mount of the mount namespace
  fs: only copy to userspace on success in listmount()
  sefltests: extend the statmount test for mount options
  fs: use guard for namespace_sem in statmount()
  fs: export mount options via statmount()
  fs: rename show_mnt_opts -&gt; show_vfsmnt_opts
  selftests: add a test for the foreign mnt ns extensions
  fs: add an ioctl to get the mnt ns id from nsfs
  fs: Allow statmount() in foreign mount namespace
  fs: Allow listmount() in foreign mount namespace
  fs: export the mount ns id via statmount
  fs: keep an index of current mount namespaces
  fs: relax permissions for statmount()
  listmount: allow listing in reverse order
  fs: relax permissions for listmount()
  fs: simplify error handling
  fs: don't copy to userspace under namespace semaphore
  path: add cleanup helper
</content>
</entry>
<entry>
<title>fs: keep an index of current mount namespaces</title>
<updated>2024-06-28T07:53:30+00:00</updated>
<author>
<name>Josef Bacik</name>
<email>josef@toxicpanda.com</email>
</author>
<published>2024-06-24T15:49:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1901c92497bd90caf608a474f1bf4d8795b372a2'/>
<id>urn:sha1:1901c92497bd90caf608a474f1bf4d8795b372a2</id>
<content type='text'>
In order to allow for listmount() to be used on different namespaces we
need a way to lookup a mount ns by its id.  Keep a rbtree of the current
!anonymous mount name spaces indexed by ID that we can use to look up
the namespace.

Co-developed-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Link: https://lore.kernel.org/r/e5fdd78a90f5b00a75bd893962a70f52a2c015cd.1719243756.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>fhandle: relax open_by_handle_at() permission checks</title>
<updated>2024-05-28T13:57:23+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2024-05-24T10:19:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=620c266f394932e5decc4b34683a75dfc59dc2f4'/>
<id>urn:sha1:620c266f394932e5decc4b34683a75dfc59dc2f4</id>
<content type='text'>
A current limitation of open_by_handle_at() is that it's currently not possible
to use it from within containers at all because we require CAP_DAC_READ_SEARCH
in the initial namespace. That's unfortunate because there are scenarios where
using open_by_handle_at() from within containers.

Two examples:

(1) cgroupfs allows to encode cgroups to file handles and reopen them with
    open_by_handle_at().
(2) Fanotify allows placing filesystem watches they currently aren't usable in
    containers because the returned file handles cannot be used.

Here's a proposal for relaxing the permission check for open_by_handle_at().

(1) Opening file handles when the caller has privileges over the filesystem
    (1.1) The caller has an unobstructed view of the filesystem.
    (1.2) The caller has permissions to follow a path to the file handle.

This doesn't address the problem of opening a file handle when only a portion
of a filesystem is exposed as is common in containers by e.g., bind-mounting a
subtree. The proposal to solve this use-case is:

(2) Opening file handles when the caller has privileges over a subtree
    (2.1) The caller is able to reach the file from the provided mount fd.
    (2.2) The caller has permissions to construct an unobstructed path to the
          file handle.
    (2.3) The caller has permissions to follow a path to the file handle.

The relaxed permission checks are currently restricted to directory file
handles which are what both cgroupfs and fanotify need. Handling disconnected
non-directory file handles would lead to a potentially non-deterministic api.
If a disconnected non-directory file handle is provided we may fail to decode
a valid path that we could use for permission checking. That in itself isn't a
problem as we would just return EACCES in that case. However, confusion may
arise if a non-disconnected dentry ends up in the cache later and those opening
the file handle would suddenly succeed.

* It's potentially possible to use timing information (side-channel) to infer
  whether a given inode exists. I don't think that's particularly
  problematic. Thanks to Jann for bringing this to my attention.

* An unrelated note (IOW, these are thoughts that apply to
  open_by_handle_at() generically and are unrelated to the changes here):
  Jann pointed out that we should verify whether deleted files could
  potentially be reopened through open_by_handle_at(). I don't think that's
  possible though.

  Another potential thing to check is whether open_by_handle_at() could be
  abused to open internal stuff like memfds or gpu stuff. I don't think so
  but I haven't had the time to completely verify this.

This dates back to discussions Amir and I had quite some time ago and thanks to
him for providing a lot of details around the export code and related patches!

Link: https://lore.kernel.org/r/20240524-vfs-open_by_handle_at-v1-1-3d4b7d22736b@kernel.org
Reviewed-by: Amir Goldstein &lt;amir73il@gmail.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>mounts: keep list of mounts in an rbtree</title>
<updated>2023-11-18T13:56:16+00:00</updated>
<author>
<name>Miklos Szeredi</name>
<email>mszeredi@redhat.com</email>
</author>
<published>2023-10-25T14:02:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2eea9ce4310d8c0f8ef1dbe7b0e7d9219ff02b97'/>
<id>urn:sha1:2eea9ce4310d8c0f8ef1dbe7b0e7d9219ff02b97</id>
<content type='text'>
When adding a mount to a namespace insert it into an rbtree rooted in the
mnt_namespace instead of a linear list.

The mnt.mnt_list is still used to set up the mount tree and for
propagation, but not after the mount has been added to a namespace.  Hence
mnt_list can live in union with rb_node.  Use MNT_ONRB mount flag to
validate that the mount is on the correct list.

This allows removing the cursor used for reading /proc/$PID/mountinfo.  The
mnt_id_unique of the next mount can be used as an index into the seq file.

Tested by inserting 100k bind mounts, unsharing the mount namespace, and
unmounting.  No performance regressions have been observed.

For the last mount in the 100k list the statmount() call was more than 100x
faster due to the mount ID lookup not having to do a linear search.  This
patch makes the overhead of mount ID lookup non-observable in this range.

Signed-off-by: Miklos Szeredi &lt;mszeredi@redhat.com&gt;
Link: https://lore.kernel.org/r/20231025140205.3586473-3-mszeredi@redhat.com
Reviewed-by: Ian Kent &lt;raven@themaw.net&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>add unique mount ID</title>
<updated>2023-11-18T13:56:16+00:00</updated>
<author>
<name>Miklos Szeredi</name>
<email>mszeredi@redhat.com</email>
</author>
<published>2023-10-25T14:01:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=98d2b43081972abeb5bb5a087bc3e3197531c46e'/>
<id>urn:sha1:98d2b43081972abeb5bb5a087bc3e3197531c46e</id>
<content type='text'>
If a mount is released then its mnt_id can immediately be reused.  This is
bad news for user interfaces that want to uniquely identify a mount.

Implementing a unique mount ID is trivial (use a 64bit counter).
Unfortunately userspace assumes 32bit size and would overflow after the
counter reaches 2^32.

Introduce a new 64bit ID alongside the old one.  Initialize the counter to
2^32, this guarantees that the old and new IDs are never mixed up.

Signed-off-by: Miklos Szeredi &lt;mszeredi@redhat.com&gt;
Link: https://lore.kernel.org/r/20231025140205.3586473-2-mszeredi@redhat.com
Reviewed-by: Ian Kent &lt;raven@themaw.net&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>switch try_to_unlazy_next() to __legitimize_mnt()</title>
<updated>2022-07-05T20:18:21+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2022-07-05T16:22:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7e4745a09426b3fe63e9fbea3190e0f8500820a4'/>
<id>urn:sha1:7e4745a09426b3fe63e9fbea3190e0f8500820a4</id>
<content type='text'>
The tricky case (__legitimize_mnt() failing after having grabbed
a reference) can be trivially dealt with by leaving nd-&gt;path.mnt
non-NULL, for terminate_walk() to drop it.

legitimize_mnt() becomes static after that.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
</feed>
