<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/linux/pid.h, branch v6.19.11</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.19.11</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.19.11'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2025-06-19T12:28:24+00:00</updated>
<entry>
<title>pidfs: persist information</title>
<updated>2025-06-19T12:28:24+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2025-06-18T20:53:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8ec7c826d97b390879df2a03dfb035c70af86779'/>
<id>urn:sha1:8ec7c826d97b390879df2a03dfb035c70af86779</id>
<content type='text'>
Persist exit and coredump information independent of whether anyone
currently holds a pidfd for the struct pid.

The current scheme allocated pidfs dentries on-demand repeatedly.
This scheme is reaching it's limits as it makes it impossible to pin
information that needs to be available after the task has exited or
coredumped and that should not be lost simply because the pidfd got
closed temporarily. The next opener should still see the stashed
information.

This is also a prerequisite for supporting extended attributes on
pidfds to allow attaching meta information to them.

If someone opens a pidfd for a struct pid a pidfs dentry is allocated
and stashed in pid-&gt;stashed. Once the last pidfd for the struct pid is
closed the pidfs dentry is released and removed from pid-&gt;stashed.

So if 10 callers create a pidfs dentry for the same struct pid
sequentially, i.e., each closing the pidfd before the other creates a
new one then a new pidfs dentry is allocated every time.

Because multiple tasks acquiring and releasing a pidfd for the same
struct pid can race with each another a task may still find a valid
pidfs entry from the previous task in pid-&gt;stashed and reuse it. Or it
might find a dead dentry in there and fail to reuse it and so stashes a
new pidfs dentry. Multiple tasks may race to stash a new pidfs dentry
but only one will succeed, the other ones will put their dentry.

The current scheme aims to ensure that a pidfs dentry for a struct pid
can only be created if the task is still alive or if a pidfs dentry
already existed before the task was reaped and so exit information has
been was stashed in the pidfs inode.

That's great except that it's buggy. If a pidfs dentry is stashed in
pid-&gt;stashed after pidfs_exit() but before __unhash_process() is called
we will return a pidfd for a reaped task without exit information being
available.

The pidfds_pid_valid() check does not guard against this race as it
doens't sync at all with pidfs_exit(). The pid_has_task() check might be
successful simply because we're before __unhash_process() but after
pidfs_exit().

Introduce a new scheme where the lifetime of information associated with
a pidfs entry (coredump and exit information) isn't bound to the
lifetime of the pidfs inode but the struct pid itself.

The first time a pidfs dentry is allocated for a struct pid a struct
pidfs_attr will be allocated which will be used to store exit and
coredump information.

If all pidfs for the pidfs dentry are closed the dentry and inode can be
cleaned up but the struct pidfs_attr will stick until the struct pid
itself is freed. This will ensure minimal memory usage while persisting
relevant information.

The new scheme has various advantages. First, it allows to close the
race where we end up handing out a pidfd for a reaped task for which no
exit information is available. Second, it minimizes memory usage.
Third, it allows to remove complex lifetime tracking via dentries when
registering a struct pid with pidfs. There's no need to get or put a
reference. Instead, the lifetime of exit and coredump information
associated with a struct pid is bound to the lifetime of struct pid
itself.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-5-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn &lt;aleksandr.mikhalitsyn@canonical.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>pidfs: move to anonymous struct</title>
<updated>2025-06-19T12:28:24+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2025-06-18T20:53:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=75215c972581d3934e76a57690cf838d7ceab399'/>
<id>urn:sha1:75215c972581d3934e76a57690cf838d7ceab399</id>
<content type='text'>
Move the pidfs entries to an anonymous struct.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-4-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn &lt;aleksandr.mikhalitsyn@canonical.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>pidfs: get rid of __pidfd_prepare()</title>
<updated>2025-04-26T06:28:03+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2025-04-25T08:11:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a71f402acd71a942e59c16270ad61dee06de6e24'/>
<id>urn:sha1:a71f402acd71a942e59c16270ad61dee06de6e24</id>
<content type='text'>
Fold it into pidfd_prepare() and rename PIDFD_CLONE to PIDFD_STALE to
indicate that the passed pid might not have task linkage and no explicit
check for that should be performed.

Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-3-450a19461e75@kernel.org
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reviewed-by: David Rheinsberg &lt;david@readahead.eu&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>pid: perform free_pid() calls outside of tasklist_lock</title>
<updated>2025-02-07T10:22:43+00:00</updated>
<author>
<name>Mateusz Guzik</name>
<email>mjguzik@gmail.com</email>
</author>
<published>2025-02-06T16:44:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7903f907a226058ed99f86e9924e082aea57fc45'/>
<id>urn:sha1:7903f907a226058ed99f86e9924e082aea57fc45</id>
<content type='text'>
As the clone side already executes pid allocation with only pidmap_lock
held, issuing free_pid() while still holding tasklist_lock exacerbates
total hold time of the latter.

More things may show up later which require initial clean up with the
lock held and allow finishing without it. For that reason a struct to
collect such work is added instead of merely passing the pid array.

Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Link: https://lore.kernel.org/r/20250206164415.450051-5-mjguzik@gmail.com
Acked-by: "Liam R. Howlett" &lt;Liam.Howlett@Oracle.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs</title>
<updated>2025-01-20T18:29:11+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-01-20T18:29:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1a89a6924b581884b1b54bcd3ea790b3668be2e0'/>
<id>urn:sha1:1a89a6924b581884b1b54bcd3ea790b3668be2e0</id>
<content type='text'>
Pull pid_max namespacing update from Christian Brauner:
 "The pid_max sysctl is a global value. For a long time the default
  value has been 65535 and during the pidfd dicussions Linus proposed to
  bump pid_max by default. Based on this discussion systemd started
  bumping pid_max to 2^22. So all new systems now run with a very high
  pid_max limit with some distros having also backported that change.

  The decision to bump pid_max is obviously correct. It just doesn't
  make a lot of sense nowadays to enforce such a low pid number. There's
  sufficient tooling to make selecting specific processes without typing
  really large pid numbers available.

  In any case, there are workloads that have expections about how large
  pid numbers they accept. Either for historical reasons or
  architectural reasons. One concreate example is the 32-bit version of
  Android's bionic libc which requires pid numbers less than 65536.
  There are workloads where it is run in a 32-bit container on a 64-bit
  kernel. If the host has a pid_max value greater than 65535 the libc
  will abort thread creation because of size assumptions of
  pthread_mutex_t.

  That's a fairly specific use-case however, in general specific
  workloads that are moved into containers running on a host with a new
  kernel and a new systemd can run into issues with large pid_max
  values. Obviously making assumptions about the size of the allocated
  pid is suboptimal but we have userspace that does it.

  Of course, giving containers the ability to restrict the number of
  processes in their respective pid namespace indepent of the global
  limit through pid_max is something desirable in itself and comes in
  handy in general.

  Independent of motivating use-cases the existence of pid namespaces
  makes this also a good semantical extension and there have been prior
  proposals pushing in a similar direction. The trick here is to
  minimize the risk of regressions which I think is doable. The fact
  that pid namespaces are hierarchical will help us here.

  What we mostly care about is that when the host sets a low pid_max
  limit, say (crazy number) 100 that no descendant pid namespace can
  allocate a higher pid number in its namespace. Since pid allocation is
  hierarchial this can be ensured by checking each pid allocation
  against the pid namespace's pid_max limit. This means if the
  allocation in the descendant pid namespace succeeds, the ancestor pid
  namespace can reject it. If the ancestor pid namespace has a higher
  limit than the descendant pid namespace the descendant pid namespace
  will reject the pid allocation. The ancestor pid namespace will
  obviously not care about this.

  All in all this means pid_max continues to enforce a system wide limit
  on the number of processes but allows pid namespaces sufficient leeway
  in handling workloads with assumptions about pid values and allows
  containers to restrict the number of processes in a pid namespace
  through the pid_max interface"

* tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  tests/pid_namespace: add pid_max tests
  pid: allow pid_max to be set per pid namespace
</content>
</entry>
<entry>
<title>pidfs: lookup pid through rbtree</title>
<updated>2024-12-17T08:16:18+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2024-12-14T21:01:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=16ecd47cb0cd895c7c2f5dd5db50f6c005c51639'/>
<id>urn:sha1:16ecd47cb0cd895c7c2f5dd5db50f6c005c51639</id>
<content type='text'>
The new pid inode number allocation scheme is neat but I overlooked a
possible, even though unlikely, attack that can be used to trigger an
overflow on both 32bit and 64bit.

An unique 64 bit identifier was constructed for each struct pid by two
combining a 32 bit idr with a 32 bit generation number. A 32bit number
was allocated using the idr_alloc_cyclic() infrastructure. When the idr
wrapped around a 32 bit wraparound counter was incremented. The 32 bit
wraparound counter served as the upper 32 bits and the allocated idr
number as the lower 32 bits.

Since the idr can only allocate up to INT_MAX entries everytime a
wraparound happens INT_MAX - 1 entries are lost (Ignoring that numbering
always starts at 2 to avoid theoretical collisions with the root inode
number.).

If userspace fully populates the idr such that and puts itself into
control of two entries such that one entry is somewhere in the middle
and the other entry is the INT_MAX entry then it is possible to overflow
the wraparound counter. That is probably difficult to pull off but the
mere possibility is annoying.

The problem could be contained to 32 bit by switching to a data
structure such as the maple tree that allows allocating 64 bit numbers
on 64 bit machines. That would leave 32 bit in a lurch but that probably
doesn't matter that much. The other problem is that removing entries
form the maple tree is somewhat non-trivial because the removal code can
be called under the irq write lock of tasklist_lock and
irq{save,restore} code.

Instead, allocate unique identifiers for struct pid by simply
incrementing a 64 bit counter and insert each struct pid into the rbtree
so it can be looked up to decode file handles avoiding to leak actual
pids across pid namespaces in file handles.

On both 64 bit and 32 bit the same 64 bit identifier is used to lookup
struct pid in the rbtree. On 64 bit the unique identifier for struct pid
simply becomes the inode number. Comparing two pidfds continues to be as
simple as comparing inode numbers.

On 32 bit the 64 bit number assigned to struct pid is split into two 32
bit numbers. The lower 32 bits are used as the inode number and the
upper 32 bits are used as the inode generation number. Whenever a
wraparound happens on 32 bit the 64 bit number will be incremented by 2
so inode numbering starts at 2 again.

When a wraparound happens on 32 bit multiple pidfds with the same inode
number are likely to exist. This isn't a problem since before pidfs
pidfds used the anonymous inode meaning all pidfds had the same inode
number. On 32 bit sserspace can thus reconstruct the 64 bit identifier
by retrieving both the inode number and the inode generation number to
compare, or use file handles. This gives the same guarantees on both 32
bit and 64 bit.

Link: https://lore.kernel.org/r/20241214-gekoppelt-erdarbeiten-a1f9a982a5a6@brauner
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>pid: allow pid_max to be set per pid namespace</title>
<updated>2024-12-02T10:25:25+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2024-11-22T13:24:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7863dcc72d0f4b13a641065670426435448b3d80'/>
<id>urn:sha1:7863dcc72d0f4b13a641065670426435448b3d80</id>
<content type='text'>
The pid_max sysctl is a global value. For a long time the default value
has been 65535 and during the pidfd dicussions Linus proposed to bump
pid_max by default (cf. [1]). Based on this discussion systemd started
bumping pid_max to 2^22. So all new systems now run with a very high
pid_max limit with some distros having also backported that change.
The decision to bump pid_max is obviously correct. It just doesn't make
a lot of sense nowadays to enforce such a low pid number. There's
sufficient tooling to make selecting specific processes without typing
really large pid numbers available.

In any case, there are workloads that have expections about how large
pid numbers they accept. Either for historical reasons or architectural
reasons. One concreate example is the 32-bit version of Android's bionic
libc which requires pid numbers less than 65536. There are workloads
where it is run in a 32-bit container on a 64-bit kernel. If the host
has a pid_max value greater than 65535 the libc will abort thread
creation because of size assumptions of pthread_mutex_t.

That's a fairly specific use-case however, in general specific workloads
that are moved into containers running on a host with a new kernel and a
new systemd can run into issues with large pid_max values. Obviously
making assumptions about the size of the allocated pid is suboptimal but
we have userspace that does it.

Of course, giving containers the ability to restrict the number of
processes in their respective pid namespace indepent of the global limit
through pid_max is something desirable in itself and comes in handy in
general.

Independent of motivating use-cases the existence of pid namespaces
makes this also a good semantical extension and there have been prior
proposals pushing in a similar direction.
The trick here is to minimize the risk of regressions which I think is
doable. The fact that pid namespaces are hierarchical will help us here.

What we mostly care about is that when the host sets a low pid_max
limit, say (crazy number) 100 that no descendant pid namespace can
allocate a higher pid number in its namespace. Since pid allocation is
hierarchial this can be ensured by checking each pid allocation against
the pid namespace's pid_max limit. This means if the allocation in the
descendant pid namespace succeeds, the ancestor pid namespace can reject
it. If the ancestor pid namespace has a higher limit than the descendant
pid namespace the descendant pid namespace will reject the pid
allocation. The ancestor pid namespace will obviously not care about
this.
All in all this means pid_max continues to enforce a system wide limit
on the number of processes but allows pid namespaces sufficient leeway
in handling workloads with assumptions about pid values and allows
containers to restrict the number of processes in a pid namespace
through the pid_max interface.

[1]: https://lore.kernel.org/linux-api/CAHk-=wiZ40LVjnXSi9iHLE_-ZBsWFGCgdmNiYZUXn1-V5YBg2g@mail.gmail.com
- rebased from 5.14-rc1
- a few fixes (missing ns_free_inum on error path, missing initialization, etc)
- permission check changes in pid_table_root_permissions
- unsigned int pid_max -&gt; int pid_max (keep pid_max type as it was)
- add READ_ONCE in alloc_pid() as suggested by Christian
- rebased from 6.7 and take into account:
 * sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)
 * sysctl: treewide: constify ctl_table_header::ctl_table_arg
 * pidfd: add pidfs
 * tracing: Move saved_cmdline code into trace_sched_switch.c

Signed-off-by: Alexander Mikhalitsyn &lt;aleksandr.mikhalitsyn@canonical.com&gt;
Link: https://lore.kernel.org/r/20241122132459.135120-2-aleksandr.mikhalitsyn@canonical.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>pidfs: remove config option</title>
<updated>2024-03-13T19:53:53+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2024-03-12T09:39:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9d9539db8638cfe053fcd1f441746f0e2c8c2d32'/>
<id>urn:sha1:9d9539db8638cfe053fcd1f441746f0e2c8c2d32</id>
<content type='text'>
As Linus suggested this enables pidfs unconditionally. A key property to
retain is the ability to compare pidfds by inode number (cf. [1]).
That's extremely helpful just as comparing namespace file descriptors by
inode number is. They are used in a variety of scenarios where they need
to be compared, e.g., when receiving a pidfd via SO_PEERPIDFD from a
socket to trivially authenticate a the sender and various other
use-cases.

For 64bit systems this is pretty trivial to do. For 32bit it's slightly
more annoying as we discussed but we simply add a dumb ida based
allocator that gets used on 32bit. This gives the same guarantees about
inode numbers on 64bit without any overflow risk. Practically, we'll
never run into overflow issues because we're constrained by the number
of processes that can exist on 32bit and by the number of open files
that can exist on a 32bit system. On 64bit none of this matters and
things are very simple.

If 32bit also needs the uniqueness guarantee they can simply parse the
contents of /proc/&lt;pid&gt;/fd/&lt;nr&gt;. The uniqueness guarantees have a
variety of use-cases. One of the most obvious ones is that they will
make pidfiles (or "pidfdfiles", I guess) reliable as the unique
identifier can be placed into there that won't be reycled. Also a
frequent request.

Note, I took the chance and simplified path_from_stashed() even further.
Instead of passing the inode number explicitly to path_from_stashed() we
let the filesystem handle that internally. So path_from_stashed() ends
up even simpler than it is now. This is also a good solution allowing
the cleanup code to be clean and consistent between 32bit and 64bit. The
cleanup path in prepare_anon_dentry() is also switched around so we put
the inode before the dentry allocation. This means we only have to call
the cleanup handler for the filesystem's inode data once and can rely
-&gt;evict_inode() otherwise.

Aside from having to have a bit of extra code for 32bit it actually ends
up a nice cleanup for path_from_stashed() imho.

Tested on both 32 and 64bit including error injection.

Link: https://github.com/systemd/systemd/pull/31713 [1]
Link: https://lore.kernel.org/r/20240312-dingo-sehnlich-b3ecc35c6de7@brauner
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pidfs: convert to path_from_stashed() helper</title>
<updated>2024-03-01T11:24:53+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2024-02-19T15:30:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b28ddcc32d8fa3e20745b3a47dff863fe0376d79'/>
<id>urn:sha1:b28ddcc32d8fa3e20745b3a47dff863fe0376d79</id>
<content type='text'>
Moving pidfds from the anonymous inode infrastructure to a separate tiny
in-kernel filesystem similar to sockfs, pipefs, and anon_inodefs causes
selinux denials and thus various userspace components that make heavy
use of pidfds to fail as pidfds used anon_inode_getfile() which aren't
subject to any LSM hooks. But dentry_open() is and that would cause
regressions.

The failures that are seen are selinux denials. But the core failure is
dbus-broker. That cascades into other services failing that depend on
dbus-broker. For example, when dbus-broker fails to start polkit and all
the others won't be able to work because they depend on dbus-broker.

The reason for dbus-broker failing is because it doesn't handle failures
for SO_PEERPIDFD correctly. Last kernel release we introduced
SO_PEERPIDFD (and SCM_PIDFD). SO_PEERPIDFD allows dbus-broker and polkit
and others to receive a pidfd for the peer of an AF_UNIX socket. This is
the first time in the history of Linux that we can safely authenticate
clients in a race-free manner.

dbus-broker immediately made use of this but messed up the error
checking. It only allowed EINVAL as a valid failure for SO_PEERPIDFD.
That's obviously problematic not just because of LSM denials but because
of seccomp denials that would prevent SO_PEERPIDFD from working; or any
other new error code from there.

So this is catching a flawed implementation in dbus-broker as well. It
has to fallback to the old pid-based authentication when SO_PEERPIDFD
doesn't work no matter the reasons otherwise it'll always risk such
failures. So overall that LSM denial should not have caused dbus-broker
to fail. It can never assume that a feature released one kernel ago like
SO_PEERPIDFD can be assumed to be available.

So, the next fix separate from the selinux policy update is to try and
fix dbus-broker at [3]. That should make it into Fedora as well. In
addition the selinux reference policy should also be updated. See [4]
for that. If Selinux is in enforcing mode in userspace and it encounters
anything that it doesn't know about it will deny it by default. And the
policy is entirely in userspace including declaring new types for stuff
like nsfs or pidfs to allow it.

For now we continue to raise S_PRIVATE on the inode if it's a pidfs
inode which means things behave exactly like before.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2265630
Link: https://github.com/fedora-selinux/selinux-policy/pull/2050
Link: https://github.com/bus1/dbus-broker/pull/343 [3]
Link: https://github.com/SELinuxProject/refpolicy/pull/762 [4]
Reported-by: Nathan Chancellor &lt;nathan@kernel.org&gt;
Link: https://lore.kernel.org/r/20240222190334.GA412503@dev-arch.thelio-3990X
Link: https://lore.kernel.org/r/20240218-neufahrzeuge-brauhaus-fb0eb6459771@brauner
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>pidfd: add pidfs</title>
<updated>2024-03-01T11:23:37+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2024-02-12T15:32:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b'/>
<id>urn:sha1:cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b</id>
<content type='text'>
This moves pidfds from the anonymous inode infrastructure to a tiny
pseudo filesystem. This has been on my todo for quite a while as it will
unblock further work that we weren't able to do simply because of the
very justified limitations of anonymous inodes. Moving pidfds to a tiny
pseudo filesystem allows:

* statx() on pidfds becomes useful for the first time.
* pidfds can be compared simply via statx() and then comparing inode
  numbers.
* pidfds have unique inode numbers for the system lifetime.
* struct pid is now stashed in inode-&gt;i_private instead of
  file-&gt;private_data. This means it is now possible to introduce
  concepts that operate on a process once all file descriptors have been
  closed. A concrete example is kill-on-last-close.
* file-&gt;private_data is freed up for per-file options for pidfds.
* Each struct pid will refer to a different inode but the same struct
  pid will refer to the same inode if it's opened multiple times. In
  contrast to now where each struct pid refers to the same inode. Even
  if we were to move to anon_inode_create_getfile() which creates new
  inodes we'd still be associating the same struct pid with multiple
  different inodes.

The tiny pseudo filesystem is not visible anywhere in userspace exactly
like e.g., pipefs and sockfs. There's no lookup, there's no complex
inode operations, nothing. Dentries and inodes are always deleted when
the last pidfd is closed.

We allocate a new inode for each struct pid and we reuse that inode for
all pidfds. We use iget_locked() to find that inode again based on the
inode number which isn't recycled. We allocate a new dentry for each
pidfd that uses the same inode. That is similar to anonymous inodes
which reuse the same inode for thousands of dentries. For pidfds we're
talking way less than that. There usually won't be a lot of concurrent
openers of the same struct pid. They can probably often be counted on
two hands. I know that systemd does use separate pidfd for the same
struct pid for various complex process tracking issues. So I think with
that things actually become way simpler. Especially because we don't
have to care about lookup. Dentries and inodes continue to be always
deleted.

The code is entirely optional and fairly small. If it's not selected we
fallback to anonymous inodes. Heavily inspired by nsfs which uses a
similar stashing mechanism just for namespaces.

Link: https://lore.kernel.org/r/20240213-vfs-pidfd_fs-v1-2-f863f58cfce1@kernel.org
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
</feed>
