<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/uapi/linux/fcntl.h, branch v6.12.80</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.12.80</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.12.80'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2024-09-05T09:39:17+00:00</updated>
<entry>
<title>fhandle: expose u64 mount id to name_to_handle_at(2)</title>
<updated>2024-09-05T09:39:17+00:00</updated>
<author>
<name>Aleksa Sarai</name>
<email>cyphar@cyphar.com</email>
</author>
<published>2024-08-28T10:19:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4356d575ef0f39a3e8e0ce0c40d84ce900ac3b61'/>
<id>urn:sha1:4356d575ef0f39a3e8e0ce0c40d84ce900ac3b61</id>
<content type='text'>
Now that we provide a unique 64-bit mount ID interface in statx(2), we
can now provide a race-free way for name_to_handle_at(2) to provide a
file handle and corresponding mount without needing to worry about
racing with /proc/mountinfo parsing or having to open a file just to do
statx(2).

While this is not necessary if you are using AT_EMPTY_PATH and don't
care about an extra statx(2) call, users that pass full paths into
name_to_handle_at(2) need to know which mount the file handle comes from
(to make sure they don't try to open_by_handle_at a file handle from a
different filesystem) and switching to AT_EMPTY_PATH would require
allocating a file for every name_to_handle_at(2) call, turning

  err = name_to_handle_at(-EBADF, "/foo/bar/baz", &amp;handle, &amp;mntid,
                          AT_HANDLE_MNT_ID_UNIQUE);

into

  int fd = openat(-EBADF, "/foo/bar/baz", O_PATH | O_CLOEXEC);
  err1 = name_to_handle_at(fd, "", &amp;handle, &amp;unused_mntid, AT_EMPTY_PATH);
  err2 = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &amp;statxbuf);
  mntid = statxbuf.stx_mnt_id;
  close(fd);

Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Aleksa Sarai &lt;cyphar@cyphar.com&gt;
Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-2-10c2c4c16708@cyphar.com
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>uapi: explain how per-syscall AT_* flags should be allocated</title>
<updated>2024-09-05T09:37:45+00:00</updated>
<author>
<name>Aleksa Sarai</name>
<email>cyphar@cyphar.com</email>
</author>
<published>2024-08-28T10:19:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b4fef22c2fb97fa204f0c99c7c7f1c6b422ef0aa'/>
<id>urn:sha1:b4fef22c2fb97fa204f0c99c7c7f1c6b422ef0aa</id>
<content type='text'>
Unfortunately, the way we have gone about adding new AT_* flags has
been a little messy. In the beginning, all of the AT_* flags had generic
meanings and so it made sense to share the flag bits indiscriminately.
However, we inevitably ran into syscalls that needed their own
syscall-specific flags. Due to the lack of a planned out policy, we
ended up with the following situations:

 * Existing syscalls adding new features tended to use new AT_* bits,
   with some effort taken to try to re-use bits for flags that were so
   obviously syscall specific that they only make sense for a single
   syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet).

   Given the constraints of bitflags, this works well in practice, but
   ideally (to avoid future confusion) we would plan ahead and define a
   set of "per-syscall bits" ahead of time so that when allocating new
   bits we don't end up with a complete mish-mash of which bits are
   supposed to be per-syscall and which aren't.

 * New syscalls dealt with this in several ways:

   - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and
     fspick(2)) created their separate own flag spaces that have no
     overlap with the AT_* flags. Most of these ended up allocating
     their bits sequentually.

     In the case of move_mount(2) and fspick(2), several flags have
     identical meanings to AT_* flags but were allocated in their own
     flag space.

     This makes sense for syscalls that will never share AT_* flags, but
     for some syscalls this leads to duplication with AT_* flags in a
     way that could cause confusion (if renameat2(2) grew a
     RENAME_EMPTY_PATH it seems likely that users could mistake it for
     AT_EMPTY_PATH since it is an *at(2) syscall).

   - Some syscalls unfortunately ended up both creating their own flag
     space while also using bits from other flag spaces. The most
     obvious example is open_tree(2), where the standard usage ends up
     using flags from *THREE* separate flag spaces:

       open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE|O_CLOEXEC|AT_RECURSIVE);

     (Note that O_CLOEXEC is also platform-specific, so several future
     OPEN_TREE_* bits are also made unusable in one fell swoop.)

It's not entirely clear to me what the "right" choice is for new
syscalls. Just saying that all future VFS syscalls should use AT_* flags
doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which
don't make much sense to burn generic AT_* flags for) and move_mount(2)
has separate AT_*-like flags for both the source and target so separate
flags are needed anyway (though it seems possible that renameat2(2)
could grow *_EMPTY_PATH flags at some point, and it's a bit of a shame
they can't be reused).

But at least for syscalls that _do_ choose to use AT_* flags, we should
explicitly state the policy that 0x2ff is currently intended for
per-syscall flags and that new flags should err on the side of
overlapping with existing flag bits (so we can extend the scope of
generic flags in the future if necessary).

And add AT_* aliases for the RENAME_* flags to further cement that
renameat2(2) is an *at(2) flag, just with its own per-syscall flags.

Suggested-by: Amir Goldstein &lt;amir73il@gmail.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Aleksa Sarai &lt;cyphar@cyphar.com&gt;
Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-1-10c2c4c16708@cyphar.com
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>fcntl: add F_CREATED_QUERY</title>
<updated>2024-08-19T11:44:34+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2024-07-24T13:15:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=820a185896b77814557302b981b092a9e7b36814'/>
<id>urn:sha1:820a185896b77814557302b981b092a9e7b36814</id>
<content type='text'>
Systemd has a helper called openat_report_new() that returns whether a
file was created anew or it already existed before for cases where
O_CREAT has to be used without O_EXCL (cf. [1]). That apparently isn't
something that's specific to systemd but it's where I noticed it.

The current logic is that it first attempts to open the file without
O_CREAT | O_EXCL and if it gets ENOENT the helper tries again with both
flags. If that succeeds all is well. If it now reports EEXIST it
retries.

That works fairly well but some corner cases make this more involved. If
this operates on a dangling symlink the first openat() without O_CREAT |
O_EXCL will return ENOENT but the second openat() with O_CREAT | O_EXCL
will fail with EEXIST. The reason is that openat() without O_CREAT |
O_EXCL follows the symlink while O_CREAT | O_EXCL doesn't for security
reasons. So it's not something we can really change unless we add an
explicit opt-in via O_FOLLOW which seems really ugly.

The caller could try and use fanotify() to register to listen for
creation events in the directory before calling openat(). The caller
could then compare the returned tid to its own tid to ensure that even
in threaded environments it actually created the file. That might work
but is a lot of work for something that should be fairly simple and I'm
uncertain about it's reliability.

The caller could use a bpf lsm hook to hook into security_file_open() to
figure out whether they created the file. That also seems a bit wild.

So let's add F_CREATED_QUERY which allows the caller to check whether
they actually did create the file. That has caveats of course but I
don't think they are problematic:

* In multi-threaded environments a thread can only be sure that it did
  create the file if it calls openat() with O_CREAT. In other words,
  it's obviously not enough to just go through it's fdtable and check
  these fds because another thread could've created the file.

* If there's any codepaths where an openat() with O_CREAT would yield
  the same struct file as that of another thread it would obviously
  cause wrong results. I'm not aware of any such codepaths from openat()
  itself. Imho, that would be a bug.

* Related to the previous point, calling the new fcntl() on files created
  and opened via special-purpose system calls or ioctl()s would cause
  wrong results only if the affected subsystem a) raises FMODE_CREATED
  and b) may return the same struct file for two different calls. I'm
  not seeing anything outside of regular VFS code that raises
  FMODE_CREATED.

  There is code for b) in e.g., the drm layer where the same struct file
  is resurfaced but again FMODE_CREATED isn't used and it would be very
  misleading if it did.

Link: https://github.com/systemd/systemd/blob/11d5e2b5fbf9f6bfa5763fd45b56829ad4f0777f/src/basic/fs-util.c#L1078 [1]
Link: https://lore.kernel.org/r/20240724-work-fcntl-v1-1-e8153a2f1991@kernel.org
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>fcntl: add F_DUPFD_QUERY fcntl()</title>
<updated>2024-05-10T06:26:31+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-05-09T11:04:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c62b758bae6af16fee94f556091fa74883a96b1e'/>
<id>urn:sha1:c62b758bae6af16fee94f556091fa74883a96b1e</id>
<content type='text'>
Often userspace needs to know whether two file descriptors refer to the
same struct file. For example, systemd uses this to filter out duplicate
file descriptors in it's file descriptor store (cf. [1]) and vulkan uses
it to compare dma-buf fds (cf. [2]).

The only api we provided for this was kcmp() but that's not generally
available or might be disallowed because it is way more powerful (allows
ordering of file pointers, operates on non-current task) etc. So give
userspace a simple way of comparing two file descriptors for sameness
adding a new fcntl() F_DUDFD_QUERY.

Link: https://github.com/systemd/systemd/blob/a4f0e0da3573a10bc5404142be8799418760b1d1/src/basic/fd-util.c#L517 [1]
Link: https://gitlab.freedesktop.org/wlroots/wlroots/-/blob/master/render/vulkan/texture.c#L490 [2]
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
[brauner: commit message]
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>fs: Pass AT_GETATTR_NOSEC flag to getattr interface function</title>
<updated>2023-11-18T13:54:07+00:00</updated>
<author>
<name>Stefan Berger</name>
<email>stefanb@linux.ibm.com</email>
</author>
<published>2023-10-02T12:57:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8a924db2d7b5eb69ba08b1a0af46e9f1359a9bdf'/>
<id>urn:sha1:8a924db2d7b5eb69ba08b1a0af46e9f1359a9bdf</id>
<content type='text'>
When vfs_getattr_nosec() calls a filesystem's getattr interface function
then the 'nosec' should propagate into this function so that
vfs_getattr_nosec() can again be called from the filesystem's gettattr
rather than vfs_getattr(). The latter would add unnecessary security
checks that the initial vfs_getattr_nosec() call wanted to avoid.
Therefore, introduce the getattr flag GETATTR_NOSEC and allow to pass
with the new getattr_flags parameter to the getattr interface function.
In overlayfs and ecryptfs use this flag to determine which one of the
two functions to call.

In a recent code change introduced to IMA vfs_getattr_nosec() ended up
calling vfs_getattr() in overlayfs, which in turn called
security_inode_getattr() on an exiting process that did not have
current-&gt;fs set anymore, which then caused a kernel NULL pointer
dereference. With this change the call to security_inode_getattr() can
be avoided, thus avoiding the NULL pointer dereference.

Reported-by: &lt;syzbot+a67fc5321ffb4b311c98@syzkaller.appspotmail.com&gt;
Fixes: db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the i_version")
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: &lt;linux-fsdevel@vger.kernel.org&gt;
Cc: Miklos Szeredi &lt;miklos@szeredi.hu&gt;
Cc: Amir Goldstein &lt;amir73il@gmail.com&gt;
Cc: Tyler Hicks &lt;code@tyhicks.com&gt;
Cc: Mimi Zohar &lt;zohar@linux.ibm.com&gt;
Suggested-by: Christian Brauner &lt;brauner@kernel.org&gt;
Co-developed-by: Amir Goldstein &lt;amir73il@gmail.com&gt;
Signed-off-by: Stefan Berger &lt;stefanb@linux.ibm.com&gt;
Link: https://lore.kernel.org/r/20231002125733.1251467-1-stefanb@linux.vnet.ibm.com
Reviewed-by: Amir Goldstein &lt;amir73il@gmail.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>exportfs: allow exporting non-decodeable file handles to userspace</title>
<updated>2023-05-25T11:16:57+00:00</updated>
<author>
<name>Amir Goldstein</name>
<email>amir73il@gmail.com</email>
</author>
<published>2023-05-02T12:48:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=96b2b072ee62be8ae68c8ecf14854c4d0505a8f8'/>
<id>urn:sha1:96b2b072ee62be8ae68c8ecf14854c4d0505a8f8</id>
<content type='text'>
Some userspace programs use st_ino as a unique object identifier, even
though inode numbers may be recycable.

This issue has been addressed for NFS export long ago using the exportfs
file handle API and the unique file handle identifiers are also exported
to userspace via name_to_handle_at(2).

fanotify also uses file handles to identify objects in events, but only
for filesystems that support NFS export.

Relax the requirement for NFS export support and allow more filesystems
to export a unique object identifier via name_to_handle_at(2) with the
flag AT_HANDLE_FID.

A file handle requested with the AT_HANDLE_FID flag, may or may not be
usable as an argument to open_by_handle_at(2).

To allow filesystems to opt-in to supporting AT_HANDLE_FID, a struct
export_operations is required, but even an empty struct is sufficient
for encoding FIDs.

Acked-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Acked-by: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Signed-off-by: Amir Goldstein &lt;amir73il@gmail.com&gt;
Acked-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Message-Id: &lt;20230502124817.3070545-4-amir73il@gmail.com&gt;
</content>
</entry>
<entry>
<title>mm/memfd: add F_SEAL_EXEC</title>
<updated>2023-01-19T01:12:37+00:00</updated>
<author>
<name>Daniel Verkamp</name>
<email>dverkamp@chromium.org</email>
</author>
<published>2022-12-15T00:12:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=6fd7353829cafc4067aad9eea0dc95da67e7df16'/>
<id>urn:sha1:6fd7353829cafc4067aad9eea0dc95da67e7df16</id>
<content type='text'>
Patch series "mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC", v8.

Since Linux introduced the memfd feature, memfd have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting it
differently.

However, in a secure by default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by Verified
boot), this executable nature of memfd opens a door for NoExec bypass and
enables “confused deputy attack”.  E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code and
root escalation.  [2] lists more VRP in this kind.

On the other hand, executable memfd has its legit use, runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them, for such system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].

To address those above, this set of patches add following:
1&gt; Let memfd_create() set X bit at creation time.
2&gt; Let memfd to be sealed for modifying X bit.
3&gt; A new pid namespace sysctl: vm.memfd_noexec to control the behavior of
   X bit.For example, if a container has vm.memfd_noexec=2, then
   memfd_create() without MFD_NOEXEC_SEAL will be rejected.
4&gt; A new security hook in memfd_create(). This make it possible to a new
   LSM, which rejects or allows executable memfd based on its security policy.


This patch (of 5):

The new F_SEAL_EXEC flag will prevent modification of the exec bits:
written as traditional octal mask, 0111, or as named flags, S_IXUSR |
S_IXGRP | S_IXOTH.  Any chmod(2) or similar call that attempts to modify
any of these bits after the seal is applied will fail with errno EPERM.

This will preserve the execute bits as they are at the time of sealing, so
the memfd will become either permanently executable or permanently
un-executable.

Link: https://lkml.kernel.org/r/20221215001205.51969-1-jeffxu@google.com
Link: https://lkml.kernel.org/r/20221215001205.51969-2-jeffxu@google.com
Signed-off-by: Daniel Verkamp &lt;dverkamp@chromium.org&gt;
Co-developed-by: Jeff Xu &lt;jeffxu@google.com&gt;
Signed-off-by: Jeff Xu &lt;jeffxu@google.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Dmitry Torokhov &lt;dmitry.torokhov@gmail.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jorge Lucangeli Obes &lt;jorgelo@chromium.org&gt;
Cc: Shuah Khan &lt;skhan@linuxfoundation.org&gt;
Cc: David Herrmann &lt;dh.herrmann@gmail.com&gt;
Cc: kernel test robot &lt;lkp@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>vfs: add faccessat2 syscall</title>
<updated>2020-05-14T14:44:25+00:00</updated>
<author>
<name>Miklos Szeredi</name>
<email>mszeredi@redhat.com</email>
</author>
<published>2020-05-14T14:44:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c8ffd8bcdd28296a198f237cc595148a8d4adfbe'/>
<id>urn:sha1:c8ffd8bcdd28296a198f237cc595148a8d4adfbe</id>
<content type='text'>
POSIX defines faccessat() as having a fourth "flags" argument, while the
linux syscall doesn't have it.  Glibc tries to emulate AT_EACCESS and
AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.

Add a new faccessat(2) syscall with the added flags argument and implement
both flags.

The value of AT_EACCESS is defined in glibc headers to be the same as
AT_REMOVEDIR.  Use this value for the kernel interface as well, together
with the explanatory comment.

Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
be useful and is trivial to implement.

Signed-off-by: Miklos Szeredi &lt;mszeredi@redhat.com&gt;
</content>
</entry>
<entry>
<title>open: introduce openat2(2) syscall</title>
<updated>2020-01-18T14:19:18+00:00</updated>
<author>
<name>Aleksa Sarai</name>
<email>cyphar@cyphar.com</email>
</author>
<published>2020-01-18T12:07:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179'/>
<id>urn:sha1:fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179</id>
<content type='text'>
/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.

In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.

Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).

/* Syscall Prototype. */
  /*
   * open_how is an extensible structure (similar in interface to
   * clone3(2) or sched_setattr(2)). The size parameter must be set to
   * sizeof(struct open_how), to allow for future extensions. All future
   * extensions will be appended to open_how, with their zero value
   * acting as a no-op default.
   */
  struct open_how { /* ... */ };

  int openat2(int dfd, const char *pathname,
              struct open_how *how, size_t size);

/* Description. */
The initial version of 'struct open_how' contains the following fields:

  flags
    Used to specify openat(2)-style flags. However, any unknown flag
    bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
    will result in -EINVAL. In addition, this field is 64-bits wide to
    allow for more O_ flags than currently permitted with openat(2).

  mode
    The file mode for O_CREAT or O_TMPFILE.

    Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

  resolve
    Restrict path resolution (in contrast to O_* flags they affect all
    path components). The current set of flags are as follows (at the
    moment, all of the RESOLVE_ flags are implemented as just passing
    the corresponding LOOKUP_ flag).

    RESOLVE_NO_XDEV       =&gt; LOOKUP_NO_XDEV
    RESOLVE_NO_SYMLINKS   =&gt; LOOKUP_NO_SYMLINKS
    RESOLVE_NO_MAGICLINKS =&gt; LOOKUP_NO_MAGICLINKS
    RESOLVE_BENEATH       =&gt; LOOKUP_BENEATH
    RESOLVE_IN_ROOT       =&gt; LOOKUP_IN_ROOT

open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it). It also only contains u64s (even
though -&gt;mode arguably should be a u16) to avoid having padding fields
which are never used in the future.

Note that as a result of the new how-&gt;flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).

After input from Florian Weimer, the new open_how and flag definitions
are inside a separate header from uapi/linux/fcntl.h, to avoid problems
that glibc has with importing that header.

/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.

In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).

/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).

Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.

Another possible avenue of future work would be some kind of
CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
which openat2(2) flags and fields are supported by the current kernel
(to avoid userspace having to go through several guesses to figure it
out).

[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: commit 629e014bb834 ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
[6]: https://youtu.be/ggD-eb3yPVs

Suggested-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Signed-off-by: Aleksa Sarai &lt;cyphar@cyphar.com&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>fcntl: fix typo in RWH_WRITE_LIFE_NOT_SET r/w hint name</title>
<updated>2019-10-25T20:28:10+00:00</updated>
<author>
<name>Eugene Syromiatnikov</name>
<email>esyr@redhat.com</email>
</author>
<published>2019-09-20T15:58:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9a7f12edf8848a9b355a86fc0183d7be932ef11e'/>
<id>urn:sha1:9a7f12edf8848a9b355a86fc0183d7be932ef11e</id>
<content type='text'>
According to commit message in the original commit c75b1d9421f8 ("fs:
add fcntl() interface for setting/getting write life time hints"),
as well as userspace library[1] and man page update[2], R/W hint constants
are intended to have RWH_* prefix. However, RWF_WRITE_LIFE_NOT_SET retained
"RWF_*" prefix used in the early versions of the proposed patch set[3].
Rename it and provide the old name as a synonym for the new one
for backward compatibility.

[1] https://github.com/axboe/fio/commit/bd553af6c849
[2] https://github.com/mkerrisk/man-pages/commit/580082a186fd
[3] https://www.mail-archive.com/linux-block@vger.kernel.org/msg09638.html

Fixes: c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life time hints")
Acked-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Eugene Syromiatnikov &lt;esyr@redhat.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
</feed>
