<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/uapi/linux/seccomp.h, branch v6.6.132</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.132</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.132'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2023-07-17T23:08:08+00:00</updated>
<entry>
<title>seccomp: add the synchronous mode for seccomp_unotify</title>
<updated>2023-07-17T23:08:08+00:00</updated>
<author>
<name>Andrei Vagin</name>
<email>avagin@google.com</email>
</author>
<published>2023-03-08T07:31:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=48a1084a8b7423642b5f17ca6202f6f277c5392b'/>
<id>urn:sha1:48a1084a8b7423642b5f17ca6202f6f277c5392b</id>
<content type='text'>
seccomp_unotify allows more privileged processes do actions on behalf
of less privileged processes.

In many cases, the workflow is fully synchronous. It means a target
process triggers a system call and passes controls to a supervisor
process that handles the system call and returns controls to the target
process. In this context, "synchronous" means that only one process is
running and another one is waiting.

There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
move the wakee to the current CPU. For such synchronous workflows, it
makes context switches a few times faster.

Right now, each interaction takes 12µs. With this patch, it takes about
3µs.

This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
it used to enable the sync mode.

Signed-off-by: Andrei Vagin &lt;avagin@google.com&gt;
Acked-by: "Peter Zijlstra (Intel)" &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20230308073201.3102738-5-avagin@google.com
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: Add wait_killable semantic to seccomp user notifier</title>
<updated>2022-05-03T21:11:58+00:00</updated>
<author>
<name>Sargun Dhillon</name>
<email>sargun@sargun.me</email>
</author>
<published>2022-05-03T08:09:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c2aa2dfef243efe213a480a1ee8566507a5152f4'/>
<id>urn:sha1:c2aa2dfef243efe213a480a1ee8566507a5152f4</id>
<content type='text'>
This introduces a per-filter flag (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
that makes it so that when notifications are received by the supervisor the
notifying process will transition to wait killable semantics. Although wait
killable isn't a set of semantics formally exposed to userspace, the
concept is searchable. If the notifying process is signaled prior to the
notification being received by the userspace agent, it will be handled as
normal.

One quirk about how this is handled is that the notifying process
only switches to TASK_KILLABLE if it receives a wakeup from either
an addfd or a signal. This is to avoid an unnecessary wakeup of
the notifying task.

The reasons behind switching into wait_killable only after userspace
receives the notification are:
* Avoiding unncessary work - Often, workloads will perform work that they
  may abort (request racing comes to mind). This allows for syscalls to be
  aborted safely prior to the notification being received by the
  supervisor. In this, the supervisor doesn't end up doing work that the
  workload does not want to complete anyways.
* Avoiding side effects - We don't want the syscall to be interruptible
  once the supervisor starts doing work because it may not be trivial
  to reverse the operation. For example, unmounting a file system may
  take a long time, and it's hard to rollback, or treat that as
  reentrant.
* Avoid breaking runtimes - Various runtimes do not GC when they are
  during a syscall (or while running native code that subsequently
  calls a syscall). If many notifications are blocked, and not picked
  up by the supervisor, this can get the application into a bad state.

Signed-off-by: Sargun Dhillon &lt;sargun@sargun.me&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/20220503080958.20220-2-sargun@sargun.me
</content>
</entry>
<entry>
<title>seccomp: Support atomic "addfd + send reply"</title>
<updated>2021-06-28T19:49:52+00:00</updated>
<author>
<name>Rodrigo Campos</name>
<email>rodrigo@kinvolk.io</email>
</author>
<published>2021-05-17T19:39:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0ae71c7720e3ae3aabd2e8a072d27f7bd173d25c'/>
<id>urn:sha1:0ae71c7720e3ae3aabd2e8a072d27f7bd173d25c</id>
<content type='text'>
Alban Crequy reported a race condition userspace faces when we want to
add some fds and make the syscall return them[1] using seccomp notify.

The problem is that currently two different ioctl() calls are needed by
the process handling the syscalls (agent) for another userspace process
(target): SECCOMP_IOCTL_NOTIF_ADDFD to allocate the fd and
SECCOMP_IOCTL_NOTIF_SEND to return that value. Therefore, it is possible
for the agent to do the first ioctl to add a file descriptor but the
target is interrupted (EINTR) before the agent does the second ioctl()
call.

This patch adds a flag to the ADDFD ioctl() so it adds the fd and
returns that value atomically to the target program, as suggested by
Kees Cook[2]. This is done by simply allowing
seccomp_do_user_notification() to add the fd and return it in this case.
Therefore, in this case the target wakes up from the wait in
seccomp_do_user_notification() either to interrupt the syscall or to add
the fd and return it.

This "allocate an fd and return" functionality is useful for syscalls
that return a file descriptor only, like connect(2). Other syscalls that
return a file descriptor but not as return value (or return more than
one fd), like socketpair(), pipe(), recvmsg with SCM_RIGHTs, will not
work with this flag.

This effectively combines SECCOMP_IOCTL_NOTIF_ADDFD and
SECCOMP_IOCTL_NOTIF_SEND into an atomic opteration. The notification's
return value, nor error can be set by the user. Upon successful invocation
of the SECCOMP_IOCTL_NOTIF_ADDFD ioctl with the SECCOMP_ADDFD_FLAG_SEND
flag, the notifying process's errno will be 0, and the return value will
be the file descriptor number that was installed.

[1]: https://lore.kernel.org/lkml/CADZs7q4sw71iNHmV8EOOXhUKJMORPzF7thraxZYddTZsxta-KQ@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/

Signed-off-by: Rodrigo Campos &lt;rodrigo@kinvolk.io&gt;
Signed-off-by: Sargun Dhillon &lt;sargun@sargun.me&gt;
Acked-by: Tycho Andersen &lt;tycho@tycho.pizza&gt;
Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/20210517193908.3113-4-sargun@sargun.me
</content>
</entry>
<entry>
<title>seccomp: Introduce addfd ioctl to seccomp user notifier</title>
<updated>2020-07-14T23:29:42+00:00</updated>
<author>
<name>Sargun Dhillon</name>
<email>sargun@sargun.me</email>
</author>
<published>2020-06-03T01:10:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7cf97b12545503992020796c74bd84078eb39299'/>
<id>urn:sha1:7cf97b12545503992020796c74bd84078eb39299</id>
<content type='text'>
The current SECCOMP_RET_USER_NOTIF API allows for syscall supervision over
an fd. It is often used in settings where a supervising task emulates
syscalls on behalf of a supervised task in userspace, either to further
restrict the supervisee's syscall abilities or to circumvent kernel
enforced restrictions the supervisor deems safe to lift (e.g. actually
performing a mount(2) for an unprivileged container).

While SECCOMP_RET_USER_NOTIF allows for the interception of any syscall,
only a certain subset of syscalls could be correctly emulated. Over the
last few development cycles, the set of syscalls which can't be emulated
has been reduced due to the addition of pidfd_getfd(2). With this we are
now able to, for example, intercept syscalls that require the supervisor
to operate on file descriptors of the supervisee such as connect(2).

However, syscalls that cause new file descriptors to be installed can not
currently be correctly emulated since there is no way for the supervisor
to inject file descriptors into the supervisee. This patch adds a
new addfd ioctl to remove this restriction by allowing the supervisor to
install file descriptors into the intercepted task. By implementing this
feature via seccomp the supervisor effectively instructs the supervisee
to install a set of file descriptors into its own file descriptor table
during the intercepted syscall. This way it is possible to intercept
syscalls such as open() or accept(), and install (or replace, like
dup2(2)) the supervisor's resulting fd into the supervisee. One
replacement use-case would be to redirect the stdout and stderr of a
supervisee into log file descriptors opened by the supervisor.

The ioctl handling is based on the discussions[1] of how Extensible
Arguments should interact with ioctls. Instead of building size into
the addfd structure, make it a function of the ioctl command (which
is how sizes are normally passed to ioctls). To support forward and
backward compatibility, just mask out the direction and size, and match
everything. The size (and any future direction) checks are done along
with copy_struct_from_user() logic.

As a note, the seccomp_notif_addfd structure is laid out based on 8-byte
alignment without requiring packing as there have been packing issues
with uapi highlighted before[2][3]. Although we could overload the
newfd field and use -1 to indicate that it is not to be used, doing
so requires changing the size of the fd field, and introduces struct
packing complexity.

[1]: https://lore.kernel.org/lkml/87o8w9bcaf.fsf@mid.deneb.enyo.de/
[2]: https://lore.kernel.org/lkml/a328b91d-fd8f-4f27-b3c2-91a9c45f18c0@rasmusvillemoes.dk/
[3]: https://lore.kernel.org/lkml/20200612104629.GA15814@ircssh-2.c.rugged-nimbus-611.internal

Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Cc: Tycho Andersen &lt;tycho@tycho.ws&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Robert Sesek &lt;rsesek@google.com&gt;
Cc: Chris Palmer &lt;palmer@google.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Suggested-by: Matt Denton &lt;mpdenton@google.com&gt;
Link: https://lore.kernel.org/r/20200603011044.7972-4-sargun@sargun.me
Signed-off-by: Sargun Dhillon &lt;sargun@sargun.me&gt;
Reviewed-by: Will Drewry &lt;wad@chromium.org&gt;
Co-developed-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID</title>
<updated>2020-07-10T23:01:52+00:00</updated>
<author>
<name>Kees Cook</name>
<email>keescook@chromium.org</email>
</author>
<published>2020-06-15T22:42:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=47e33c05f9f07cac3de833e531bcac9ae052c7ca'/>
<id>urn:sha1:47e33c05f9f07cac3de833e531bcac9ae052c7ca</id>
<content type='text'>
When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced it had the wrong
direction flag set. While this isn't a big deal as nothing currently
enforces these bits in the kernel, it should be defined correctly. Fix
the define and provide support for the old command until it is no longer
needed for backward compatibility.

Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: allow TSYNC and USER_NOTIF together</title>
<updated>2020-03-04T22:48:54+00:00</updated>
<author>
<name>Tycho Andersen</name>
<email>tycho@tycho.ws</email>
</author>
<published>2020-03-04T18:05:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=51891498f2da78ee64dfad88fa53c9e85fb50abf'/>
<id>urn:sha1:51891498f2da78ee64dfad88fa53c9e85fb50abf</id>
<content type='text'>
The restriction introduced in 7a0df7fbc145 ("seccomp: Make NEW_LISTENER and
TSYNC flags exclusive") is mostly artificial: there is enough information
in a seccomp user notification to tell which thread triggered a
notification. The reason it was introduced is because TSYNC makes the
syscall return a thread-id on failure, and NEW_LISTENER returns an fd, and
there's no way to distinguish between these two cases (well, I suppose the
caller could check all fds it has, then do the syscall, and if the return
value was an fd that already existed, then it must be a thread id, but
bleh).

Matthew would like to use these two flags together in the Chrome sandbox
which wants to use TSYNC for video drivers and NEW_LISTENER to proxy
syscalls.

So, let's fix this ugliness by adding another flag, TSYNC_ESRCH, which
tells the kernel to just return -ESRCH on a TSYNC error. This way,
NEW_LISTENER (and any subsequent seccomp() commands that want to return
positive values) don't conflict with each other.

Suggested-by: Matthew Denton &lt;mpdenton@google.com&gt;
Signed-off-by: Tycho Andersen &lt;tycho@tycho.ws&gt;
Link: https://lore.kernel.org/r/20200304180517.23867-1-tycho@tycho.ws
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: rework define for SECCOMP_USER_NOTIF_FLAG_CONTINUE</title>
<updated>2019-10-28T19:29:46+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2019-10-24T21:25:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=23b2c96fad21886c53f5e1a4ffedd45ddd2e85ba'/>
<id>urn:sha1:23b2c96fad21886c53f5e1a4ffedd45ddd2e85ba</id>
<content type='text'>
Switch from BIT(0) to (1UL &lt;&lt; 0).
First, there are already two different forms used in the header, so there's
no need to add a third. Second, the BIT() macros is kernel internal and
afaict not actually exposed to userspace. Maybe there's some magic there
I'm missing but it definitely causes issues when compiling a program that
tries to use SECCOMP_USER_NOTIF_FLAG_CONTINUE. It currently fails in the
following way:

	# github.com/lxc/lxd/lxd
	/usr/bin/ld: $WORK/b001/_x003.o: in function
	`__do_user_notification_continue':
	lxd/main_checkfeature.go:240: undefined reference to `BIT'
	collect2: error: ld returned 1 exit status

Switching to (1UL &lt;&lt; 0) should prevent that and is more in line what is
already done in the rest of the header.

Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Link: https://lore.kernel.org/r/20191024212539.4059-1-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE</title>
<updated>2019-10-10T21:45:51+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2019-09-20T08:30:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fb3c5386b382d4097476ce9647260fc89b34afdb'/>
<id>urn:sha1:fb3c5386b382d4097476ce9647260fc89b34afdb</id>
<content type='text'>
This allows the seccomp notifier to continue a syscall. A positive
discussion about this feature was triggered by a post to the
ksummit-discuss mailing list (cf. [3]) and took place during KSummit
(cf. [1]) and again at the containers/checkpoint-restore
micro-conference at Linux Plumbers.

Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF (cf. [4])
which enables a process (watchee) to retrieve an fd for its seccomp
filter. This fd can then be handed to another (usually more privileged)
process (watcher). The watcher will then be able to receive seccomp
messages about the syscalls having been performed by the watchee.

This feature is heavily used in some userspace workloads. For example,
it is currently used to intercept mknod() syscalls in user namespaces
aka in containers.
The mknod() syscall can be easily filtered based on dev_t. This allows
us to only intercept a very specific subset of mknod() syscalls.
Furthermore, mknod() is not possible in user namespaces toto coelo and
so intercepting and denying syscalls that are not in the whitelist on
accident is not a big deal. The watchee won't notice a difference.

In contrast to mknod(), a lot of other syscall we intercept (e.g.
setxattr()) cannot be easily filtered like mknod() because they have
pointer arguments. Additionally, some of them might actually succeed in
user namespaces (e.g. setxattr() for all "user.*" xattrs). Since we
currently cannot tell seccomp to continue from a user notifier we are
stuck with performing all of the syscalls in lieu of the container. This
is a huge security liability since it is extremely difficult to
correctly assume all of the necessary privileges of the calling task
such that the syscall can be successfully emulated without escaping
other additional security restrictions (think missing CAP_MKNOD for
mknod(), or MS_NODEV on a filesystem etc.). This can be solved by
telling seccomp to resume the syscall.

One thing that came up in the discussion was the problem that another
thread could change the memory after userspace has decided to let the
syscall continue which is a well known TOCTOU with seccomp which is
present in other ways already.
The discussion showed that this feature is already very useful for any
syscall without pointer arguments. For any accidentally intercepted
non-pointer syscall it is safe to continue.
For syscalls with pointer arguments there is a race but for any cautious
userspace and the main usec cases the race doesn't matter. The notifier
is intended to be used in a scenario where a more privileged watcher
supervises the syscalls of lesser privileged watchee to allow it to get
around kernel-enforced limitations by performing the syscall for it
whenever deemed save by the watcher. Hence, if a user tricks the watcher
into allowing a syscall they will either get a deny based on
kernel-enforced restrictions later or they will have changed the
arguments in such a way that they manage to perform a syscall with
arguments that they would've been allowed to do anyway.
In general, it is good to point out again, that the notifier fd was not
intended to allow userspace to implement a security policy but rather to
work around kernel security mechanisms in cases where the watcher knows
that a given action is safe to perform.

/* References */
[1]: https://linuxplumbersconf.org/event/4/contributions/560
[2]: https://linuxplumbersconf.org/event/4/contributions/477
[3]: https://lore.kernel.org/r/20190719093538.dhyopljyr5ns33qx@brauner.io
[4]: commit 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")

Co-developed-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Reviewed-by: Tycho Andersen &lt;tycho@tycho.ws&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Cc: Will Drewry &lt;wad@chromium.org&gt;
CC: Tyler Hicks &lt;tyhicks@canonical.com&gt;
Link: https://lore.kernel.org/r/20190920083007.11475-2-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: add a return code to trap to userspace</title>
<updated>2018-12-12T00:28:41+00:00</updated>
<author>
<name>Tycho Andersen</name>
<email>tycho@tycho.ws</email>
</author>
<published>2018-12-09T18:24:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=6a21cc50f0c7f87dae5259f6cfefe024412313f6'/>
<id>urn:sha1:6a21cc50f0c7f87dae5259f6cfefe024412313f6</id>
<content type='text'>
This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mount() in general since various
filesystems assume a trusted image. However, if an orchestrator knows that
e.g. a particular block device has not been exposed to a container for
writing, it want to allow the container to mount that block device (that
is, handle the mount for it).

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to monitor services while starting.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

Signed-off-by: Tycho Andersen &lt;tycho@tycho.ws&gt;
CC: Kees Cook &lt;keescook@chromium.org&gt;
CC: Andy Lutomirski &lt;luto@amacapital.net&gt;
CC: Oleg Nesterov &lt;oleg@redhat.com&gt;
CC: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
CC: "Serge E. Hallyn" &lt;serge@hallyn.com&gt;
Acked-by: Serge Hallyn &lt;serge@hallyn.com&gt;
CC: Christian Brauner &lt;christian@brauner.io&gt;
CC: Tyler Hicks &lt;tyhicks@canonical.com&gt;
CC: Akihiro Suda &lt;suda.akihiro@lab.ntt.co.jp&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: Add filter flag to opt-out of SSB mitigation</title>
<updated>2018-05-04T22:51:44+00:00</updated>
<author>
<name>Kees Cook</name>
<email>keescook@chromium.org</email>
</author>
<published>2018-05-03T21:56:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=00a02d0c502a06d15e07b857f8ff921e3e402675'/>
<id>urn:sha1:00a02d0c502a06d15e07b857f8ff921e3e402675</id>
<content type='text'>
If a seccomp user is not interested in Speculative Store Bypass mitigation
by default, it can set the new SECCOMP_FILTER_FLAG_SPEC_ALLOW flag when
adding filters.

Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</content>
</entry>
</feed>
