seccomp: Add wait_killable semantic to seccomp user notifier

This introduces a per-filter flag (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) that makes it so that when notifications are received by the supervisor the notifying process will transition to wait killable semantics. Although wait killable isn't a set of semantics formally exposed to userspace, the concept is searchable. If the notifying process is signaled prior to the notification being received by the userspace agent, it will be handled as normal. One quirk about how this is handled is that the notifying process only switches to TASK_KILLABLE if it receives a wakeup from either an addfd or a signal. This is to avoid an unnecessary wakeup of the notifying task. The reasons behind switching into wait_killable only after userspace receives the notification are: * Avoiding unncessary work - Often, workloads will perform work that they may abort (request racing comes to mind). This allows for syscalls to be aborted safely prior to the notification being received by the supervisor. In this, the supervisor doesn't end up doing work that the workload does not want to complete anyways. * Avoiding side effects - We don't want the syscall to be interruptible once the supervisor starts doing work because it may not be trivial to reverse the operation. For example, unmounting a file system may take a long time, and it's hard to rollback, or treat that as reentrant. * Avoid breaking runtimes - Various runtimes do not GC when they are during a syscall (or while running native code that subsequently calls a syscall). If many notifications are blocked, and not picked up by the supervisor, this can get the application into a bad state. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220503080958.20220-2-sargun@sargun.me
author: Sargun Dhillon <sargun@sargun.me> 2022-05-03 11:09:56 +0300
committer: Kees Cook <keescook@chromium.org> 2022-05-04 00:11:58 +0300
commit: c2aa2dfef243efe213a480a1ee8566507a5152f4 (patch)
tree: 81b322a5a5f425bc485ed1ac0c7bcceac1d7f22c /Documentation/userspace-api
parent: 662340ef921828507c931da6db303fa3cb02228e (diff)
download: linux-c2aa2dfef243efe213a480a1ee8566507a5152f4.tar.xz
1 files changed, 10 insertions, 0 deletions
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index 539e9d4a4860..d1e2b9193f09 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -271,6 +271,16 @@ notifying process it will be replaced. The supervisor can also add an FD, and
 respond atomically by using the ``SECCOMP_ADDFD_FLAG_SEND`` flag and the return
 value will be the injected file descriptor number.
 
+The notifying process can be preempted, resulting in the notification being
+aborted. This can be problematic when trying to take actions on behalf of the
+notifying process that are long-running and typically retryable (mounting a
+filesytem). Alternatively, at filter installation time, the
+``SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV`` flag can be set. This flag makes it
+such that when a user notification is received by the supervisor, the notifying
+process will ignore non-fatal signals until the response is sent. Signals that
+are sent prior to the notification being received by userspace are handled
+normally.
+
 It is worth noting that ``struct seccomp_data`` contains the values of register
 arguments to the syscall, but does not contain pointers to memory. The task's
 memory is accessible to suitably privileged traces via ``ptrace()`` or
author	Sargun Dhillon <sargun@sargun.me>	2022-05-03 11:09:56 +0300
committer	Kees Cook <keescook@chromium.org>	2022-05-04 00:11:58 +0300
commit	c2aa2dfef243efe213a480a1ee8566507a5152f4 (patch)
tree	81b322a5a5f425bc485ed1ac0c7bcceac1d7f22c /Documentation/userspace-api
parent	662340ef921828507c931da6db303fa3cb02228e (diff)
download	linux-c2aa2dfef243efe213a480a1ee8566507a5152f4.tar.xz