kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
2025-07-13	um: Stop tracking stub's PID via userspace_pid[]	Tiwei Bie	1	-10/+6
	The PID of the stub process can be obtained from current_mm_id(). There is no need to track it via userspace_pid[]. Stop doing that to simplify the code. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250711065021.2535362-4-tiwei.bie@linux.dev Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-13	um: Remove the pid parameter of handle_trap()	Tiwei Bie	1	-2/+2
	It's no longer used. Remove it. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250711065021.2535362-3-tiwei.bie@linux.dev Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-13	um: Use err consistently in userspace()	Tiwei Bie	1	-7/+6
	Avoid declaring a new variable 'ret' inside the 'if (using_seccomp)' block, as the existing 'err' variable declared at the top of the function already serves the same purpose. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250711065021.2535362-2-tiwei.bie@linux.dev Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-11	um: Make unscheduled_userspace_iterations static	Tiwei Bie	1	-1/+1
	It's only used within process.c. Make it static. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250708090403.1067440-2-tiwei.bie@linux.dev Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-04	um: fix SECCOMP 32bit xstate register restore	Benjamin Berg	1	-4/+0
	There was a typo that caused the extended FP state to be copied into the wrong location on 32 bit. On 32 bit we only store the xstate internally as that already contains everything. However, for compatibility, the mcontext on 32 bit first contains the legacy FP state and then the xstate. The code copied the xstate on top of the legacy FP state instead of using the correct offset. This offset was already calculated in the xstate_* variables, so simply switch to those to fix the problem. With this SECCOMP mode works on 32 bit, so lift the restriction. Fixes: b1e1bd2e6943 ("um: Add helper functions to get/set state for SECCOMP") Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250604081705.934112-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-02	um: pass FD for memory operations when needed	Benjamin Berg	3	-41/+161
	Instead of always sharing the FDs with the userspace process, only hand over the FDs needed for mmap when required. The idea is that userspace might be able to force the stub into executing an mmap syscall, however, it will not be able to manipulate the control flow sufficiently to have access to an FD that would allow mapping arbitrary memory. Security wise, we need to be sure that only the expected syscalls are executed after the kernel sends FDs through the socket. This is currently not the case, as userspace can trivially jump to the rt_sigreturn syscall instruction to execute any syscall that the stub is permitted to do. With this, it can trick the kernel to send the FD, which in turn allows userspace to freely map any physical memory. As such, this is currently not secure. However, in principle the approach should be fine with a more strict SECCOMP filter and a careful review of the stub control flow (as userspace can prepare a stack). With some care, it is likely possible to extend the security model to SMP if desired. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250602130052.545733-8-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-02	um: Add SECCOMP support detection and initialization	Benjamin Berg	2	-4/+195
	This detects seccomp support, sets the global using_seccomp variable and initilizes the exec registers. The support is only enabled if the seccomp= kernel parameter is set to either "on" or "auto". With "auto" a fallback to ptrace mode will happen if initialization failed. Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250602130052.545733-7-benjamin@sipsolutions.net [extend help with Kconfig text from v2, use exit syscall instead of libc, remove unneeded mctx_offset assignment, disable on 32-bit for now] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-02	um: Implement kernel side of SECCOMP based process handling	Benjamin Berg	3	-117/+299
	This adds the kernel side of the seccomp based process handling. Co-authored-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250602130052.545733-6-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-02	um: Track userspace children dying in SECCOMP mode	Benjamin Berg	3	-1/+50
	When in seccomp mode, we would hang forever on the futex if a child has died unexpectedly. In contrast, ptrace mode will notice it and kill the corresponding thread when it fails to run it. Fix this issue using a new IRQ that is fired after a SIGCHLD and keeping an (internal) list of all MMs. In the IRQ handler, find the affected MM and set its PID to -1 as well as the futex variable to FUTEX_IN_KERN. This, together with futex returning -EINTR after the signal is sufficient to implement a race-free detection of a child dying. Note that this also enables IRQ handling while starting a userspace process. This should be safe and SECCOMP requires the IRQ in case the process does not come up properly. Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250602130052.545733-5-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-02	um: Move faultinfo extraction into userspace routine	Benjamin Berg	1	-11/+6
	The segv handler is called slightly differently depending on whether PTRACE_FULL_FAULTINFO is set or not (32bit vs. 64bit). The only difference is that we don't try to pass the registers and instruction pointer to the segv handler. It would be good to either document or remove the difference, but I do not know why this difference exists. And, passing NULL can even result in a crash. Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Link: https://patch.msgid.link/20250602130052.545733-2-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-02	um: Fix tgkill compile error on old host OSes	Yongting Lin	1	-1/+2
	tgkill is a quite old syscall since kernel 2.5.75, but unfortunately glibc doesn't support it before 2.30. Thus some systems fail to compile the latest UserMode Linux. Here is the compile error I encountered when I tried to compile UML in my system shipped with glibc-2.28. CALL scripts/checksyscalls.sh CC arch/um/os-Linux/sigio.o In file included from arch/um/os-Linux/sigio.c:17: arch/um/os-Linux/sigio.c: In function ‘write_sigio_thread’: arch/um/os-Linux/sigio.c:49:19: error: implicit declaration of function ‘tgkill’; did you mean ‘kill’? [-Werror=implicit-function-declaration] CATCH_EINTR(r = tgkill(pid, pid, SIGIO)); ^~~~~~ ./arch/um/include/shared/os.h:21:48: note: in definition of macro ‘CATCH_EINTR’ #define CATCH_EINTR(expr) while ((errno = 0, ((expr) < 0)) && (errno == EINTR)) ^~~~ cc1: some warnings being treated as errors Fix it by Replacing glibc call with raw syscall. Fixes: 33c9da5dfb18 ("um: Rewrite the sigio workaround based on epoll and tgkill") Signed-off-by: Yongting Lin <linyongting@gmail.com> Link: https://patch.msgid.link/20250527151222.40371-1-linyongting@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-05	um: Remove obsolete legacy network transports	Tiwei Bie	9	-720/+1
	These legacy network transports were marked as obsolete in commit 40814b98a570 ("um: Mark non-vector net transports as obsolete"). More than five years have passed since then. Remove these network transports to reduce the maintenance burden. Suggested-by: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Link: https://patch.msgid.link/20250503051710.3286595-2-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-20	um: Rewrite the sigio workaround based on epoll and tgkill	Tiwei Bie	1	-286/+44
	The existing sigio workaround implementation removes FDs from the poll when events are triggered, requiring users to re-add them via add_sigio_fd() after processing. This introduces a potential race condition between FD removal in write_sigio_thread() and next_poll update in __add_sigio_fd(), and is inefficient due to frequent FD removal and re-addition. Rewrite the implementation based on epoll and tgkill for improved efficiency and reliability. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250315161910.4082396-2-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-20	um: Prohibit the VM_CLONE flag in run_helper_thread()	Tiwei Bie	1	-0/+4
	Directly creating helper threads with VM_CLONE using clone can compromise the thread safety of errno. Since all these helper threads have been converted to use os_run_helper_thread(), let's prevent using this flag in run_helper_thread(). Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250319135523.97050-5-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-20	um: Switch to the pthread-based helper in sigio workaround	Tiwei Bie	1	-25/+19
	The write_sigio thread and UML kernel thread share the same errno, which can lead to conflicts when both call syscalls concurrently. Switch to the pthread-based helper to address this issue. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250319135523.97050-4-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-20	um: Add pthread-based helper support	Tiwei Bie	1	-0/+63
	Introduce a new set of utility functions that can be used to create pthread-based helpers. Helper threads created in this way will ensure thread safety for errno while sharing the same memory space. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250319135523.97050-2-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	um: remove copy_from_kernel_nofault_allowed	Benjamin Berg	1	-51/+0
	There is no need to override the default version of this function anymore as UML now has proper _nofault memory access functions. Doing this also fixes the fact that the implementation was incorrect as using mincore() will incorrectly flag pages as inaccessible if they were swapped out by the host. Fixes: f75b1b1bedfb ("um: Implement probe_kernel_read()") Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250210160926.420133-3-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	um: mark rodata read-only and implement _nofault accesses	Johannes Berg	2	-6/+6
	Mark read-only data actually read-only (simple mprotect), and to be able to test it also implement _nofault accesses. This works by setting up a new "segv_continue" pointer in current, and then when we hit a segfault we change the signal return context so that we continue at that address. The code using this sets it up so that it jumps to a label and then aborts the access that way, returning -EFAULT. It's possible to optimize the ___backtrack_faulted() thing by using asm goto (compiler version dependent) and/or gcc's (not sure if clang has it) &&label extension, but at least in one attempt I made the && caused the compiler to not load -EFAULT into the register in case of jumping to the &&label from the fault handler. So leave it like this for now. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Co-developed-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250210160926.420133-2-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-02-13	um: fix execve stub execution on old host OSs	Benjamin Berg	1	-3/+13
	The stub execution uses the somewhat new close_range and execveat syscalls. Of these two, the execveat call is essential, but the close_range call is more about stub process hygiene rather than safety (and its result is ignored). Replace both calls with a raw syscall as older machines might not have a recent enough kernel for close_range (with CLOSE_RANGE_CLOEXEC) or a libc that does not yet expose both of the syscalls. Fixes: 32e8eaf263d9 ("um: use execveat to create userspace MMs") Reported-by: Glenn Washburn <development@efficientek.com> Closes: https://lore.kernel.org/20250108022404.05e0de1e@crass-HP-ZBook-15-G2 Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250113094107.674738-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2025-01-10	um: Remove unused THREAD_NAME_LEN macro	Tiwei Bie	1	-1/+0
	It's no longer used since commit 42fda66387da ("uml: throw out CONFIG_MODE_TT"). Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20241128083137.2219830-9-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-10	um: Remove unused PGD_BOUND macro	Tiwei Bie	1	-1/+0
	It's no longer used since commit 11100b1dfb6e ("uml: delete unused code"). Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20241128083137.2219830-8-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-10	um: Mark setup_env_path as __init	Tiwei Bie	1	-1/+1
	It's only invoked during boot from main(). Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20241128083137.2219830-7-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-10	um: Mark install_fatal_handler as __init	Tiwei Bie	1	-1/+1
	It's only invoked during boot from main(). Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20241128083137.2219830-6-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-10	um: Mark set_stklim as __init	Tiwei Bie	1	-1/+1
	It's only invoked during boot from main(). Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20241128083137.2219830-5-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-11-12	um: move thread info into task	Benjamin Berg	1	-36/+1
	This selects the THREAD_INFO_IN_TASK option for UM and changes the way that the current task is discovered. This is trivial though, as UML already tracks the current task in cpu_tasks[] and this can be used to retrieve it. Also remove the signal handler code that copies the thread information into the IRQ stack. It is obsolete now, which also means that the mentioned race condition cannot happen anymore. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Reviewed-by: Hajime Tazaki <thehajime@gmail.com> Link: https://patch.msgid.link/20241111102910.46512-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-11-07	um: remove broken double fault detection	Benjamin Berg	1	-8/+0
	The show_stack function had some code to detect double faults. However, the logic is wrong and it would e.g. trigger if a WARNING happened inside an IRQ. Remove it without trying to add a new logic. The current behaviour, which will just fault repeatedly until the IRQ stack is used up and the host kills UML, seems to be good enough. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241103150506.1367695-5-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-11-07	um: remove file sync for stub data	Benjamin Berg	1	-6/+0
	There is no need to sync the stub code to "disk" for the other process to see the correct memory. Drop the fsync there and remove the helper function. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241103150506.1367695-3-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-11-07	um: always include kconfig.h and compiler-version.h	Benjamin Berg	3	-8/+10
	Since commit a95b37e20db9 ("kbuild: get <linux/compiler_types.h> out of <linux/kconfig.h>") we can safely include these files in userspace code. Doing so simplifies matters as options do not need to be exported via asm-offsets.h anymore. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241103150506.1367695-2-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-11-07	um: set DONTDUMP and DONTFORK flags on KASAN shadow memory	Benjamin Berg	1	-0/+12
	There is no point in either dumping the KASAN shadow memory or doing copy-on-write after a fork on these memory regions. This considerably speeds up coredump generation. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241103150506.1367695-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-25	um: Set parent-death signal for write_sigio thread/process	Tiwei Bie	1	-0/+1
	The write_sigio thread is not really a traditional thread. Set the parent-death signal for it to ensure that it will be killed if the UML kernel dies unexpectedly without proper cleanup. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20241024142828.2612828-4-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-25	um: Add os_set_pdeathsig helper function	Tiwei Bie	1	-0/+6
	This helper can be used to set the parent-death signal of the calling process to SIGKILL to ensure that the process will be killed if the UML kernel dies unexpectedly without proper cleanup. This helper will be used in the follow-up patches. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20241024142828.2612828-2-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-24	um: remove PATH_MAX use	Johannes Berg	1	-2/+1
	Evidently, PATH_MAX isn't always defined, at least not via <limits.h>. Simply remove the use and replace it by a constant 4k. As stat::st_size is zero for /proc/self/exe we can't even size it automatically, and it seems unlikely someone's going to try to run UML with such a path. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202410240553.gYNIXN8i-lkp@intel.com/ Fixes: 031acdcfb566 ("um: restore process name") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-23	um: switch to regset API and depend on XSTATE	Benjamin Berg	1	-3/+8
	The PTRACE_GETREGSET API has now existed since Linux 2.6.33. The XSAVE CPU feature should also be sufficiently common to be able to rely on it. With this, define our internal FP state to be the hosts XSAVE data. Add discovery for the hosts XSAVE size and place the FP registers at the end of task_struct so that we can adjust the size at runtime. Next we can implement the regset API on top and update the signal handling as well as ptrace APIs to use them. Also switch coredump creation to use the regset API and finally set HAVE_ARCH_TRACEHOOK. This considerably improves the signal frames. Previously they might not have contained all the registers (i386) and also did not have the sizes and magic values set to the correct values to permit userspace to decode the frame. As a side effect, this will permit UML to run on hosts with newer CPU extensions (such as AMX) that need even more register state. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241023094120.4083426-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-23	um: insert scheduler ticks when userspace does not yield	Benjamin Berg	1	-0/+26
	In time-travel mode userspace can do a lot of work without any time passing. Unfortunately, this can result in OOM situations as the RCU core code will never be run. Work around this by keeping track of userspace processes that do not yield for a lot of operations. When this happens, insert a jiffie into the sched_clock clock to account time against the process and cause the bookkeeping to run. As sched_clock is used for tracing, it is useful to keep it in sync between the different VMs. As such, try to remove added ticks again when the actual clock ticks. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241010142537.1134685-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-23	um: Abandon the _PAGE_NEWPROT bit	Tiwei Bie	1	-21/+0
	When a PTE is updated in the page table, the _PAGE_NEWPAGE bit will always be set. And the corresponding page will always be mapped or unmapped depending on whether the PTE is present or not. The check on the _PAGE_NEWPROT bit is not really reachable. Abandoning it will allow us to simplify the code and remove the unreachable code. Reviewed-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20241011102354.1682626-2-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-23	um: Do not propagate noreboot parameter to kernel	Tiwei Bie	1	-0/+1
	This parameter is UML specific and is unknown to kernel. It should not be propagated to kernel, otherwise it could be passed to user space as a command line option by kernel with a warning like: Unknown kernel command line parameters "noreboot", will be passed to user space. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20241011040441.1586345-6-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-23	um: Do not propagate uml_dir parameter to kernel	Tiwei Bie	1	-0/+2
	This parameter is UML specific and is unknown to kernel. It should not be propagated to kernel, otherwise it will be passed to user space as an environment option by kernel with a warning like: Unknown kernel command line parameters "uml_dir=/foo", will be passed to user space. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20241011040441.1586345-4-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-23	um: remove fault_catcher infrastructure	Johannes Berg	1	-1/+1
	This was perhaps intended to do _nofault copies, but the real reason is lost to history. Remove this, it's not needed, and using longjmp() out of the middle of the signal handler with all the state it has modified is not going to be a good idea anyway. Link: https://patch.msgid.link/20241010224513.901c4d390b3e.Ia74742668b44603c1ca23dd36f90e964e6e7ee55@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-23	um: restore process name	Johannes Berg	2	-2/+14
	After the execve() to disable ASLR, comm is now "exe", which is a bit confusing. Use readlink() to get this to the right name again. Disable stack frame size warnings on main.o since it's part of the initial userspace and can use larger stack. Fixes: 68b9883cc16e ("um: Discover host_task_size from envp") Link: https://patch.msgid.link/20241010161411.c576e2aeb3e5.I244d4f34b8a8555ee5bec0e1cf5027bce4cc491b@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-10	um: Discover host_task_size from envp	Benjamin Berg	1	-1/+8
	When loading the UML binary, the host kernel will place the stack at the highest possible address. It will then map the program name and environment variables onto the start of the stack. As such, an easy way to figure out the host_task_size is to use the highest pointer to an environment variable as a reference. Ensure that this works by disabling address layout randomization and re-executing UML in case it was enabled. This increases the available TASK_SIZE for 64 bit UML considerably. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240919124511.282088-9-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-10	um: use execveat to create userspace MMs	Benjamin Berg	2	-57/+122
	Using clone will not undo features that have been enabled by libc. An example of this already happening is rseq, which could cause the kernel to read/write memory of the userspace process. In the future the standard library might also use mseal by default to protect itself, which would also thwart our attempts at unmapping everything. Solve all this by taking a step back and doing an execve into a tiny static binary that sets up the minimal environment required for the stub without using any standard library. That way we have a clean execution environment that is fully under the control of UML. Note that this changes things a bit as the FDs are not anymore shared with the kernel. Instead, we explicitly share the FDs for the physical memory and all existing iomem regions. Doing this is fine, as iomem regions cannot be added at runtime. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240919124511.282088-3-benjamin@sipsolutions.net [use pipe() instead of pipe2(), remove unneeded close() calls] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-10	um: remove auxiliary FP registers	Benjamin Berg	1	-19/+6
	We do not need the extra save/restore of the FP registers when getting the fault information. This was originally added in commit 2f56debd77a8 ("uml: fix FP register corruption") but at that time the code was not saving/restoring the FP registers when switching to userspace. This was fixed in commit fbfe9c847edf ("um: Save FPU registers between task switches") and since then the auxiliary registers have not been useful. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241004233821.2130874-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-10	um: Remove the redundant declaration of high_physmem	Tiwei Bie	1	-4/+1
	high_physmem has already been declared in as-layout.h, so there is no need to declare it explicitly in the .c file again. While at it, group the declarations of __real_malloc and __real_free together to make the code slightly more readable. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20240916045950.508910-2-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-10	um: Remove unused os_getpgrp function	Benjamin Berg	1	-5/+0
	The function is not used anywhere. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240913134442.967599-5-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-10	um: Remove unused os_stop_process	Benjamin Berg	1	-5/+0
	The function is not used anywhere. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240913134442.967599-4-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-10	um: Remove unused os_process_parent	Benjamin Berg	1	-39/+0
	The function is not used anywhere. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240913134442.967599-3-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-10	um: Remove unused os_process_pc	Benjamin Berg	1	-33/+0
	The function is not used anywhere in the codebase. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240913134442.967599-2-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-09-12	um: Remove unused mm_fd field from mm_id	Tiwei Bie	2	-2/+2
	It's no longer used since the removal of the SKAS3/4 support. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-09-12	um: remove variable stack array in os_rcv_fd_msg()	Johannes Berg	1	-2/+6
	When generalizing this, I was in the mindset of this being "userspace" code, but even there we should not use variable arrays as the kernel is moving away from allowing that. Simply reserve (but not use) enough space for the maximum two descriptors we might need now, and return an error if attempting to receive more than that. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407041459.3SYg4TEi-lkp@intel.com/ Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-03	um: refactor TLB update handling	Benjamin Berg	1	-0/+2
	Conceptually, we want the memory mappings to always be up to date and represent whatever is in the TLB. To ensure that, we need to sync them over in the userspace case and for the kernel we need to process the mappings. The kernel will call flush_tlb_* if page table entries that were valid before become invalid. Unfortunately, this is not the case if entries are added. As such, change both flush_tlb_* and set_ptes to track the memory range that has to be synchronized. For the kernel, we need to execute a flush_tlb_kern_* immediately but we can wait for the first page fault in case of set_ptes. For userspace in contrast we only store that a range of memory needs to be synced and do so whenever we switch to that process. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240703134536.1161108-13-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>