summaryrefslogtreecommitdiff
path: root/arch/x86/entry
AgeCommit message (Collapse)AuthorFilesLines
2020-10-28Merge tag 'x86-urgent-2020-10-27' of ↵Linus Torvalds1-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A couple of x86 fixes which missed rc1 due to my stupidity: - Drop lazy TLB mode before switching to the temporary address space for text patching. text_poke() switches to the temporary mm which clears the lazy mode and restores the original mm afterwards. Due to clearing lazy mode this might restore a already dead mm if exit_mmap() runs in parallel on another CPU. - Document the x32 syscall design fail vs. syscall numbers 512-547 properly. - Fix the ORC unwinder to handle the inactive task frame correctly. This was unearthed due to the slightly different code generation of gcc-10. - Use an up to date screen_info for the boot params of kexec instead of the possibly stale and invalid version which happened to be valid when the kexec kernel was loaded" * tag 'x86-urgent-2020-10-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/alternative: Don't call text_poke() in lazy TLB mode x86/syscalls: Document the fact that syscalls 512-547 are a legacy mistake x86/unwind/orc: Fix inactive tasks with stack pointer in %sp on GCC 10 compiled kernels hyperv_fb: Update screen_info after removing old framebuffer x86/kexec: Use up-to-dated screen_info copy to fill boot params
2020-10-22Merge tag 'kbuild-v5.10' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Support 'make compile_commands.json' to generate the compilation database more easily, avoiding stale entries - Support 'make clang-analyzer' and 'make clang-tidy' for static checks using clang-tidy - Preprocess scripts/modules.lds.S to allow CONFIG options in the module linker script - Drop cc-option tests from compiler flags supported by our minimal GCC/Clang versions - Use always 12-digits commit hash for CONFIG_LOCALVERSION_AUTO=y - Use sha1 build id for both BFD linker and LLD - Improve deb-pkg for reproducible builds and rootless builds - Remove stale, useless scripts/namespace.pl - Turn -Wreturn-type warning into error - Fix build error of deb-pkg when CONFIG_MODULES=n - Replace 'hostname' command with more portable 'uname -n' - Various Makefile cleanups * tag 'kbuild-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits) kbuild: Use uname for LINUX_COMPILE_HOST detection kbuild: Only add -fno-var-tracking-assignments for old GCC versions kbuild: remove leftover comment for filechk utility treewide: remove DISABLE_LTO kbuild: deb-pkg: clean up package name variables kbuild: deb-pkg: do not build linux-headers package if CONFIG_MODULES=n kbuild: enforce -Werror=return-type scripts: remove namespace.pl builddeb: Add support for all required debian/rules targets builddeb: Enable rootless builds builddeb: Pass -n to gzip for reproducible packages kbuild: split the build log of kallsyms kbuild: explicitly specify the build id style scripts/setlocalversion: make git describe output more reliable kbuild: remove cc-option test of -Werror=date-time kbuild: remove cc-option test of -fno-stack-check kbuild: remove cc-option test of -fno-strict-overflow kbuild: move CFLAGS_{KASAN,UBSAN,KCSAN} exports to relevant Makefiles kbuild: remove redundant CONFIG_KASAN check from scripts/Makefile.kasan kbuild: do not create built-in objects for external module builds ...
2020-10-20treewide: remove DISABLE_LTOSami Tolvanen1-2/+0
This change removes all instances of DISABLE_LTO from Makefiles, as they are currently unused, and the preferred method of disabling LTO is to filter out the flags instead. Note added by Masahiro Yamada: DISABLE_LTO was added as preparation for GCC LTO, but GCC LTO was not pulled into the mainline. (https://lkml.org/lkml/2014/4/8/272) Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-10-18mm/madvise: introduce process_madvise() syscall: an external memory hinting APIMinchan Kim2-0/+2
There is usecase that System Management Software(SMS) want to give a memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the case of Android, it is the ActivityManagerService. The information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces a new syscall process_madvise(2). It uses pidfd of an external process to give the hint. It also supports vector address range because Android app has thousands of vmas due to zygote so it's totally waste of CPU and power if we should call the syscall one by one for each vma.(With testing 2000-vma syscall vs 1-vector syscall, it showed 15% performance improvement. I think it would be bigger in real practice because the testing ran very cache friendly environment). Another potential use case for the vector range is to amortize the cost ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could benefit users like TCP receive zerocopy and malloc implementations. In future, we could find more usecases for other advises so let's make it happens as API since we introduce a new syscall at this moment. With that, existing madvise(2) user could replace it with process_madvise(2) with their own pid if they want to have batch address ranges support feature. ince it could affect other process's address range, only privileged process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same UID) gives it the right to ptrace the process could use it successfully. The flag argument is reserved for future use if we need to extend the API. I think supporting all hints madvise has/will supported/support to process_madvise is rather risky. Because we are not sure all hints make sense from external process and implementation for the hint may rely on the caller being in the current context so it could be error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. If someone want to add other hints, we could hear the usecase and review it for each hint. It's safer for maintenance rather than introducing a buggy syscall but hard to fix it later. So finally, the API is as follows, ssize_t process_madvise(int pidfd, const struct iovec *iovec, unsigned long vlen, int advice, unsigned int flags); DESCRIPTION The process_madvise() system call is used to give advice or directions to the kernel about the address ranges from external process as well as local process. It provides the advice to address ranges of process described by iovec and vlen. The goal of such advice is to improve system or application performance. The pidfd selects the process referred to by the PID file descriptor specified in pidfd. (See pidofd_open(2) for further information) The pointer iovec points to an array of iovec structures, defined in <sys/uio.h> as: struct iovec { void *iov_base; /* starting address */ size_t iov_len; /* number of bytes to be advised */ }; The iovec describes address ranges beginning at address(iov_base) and with size length of bytes(iov_len). The vlen represents the number of elements in iovec. The advice is indicated in the advice argument, which is one of the following at this moment if the target process specified by pidfd is external. MADV_COLD MADV_PAGEOUT Permission to provide a hint to external process is governed by a ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2). The process_madvise supports every advice madvise(2) has if target process is in same thread group with calling process so user could use process_madvise(2) to extend existing madvise(2) to support vector address ranges. RETURN VALUE On success, process_madvise() returns the number of bytes advised. This return value may be less than the total number of requested bytes, if an error occurred. The caller should check return value to determine whether a partial advice occurred. FAQ: Q.1 - Why does any external entity have better knowledge? Quote from Sandeep "For Android, every application (including the special SystemServer) are forked from Zygote. The reason of course is to share as many libraries and classes between the two as possible to benefit from the preloading during boot. After applications start, (almost) all of the APIs end up calling into this SystemServer process over IPC (binder) and back to the application. In a fully running system, the SystemServer monitors every single process periodically to calculate their PSS / RSS and also decides which process is "important" to the user for interactivity. So, because of how these processes start _and_ the fact that the SystemServer is looping to monitor each process, it does tend to *know* which address range of the application is not used / useful. Besides, we can never rely on applications to clean things up themselves. We've had the "hey app1, the system is low on memory, please trim your memory usage down" notifications for a long time[1]. They rely on applications honoring the broadcasts and very few do. So, if we want to avoid the inevitable killing of the application and restarting it, some way to be able to tell the OS about unimportant memory in these applications will be useful. - ssp Q.2 - How to guarantee the race(i.e., object validation) between when giving a hint from an external process and get the hint from the target process? process_madvise operates on the target process's address space as it exists at the instant that process_madvise is called. If the space target process can run between the time the process_madvise process inspects the target process address space and the time that process_madvise is actually called, process_madvise may operate on memory regions that the calling process does not expect. It's the responsibility of the process calling process_madvise to close this race condition. For example, the calling process can suspend the target process with ptrace, SIGSTOP, or the freezer cgroup so that it doesn't have an opportunity to change its own address space before process_madvise is called. Another option is to operate on memory regions that the caller knows a priori will be unchanged in the target process. Yet another option is to accept the race for certain process_madvise calls after reasoning that mistargeting will do no harm. The suggested API itself does not provide synchronization. It also apply other APIs like move_pages, process_vm_write. The race isn't really a problem though. Why is it so wrong to require that callers do their own synchronization in some manner? Nobody objects to write(2) merely because it's possible for two processes to open the same file and clobber each other's writes --- instead, we tell people to use flock or something. Think about mmap. It never guarantees newly allocated address space is still valid when the user tries to access it because other threads could unmap the memory right before. That's where we need synchronization by using other API or design from userside. It shouldn't be part of API itself. If someone needs more fine-grained synchronization rather than process level, there were two ideas suggested - cookie[2] and anon-fd[3]. Both are applicable via using last reserved argument of the API but I don't think it's necessary right now since we have already ways to prevent the race so don't want to add additional complexity with more fine-grained optimization model. To make the API extend, it reserved an unsigned long as last argument so we could support it in future if someone really needs it. Q.3 - Why doesn't ptrace work? Injecting an madvise in the target process using ptrace would not work for us because such injected madvise would have to be executed by the target process, which means that process would have to be runnable and that creates the risk of the abovementioned race and hinting a wrong VMA. Furthermore, we want to act the hint in caller's context, not the callee's, because the callee is usually limited in cpuset/cgroups or even freezed state so they can't act by themselves quick enough, which causes more thrashing/kill. It doesn't work if the target process are ptraced(e.g., strace, debugger, minidump) because a process can have at most one ptracer. [1] https://developer.android.com/topic/performance/memory" [2] process_getinfo for getting the cookie which is updated whenever vma of process address layout are changed - Daniel Colascione - https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224 [3] anonymous fd which is used for the object(i.e., address range) validation - Michal Hocko - https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/ [minchan@kernel.org: fix process_madvise build break for arm64] Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com [minchan@kernel.org: fix build error for mips of process_madvise] Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com [akpm@linux-foundation.org: fix patch ordering issue] [akpm@linux-foundation.org: fix arm64 whoops] [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian] [akpm@linux-foundation.org: fix i386 build] [sfr@canb.auug.org.au: fix syscall numbering] Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au [sfr@canb.auug.org.au: madvise.c needs compat.h] Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au [minchan@kernel.org: fix mips build] Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com [yuehaibing@huawei.com: remove duplicate header which is included twice] Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com [minchan@kernel.org: do not use helper functions for process_madvise] Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com [akpm@linux-foundation.org: pidfd_get_pid() gained an argument] [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"] Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <christian@brauner.io> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-14x86/syscalls: Document the fact that syscalls 512-547 are a legacy mistakeAndy Lutomirski1-4/+6
Since this commit: 6365b842aae4 ("x86/syscalls: Split the x32 syscalls into their own table") there is no need for special x32-specific syscall numbers. I forgot to update the comments in syscall_64.tbl. Add comments to make it clear to future contributors that this range is a legacy wart. Reported-by: Jessica Clarke <jrtc27@jrtc27.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/6c56fb4ddd18fc60a238eb4d867e4b3d97c6351e.1602471055.git.luto@kernel.org
2020-10-14Merge tag 'x86_seves_for_v5.10' of ↵Linus Torvalds1-0/+80
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV-ES support from Borislav Petkov: "SEV-ES enhances the current guest memory encryption support called SEV by also encrypting the guest register state, making the registers inaccessible to the hypervisor by en-/decrypting them on world switches. Thus, it adds additional protection to Linux guests against exfiltration, control flow and rollback attacks. With SEV-ES, the guest is in full control of what registers the hypervisor can access. This is provided by a guest-host exchange mechanism based on a new exception vector called VMM Communication Exception (#VC), a new instruction called VMGEXIT and a shared Guest-Host Communication Block which is a decrypted page shared between the guest and the hypervisor. Intercepts to the hypervisor become #VC exceptions in an SEV-ES guest so in order for that exception mechanism to work, the early x86 init code needed to be made able to handle exceptions, which, in itself, brings a bunch of very nice cleanups and improvements to the early boot code like an early page fault handler, allowing for on-demand building of the identity mapping. With that, !KASLR configurations do not use the EFI page table anymore but switch to a kernel-controlled one. The main part of this series adds the support for that new exchange mechanism. The goal has been to keep this as much as possibly separate from the core x86 code by concentrating the machinery in two SEV-ES-specific files: arch/x86/kernel/sev-es-shared.c arch/x86/kernel/sev-es.c Other interaction with core x86 code has been kept at minimum and behind static keys to minimize the performance impact on !SEV-ES setups. Work by Joerg Roedel and Thomas Lendacky and others" * tag 'x86_seves_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits) x86/sev-es: Use GHCB accessor for setting the MMIO scratch buffer x86/sev-es: Check required CPU features for SEV-ES x86/efi: Add GHCB mappings when SEV-ES is active x86/sev-es: Handle NMI State x86/sev-es: Support CPU offline/online x86/head/64: Don't call verify_cpu() on starting APs x86/smpboot: Load TSS and getcpu GDT entry before loading IDT x86/realmode: Setup AP jump table x86/realmode: Add SEV-ES specific trampoline entry point x86/vmware: Add VMware-specific handling for VMMCALL under SEV-ES x86/kvm: Add KVM-specific VMMCALL handling under SEV-ES x86/paravirt: Allow hypervisor-specific VMMCALL handling under SEV-ES x86/sev-es: Handle #DB Events x86/sev-es: Handle #AC Events x86/sev-es: Handle VMMCALL Events x86/sev-es: Handle MWAIT/MWAITX Events x86/sev-es: Handle MONITOR/MONITORX Events x86/sev-es: Handle INVD Events x86/sev-es: Handle RDPMC Events x86/sev-es: Handle RDTSC(P) Events ...
2020-10-13Merge branch 'compat.mount' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull compat mount cleanups from Al Viro: "The last remnants of mount(2) compat buried by Christoph. Buried into NFS, that is. Generally I'm less enthusiastic about "let's use in_compat_syscall() deep in call chain" kind of approach than Christoph seems to be, but in this case it's warranted - that had been an NFS-specific wart, hopefully not to be repeated in any other filesystems (read: any new filesystem introducing non-text mount options will get NAKed even if it doesn't mess the layout up). IOW, not worth trying to grow an infrastructure that would avoid that use of in_compat_syscall()..." * 'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove compat_sys_mount fs,nfs: lift compat nfs4 mount data handling into the nfs code nfs: simplify nfs4_parse_monolithic
2020-10-13Merge branch 'work.quota-compat' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull compat quotactl cleanups from Al Viro: "More Christoph's compat cleanups: quotactl(2)" * 'work.quota-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: quota: simplify the quotactl compat handling compat: add a compat_need_64bit_alignment_fixup() helper compat: lift compat_s64 and compat_u64 to <asm-generic/compat.h>
2020-10-13Merge branch 'work.iov_iter' of ↵Linus Torvalds3-10/+15
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull compat iovec cleanups from Al Viro: "Christoph's series around import_iovec() and compat variant thereof" * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: security/keys: remove compat_keyctl_instantiate_key_iov mm: remove compat_process_vm_{readv,writev} fs: remove compat_sys_vmsplice fs: remove the compat readv/writev syscalls fs: remove various compat readv/writev helpers iov_iter: transparently handle compat iovecs in import_iovec iov_iter: refactor rw_copy_check_uvector and import_iovec iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c compat.h: fix a spelling error in <linux/compat.h>
2020-10-13Merge tag 'x86-paravirt-2020-10-12' of ↵Linus Torvalds2-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 paravirt cleanup from Ingo Molnar: "Clean up the paravirt code after the removal of 32-bit Xen PV support" * tag 'x86-paravirt-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/paravirt: Avoid needless paravirt step clearing page table entries x86/paravirt: Remove set_pte_at() pv-op x86/entry/32: Simplify CONFIG_XEN_PV build dependency x86/paravirt: Use CONFIG_PARAVIRT_XXL instead of CONFIG_PARAVIRT x86/paravirt: Clean up paravirt macros x86/paravirt: Remove 32-bit support from CONFIG_PARAVIRT_XXL
2020-10-12Merge tag 'x86_cleanups_for_v5.10' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: "Misc minor cleanups" * tag 'x86_cleanups_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry: Fix typo in comments for syscall_enter_from_user_mode() x86/resctrl: Fix spelling in user-visible warning messages x86/entry/64: Do not include inst.h in calling.h x86/mpparse: Remove duplicate io_apic.h include
2020-10-12Merge tag 'x86_fsgsbase_for_v5.10' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fsgsbase updates from Borislav Petkov: "Misc minor cleanups and corrections to the fsgsbase code and respective selftests" * tag 'x86_fsgsbase_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftests/x86/fsgsbase: Test PTRACE_PEEKUSER for GSBASE with invalid LDT GS selftests/x86/fsgsbase: Reap a forgotten child x86/fsgsbase: Replace static_cpu_has() with boot_cpu_has() x86/entry/64: Correct the comment over SAVE_AND_SET_GSBASE
2020-10-09kbuild: explicitly specify the build id styleBill Wendling1-1/+1
ld's --build-id defaults to "sha1" style, while lld defaults to "fast". The build IDs are very different between the two, which may confuse programs that reference them. Signed-off-by: Bill Wendling <morbo@google.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-10-03mm: remove compat_process_vm_{readv,writev}Christoph Hellwig3-4/+6
Now that import_iovec handles compat iovecs, the native syscalls can be used for the compat case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-03fs: remove compat_sys_vmspliceChristoph Hellwig3-2/+3
Now that import_iovec handles compat iovecs, the native vmsplice syscall can be used for the compat case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-03fs: remove the compat readv/writev syscallsChristoph Hellwig3-4/+6
Now that import_iovec handles compat iovecs, the native readv and writev syscalls can be used for the compat case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-23fs: remove compat_sys_mountChristoph Hellwig1-1/+1
compat_sys_mount is identical to the regular sys_mount now, so remove it and use the native version everywhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-22x86/irq: Make run_on_irqstack_cond() typesafeThomas Gleixner2-1/+3
Sami reported that run_on_irqstack_cond() requires the caller to cast functions to mismatching types, which trips indirect call Control-Flow Integrity (CFI) in Clang. Instead of disabling CFI on that function, provide proper helpers for the three call variants. The actual ASM code stays the same as that is out of reach. [ bp: Fix __run_on_irqstack() prototype to match. ] Fixes: 931b94145981 ("x86/entry: Provide helpers for executing on the irqstack") Reported-by: Nathan Chancellor <natechancellor@gmail.com> Reported-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Sami Tolvanen <samitolvanen@google.com> Cc: <stable@vger.kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1052 Link: https://lkml.kernel.org/r/87pn6eb5tv.fsf@nanos.tec.linutronix.de
2020-09-17quota: simplify the quotactl compat handlingChristoph Hellwig1-1/+1
Fold the misaligned u64 workarounds into the main quotactl flow instead of implementing a separate compat syscall handler. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-09x86/entry/64: Add entry code for #VC handlerJoerg Roedel1-0/+80
The #VC handler needs special entry code because: 1. It runs on an IST stack 2. It needs to be able to handle nested #VC exceptions To make this work, the entry code is implemented to pretend it doesn't use an IST stack. When entered from user-mode or early SYSCALL entry path it switches to the task stack. If entered from kernel-mode it tries to switch back to the previous stack in the IRET frame. The stack found in the IRET frame is validated first, and if it is not safe to use it for the #VC handler, the code will switch to a fall-back stack (the #VC2 IST stack). From there, it can cause nested exceptions again. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200907131613.12703-46-joro@8bytes.org
2020-09-04x86/entry: Unbreak 32bit fast syscallThomas Gleixner1-9/+20
Andy reported that the syscall treacing for 32bit fast syscall fails: # ./tools/testing/selftests/x86/ptrace_syscall_32 ... [RUN] SYSEMU [FAIL] Initial args are wrong (nr=224, args=10 11 12 13 14 4289172732) ... [RUN] SYSCALL [FAIL] Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732) The eason is that the conversion to generic entry code moved the retrieval of the sixth argument (EBP) after the point where the syscall entry work runs, i.e. ptrace, seccomp, audit... Unbreak it by providing a split up version of syscall_enter_from_user_mode(). - syscall_enter_from_user_mode_prepare() establishes state and enables interrupts - syscall_enter_from_user_mode_work() runs the entry work Replace the call to syscall_enter_from_user_mode() in the 32bit fast syscall C-entry with the split functions and stick the EBP retrieval between them. Fixes: 27d6b4d14f5c ("x86/entry: Use generic syscall entry function") Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/87k0xdjbtt.fsf@nanos.tec.linutronix.de
2020-09-04x86/entry/64: Do not include inst.h in calling.hUros Bizjak1-1/+0
inst.h was included in calling.h solely to instantiate the RDPID macro. The usage of RDPID was removed in 6a3ea3e68b8a ("x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM") so remove the include. Fixes: 6a3ea3e68b8a ("x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200827171735.93825-1-ubizjak@gmail.com
2020-08-26x86/entry: Remove unused THUNKsPeter Zijlstra1-5/+0
Unused remnants Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20200821085348.487040689@infradead.org
2020-08-24x86/entry/64: Correct the comment over SAVE_AND_SET_GSBASEBorislav Petkov1-2/+3
Add the proper explanation why an LFENCE is not needed in the FSGSBASE case. Fixes: c82965f9e530 ("x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit") Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200821090710.GE12181@zn.tnic
2020-08-21x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVMSean Christopherson1-4/+6
KVM has an optmization to avoid expensive MRS read/writes on VMENTER/EXIT. It caches the MSR values and restores them either when leaving the run loop, on preemption or when going out to user space. The affected MSRs are not required for kernel context operations. This changed with the recently introduced mechanism to handle FSGSBASE in the paranoid entry code which has to retrieve the kernel GSBASE value by accessing per CPU memory. The mechanism needs to retrieve the CPU number and uses either LSL or RDPID if the processor supports it. Unfortunately RDPID uses MSR_TSC_AUX which is in the list of cached and lazily restored MSRs, which means between the point where the guest value is written and the point of restore, MSR_TSC_AUX contains a random number. If an NMI or any other exception which uses the paranoid entry path happens in such a context, then RDPID returns the random guest MSR_TSC_AUX value. As a consequence this reads from the wrong memory location to retrieve the kernel GSBASE value. Kernel GS is used to for all regular this_cpu_*() operations. If the GSBASE in the exception handler points to the per CPU memory of a different CPU then this has the obvious consequences of data corruption and crashes. As the paranoid entry path is the only place which accesses MSR_TSX_AUX (via RDPID) and the fallback via LSL is not significantly slower, remove the RDPID alternative from the entry path and always use LSL. The alternative would be to write MSR_TSC_AUX on every VMENTER and VMEXIT which would be inflicting massive overhead on that code path. [ tglx: Rewrote changelog ] Fixes: eaad981291ee3 ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro") Reported-by: Tom Lendacky <thomas.lendacky@amd.com> Debugged-by: Tom Lendacky <thomas.lendacky@amd.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200821105229.18938-1-pbonzini@redhat.com
2020-08-15x86/paravirt: Use CONFIG_PARAVIRT_XXL instead of CONFIG_PARAVIRTJuergen Gross1-2/+2
There are some code parts using CONFIG_PARAVIRT for Xen pvops related issues instead of the more stringent CONFIG_PARAVIRT_XXL. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200815100641.26362-4-jgross@suse.com
2020-08-15x86/paravirt: Remove 32-bit support from CONFIG_PARAVIRT_XXLJuergen Gross1-0/+1
The last 32-bit user of stuff under CONFIG_PARAVIRT_XXL is gone. Remove 32-bit specific parts. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200815100641.26362-2-jgross@suse.com
2020-08-15all arch: remove system call sys_sysctlXiaoming Ni2-2/+2
Since commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"), sys_sysctl is actually unavailable: any input can only return an error. We have been warning about people using the sysctl system call for years and believe there are no more users. Even if there are users of this interface if they have not complained or fixed their code by now they probably are not going to, so there is no point in warning them any longer. So completely remove sys_sysctl on all architectures. [nixiaoming@huawei.com: s390: fix build error for sys_call_table_emu] Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.com Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Will Deacon <will@kernel.org> [arm/arm64] Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bin Meng <bin.meng@windriver.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: chenzefeng <chenzefeng2@huawei.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christian Brauner <christian@brauner.io> Cc: Chris Zankel <chris@zankel.net> Cc: David Howells <dhowells@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Diego Elio Pettenò <flameeyes@flameeyes.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kars de Jong <jongk@linux-m68k.org> Cc: Kees Cook <keescook@chromium.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Olof Johansson <olof@lixom.net> Cc: Paul Burton <paulburton@kernel.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Sven Schnelle <svens@stackframe.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zhou Yanjie <zhouyanjie@wanyeetech.com> Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14Merge tag 'for-linus-5.9-rc1b-tag' of ↵Linus Torvalds2-137/+2
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull more xen updates from Juergen Gross: - Remove support for running as 32-bit Xen PV-guest. 32-bit PV guests are rarely used, are lacking security fixes for Meltdown, and can be easily replaced by PVH mode. Another series for doing more cleanup will follow soon (removal of 32-bit-only pvops functionality). - Fixes and additional features for the Xen display frontend driver. * tag 'for-linus-5.9-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: drm/xen-front: Pass dumb buffer data offset to the backend xen: Sync up with the canonical protocol definition in Xen drm/xen-front: Add YUYV to supported formats drm/xen-front: Fix misused IS_ERR_OR_NULL checks xen/gntdev: Fix dmabuf import with non-zero sgt offset x86/xen: drop tests for highmem in pv code x86/xen: eliminate xen-asm_64.S x86/xen: remove 32-bit Xen PV guest support
2020-08-11x86/xen: remove 32-bit Xen PV guest supportJuergen Gross2-137/+2
Xen is requiring 64-bit machines today and since Xen 4.14 it can be built without 32-bit PV guest support. There is no need to carry the burden of 32-bit PV guest support in the kernel any longer, as new guests can be either HVM or PVH, or they can use a 64 bit kernel. Remove the 32-bit Xen PV support from the kernel. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2020-08-10Merge tag 'kbuild-v5.9' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - run the checker (e.g. sparse) after the compiler - remove unneeded cc-option tests for old compiler flags - fix tar-pkg to install dtbs - introduce ccflags-remove-y and asflags-remove-y syntax - allow to trace functions in sub-directories of lib/ - introduce hostprogs-always-y and userprogs-always-y syntax - various Makefile cleanups * tag 'kbuild-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: stop filtering out $(GCC_PLUGINS_CFLAGS) from cc-option base kbuild: include scripts/Makefile.* only when relevant CONFIG is enabled kbuild: introduce hostprogs-always-y and userprogs-always-y kbuild: sort hostprogs before passing it to ifneq kbuild: move host .so build rules to scripts/gcc-plugins/Makefile kbuild: Replace HTTP links with HTTPS ones kbuild: trace functions in subdirectories of lib/ kbuild: introduce ccflags-remove-y and asflags-remove-y kbuild: do not export LDFLAGS_vmlinux kbuild: always create directories of targets powerpc/boot: add DTB to 'targets' kbuild: buildtar: add dtbs support kbuild: remove cc-option test of -ffreestanding kbuild: remove cc-option test of -fno-stack-protector Revert "kbuild: Create directory for target DTB" kbuild: run the checker after the compiler
2020-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds3-4/+11
Pull networking updates from David Miller: 1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan. 2) Support UDP segmentation in code TSO code, from Eric Dumazet. 3) Allow flashing different flash images in cxgb4 driver, from Vishal Kulkarni. 4) Add drop frames counter and flow status to tc flower offloading, from Po Liu. 5) Support n-tuple filters in cxgb4, from Vishal Kulkarni. 6) Various new indirect call avoidance, from Eric Dumazet and Brian Vazquez. 7) Fix BPF verifier failures on 32-bit pointer arithmetic, from Yonghong Song. 8) Support querying and setting hardware address of a port function via devlink, use this in mlx5, from Parav Pandit. 9) Support hw ipsec offload on bonding slaves, from Jarod Wilson. 10) Switch qca8k driver over to phylink, from Jonathan McDowell. 11) In bpftool, show list of processes holding BPF FD references to maps, programs, links, and btf objects. From Andrii Nakryiko. 12) Several conversions over to generic power management, from Vaibhav Gupta. 13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry Yakunin. 14) Various https url conversions, from Alexander A. Klimov. 15) Timestamping and PHC support for mscc PHY driver, from Antoine Tenart. 16) Support bpf iterating over tcp and udp sockets, from Yonghong Song. 17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov. 18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan. 19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several drivers. From Luc Van Oostenryck. 20) XDP support for xen-netfront, from Denis Kirjanov. 21) Support receive buffer autotuning in MPTCP, from Florian Westphal. 22) Support EF100 chip in sfc driver, from Edward Cree. 23) Add XDP support to mvpp2 driver, from Matteo Croce. 24) Support MPTCP in sock_diag, from Paolo Abeni. 25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic infrastructure, from Jakub Kicinski. 26) Several pci_ --> dma_ API conversions, from Christophe JAILLET. 27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel. 28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki. 29) Refactor a lot of networking socket option handling code in order to avoid set_fs() calls, from Christoph Hellwig. 30) Add rfc4884 support to icmp code, from Willem de Bruijn. 31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei. 32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin. 33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin. 34) Support TCP syncookies in MPTCP, from Flowian Westphal. 35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano Brivio. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits) net: thunderx: initialize VF's mailbox mutex before first usage usb: hso: remove bogus check for EINPROGRESS usb: hso: no complaint about kmalloc failure hso: fix bailout in error case of probe ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM selftests/net: relax cpu affinity requirement in msg_zerocopy test mptcp: be careful on subflow creation selftests: rtnetlink: make kci_test_encap() return sub-test result selftests: rtnetlink: correct the final return value for the test net: dsa: sja1105: use detected device id instead of DT one on mismatch tipc: set ub->ifindex for local ipv6 address ipv6: add ipv6_dev_find() net: openvswitch: silence suspicious RCU usage warning Revert "vxlan: fix tos value before xmit" ptp: only allow phase values lower than 1 period farsync: switch from 'pci_' to 'dma_' API wan: wanxl: switch from 'pci_' to 'dma_' API hv_netvsc: do not use VF device if link is down dpaa2-eth: Fix passing zero to 'PTR_ERR' warning net: macb: Properly handle phylink on at91sam9x ...
2020-08-05Merge tag 'x86-fsgsbase-2020-08-04' of ↵Linus Torvalds2-32/+147
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fsgsbase from Thomas Gleixner: "Support for FSGSBASE. Almost 5 years after the first RFC to support it, this has been brought into a shape which is maintainable and actually works. This final version was done by Sasha Levin who took it up after Intel dropped the ball. Sasha discovered that the SGX (sic!) offerings out there ship rogue kernel modules enabling FSGSBASE behind the kernels back which opens an instantanious unpriviledged root hole. The FSGSBASE instructions provide a considerable speedup of the context switch path and enable user space to write GSBASE without kernel interaction. This enablement requires careful handling of the exception entries which go through the paranoid entry path as they can no longer rely on the assumption that user GSBASE is positive (as enforced via prctl() on non FSGSBASE enabled systemn). All other entries (syscalls, interrupts and exceptions) can still just utilize SWAPGS unconditionally when the entry comes from user space. Converting these entries to use FSGSBASE has no benefit as SWAPGS is only marginally slower than WRGSBASE and locating and retrieving the kernel GSBASE value is not a free operation either. The real benefit of RD/WRGSBASE is the avoidance of the MSR reads and writes. The changes come with appropriate selftests and have held up in field testing against the (sanitized) Graphene-SGX driver" * tag 'x86-fsgsbase-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) x86/fsgsbase: Fix Xen PV support x86/ptrace: Fix 32-bit PTRACE_SETREGS vs fsbase and gsbase selftests/x86/fsgsbase: Add a missing memory constraint selftests/x86/fsgsbase: Fix a comment in the ptrace_write_gsbase test selftests/x86: Add a syscall_arg_fault_64 test for negative GSBASE selftests/x86/fsgsbase: Test ptracer-induced GS base write with FSGSBASE selftests/x86/fsgsbase: Test GS selector on ptracer-induced GS base write Documentation/x86/64: Add documentation for GS/FS addressing mode x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2 x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit x86/entry/64: Introduce the FIND_PERCPU_BASE macro x86/entry/64: Switch CR3 before SWAPGS in paranoid entry x86/speculation/swapgs: Check FSGSBASE in enabling SWAPGS mitigation x86/process/64: Use FSGSBASE instructions on thread copy and ptrace x86/process/64: Use FSBSBASE in switch_to() if available x86/process/64: Make save_fsgs_for_kvm() ready for FSGSBASE x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE ...
2020-08-05Merge tag 'x86-entry-2020-08-04' of ↵Linus Torvalds3-597/+39
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 conversion to generic entry code from Thomas Gleixner: "The conversion of X86 syscall, interrupt and exception entry/exit handling to the generic code. Pretty much a straight-forward 1:1 conversion plus the consolidation of the KVM handling of pending work before entering guest mode" * tag 'x86-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu() x86/kvm: Use generic xfer to guest work function x86/entry: Cleanup idtentry_enter/exit x86/entry: Use generic interrupt entry/exit code x86/entry: Cleanup idtentry_entry/exit_user x86/entry: Use generic syscall exit functionality x86/entry: Use generic syscall entry function x86/ptrace: Provide pt_regs helper for entry/exit x86/entry: Move user return notifier out of loop x86/entry: Consolidate 32/64 bit syscall entry x86/entry: Consolidate check_user_regs() x86: Correct noinstr qualifiers x86/idtentry: Remove stale comment
2020-08-05Merge tag 'close-range-v5.9' of ↵Linus Torvalds2-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull close_range() implementation from Christian Brauner: "This adds the close_range() syscall. It allows to efficiently close a range of file descriptors up to all file descriptors of a calling task. This is coordinated with the FreeBSD folks which have copied our version of this syscall and in the meantime have already merged it in April 2019: https://reviews.freebsd.org/D21627 https://svnweb.freebsd.org/base?view=revision&revision=359836 The syscall originally came up in a discussion around the new mount API and making new file descriptor types cloexec by default. During this discussion, Al suggested the close_range() syscall. First, it helps to close all file descriptors of an exec()ing task. This can be done safely via (quoting Al's example from [1] verbatim): /* that exec is sensitive */ unshare(CLONE_FILES); /* we don't want anything past stderr here */ close_range(3, ~0U); execve(....); The code snippet above is one way of working around the problem that file descriptors are not cloexec by default. This is aggravated by the fact that we can't just switch them over without massively regressing userspace. For a whole class of programs having an in-kernel method of closing all file descriptors is very helpful (e.g. demons, service managers, programming language standard libraries, container managers etc.). Second, it allows userspace to avoid implementing closing all file descriptors by parsing through /proc/<pid>/fd/* and calling close() on each file descriptor and other hacks. From looking at various large(ish) userspace code bases this or similar patterns are very common in service managers, container runtimes, and programming language runtimes/standard libraries such as Python or Rust. In addition, the syscall will also work for tasks that do not have procfs mounted and on kernels that do not have procfs support compiled in. In such situations the only way to make sure that all file descriptors are closed is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery. Based on Linus' suggestion close_range() also comes with a new flag CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping right before exec. This would usually be expressed in the sequence: unshare(CLONE_FILES); close_range(3, ~0U); as pointed out by Linus it might be desirable to have this be a part of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which gets especially handy when we're closing all file descriptors above a certain threshold. Test-suite as always included" * tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: tests: add CLOSE_RANGE_UNSHARE tests close_range: add CLOSE_RANGE_UNSHARE tests: add close_range() tests arch: wire-up close_range() open: add close_range()
2020-08-05Merge tag 'threads-v5.9' of ↵Linus Torvalds1-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread updates from Christian Brauner: "This contains the changes to add the missing support for attaching to time namespaces via pidfds. Last cycle setns() was changed to support attaching to multiple namespaces atomically. This requires all namespaces to have a point of no return where they can't fail anymore. Specifically, <namespace-type>_install() is allowed to perform permission checks and install the namespace into the new struct nsset that it has been given but it is not allowed to make visible changes to the affected task. Once <namespace-type>_install() returns, anything that the given namespace type additionally requires to be setup needs to ideally be done in a function that can't fail or if it fails the failure must be non-fatal. For time namespaces the relevant functions that fell into this category were timens_set_vvar_page() and vdso_join_timens(). The latter could still fail although it didn't need to. This function is only implemented for vdso_join_timens() in current mainline. As discussed on-list (cf. [1]), in order to make setns() support time namespaces when attaching to multiple namespaces at once properly we changed vdso_join_timens() to always succeed. So vdso_join_timens() replaces the mmap_write_lock_killable() with mmap_read_lock(). Please note that arm is about to grow vdso support for time namespaces (possibly this merge window). We've synced on this change and arm64 also uses mmap_read_lock(), i.e. makes vdso_join_timens() a function that can't fail. Once the changes here and the arm64 changes have landed, vdso_join_timens() should be turned into a void function so it's obvious to callers and implementers on other architectures that the expectation is that it can't fail. We didn't do this right away because it would've introduced unnecessary merge conflicts between the two trees for no major gain. As always, tests included" [1]: https://lore.kernel.org/lkml/20200611110221.pgd3r5qkjrjmfqa2@wittgenstein * tag 'threads-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: tests: add CLONE_NEWTIME setns tests nsproxy: support CLONE_NEWTIME with setns() timens: add timens_commit() helper timens: make vdso_join_timens() always succeed
2020-08-05Merge branch 'exec-linus' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull execve updates from Eric Biederman: "During the development of v5.7 I ran into bugs and quality of implementation issues related to exec that could not be easily fixed because of the way exec is implemented. So I have been diggin into exec and cleaning up what I can. This cycle I have been looking at different ideas and different implementations to see what is possible to improve exec, and cleaning the way exec interfaces with in kernel users. Only cleaning up the interfaces of exec with rest of the kernel has managed to stabalize and make it through review in time for v5.9-rc1 resulting in 2 sets of changes this cycle. - Implement kernel_execve - Make the user mode driver code a better citizen With kernel_execve the code size got a little larger as the copying of parameters from userspace and copying of parameters from userspace is now separate. The good news is kernel threads no longer need to play games with set_fs to use exec. Which when combined with the rest of Christophs set_fs changes should security bugs with set_fs much more difficult" * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits) exec: Implement kernel_execve exec: Factor bprm_stack_limits out of prepare_arg_pages exec: Factor bprm_execve out of do_execve_common exec: Move bprm_mm_init into alloc_bprm exec: Move initialization of bprm->filename into alloc_bprm exec: Factor out alloc_bprm exec: Remove unnecessary spaces from binfmts.h umd: Stop using split_argv umd: Remove exit_umh bpfilter: Take advantage of the facilities of struct pid exit: Factor thread_group_exited out of pidfd_poll umd: Track user space drivers with struct pid bpfilter: Move bpfilter_umh back into init data exec: Remove do_execve_file umh: Stop calling do_execve_file umd: Transform fork_usermode_blob into fork_usermode_driver umd: Rename umd_info.cmdline umd_info.driver_name umd: For clarity rename umh_info umd_info umh: Separate the user mode driver and the user mode helper support umh: Remove call_usermodehelper_setup_file. ...
2020-07-26Merge branch 'locking/nmi' into x86/entryIngo Molnar1-0/+34
Resolve conflicts with ongoing lockdep work that fixed the NMI entry code. Conflicts: arch/x86/entry/common.c arch/x86/include/asm/idtentry.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2-5/+13
The UDP reuseport conflict was a little bit tricky. The net-next code, via bpf-next, extracted the reuseport handling into a helper so that the BPF sk lookup code could invoke it. At the same time, the logic for reuseport handling of unconnected sockets changed via commit efc6b6f6c3113e8b203b9debfb72d81e0f3dcace which changed the logic to carry on the reuseport result into the rest of the lookup loop if we do not return immediately. This requires moving the reuseport_has_conns() logic into the callers. While we are here, get rid of inline directives as they do not belong in foo.c files. The other changes were cases of more straightforward overlapping modifications. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-25Merge tag 'v5.8-rc6' into locking/core, to pick up fixesIngo Molnar2-5/+13
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-24x86/entry: Cleanup idtentry_enter/exitThomas Gleixner1-3/+3
Remove the temporary defines and fixup all references. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20200722220520.855839271@linutronix.de
2020-07-24x86/entry: Use generic interrupt entry/exit codeThomas Gleixner1-166/+1
Replace the x86 code with the generic variant. Use temporary defines for idtentry_* which will be cleaned up in the next step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200722220520.711492752@linutronix.de
2020-07-24x86/entry: Use generic syscall exit functionalityThomas Gleixner3-218/+7
Replace the x86 variant with the generic version. Provide the relevant architecture specific helper functions and defines. Use a temporary define for idtentry_exit_user which will be cleaned up seperately. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20200722220520.494648601@linutronix.de
2020-07-24x86/entry: Use generic syscall entry functionThomas Gleixner1-173/+8
Replace the syscall entry work handling with the generic version. Provide the necessary helper inlines to handle the real architecture specific parts, e.g. ptrace. Use a temporary define for idtentry_enter_user which will be cleaned up seperately. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20200722220520.376213694@linutronix.de
2020-07-24x86/entry: Move user return notifier out of loopThomas Gleixner1-4/+4
Guests and user space share certain MSRs. KVM sets these MSRs to guest values once and does not set them back to user space values on every VM exit to spare the costly MSR operations. User return notifiers ensure that these MSRs are set back to the correct values before returning to user space in exit_to_usermode_loop(). There is no reason to evaluate the TIF flag indicating that user return notifiers need to be invoked in the loop. The important point is that they are invoked before returning to user space. Move the invocation out of the loop into the section which does the last preperatory steps before returning to user space. That section is not preemptible and runs with interrupts disabled until the actual return. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200722220520.159112003@linutronix.de
2020-07-24x86/entry: Consolidate 32/64 bit syscall entryThomas Gleixner1-52/+41
64bit and 32bit entry code have the same open coded syscall entry handling after the bitwidth specific bits. Move it to a helper function and share the code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200722220520.051234096@linutronix.de
2020-07-24x86/entry: Consolidate check_user_regs()Thomas Gleixner1-15/+9
The user register sanity check is sprinkled all over the place. Move it into enter_from_user_mode(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20200722220519.943016204@linutronix.de
2020-07-24Merge branch 'core/entry' into x86/entryThomas Gleixner1-3/+11
Pick up generic entry code to migrate x86 over.
2020-07-21exec: Implement kernel_execveEric W. Biederman2-2/+2
To allow the kernel not to play games with set_fs to call exec implement kernel_execve. The function kernel_execve takes pointers into kernel memory and copies the values pointed to onto the new userspace stack. The calls with arguments from kernel space of do_execve are replaced with calls to kernel_execve. The calls do_execve and do_execveat are made static as there are now no callers outside of exec. The comments that mention do_execve are updated to refer to kernel_execve or execve depending on the circumstances. In addition to correcting the comments, this makes it easy to grep for do_execve and verify it is not used. Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-20net: remove compat_sys_{get,set}sockoptChristoph Hellwig3-4/+11
Now that the ->compat_{get,set}sockopt proto_ops methods are gone there is no good reason left to keep the compat syscalls separate. This fixes the odd use of unsigned int for the compat_setsockopt optlen and the missing sock_use_custom_sol_socket. It would also easily allow running the eBPF hooks for the compat syscalls, but such a large change in behavior does not belong into a consolidation patch like this one. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>