summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-06-29x86/ftrace: Make non direct case the default in ftrace_regs_callerSteven Rostedt (VMware)1-16/+15
If a direct function is hooked along with one of the ftrace registered functions, then the ftrace_regs_caller is attached to the function that shares the direct hook as well as the ftrace hook. The ftrace_regs_caller will call ftrace_ops_list_func() that iterates through all the registered ftrace callbacks, and if there's a direct callback attached to that function, the direct ftrace_ops callback is called to notify that ftrace_regs_caller to return to the direct caller instead of going back to the function that called it. But this is a very uncommon case. Currently, the code has it as the default case. Modify ftrace_regs_caller to make the default case (the non jump) to just return normally, and have the jump to the handling of the direct caller. Link: http://lkml.kernel.org/r/20200422162750.350373278@goodmis.org Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-29x86/fpu: Reset MXCSR to default in kernel_fpu_begin()Petteri Aimonen1-0/+6
Previously, kernel floating point code would run with the MXCSR control register value last set by userland code by the thread that was active on the CPU core just before kernel call. This could affect calculation results if rounding mode was changed, or a crash if a FPU/SIMD exception was unmasked. Restore MXCSR to the kernel's default value. [ bp: Carve out from a bigger patch by Petteri, add feature check, add FNINIT call too (amluto). ] Signed-off-by: Petteri Aimonen <jpa@git.mail.kapsi.fi> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=207979 Link: https://lkml.kernel.org/r/20200624114646.28953-2-bp@alien8.de
2020-06-28Merge tag 'x86_urgent_for_5.8_rc3' of ↵Linus Torvalds7-20/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - AMD Memory bandwidth counter width fix, by Babu Moger. - Use the proper length type in the 32-bit truncate() syscall variant, by Jiri Slaby. - Reinit IA32_FEAT_CTL during wakeup to fix the case where after resume, VMXON would #GP due to VMX not being properly enabled, by Sean Christopherson. - Fix a static checker warning in the resctrl code, by Dan Carpenter. - Add a CR4 pinning mask for bits which cannot change after boot, by Kees Cook. - Align the start of the loop of __clear_user() to 16 bytes, to improve performance on AMD zen1 and zen2 microarchitectures, by Matt Fleming. * tag 'x86_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm/64: Align start of __clear_user() loop to 16-bytes x86/cpu: Use pinning mask for CR4 bits needing to be 0 x86/resctrl: Fix a NULL vs IS_ERR() static checker warning in rdt_cdp_peer_get() x86/cpu: Reinitialize IA32_FEAT_CTL MSR on BSP during wakeup syscalls: Fix offset type of ksys_ftruncate() x86/resctrl: Fix memory bandwidth counter width for AMD
2020-06-26docs: fix references for DMA*.txt filesMauro Carvalho Chehab1-1/+1
As we moved those files to core-api, fix references to point to their newer locations. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/37b2fd159fbc7655dbf33b3eb1215396a25f6344.1592895969.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-06-26Merge branch 'linus' into x86/entry, to resolve conflictsIngo Molnar8-44/+30
Conflicts: arch/x86/kernel/traps.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-06-26x86: kill dump_fpu()Al Viro1-16/+0
dead since the removal of aout coredump support... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-26x86: copy_fpstate_to_sigframe(): have fpregs_soft_get() use kernel bufferAl Viro1-6/+6
... then copy_to_user() the results Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-25Merge drm/drm-next into drm-intel-next-queuedJani Nikula106-1812/+2402
Catch up with upstream, in particular to get c1e8d7c6a7a6 ("mmap locking API: convert mmap_sem comments"). Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2020-06-25x86/entry: Fix #UD vs WARN morePeter Zijlstra1-34/+38
vmlinux.o: warning: objtool: exc_invalid_op()+0x47: call to probe_kernel_read() leaves .noinstr.text section Since we use UD2 as a short-cut for 'CALL __WARN', treat it as such. Have the bare exception handler do the report_bug() thing. Fixes: 15a416e8aaa7 ("x86/entry: Treat BUG/WARN as NMI-like entries") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200622114713.GE577403@hirez.programming.kicks-ass.net
2020-06-25x86/entry: Fixup bad_iret vs noinstrPeter Zijlstra1-3/+3
vmlinux.o: warning: objtool: fixup_bad_iret()+0x8e: call to memcpy() leaves .noinstr.text section Worse, when KASAN there is no telling what memcpy() actually is. Force the use of __memcpy() which is our assmebly implementation. Reported-by: Marco Elver <elver@google.com> Suggested-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20200618144801.760070502@infradead.org
2020-06-25x86/msr: Filter MSR writesBorislav Petkov1-0/+69
Add functionality to disable writing to MSRs from userspace. Writes can still be allowed by supplying the allow_writes=on module parameter. The kernel will be tainted so that it shows in oopses. Having unfettered access to all MSRs on a system is and has always been a disaster waiting to happen. Think performance counter MSRs, MSRs with sticky or locked bits, MSRs making major system changes like loading microcode, MTRRs, PAT configuration, TSC counter, security mitigations MSRs, you name it. This also destroys all the kernel's caching of MSR values for performance, as the recent case with MSR_AMD64_LS_CFG showed. Another example is writing MSRs by mistake by simply typing the wrong MSR address. System freezes have been experienced that way. In general, poking at MSRs under the kernel's feet is a bad bad idea. So log writing to MSRs by default. Longer term, such writes will be disabled by default. If userspace still wants to do that, then proper interfaces should be defined which are under the kernel's control and accesses to those MSRs can be synchronized and sanitized properly. [ Fix sparse warnings. ] Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Sean Christopherson <sean.j.christopherson@intel.com> Link: https://lkml.kernel.org/r/20200612105026.GA22660@zn.tnic
2020-06-23Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-6/+0
Pull kvm fixes from Paolo Bonzini: "All bugfixes except for a couple cleanup patches" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: VMX: Remove vcpu_vmx's defunct copy of host_pkru KVM: x86: allow TSC to differ by NTP correction bounds without TSC scaling KVM: X86: Fix MSR range of APIC registers in X2APIC mode KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL KVM: nVMX: Plumb L2 GPA through to PML emulation KVM: x86/mmu: Avoid mixing gpa_t with gfn_t in walk_addr_generic() KVM: LAPIC: ensure APIC map is up to date on concurrent update requests kvm: lapic: fix broken vcpu hotplug Revert "KVM: VMX: Micro-optimize vmexit time when not exposing PMU" KVM: VMX: Add helpers to identify interrupt type from intr_info kvm/svm: disable KCSAN for svm_vcpu_run() KVM: MIPS: Fix a build error for !CPU_LOONGSON64
2020-06-23x86/mce, EDAC/mce_amd: Print PPIN in machine check recordsSmita Koralahalli1-0/+2
Print the Protected Processor Identification Number (PPIN) on processors which support it. [ bp: Massage. ] Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200623130059.8870-1-Smita.KoralahalliChannabasappa@amd.com
2020-06-23KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROLSean Christopherson1-6/+0
Remove support for context switching between the guest's and host's desired UMWAIT_CONTROL. Propagating the guest's value to hardware isn't required for correct functionality, e.g. KVM intercepts reads and writes to the MSR, and the latency effects of the settings controlled by the MSR are not architecturally visible. As a general rule, KVM should not allow the guest to control power management settings unless explicitly enabled by userspace, e.g. see KVM_CAP_X86_DISABLE_EXITS. E.g. Intel's SDM explicitly states that C0.2 can improve the performance of SMT siblings. A devious guest could disable C0.2 so as to improve the performance of their workloads at the detriment to workloads running in the host or on other VMs. Wholesale removal of UMWAIT_CONTROL context switching also fixes a race condition where updates from the host may cause KVM to enter the guest with the incorrect value. Because updates are are propagated to all CPUs via IPI (SMP function callback), the value in hardware may be stale with respect to the cached value and KVM could enter the guest with the wrong value in hardware. As above, the guest can't observe the bad value, but it's a weird and confusing wart in the implementation. Removal also fixes the unnecessary usage of VMX's atomic load/store MSR lists. Using the lists is only necessary for MSRs that are required for correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on old hardware, or for MSRs that need to-the-uop precision, e.g. perf related MSRs. For UMWAIT_CONTROL, the effects are only visible in the kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in vcpu_vmx_run(). Using the atomic lists is undesirable as they are more expensive than direct RDMSR/WRMSR. Furthermore, even if giving the guest control of the MSR is legitimate, e.g. in pass-through scenarios, it's not clear that the benefits would outweigh the overhead. E.g. saving and restoring an MSR across a VMX roundtrip costs ~250 cycles, and if the guest diverged from the host that cost would be paid on every run of the guest. In other words, if there is a legitimate use case then it should be enabled by a new per-VM capability. Note, KVM still needs to emulate MSR_IA32_UMWAIT_CONTROL so that it can correctly expose other WAITPKG features to the guest, e.g. TPAUSE, UMWAIT and UMONITOR. Fixes: 6e3ba4abcea56 ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL") Cc: stable@vger.kernel.org Cc: Jingqi Liu <jingqi.liu@intel.com> Cc: Tao Xu <tao3.xu@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200623005135.10414-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22fork: fold legacy_clone_args_valid() into _do_fork()Christian Brauner1-3/+0
This separate helper only existed to guarantee the mutual exclusivity of CLONE_PIDFD and CLONE_PARENT_SETTID for legacy clone since CLONE_PIDFD abuses the parent_tid field to return the pidfd. But we can actually handle this uniformely thus removing the helper. For legacy clone we can detect that CLONE_PIDFD is specified in conjunction with CLONE_PARENT_SETTID because they will share the same memory which is invalid and for clone3() setting the separate pidfd and parent_tid fields to the same memory is bogus as well. So fold that helper directly into _do_fork() by detecting this case. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: linux-m68k@lists.linux-m68k.org Cc: x86@kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-20Merge tag 'trace-v5.8-rc1' of ↵Linus Torvalds1-13/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Have recordmcount work with > 64K sections (to support LTO) - kprobe RCU fixes - Correct a kprobe critical section with missing mutex - Remove redundant arch_disarm_kprobe() call - Fix lockup when kretprobe triggers within kprobe_flush_task() - Fix memory leak in fetch_op_data operations - Fix sleep in atomic in ftrace trace array sample code - Free up memory on failure in sample trace array code - Fix incorrect reporting of function_graph fields in format file - Fix quote within quote parsing in bootconfig - Fix return value of bootconfig tool - Add testcases for bootconfig tool - Fix maybe uninitialized warning in ftrace pid file code - Remove unused variable in tracing_iter_reset() - Fix some typos * tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Fix maybe-uninitialized compiler warning tools/bootconfig: Add testcase for show-command and quotes test tools/bootconfig: Fix to return 0 if succeeded to show the bootconfig tools/bootconfig: Fix to use correct quotes for value proc/bootconfig: Fix to use correct quotes for value tracing: Remove unused event variable in tracing_iter_reset tracing/probe: Fix memleak in fetch_op_data operations trace: Fix typo in allocate_ftrace_ops()'s comment tracing: Make ftrace packed events have align of 1 sample-trace-array: Remove trace_array 'sample-instance' sample-trace-array: Fix sleeping function called from invalid context kretprobe: Prevent triggering kretprobe from within kprobe_flush_task kprobes: Remove redundant arch_disarm_kprobe() call kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex kprobes: Use non RCU traversal APIs on kprobe_tables if possible kprobes: Suppress the suspicious RCU warning on kprobes recordmcount: support >64k sections
2020-06-20x86/idt: Make idt_descr staticJason Andryuk1-1/+1
Commit 3e77abda65b1 ("x86/idt: Consolidate idt functionality") states that idt_descr could be made static, but it did not actually make the change. Make it static now. Fixes: 3e77abda65b1 ("x86/idt: Consolidate idt functionality") Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200619205103.30873-1-jandryuk@gmail.com
2020-06-18maccess: make get_kernel_nofault() check for minimal type compatibilityLinus Torvalds1-2/+2
Now that we've renamed probe_kernel_address() to get_kernel_nofault() and made it look and behave more in line with get_user(), some of the subtle type behavior differences end up being more obvious and possibly dangerous. When you do get_user(val, user_ptr); the type of the access comes from the "user_ptr" part, and the above basically acts as val = *user_ptr; by design (except, of course, for the fact that the actual dereference is done with a user access). Note how in the above case, the type of the end result comes from the pointer argument, and then the value is cast to the type of 'val' as part of the assignment. So the type of the pointer is ultimately the more important type both for the access itself. But 'get_kernel_nofault()' may now _look_ similar, but it behaves very differently. When you do get_kernel_nofault(val, kernel_ptr); it behaves like val = *(typeof(val) *)kernel_ptr; except, of course, for the fact that the actual dereference is done with exception handling so that a faulting access is suppressed and returned as the error code. But note how different the casting behavior of the two superficially similar accesses are: one does the actual access in the size of the type the pointer points to, while the other does the access in the size of the target, and ignores the pointer type entirely. Actually changing get_kernel_nofault() to act like get_user() is almost certainly the right thing to do eventually, but in the meantime this patch adds logit to at least verify that the pointer type is compatible with the type of the result. In many cases, this involves just casting the pointer to 'void *' to make it obvious that the type of the pointer is not the important part. It's not how 'get_user()' acts, but at least the behavioral difference is now obvious and explicit. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-18maccess: rename probe_kernel_address to get_kernel_nofaultChristoph Hellwig2-11/+11
Better describe what this helper does, and match the naming of copy_from_kernel_nofault. Also switch the argument order around, so that it acts and looks like get_user(). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-18x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2Andi Kleen1-1/+3
The kernel needs to explicitly enable FSGSBASE. So, the application needs to know if it can safely use these instructions. Just looking at the CPUID bit is not enough because it may be running in a kernel that does not enable the instructions. One way for the application would be to just try and catch the SIGILL. But that is difficult to do in libraries which may not want to overwrite the signal handlers of the main application. Enumerate the enabled FSGSBASE capability in bit 1 of AT_HWCAP2 in the ELF aux vector. AT_HWCAP2 is already used by PPC for similar purposes. The application can access it open coded or by using the getauxval() function in newer versions of glibc. [ tglx: Massaged changelog ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1557309753-24073-18-git-send-email-chang.seok.bae@intel.com Link: https://lkml.kernel.org/r/20200528201402.1708239-14-sashal@kernel.org
2020-06-18x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bitAndy Lutomirski1-18/+14
Now that FSGSBASE is fully supported, remove unsafe_fsgsbase, enable FSGSBASE by default, and add nofsgsbase to disable it. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1557309753-24073-17-git-send-email-chang.seok.bae@intel.com Link: https://lkml.kernel.org/r/20200528201402.1708239-13-sashal@kernel.org
2020-06-18x86/speculation/swapgs: Check FSGSBASE in enabling SWAPGS mitigationTony Luck1-4/+2
Before enabling FSGSBASE the kernel could safely assume that the content of GS base was a user address. Thus any speculative access as the result of a mispredicted branch controlling the execution of SWAPGS would be to a user address. So systems with speculation-proof SMAP did not need to add additional LFENCE instructions to mitigate. With FSGSBASE enabled a hostile user can set GS base to a kernel address. So they can make the kernel speculatively access data they wish to leak via a side channel. This means that SMAP provides no protection. Add FSGSBASE as an additional condition to enable the fence-based SWAPGS mitigation. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200528201402.1708239-9-sashal@kernel.org
2020-06-18x86/process/64: Use FSGSBASE instructions on thread copy and ptraceChang S. Bae2-6/+10
When FSGSBASE is enabled, copying threads and reading fsbase and gsbase using ptrace must read the actual values. When copying a thread, use save_fsgs() and copy the saved values. For ptrace, the bases must be read from memory regardless of the selector if FSGSBASE is enabled. [ tglx: Invoke __rdgsbase_inactive() with interrupts disabled ] [ luto: Massage changelog ] Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1557309753-24073-9-git-send-email-chang.seok.bae@intel.com Link: https://lkml.kernel.org/r/20200528201402.1708239-8-sashal@kernel.org
2020-06-18x86/process/64: Use FSBSBASE in switch_to() if availableAndy Lutomirski1-6/+28
With the new FSGSBASE instructions, FS and GSABSE can be efficiently read and writen in __switch_to(). Use that capability to preserve the full state. This will enable user code to do whatever it wants with the new instructions without any kernel-induced gotchas. (There can still be architectural gotchas: movl %gs,%eax; movl %eax,%gs may change GSBASE if WRGSBASE was used, but users are expected to read the CPU manual before doing things like that.) This is a considerable speedup. It seems to save about 100 cycles per context switch compared to the baseline 4.6-rc1 behavior on a Skylake laptop. This is mostly due to avoiding the WRMSR operation. [ chang: 5~10% performance improvements were seen with a context switch benchmark that ran threads with different FS/GSBASE values (to the baseline 4.16). Minor edit on the changelog. ] [ tglx: Masaage changelog ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1557309753-24073-8-git-send-email-chang.seok.bae@intel.com Link: https://lkml.kernel.org/r/20200528201402.1708239-6-sashal@kernel.org
2020-06-18x86/process/64: Make save_fsgs_for_kvm() ready for FSGSBASEThomas Gleixner1-6/+9
save_fsgs_for_kvm() is invoked via vcpu_enter_guest() kvm_x86_ops.prepare_guest_switch(vcpu) vmx_prepare_switch_to_guest() save_fsgs_for_kvm() with preemption disabled, but interrupts enabled. The upcoming FSGSBASE based GS safe needs interrupts to be disabled. This could be done in the helper function, but that function is also called from switch_to() which has interrupts disabled already. Disable interrupts inside save_fsgs_for_kvm() and rename the function to current_save_fsgs() so it can be invoked from other places. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200528201402.1708239-7-sashal@kernel.org
2020-06-18x86/fsgsbase/64: Enable FSGSBASE instructions in helper functionsChang S. Bae1-0/+68
Add cpu feature conditional FSGSBASE access to the relevant helper functions. That allows to accelerate certain FS/GS base operations in subsequent changes. Note, that while possible, the user space entry/exit GSBASE operations are not going to use the new FSGSBASE instructions. The reason is that it would require additional storage for the user space value which adds more complexity to the low level code and experiments have shown marginal benefit. This may be revisited later but for now the SWAPGS based handling in the entry code is preserved except for the paranoid entry/exit code. To preserve the SWAPGS entry mechanism introduce __[rd|wr]gsbase_inactive() helpers. Note, for Xen PV, paravirt hooks can be added later as they might allow a very efficient but different implementation. [ tglx: Massaged changelog, convert it to noinstr and force inline native_swapgs() ] Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1557309753-24073-7-git-send-email-chang.seok.bae@intel.com Link: https://lkml.kernel.org/r/20200528201402.1708239-5-sashal@kernel.org
2020-06-18x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASEAndy Lutomirski1-0/+24
This is temporary. It will allow the next few patches to be tested incrementally. Setting unsafe_fsgsbase is a root hole. Don't do it. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/1557309753-24073-4-git-send-email-chang.seok.bae@intel.com Link: https://lkml.kernel.org/r/20200528201402.1708239-3-sashal@kernel.org
2020-06-18x86/ptrace: Prevent ptrace from clearing the FS/GS selectorChang S. Bae1-15/+2
When a ptracer writes a ptracee's FS/GSBASE with a different value, the selector is also cleared. This behavior is not correct as the selector should be preserved. Update only the base value and leave the selector intact. To simplify the code further remove the conditional checking for the same value as this code is not performance critical. The only recognizable downside of this change is when the selector is already nonzero on write. The base will be reloaded according to the selector. But the case is highly unexpected in real usages. [ tglx: Massage changelog ] Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/9040CFCD-74BD-4C17-9A01-B9B713CF6B10@intel.com Link: https://lkml.kernel.org/r/20200528201402.1708239-2-sashal@kernel.org
2020-06-18x86/mce/dev-mcelog: Use struct_size() helper in kzalloc()Gustavo A. R. Silva1-1/+1
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200617211734.GA9636@embeddedor
2020-06-18x86/stackprotector: Pre-initialize canary for secondary CPUsBrian Gerst1-12/+2
The idle tasks created for each secondary CPU already have a random stack canary generated by fork(). Copy the canary to the percpu variable before starting the secondary CPU which removes the need to call boot_init_stack_canary(). Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200617225624.799335-1-brgerst@gmail.com
2020-06-18x86/cpu: Use pinning mask for CR4 bits needing to be 0Kees Cook1-12/+12
The X86_CR4_FSGSBASE bit of CR4 should not change after boot[1]. Older kernels should enforce this bit to zero, and newer kernels need to enforce it depending on boot-time configuration (e.g. "nofsgsbase"). To support a pinned bit being either 1 or 0, use an explicit mask in combination with the expected pinned bit values. [1] https://lore.kernel.org/lkml/20200527103147.GI325280@hirez.programming.kicks-ass.net Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/202006082013.71E29A42@keescook
2020-06-17maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofaultChristoph Hellwig6-13/+15
Better describe what these functions do. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-17x86/resctrl: Fix a NULL vs IS_ERR() static checker warning in rdt_cdp_peer_get()Dan Carpenter1-0/+1
The callers don't expect *d_cdp to be set to an error pointer, they only check for NULL. This leads to a static checker warning: arch/x86/kernel/cpu/resctrl/rdtgroup.c:2648 __init_one_rdt_domain() warn: 'd_cdp' could be an error pointer This would not trigger a bug in this specific case because __init_one_rdt_domain() calls it with a valid domain that would not have a negative id and thus not trigger the return of the ERR_PTR(). If this was a negative domain id then the call to rdt_find_domain() in domain_add_cpu() would have returned the ERR_PTR() much earlier and the creation of the domain with an invalid id would have been prevented. Even though a bug is not triggered currently the right and safe thing to do is to set the pointer to NULL because that is what can be checked for when the caller is handling the CDP and non-CDP cases. Fixes: 52eb74339a62 ("x86/resctrl: Fix rdt_find_domain() return value and checks") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Acked-by: Fenghua Yu <fenghua.yu@intel.com> Link: https://lkml.kernel.org/r/20200602193611.GA190851@mwanda
2020-06-17kretprobe: Prevent triggering kretprobe from within kprobe_flush_taskJiri Olsa1-13/+3
Ziqian reported lockup when adding retprobe on _raw_spin_lock_irqsave. My test was also able to trigger lockdep output: ============================================ WARNING: possible recursive locking detected 5.6.0-rc6+ #6 Not tainted -------------------------------------------- sched-messaging/2767 is trying to acquire lock: ffffffff9a492798 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_hash_lock+0x52/0xa0 but task is already holding lock: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(kretprobe_table_locks[i].lock)); lock(&(kretprobe_table_locks[i].lock)); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by sched-messaging/2767: #0: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50 stack backtrace: CPU: 3 PID: 2767 Comm: sched-messaging Not tainted 5.6.0-rc6+ #6 Call Trace: dump_stack+0x96/0xe0 __lock_acquire.cold.57+0x173/0x2b7 ? native_queued_spin_lock_slowpath+0x42b/0x9e0 ? lockdep_hardirqs_on+0x590/0x590 ? __lock_acquire+0xf63/0x4030 lock_acquire+0x15a/0x3d0 ? kretprobe_hash_lock+0x52/0xa0 _raw_spin_lock_irqsave+0x36/0x70 ? kretprobe_hash_lock+0x52/0xa0 kretprobe_hash_lock+0x52/0xa0 trampoline_handler+0xf8/0x940 ? kprobe_fault_handler+0x380/0x380 ? find_held_lock+0x3a/0x1c0 kretprobe_trampoline+0x25/0x50 ? lock_acquired+0x392/0xbc0 ? _raw_spin_lock_irqsave+0x50/0x70 ? __get_valid_kprobe+0x1f0/0x1f0 ? _raw_spin_unlock_irqrestore+0x3b/0x40 ? finish_task_switch+0x4b9/0x6d0 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 The code within the kretprobe handler checks for probe reentrancy, so we won't trigger any _raw_spin_lock_irqsave probe in there. The problem is in outside kprobe_flush_task, where we call: kprobe_flush_task kretprobe_table_lock raw_spin_lock_irqsave _raw_spin_lock_irqsave where _raw_spin_lock_irqsave triggers the kretprobe and installs kretprobe_trampoline handler on _raw_spin_lock_irqsave return. The kretprobe_trampoline handler is then executed with already locked kretprobe_table_locks, and first thing it does is to lock kretprobe_table_locks ;-) the whole lockup path like: kprobe_flush_task kretprobe_table_lock raw_spin_lock_irqsave _raw_spin_lock_irqsave ---> probe triggered, kretprobe_trampoline installed ---> kretprobe_table_locks locked kretprobe_trampoline trampoline_handler kretprobe_hash_lock(current, &head, &flags); <--- deadlock Adding kprobe_busy_begin/end helpers that mark code with fake probe installed to prevent triggering of another kprobe within this code. Using these helpers in kprobe_flush_task, so the probe recursion protection check is hit and the probe is never set to prevent above lockup. Link: http://lkml.kernel.org/r/158927059835.27680.7011202830041561604.stgit@devnote2 Fixes: ef53d9c5e4da ("kprobes: improve kretprobe scalability with hashed locking") Cc: Ingo Molnar <mingo@kernel.org> Cc: "Gustavo A . R . Silva" <gustavoars@kernel.org> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Reported-by: "Ziqian SUN (Zamir)" <zsun@redhat.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-17x86/speculation: Merge one test in spectre_v2_user_select_mitigation()Borislav Petkov1-9/+4
Merge the test whether the CPU supports STIBP into the test which determines whether STIBP is required. Thus try to simplify what is already an insane logic. Remove a superfluous newline in a comment, while at it. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Anthony Steinhauser <asteinhauser@google.com> Link: https://lkml.kernel.org/r/20200615065806.GB14668@zn.tnic
2020-06-16x86/alternatives: Add pr_fmt() to debug macrosBorislav Petkov1-2/+2
... in order to have debug output prefixed with the pr_fmt text "SMP alternatives:" which allows easy grepping: $ dmesg | grep "SMP alternatives" [ 0.167783] SMP alternatives: alt table ffffffff8272c780, -> ffffffff8272fd6e [ 0.168620] SMP alternatives: feat: 3*32+16, old: (x86_64_start_kernel+0x37/0x73 \ (ffffffff826093f7) len: 5), repl: (ffffffff8272fd6e, len: 5), pad: 0 [ 0.170103] SMP alternatives: ffffffff826093f7: old_insn: e8 54 a8 da fe [ 0.171184] SMP alternatives: ffffffff8272fd6e: rpl_insn: e8 cd 3e c8 fe ... Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200615175315.17301-1-bp@alien8.de
2020-06-15x86/cpu: Reinitialize IA32_FEAT_CTL MSR on BSP during wakeupSean Christopherson3-4/+2
Reinitialize IA32_FEAT_CTL on the BSP during wakeup to handle the case where firmware doesn't initialize or save/restore across S3. This fixes a bug where IA32_FEAT_CTL is left uninitialized and results in VMXON taking a #GP due to VMX not being fully enabled, i.e. breaks KVM. Use init_ia32_feat_ctl() to "restore" IA32_FEAT_CTL as it already deals with the case where the MSR is locked, and because APs already redo init_ia32_feat_ctl() during suspend by virtue of the SMP boot flow being used to reinitialize APs upon wakeup. Do the call in the early wakeup flow to avoid dependencies in the syscore_ops chain, e.g. simply adding a resume hook is not guaranteed to work, as KVM does VMXON in its own resume hook, kvm_resume(), when KVM has active guests. Fixes: 21bd3467a58e ("KVM: VMX: Drop initialization of IA32_FEAT_CTL MSR") Reported-by: Brad Campbell <lists2009@fnarfbargle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Brad Campbell <lists2009@fnarfbargle.com> Cc: stable@vger.kernel.org # v5.6 Link: https://lkml.kernel.org/r/20200608174134.11157-1-sean.j.christopherson@intel.com
2020-06-15x86/entry, cpumask: Provide non-instrumented variant of cpu_is_offline()Peter Zijlstra2-2/+2
vmlinux.o: warning: objtool: exc_nmi()+0x12: call to cpumask_test_cpu.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: mce_check_crashing_cpu()+0x12: call to cpumask_test_cpu.constprop.0()leaves .noinstr.text section cpumask_test_cpu() test_bit() instrument_atomic_read() arch_test_bit() Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-06-15x86, sched: Bail out of frequency invariance if turbo_freq/base_freq gives 0Giovanni Gherdovich1-2/+9
Be defensive against the case where the processor reports a base_freq larger than turbo_freq (the ratio would be zero). Fixes: 1567c3e3467c ("x86, sched: Add support for frequency invariance") Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/20200531182453.15254-4-ggherdovich@suse.cz
2020-06-15x86, sched: Bail out of frequency invariance if turbo frequency is unknownGiovanni Gherdovich1-2/+4
There may be CPUs that support turbo boost but don't declare any turbo ratio, i.e. their MSR_TURBO_RATIO_LIMIT is all zeroes. In that condition scale-invariant calculations can't be performed. Fixes: 1567c3e3467c ("x86, sched: Add support for frequency invariance") Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Link: https://lkml.kernel.org/r/20200531182453.15254-3-ggherdovich@suse.cz
2020-06-15x86, sched: check for counters overflow in frequency invariant accountingGiovanni Gherdovich1-5/+28
The product mcnt * arch_max_freq_ratio can overflows u64. For context, a large value for arch_max_freq_ratio would be 5000, corresponding to a turbo_freq/base_freq ratio of 5 (normally it's more like 1500-2000). A large increment frequency for the MPERF counter would be 5GHz (the base clock of all CPUs on the market today is less than that). With these figures, a CPU would need to go without a scheduler tick for around 8 days for the u64 overflow to happen. It is unlikely, but the check is warranted. Under similar conditions, the difference acnt of two consecutive APERF readings can overflow as well. In these circumstances is appropriate to disable frequency invariant accounting: the feature relies on measures of the clock frequency done at every scheduler tick, which need to be "fresh" to be at all meaningful. A note on i386: prior to version 5.1, the GCC compiler didn't have the builtin function __builtin_mul_overflow. In these GCC versions the macro check_mul_overflow needs __udivdi3() to do (u64)a/b, which the kernel doesn't provide. For this reason this change fails to build on i386 if GCC<5.1, and we protect the entire frequency invariant code behind CONFIG_X86_64 (special thanks to "kbuild test robot" <lkp@intel.com>). Fixes: 1567c3e3467c ("x86, sched: Add support for frequency invariance") Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/20200531182453.15254-2-ggherdovich@suse.cz
2020-06-15perf/x86: Add perf text poke events for kprobesAdrian Hunter2-6/+47
Add perf text poke events for kprobes. That includes: - the replaced instruction(s) which are executed out-of-line i.e. arch_copy_kprobe() and arch_remove_kprobe() - the INT3 that activates the kprobe i.e. arch_arm_kprobe() and arch_disarm_kprobe() - optimised kprobe function i.e. arch_prepare_optimized_kprobe() and __arch_remove_optimized_kprobe() - optimised kprobe i.e. arch_optimize_kprobes() and arch_unoptimize_kprobe() Resulting in 8 possible text_poke events: 0: NULL -> probe.ainsn.insn (if ainsn.boostable && !kp.post_handler) arch_copy_kprobe() 1: old0 -> INT3 arch_arm_kprobe() // boosted kprobe active 2: NULL -> optprobe_trampoline arch_prepare_optimized_kprobe() 3: INT3,old1,old2,old3,old4 -> JMP32 arch_optimize_kprobes() // optprobe active 4: JMP32 -> INT3,old1,old2,old3,old4 // optprobe disabled and kprobe active (this sometimes goes back to 3) arch_unoptimize_kprobe() 5: optprobe_trampoline -> NULL arch_remove_optimized_kprobe() // boosted kprobe active 6: INT3 -> old0 arch_disarm_kprobe() 7: probe.ainsn.insn -> NULL (if ainsn.boostable && !kp.post_handler) arch_remove_kprobe() Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20200512121922.8997-6-adrian.hunter@intel.com
2020-06-15perf/x86: Add support for perf text poke event for text_poke_bp_batch() callersAdrian Hunter1-1/+36
Add support for perf text poke event for text_poke_bp_batch() callers. That includes jump labels. See comments for more details. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200512121922.8997-3-adrian.hunter@intel.com
2020-06-15KVM: x86: Switch KVM guest to using interrupts for page ready APF deliveryVitaly Kuznetsov1-15/+33
KVM now supports using interrupt for 'page ready' APF event delivery and legacy mechanism was deprecated. Switch KVM guests to the new one. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-9-vkuznets@redhat.com> [Use HYPERVISOR_CALLBACK_VECTOR instead of a separate vector. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-15x86/mce/inject: Fix a wrong assignment of i_mce.statusZhenzhong Duan1-1/+1
The original code is a nop as i_mce.status is or'ed with part of itself, fix it. Fixes: a1300e505297 ("x86/ras/mce_amd_inj: Trigger deferred and thresholding errors interrupts") Signed-off-by: Zhenzhong Duan <zhenzhong.duan@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lkml.kernel.org/r/20200611023238.3830-1-zhenzhong.duan@gmail.com
2020-06-15x86/microcode: Do not select FW_LOADERHerbert Xu1-2/+0
The x86 microcode support works just fine without FW_LOADER. In fact, these days most people load microcode early during boot so FW_LOADER never gets into the picture anyway. As almost everyone on x86 needs to enable MICROCODE, this by extension means that FW_LOADER is always built into the kernel even if nothing uses it. The FW_LOADER system is about two thousand lines long and contains user-space facing interfaces that could potentially provide an entry point into the kernel (or beyond). Remove the unnecessary select of FW_LOADER by MICROCODE. People who need the FW_LOADER capability can still enable it. [ bp: Massage a bit. ] Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200610042911.GA20058@gondor.apana.org.au
2020-06-15x86/resctrl: Fix memory bandwidth counter width for AMDBabu Moger2-4/+5
Memory bandwidth is calculated reading the monitoring counter at two intervals and calculating the delta. It is the software’s responsibility to read the count often enough to avoid having the count roll over _twice_ between reads. The current code hardcodes the bandwidth monitoring counter's width to 24 bits for AMD. This is due to default base counter width which is 24. Currently, AMD does not implement the CPUID 0xF.[ECX=1]:EAX to adjust the counter width. But, the AMD hardware supports much wider bandwidth counter with the default width of 44 bits. Kernel reads these monitoring counters every 1 second and adjusts the counter value for overflow. With 24 bits and scale value of 64 for AMD, it can only measure up to 1GB/s without overflowing. For the rates above 1GB/s this will fail to measure the bandwidth. Fix the issue setting the default width to 44 bits by adjusting the offset. AMD future products will implement CPUID 0xF.[ECX=1]:EAX. [ bp: Let the line stick out and drop {}-brackets around a single statement. ] Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature") Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/159129975546.62538.5656031125604254041.stgit@naples-babu.amd.com
2020-06-13Merge tag 'ras-core-2020-06-12' of ↵Linus Torvalds6-166/+198
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Thomas Gleixner: "RAS updates from Borislav Petkov: - Unmap a whole guest page if an MCE is encountered in it to avoid follow-on MCEs leading to the guest crashing, by Tony Luck. This change collided with the entry changes and the merge resolution would have been rather unpleasant. To avoid that the entry branch was merged in before applying this. The resulting code did not change over the rebase. - AMD MCE error thresholding machinery cleanup and hotplug sanitization, by Thomas Gleixner. - Change the MCE notifiers to denote whether they have handled the error and not break the chain early by returning NOTIFY_STOP, thus giving the opportunity for the later handlers in the chain to see it. By Tony Luck. - Add AMD family 0x17, models 0x60-6f support, by Alexander Monakov. - Last but not least, the usual round of fixes and improvements" * tag 'ras-core-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/mce/dev-mcelog: Fix -Wstringop-truncation warning about strncpy() x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned EDAC/amd64: Add AMD family 17h model 60h PCI IDs hwmon: (k10temp) Add AMD family 17h model 60h PCI match x86/amd_nb: Add AMD family 17h model 60h PCI IDs x86/mcelog: Add compat_ioctl for 32-bit mcelog support x86/mce: Drop bogus comment about mce.kflags x86/mce: Fixup exception only for the correct MCEs EDAC: Drop the EDAC report status checks x86/mce: Add mce=print_all option x86/mce: Change default MCE logger to check mce->kflags x86/mce: Fix all mce notifiers to update the mce->kflags bitmask x86/mce: Add a struct mce.kflags field x86/mce: Convert the CEC to use the MCE notifier x86/mce: Rename "first" function as "early" x86/mce/amd, edac: Remove report_gart_errors x86/mce/amd: Make threshold bank setting hotplug robust x86/mce/amd: Cleanup threshold device remove path x86/mce/amd: Straighten CPU hotplug path x86/mce/amd: Sanitize thresholding device creation hotplug path ...
2020-06-13Merge tag 'x86-entry-2020-06-12' of ↵Linus Torvalds35-615/+797
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry updates from Thomas Gleixner: "The x86 entry, exception and interrupt code rework This all started about 6 month ago with the attempt to move the Posix CPU timer heavy lifting out of the timer interrupt code and just have lockless quick checks in that code path. Trivial 5 patches. This unearthed an inconsistency in the KVM handling of task work and the review requested to move all of this into generic code so other architectures can share. Valid request and solved with another 25 patches but those unearthed inconsistencies vs. RCU and instrumentation. Digging into this made it obvious that there are quite some inconsistencies vs. instrumentation in general. The int3 text poke handling in particular was completely unprotected and with the batched update of trace events even more likely to expose to endless int3 recursion. In parallel the RCU implications of instrumenting fragile entry code came up in several discussions. The conclusion of the x86 maintainer team was to go all the way and make the protection against any form of instrumentation of fragile and dangerous code pathes enforcable and verifiable by tooling. A first batch of preparatory work hit mainline with commit d5f744f9a2ac ("Pull x86 entry code updates from Thomas Gleixner") That (almost) full solution introduced a new code section '.noinstr.text' into which all code which needs to be protected from instrumentation of all sorts goes into. Any call into instrumentable code out of this section has to be annotated. objtool has support to validate this. Kprobes now excludes this section fully which also prevents BPF from fiddling with it and all 'noinstr' annotated functions also keep ftrace off. The section, kprobes and objtool changes are already merged. The major changes coming with this are: - Preparatory cleanups - Annotating of relevant functions to move them into the noinstr.text section or enforcing inlining by marking them __always_inline so the compiler cannot misplace or instrument them. - Splitting and simplifying the idtentry macro maze so that it is now clearly separated into simple exception entries and the more interesting ones which use interrupt stacks and have the paranoid handling vs. CR3 and GS. - Move quite some of the low level ASM functionality into C code: - enter_from and exit to user space handling. The ASM code now calls into C after doing the really necessary ASM handling and the return path goes back out without bells and whistels in ASM. - exception entry/exit got the equivivalent treatment - move all IRQ tracepoints from ASM to C so they can be placed as appropriate which is especially important for the int3 recursion issue. - Consolidate the declaration and definition of entry points between 32 and 64 bit. They share a common header and macros now. - Remove the extra device interrupt entry maze and just use the regular exception entry code. - All ASM entry points except NMI are now generated from the shared header file and the corresponding macros in the 32 and 64 bit entry ASM. - The C code entry points are consolidated as well with the help of DEFINE_IDTENTRY*() macros. This allows to ensure at one central point that all corresponding entry points share the same semantics. The actual function body for most entry points is in an instrumentable and sane state. There are special macros for the more sensitive entry points, e.g. INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF. They allow to put the whole entry instrumentation and RCU handling into safe places instead of the previous pray that it is correct approach. - The INT3 text poke handling is now completely isolated and the recursion issue banned. Aside of the entry rework this required other isolation work, e.g. the ability to force inline bsearch. - Prevent #DB on fragile entry code, entry relevant memory and disable it on NMI, #MC entry, which allowed to get rid of the nested #DB IST stack shifting hackery. - A few other cleanups and enhancements which have been made possible through this and already merged changes, e.g. consolidating and further restricting the IDT code so the IDT table becomes RO after init which removes yet another popular attack vector - About 680 lines of ASM maze are gone. There are a few open issues: - An escape out of the noinstr section in the MCE handler which needs some more thought but under the aspect that MCE is a complete trainwreck by design and the propability to survive it is low, this was not high on the priority list. - Paravirtualization When PV is enabled then objtool complains about a bunch of indirect calls out of the noinstr section. There are a few straight forward ways to fix this, but the other issues vs. general correctness were more pressing than parawitz. - KVM KVM is inconsistent as well. Patches have been posted, but they have not yet been commented on or picked up by the KVM folks. - IDLE Pretty much the same problems can be found in the low level idle code especially the parts where RCU stopped watching. This was beyond the scope of the more obvious and exposable problems and is on the todo list. The lesson learned from this brain melting exercise to morph the evolved code base into something which can be validated and understood is that once again the violation of the most important engineering principle "correctness first" has caused quite a few people to spend valuable time on problems which could have been avoided in the first place. The "features first" tinkering mindset really has to stop. With that I want to say thanks to everyone involved in contributing to this effort. Special thanks go to the following people (alphabetical order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra, Vitaly Kuznetsov, and Will Deacon" * tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits) x86/entry: Force rcu_irq_enter() when in idle task x86/entry: Make NMI use IDTENTRY_RAW x86/entry: Treat BUG/WARN as NMI-like entries x86/entry: Unbreak __irqentry_text_start/end magic x86/entry: __always_inline CR2 for noinstr lockdep: __always_inline more for noinstr x86/entry: Re-order #DB handler to avoid *SAN instrumentation x86/entry: __always_inline arch_atomic_* for noinstr x86/entry: __always_inline irqflags for noinstr x86/entry: __always_inline debugreg for noinstr x86/idt: Consolidate idt functionality x86/idt: Cleanup trap_init() x86/idt: Use proper constants for table size x86/idt: Add comments about early #PF handling x86/idt: Mark init only functions __init x86/entry: Rename trace_hardirqs_off_prepare() x86/entry: Clarify irq_{enter,exit}_rcu() x86/entry: Remove DBn stacks x86/entry: Remove debug IDT frobbing x86/entry: Optimize local_db_save() for virt ...
2020-06-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-1/+0
Pull more KVM updates from Paolo Bonzini: "The guest side of the asynchronous page fault work has been delayed to 5.9 in order to sync with Thomas's interrupt entry rework, but here's the rest of the KVM updates for this merge window. MIPS: - Loongson port PPC: - Fixes ARM: - Fixes x86: - KVM_SET_USER_MEMORY_REGION optimizations - Fixes - Selftest fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (62 commits) KVM: x86: do not pass poisoned hva to __kvm_set_memory_region KVM: selftests: fix sync_with_host() in smm_test KVM: async_pf: Inject 'page ready' event only if 'page not present' was previously injected KVM: async_pf: Cleanup kvm_setup_async_pf() kvm: i8254: remove redundant assignment to pointer s KVM: x86: respect singlestep when emulating instruction KVM: selftests: Don't probe KVM_CAP_HYPERV_ENLIGHTENED_VMCS when nested VMX is unsupported KVM: selftests: do not substitute SVM/VMX check with KVM_CAP_NESTED_STATE check KVM: nVMX: Consult only the "basic" exit reason when routing nested exit KVM: arm64: Move hyp_symbol_addr() to kvm_asm.h KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts KVM: arm64: Remove host_cpu_context member from vcpu structure KVM: arm64: Stop sparse from moaning at __hyp_this_cpu_ptr KVM: arm64: Handle PtrAuth traps early KVM: x86: Unexport x86_fpu_cache and make it static KVM: selftests: Ignore KVM 5-level paging support for VM_MODE_PXXV48_4K KVM: arm64: Save the host's PtrAuth keys in non-preemptible context KVM: arm64: Stop save/restoring ACTLR_EL1 KVM: arm64: Add emulation for 32bit guests accessing ACTLR2 ...