summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2022-03-16x86/traps: Mark do_int3() NOKPROBE_SYMBOLLi Huafei1-0/+1
commit a365a65f9ca1ceb9cf1ac29db4a4f51df7c507ad upstream. Since kprobe_int3_handler() is called in do_int3(), probing do_int3() can cause a breakpoint recursion and crash the kernel. Therefore, do_int3() should be marked as NOKPROBE_SYMBOL. Fixes: 21e28290b317 ("x86/traps: Split int3 handler up") Signed-off-by: Li Huafei <lihuafei1@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220310120915.63349-1-lihuafei1@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16x86/sgx: Free backing memory after faulting the enclave pageJarkko Sakkinen1-9/+48
commit 08999b2489b4c9b939d7483dbd03702ee4576d96 upstream. There is a limited amount of SGX memory (EPC) on each system. When that memory is used up, SGX has its own swapping mechanism which is similar in concept but totally separate from the core mm/* code. Instead of swapping to disk, SGX swaps from EPC to normal RAM. That normal RAM comes from a shared memory pseudo-file and can itself be swapped by the core mm code. There is a hierarchy like this: EPC <-> shmem <-> disk After data is swapped back in from shmem to EPC, the shmem backing storage needs to be freed. Currently, the backing shmem is not freed. This effectively wastes the shmem while the enclave is running. The memory is recovered when the enclave is destroyed and the backing storage freed. Sort this out by freeing memory with shmem_truncate_range(), as soon as a page is faulted back to the EPC. In addition, free the memory for PCMD pages as soon as all PCMD's in a page have been marked as unused by zeroing its contents. Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220303223859.273187-1-jarkko@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16x86/module: Fix the paravirt vs alternative orderPeter Zijlstra1-5/+8
commit 5adf349439d29f92467e864f728dfc23180f3ef9 upstream. Ever since commit 4e6292114c74 ("x86/paravirt: Add new features for paravirt patching") there is an ordering dependency between patching paravirt ops and patching alternatives, the module loader still violates this. Fixes: 4e6292114c74 ("x86/paravirt: Add new features for paravirt patching") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220303112825.068773913@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16x86/boot: Add setup_indirect support in early_memremap_is_setup_data()Ross Philipson1-2/+31
commit 445c1470b6ef96440e7cfc42dfc160f5004fd149 upstream. The x86 boot documentation describes the setup_indirect structures and how they are used. Only one of the two functions in ioremap.c that needed to be modified to be aware of the introduction of setup_indirect functionality was updated. Adds comparable support to the other function where it was missing. Fixes: b3c72fc9a78e ("x86/boot: Introduce setup_indirect") Signed-off-by: Ross Philipson <ross.philipson@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/1645668456-22036-3-git-send-email-ross.philipson@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16x86/boot: Fix memremap of setup_indirect structuresRoss Philipson5-47/+166
commit 7228918b34615ef6317edcd9a058a057bc54aa32 upstream. As documented, the setup_indirect structure is nested inside the setup_data structures in the setup_data list. The code currently accesses the fields inside the setup_indirect structure but only the sizeof(struct setup_data) is being memremapped. No crash occurred but this is just due to how the area is remapped under the covers. Properly memremap both the setup_data and setup_indirect structures in these cases before accessing them. Fixes: b3c72fc9a78e ("x86/boot: Introduce setup_indirect") Signed-off-by: Ross Philipson <ross.philipson@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/1645668456-22036-2-git-send-email-ross.philipson@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPUWanpeng Li1-3/+6
[ Upstream commit ec756e40e271866f951d77c5e923d8deb6002b15 ] Inspired by commit 3553ae5690a (x86/kvm: Don't use pvqspinlock code if only 1 vCPU), on a VM with only 1 vCPU, there is no need to enable pv tlb/ipi/sched_yield and we can save the memory for __pv_cpu_mask. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1645171838-2855-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-16kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup modeAnton Romanov1-0/+7
[ Upstream commit 3a55f729240a686aa8af00af436306c0cd532522 ] If vcpu has tsc_always_catchup set each request updates pvclock data. KVM_HC_CLOCK_PAIRING consumers such as ptp_kvm_x86 rely on tsc read on host's side and do hypercall inside pvclock_read_retry loop leading to infinite loop in such situation. v3: Removed warn Changed return code to KVM_EFAULT v2: Added warn Signed-off-by: Anton Romanov <romanton@google.com> Message-Id: <20220216182653.506850-1-romanton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-11x86/speculation: Warn about eIBRS + LFENCE + Unprivileged eBPF + SMTJosh Poimboeuf1-2/+25
commit 0de05d056afdb00eca8c7bbb0c79a3438daf700c upstream. The commit 44a3918c8245 ("x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting") added a warning for the "eIBRS + unprivileged eBPF" combination, which has been shown to be vulnerable against Spectre v2 BHB-based attacks. However, there's no warning about the "eIBRS + LFENCE retpoline + unprivileged eBPF" combo. The LFENCE adds more protection by shortening the speculation window after a mispredicted branch. That makes an attack significantly more difficult, even with unprivileged eBPF. So at least for now the logic doesn't warn about that combination. But if you then add SMT into the mix, the SMT attack angle weakens the effectiveness of the LFENCE considerably. So extend the "eIBRS + unprivileged eBPF" warning to also include the "eIBRS + LFENCE + unprivileged eBPF + SMT" case. [ bp: Massage commit message. ] Suggested-by: Alyssa Milburn <alyssa.milburn@linux.intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Warn about Spectre v2 LFENCE mitigationJosh Poimboeuf1-0/+5
commit eafd987d4a82c7bb5aa12f0e3b4f8f3dea93e678 upstream. With: f8a66d608a3e ("x86,bugs: Unconditionally allow spectre_v2=retpoline,amd") it became possible to enable the LFENCE "retpoline" on Intel. However, Intel doesn't recommend it, as it has some weaknesses compared to retpoline. Now AMD doesn't recommend it either. It can still be left available as a cmdline option. It's faster than retpoline but is weaker in certain scenarios -- particularly SMT, but even non-SMT may be vulnerable in some cases. So just unconditionally warn if the user requests it on the cmdline. [ bp: Massage commit message. ] Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Use generic retpoline by default on AMDKim Phillips1-9/+0
commit 244d00b5dd4755f8df892c86cab35fb2cfd4f14b upstream. AMD retpoline may be susceptible to speculation. The speculation execution window for an incorrect indirect branch prediction using LFENCE/JMP sequence may potentially be large enough to allow exploitation using Spectre V2. By default, don't use retpoline,lfence on AMD. Instead, use the generic retpoline. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation ↵Josh Poimboeuf1-6/+29
reporting commit 44a3918c8245ab10c6c9719dd12e7a8d291980d8 upstream. With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable to Spectre v2 BHB-based attacks. When both are enabled, print a warning message and report it in the 'spectre_v2' sysfs vulnerabilities file. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Add eIBRS + Retpoline optionsPeter Zijlstra2-38/+99
commit 1e19da8522c81bf46b335f84137165741e0d82b7 upstream. Thanks to the chaps at VUsec it is now clear that eIBRS is not sufficient, therefore allow enabling of retpolines along with eIBRS. Add spectre_v2=eibrs, spectre_v2=eibrs,lfence and spectre_v2=eibrs,retpoline options to explicitly pick your preferred means of mitigation. Since there's new mitigations there's also user visible changes in /sys/devices/system/cpu/vulnerabilities/spectre_v2 to reflect these new mitigations. [ bp: Massage commit message, trim error messages, do more precise eIBRS mode checking. ] Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Patrick Colp <patrick.colp@oracle.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCEPeter Zijlstra (Intel)6-24/+31
commit d45476d9832409371537013ebdd8dc1a7781f97a upstream. The RETPOLINE_AMD name is unfortunate since it isn't necessarily AMD only, in fact Hygon also uses it. Furthermore it will likely be sufficient for some Intel processors. Therefore rename the thing to RETPOLINE_LFENCE to better describe what it is. Add the spectre_v2=retpoline,lfence option as an alias to spectre_v2=retpoline,amd to preserve existing setups. However, the output of /sys/devices/system/cpu/vulnerabilities/spectre_v2 will be changed. [ bp: Fix typos, massage. ] Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-08KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()Like Xu1-1/+1
commit c6c937d673aaa1d603f62f134e1ca9c173eeeed3 upstream. Just like on the optional mmu_alloc_direct_roots() path, once shadow path reaches "r = -EIO" somewhere, the caller needs to know the actual state in order to enter error handling and avoid something worse. Fixes: 4a38162ee9f1 ("KVM: MMU: load PDPTRs outside mmu_lock") Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220301124941.48412-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-08x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64Dexuan Cui1-0/+3
commit 92e68cc558774de01024c18e8b35cdce4731c910 upstream. When Linux runs as an Isolated VM on Hyper-V, it supports AMD SEV-SNP but it's partially enlightened, i.e. cc_platform_has( CC_ATTR_GUEST_MEM_ENCRYPT) is true but sev_active() is false. Commit 4d96f9109109 per se is good, but with it now kvm_setup_vsyscall_timeinfo() -> kvmclock_init_mem() calls set_memory_decrypted(), and later gets stuck when trying to zere out the pages pointed by 'hvclock_mem', if Linux runs as an Isolated VM on Hyper-V. The cause is that here now the Linux VM should no longer access the original guest physical addrss (GPA); instead the VM should do memremap() and access the original GPA + ms_hyperv.shared_gpa_boundary: see the example code in drivers/hv/connection.c: vmbus_connect() or drivers/hv/ring_buffer.c: hv_ringbuffer_init(). If the VM tries to access the original GPA, it keepts getting injected a fault by Hyper-V and gets stuck there. Here the issue happens only when the VM has >=65 vCPUs, because the global static array hv_clock_boot[] can hold 64 "struct pvclock_vsyscall_time_info" (the sizeof of the struct is 64 bytes), so kvmclock_init_mem() only allocates memory in the case of vCPUs > 64. Since the 'hvclock_mem' pages are only useful when the kvm clock is supported by the underlying hypervisor, fix the issue by returning early when Linux VM runs on Hyper-V, which doesn't support kvm clock. Fixes: 4d96f9109109 ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()") Tested-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Message-Id: <20220225084600.17817-1-decui@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-08KVM: x86: Add KVM_CAP_ENABLE_CAP to x86Aaron Lewis1-0/+1
[ Upstream commit 127770ac0d043435375ab86434f31a93efa88215 ] Follow the precedent set by other architectures that support the VCPU ioctl, KVM_ENABLE_CAP, and advertise the VM extension, KVM_CAP_ENABLE_CAP. This way, userspace can ensure that KVM_ENABLE_CAP is available on a vcpu before using it. Fixes: 5c919412fe61 ("kvm/x86: Hyper-V synthetic interrupt controller") Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220214212950.1776943-1-aaronlewis@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-02KVM: x86: nSVM: disallow userspace setting of MSR_AMD64_TSC_RATIO to non ↵Maxim Levitsky1-2/+17
default value when tsc scaling disabled commit e910a53fb4f20aa012e46371ffb4c32c8da259b4 upstream. If nested tsc scaling is disabled, MSR_AMD64_TSC_RATIO should never have non default value. Due to way nested tsc scaling support was implmented in qemu, it would set this msr to 0 when nested tsc scaling was disabled. Ignore that value for now, as it causes no harm. Fixes: 5228eb96a487 ("KVM: x86: nSVM: implement nested TSC scaling") Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220223115649.319134-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-02KVM: x86/mmu: make apf token non-zero to fix bugLiang Zhang1-1/+12
commit 6f3c1fc53d86d580d8d6d749c4af23705e4f6f79 upstream. In current async pagefault logic, when a page is ready, KVM relies on kvm_arch_can_dequeue_async_page_present() to determine whether to deliver a READY event to the Guest. This function test token value of struct kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a READY event is finished by Guest. If value is zero meaning that a READY event is done, so the KVM can deliver another. But the kvm_arch_setup_async_pf() may produce a valid token with zero value, which is confused with previous mention and may lead the loss of this READY event. This bug may cause task blocked forever in Guest: INFO: task stress:7532 blocked for more than 1254 seconds. Not tainted 5.10.0 #16 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack: 0 pid: 7532 ppid: 1409 flags:0x00000080 Call Trace: __schedule+0x1e7/0x650 schedule+0x46/0xb0 kvm_async_pf_task_wait_schedule+0xad/0xe0 ? exit_to_user_mode_prepare+0x60/0x70 __kvm_handle_async_pf+0x4f/0xb0 ? asm_exc_page_fault+0x8/0x30 exc_page_fault+0x6f/0x110 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x402d00 RSP: 002b:00007ffd31912500 EFLAGS: 00010206 RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0 RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0 RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086 R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000 R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000 Signed-off-by: Liang Zhang <zhangliang5@huawei.com> Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23x86/bug: Merge annotate_reachable() into _BUG_FLAGS() asmNick Desaulniers1-9/+11
[ Upstream commit bfb1a7c91fb7758273b4a8d735313d9cc388b502 ] In __WARN_FLAGS(), we had two asm statements (abbreviated): asm volatile("ud2"); asm volatile(".pushsection .discard.reachable"); These pair of statements are used to trigger an exception, but then help objtool understand that for warnings, control flow will be restored immediately afterwards. The problem is that volatile is not a compiler barrier. GCC explicitly documents this: > Note that the compiler can move even volatile asm instructions > relative to other code, including across jump instructions. Also, no clobbers are specified to prevent instructions from subsequent statements from being scheduled by compiler before the second asm statement. This can lead to instructions from subsequent statements being emitted by the compiler before the second asm statement. Providing a scheduling model such as via -march= options enables the compiler to better schedule instructions with known latencies to hide latencies from data hazards compared to inline asm statements in which latencies are not estimated. If an instruction gets scheduled by the compiler between the two asm statements, then objtool will think that it is not reachable, producing a warning. To prevent instructions from being scheduled in between the two asm statements, merge them. Also remove an unnecessary unreachable() asm annotation from BUG() in favor of __builtin_unreachable(). objtool is able to track that the ud2 from BUG() terminates control flow within the function. Link: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Volatile Link: https://github.com/ClangBuiltLinux/linux/issues/1483 Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220202205557.2260694-1-ndesaulniers@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-23KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAWJim Mattson1-1/+1
[ Upstream commit 710c476514313c74045c41c0571bb5178fd16e3d ] AMD's event select is 3 nybbles, with the high nybble in bits 35:32 of a PerfEvtSeln MSR. Don't mask off the high nybble when configuring a RAW perf event. Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20220203014813.2130559-2-jmattson@google.com> Reviewed-by: David Dunn <daviddunn@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-23KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf eventJim Mattson1-2/+3
[ Upstream commit b8bfee85f1307426e0242d654f3a14c06ef639c5 ] AMD's event select is 3 nybbles, with the high nybble in bits 35:32 of a PerfEvtSeln MSR. Don't drop the high nybble when setting up the config field of a perf_event_attr structure for a call to perf_event_create_kernel_counter(). Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20220203014813.2130559-1-jmattson@google.com> Reviewed-by: David Dunn <daviddunn@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-23KVM: x86/pmu: Refactoring find_arch_event() to pmc_perf_hw_id()Like Xu4-17/+11
[ Upstream commit 7c174f305cbee6bdba5018aae02b84369e7ab995 ] The find_arch_event() returns a "unsigned int" value, which is used by the pmc_reprogram_counter() to program a PERF_TYPE_HARDWARE type perf_event. The returned value is actually the kernel defined generic perf_hw_id, let's rename it to pmc_perf_hw_id() with simpler incoming parameters for better self-explanation. Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20211130074221.93635-3-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-23x86/ptrace: Fix xfpregs_set()'s incorrect xmm clearingAndy Lutomirski2-7/+6
commit 44cad52cc14ae10062f142ec16ede489bccf4469 upstream. xfpregs_set() handles 32-bit REGSET_XFP and 64-bit REGSET_FP. The actual code treats these regsets as modern FX state (i.e. the beginning part of XSTATE). The declarations of the regsets thought they were the legacy i387 format. The code thought they were the 32-bit (no xmm8..15) variant of XSTATE and, for good measure, made the high bits disappear by zeroing the wrong part of the buffer. The latter broke ptrace, and everything else confused anyone trying to understand the code. In particular, the nonsense definitions of the regsets confused me when I wrote this code. Clean this all up. Change the declarations to match reality (which shouldn't change the generated code, let alone the ABI) and fix xfpregs_set() to clear the correct bits and to only do so for 32-bit callers. Fixes: 6164331d15f7 ("x86/fpu: Rewrite xfpregs_set()") Reported-by: Luís Ferreira <contact@lsferreira.net> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215524 Link: https://lore.kernel.org/r/YgpFnZpF01WwR8wU@zn.tnic Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23KVM: x86: nSVM: mark vmcb01 as dirty when restoring SMM saved stateMaxim Levitsky1-0/+2
commit e8efa4ff00374d2e6f47f6e4628ca3b541c001af upstream. While usually, restoring the smm state makes the KVM enter the nested guest thus a different vmcb (vmcb02 vs vmcb01), KVM should still mark it as dirty, since hardware can in theory cache multiple vmcbs. Failure to do so, combined with lack of setting the nested_run_pending (which is fixed in the next patch), might make KVM re-enter vmcb01, which was just exited from, with completely different set of guest state registers (SMM vs non SMM) and without proper dirty bits set, which results in the CPU reusing stale IDTR pointer which leads to a guest shutdown on any interrupt. On the real hardware this usually doesn't happen, but when running nested, L0's KVM does check and honour few dirty bits, causing this issue to happen. This patch fixes boot of hyperv and SMM enabled windows VM running nested on KVM. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Message-Id: <20220207155447.840194-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23KVM: x86: nSVM: fix potential NULL derefernce on nested migrationMaxim Levitsky1-12/+14
commit e1779c2714c3023e4629825762bcbc43a3b943df upstream. Turns out that due to review feedback and/or rebases I accidentally moved the call to nested_svm_load_cr3 to be too early, before the NPT is enabled, which is very wrong to do. KVM can't even access guest memory at that point as nested NPT is needed for that, and of course it won't initialize the walk_mmu, which is main issue the patch was addressing. Fix this for real. Fixes: 232f75d3b4b5 ("KVM: nSVM: call nested_svm_load_cr3 on nested state load") Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220207155447.840194-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT && !gCR0.PG caseMaxim Levitsky1-2/+10
commit c53bbe2145f51d3bc0438c2db02e737b9b598bf3 upstream. When the guest doesn't enable paging, and NPT/EPT is disabled, we use guest't paging CR3's as KVM's shadow paging pointer and we are technically in direct mode as if we were to use NPT/EPT. In direct mode we create SPTEs with user mode permissions because usually in the direct mode the NPT/EPT doesn't need to restrict access based on guest CPL (there are MBE/GMET extenstions for that but KVM doesn't use them). In this special "use guest paging as direct" mode however, and if CR4.SMAP/CR4.SMEP are enabled, that will make the CPU fault on each access and KVM will enter endless loop of page faults. Since page protection doesn't have any meaning in !PG case, just don't passthrough these bits. The fix is the same as was done for VMX in commit: commit 656ec4a4928a ("KVM: VMX: fix SMEP and SMAP without EPT") This fixes the boot of windows 10 without NPT for good. (Without this patch, BSP boots, but APs were stuck in endless loop of page faults, causing the VM boot with 1 CPU) Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Message-Id: <20220207155447.840194-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSMMaxim Levitsky2-0/+6
commit 759cbd59674a6c0aec616a3f4f0740ebd3f5fbef upstream. While RSM induced VM entries are not full VM entries, they still need to be followed by actual VM entry to complete it, unlike setting the nested state. This patch fixes boot of hyperv and SMM enabled windows VM running nested on KVM, which fail due to this issue combined with lack of dirty bit setting. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Message-Id: <20220207155447.840194-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPUDavid Woodhouse1-30/+67
commit fcb732d8f8cf6084f8480015ad41d25fb023a4dd upstream. There are circumstances whem kvm_xen_update_runstate_guest() should not sleep because it ends up being called from __schedule() when the vCPU is preempted: [ 222.830825] kvm_xen_update_runstate_guest+0x24/0x100 [ 222.830878] kvm_arch_vcpu_put+0x14c/0x200 [ 222.830920] kvm_sched_out+0x30/0x40 [ 222.830960] __schedule+0x55c/0x9f0 To handle this, make it use the same trick as __kvm_xen_has_interrupt(), of using the hva from the gfn_to_hva_cache directly. Then it can use pagefault_disable() around the accesses and just bail out if the page is absent (which is unlikely). I almost switched to using a gfn_to_pfn_cache here and bailing out if kvm_map_gfn() fails, like kvm_steal_time_set_preempted() does — but on closer inspection it looks like kvm_map_gfn() will *always* fail in atomic context for a page in IOMEM, which means it will silently fail to make the update every single time for such guests, AFAICT. So I didn't do it that way after all. And will probably fix that one too. Cc: stable@vger.kernel.org Fixes: 30b5c851af79 ("KVM: x86/xen: Add support for vCPU runstate information") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <b17a93e5ff4561e57b1238e3e7ccd0b613eb827e.camel@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23x86/Xen: streamline (and fix) PV CPU enumerationJan Beulich2-24/+6
[ Upstream commit e25a8d959992f61b64a58fc62fb7951dc6f31d1f ] This started out with me noticing that "dom0_max_vcpus=<N>" with <N> larger than the number of physical CPUs reported through ACPI tables would not bring up the "excess" vCPU-s. Addressing this is the primary purpose of the change; CPU maps handling is being tidied only as far as is necessary for the change here (with the effect of also avoiding the setting up of too much per-CPU infrastructure, i.e. for CPUs which can never come online). Noticing that xen_fill_possible_map() is called way too early, whereas xen_filter_cpu_maps() is called too late (after per-CPU areas were already set up), and further observing that each of the functions serves only one of Dom0 or DomU, it looked like it was better to simplify this. Use the .get_smp_config hook instead, uniformly for Dom0 and DomU. xen_fill_possible_map() can be dropped altogether, while xen_filter_cpu_maps() is re-purposed but not otherwise changed. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/2dbd5f0a-9859-ca2d-085e-a02f7166c610@suse.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-23Revert "svm: Add warning message for AVIC IPI invalid target"Sean Christopherson1-2/+0
commit dd4589eee99db8f61f7b8f7df1531cad3f74a64d upstream. Remove a WARN on an "AVIC IPI invalid target" exit, the WARN is trivial to trigger from guest as it will fail on any destination APIC ID that doesn't exist from the guest's perspective. Don't bother recording anything in the kernel log, the common tracepoint for kvm_avic_incomplete_ipi() is sufficient for debugging. This reverts commit 37ef0c4414c9743ba7f1af4392f0a27a99649f2a. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220204214205.3306634-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-16x86/sgx: Silence softlockup detection when releasing large enclavesReinette Chatre1-0/+2
commit 8795359e35bc33bf86b6d0765aa7f37431db3b9c upstream. Vijay reported that the "unclobbered_vdso_oversubscribed" selftest triggers the softlockup detector. Actual SGX systems have 128GB of enclave memory or more. The "unclobbered_vdso_oversubscribed" selftest creates one enclave which consumes all of the enclave memory on the system. Tearing down such a large enclave takes around a minute, most of it in the loop where the EREMOVE instruction is applied to each individual 4k enclave page. Spending one minute in a loop triggers the softlockup detector. Add a cond_resched() to give other tasks a chance to run and placate the softlockup detector. Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Vijay Dhanraj <vijay.dhanraj@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> (kselftest as sanity check) Link: https://lkml.kernel.org/r/ced01cac1e75f900251b0a4ae1150aa8ebd295ec.1644345232.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-16KVM: x86: Report deprecated x87 features in supported CPUIDJim Mattson1-6/+7
[ Upstream commit e3bcfda012edd3564e12551b212afbd2521a1f68 ] CPUID.(EAX=7,ECX=0):EBX.FDP_EXCPTN_ONLY[bit 6] and CPUID.(EAX=7,ECX=0):EBX.ZERO_FCS_FDS[bit 13] are "defeature" bits. Unlike most of the other CPUID feature bits, these bits are clear if the features are present and set if the features are not present. These bits should be reported in KVM_GET_SUPPORTED_CPUID, because if these bits are set on hardware, they cannot be cleared in the guest CPUID. Doing so would claim guest support for a feature that the hardware doesn't support and that can't be efficiently emulated. Of course, any software (e.g WIN87EM.DLL) expecting these features to be present likely predates these CPUID feature bits and therefore doesn't know to check for them anyway. Aaron Lewis added the corresponding X86_FEATURE macros in commit cbb99c0f5887 ("x86/cpufeatures: Add FDP_EXCPTN_ONLY and ZERO_FCS_FDS"), with the intention of reporting these bits in KVM_GET_SUPPORTED_CPUID, but I was unable to find a proposed patch on the kvm list. Opportunistically reordered the CPUID_7_0_EBX capability bits from least to most significant. Cc: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20220204001348.2844660-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-16KVM: VMX: Set vmcs.PENDING_DBG.BS on #DB in STI/MOVSS blocking shadowSean Christopherson1-0/+25
[ Upstream commit b9bed78e2fa9571b7c983b20666efa0009030c71 ] Set vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS, a.k.a. the pending single-step breakpoint flag, when re-injecting a #DB with RFLAGS.TF=1, and STI or MOVSS blocking is active. Setting the flag is necessary to make VM-Entry consistency checks happy, as VMX has an invariant that if RFLAGS.TF is set and STI/MOVSS blocking is true, then the previous instruction must have been STI or MOV/POP, and therefore a single-step #DB must be pending since the RFLAGS.TF cannot have been set by the previous instruction, i.e. the one instruction delay after setting RFLAGS.TF must have already expired. Normally, the CPU sets vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS appropriately when recording guest state as part of a VM-Exit, but #DB VM-Exits intentionally do not treat the #DB as "guest state" as interception of the #DB effectively makes the #DB host-owned, thus KVM needs to manually set PENDING_DBG.BS when forwarding/re-injecting the #DB to the guest. Note, although this bug can be triggered by guest userspace, doing so requires IOPL=3, and guest userspace running with IOPL=3 has full access to all I/O ports (from the guest's perspective) and can crash/reboot the guest any number of ways. IOPL=3 is required because STI blocking kicks in if and only if RFLAGS.IF is toggled 0=>1, and if CPL>IOPL, STI either takes a #GP or modifies RFLAGS.VIF, not RFLAGS.IF. MOVSS blocking can be initiated by userspace, but can be coincident with a #DB if and only if DR7.GD=1 (General Detect enabled) and a MOV DR is executed in the MOVSS shadow. MOV DR #GPs at CPL>0, thus MOVSS blocking is problematic only for CPL0 (and only if the guest is crazy enough to access a DR in a MOVSS shadow). All other sources of #DBs are either suppressed by MOVSS blocking (single-step, code fetch, data, and I/O), are mutually exclusive with MOVSS blocking (T-bit task switch), or are already handled by KVM (ICEBP, a.k.a. INT1). This bug was originally found by running tests[1] created for XSA-308[2]. Note that Xen's userspace test emits ICEBP in the MOVSS shadow, which is presumably why the Xen bug was deemed to be an exploitable DOS from guest userspace. KVM already handles ICEBP by skipping the ICEBP instruction and thus clears MOVSS blocking as a side effect of its "emulation". [1] http://xenbits.xenproject.org/docs/xtf/xsa-308_2main_8c_source.html [2] https://xenbits.xen.org/xsa/advisory-308.html Reported-by: David Woodhouse <dwmw2@infradead.org> Reported-by: Alexander Graf <graf@amazon.de> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220120000624.655815-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-16KVM: SVM: Don't kill SEV guest if SMAP erratum triggers in usermodeSean Christopherson1-1/+15
[ Upstream commit cdf85e0c5dc766fc7fc779466280e454a6d04f87 ] Inject a #GP instead of synthesizing triple fault to try to avoid killing the guest if emulation of an SEV guest fails due to encountering the SMAP erratum. The injected #GP may still be fatal to the guest, e.g. if the userspace process is providing critical functionality, but KVM should make every attempt to keep the guest alive. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Message-Id: <20220120010719.711476-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-16KVM: nVMX: Also filter MSR_IA32_VMX_TRUE_PINBASED_CTLS when eVMCSVitaly Kuznetsov1-0/+1
[ Upstream commit f80ae0ef089a09e8c18da43a382c3caac9a424a7 ] Similar to MSR_IA32_VMX_EXIT_CTLS/MSR_IA32_VMX_TRUE_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS/MSR_IA32_VMX_TRUE_ENTRY_CTLS pair, MSR_IA32_VMX_TRUE_PINBASED_CTLS needs to be filtered the same way MSR_IA32_VMX_PINBASED_CTLS is currently filtered as guests may solely rely on 'true' MSR data. Note, none of the currently existing Windows/Hyper-V versions are known to stumble upon the unfiltered MSR_IA32_VMX_TRUE_PINBASED_CTLS, the change is aimed at making the filtering future proof. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220112170134.1904308-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-16KVM: nVMX: eVMCS: Filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMERVitaly Kuznetsov1-1/+3
[ Upstream commit 7a601e2cf61558dfd534a9ecaad09f5853ad8204 ] Enlightened VMCS v1 doesn't have VMX_PREEMPTION_TIMER_VALUE field, PIN_BASED_VMX_PREEMPTION_TIMER is also filtered out already so it makes sense to filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER too. Note, none of the currently existing Windows/Hyper-V versions are known to enable 'save VMX-preemption timer value' when eVMCS is in use, the change is aimed at making the filtering future proof. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220112170134.1904308-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-16x86/perf: Avoid warning for Arch LBR without XSAVEAndi Kleen1-0/+3
[ Upstream commit 8c16dc047b5dd8f7b3bf4584fa75733ea0dde7dc ] Some hypervisors support Arch LBR, but without the LBR XSAVE support. The current Arch LBR init code prints a warning when the xsave size (0) is unexpected. Avoid printing the warning for the "no LBR XSAVE" case. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211215204029.150686-1-ak@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-16perf/x86/rapl: fix AMD event handlingStephane Eranian1-3/+6
[ Upstream commit 0036fb00a756a2f6e360d44e2e3d2200a8afbc9b ] The RAPL events exposed under /sys/devices/power/events should only reflect what the underlying hardware actually support. This is how it works on Intel RAPL and Intel core/uncore PMUs in general. But on AMD, this was not the case. All possible RAPL events were advertised. This is what it showed on an AMD Fam17h: $ ls /sys/devices/power/events/ energy-cores energy-gpu energy-pkg energy-psys energy-ram energy-cores.scale energy-gpu.scale energy-pkg.scale energy-psys.scale energy-ram.scale energy-cores.unit energy-gpu.unit energy-pkg.unit energy-psys.unit energy-ram.unit Yet, on AMD Fam17h, only energy-pkg is supported. This patch fixes the problem. Given the way perf_msr_probe() works, the amd_rapl_msrs[] table has to have all entries filled out and in particular the group field, otherwise perf_msr_probe() defaults to making the event visible. With the patch applied, the kernel now only shows was is actually supported: $ ls /sys/devices/power/events/ energy-pkg energy-pkg.scale energy-pkg.unit The patch also uses the RAPL_MSR_MASK because only the 32-bits LSB of the RAPL counters are relevant when reading power consumption. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220105185659.643355-1-eranian@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-08x86/perf: Default set FREEZE_ON_SMI for allPeter Zijlstra1-0/+13
commit a01994f5e5c79d3a35e5e8cf4252c7f2147323c3 upstream. Kyle reported that rr[0] has started to malfunction on Comet Lake and later CPUs due to EFI starting to make use of CPL3 [1] and the PMU event filtering not distinguishing between regular CPL3 and SMM CPL3. Since this is a privilege violation, default disable SMM visibility where possible. Administrators wanting to observe SMM cycles can easily change this using the sysfs attribute while regular users don't have access to this file. [0] https://rr-project.org/ [1] See the Intel white paper "Trustworthy SMM on the Intel vPro Platform" at https://bugzilla.kernel.org/attachment.cgi?id=300300, particularly the end of page 5. Reported-by: Kyle Huey <me@kylehuey.com> Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@kernel.org Link: https://lkml.kernel.org/r/YfKChjX61OW4CkYm@hirez.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-08perf/x86/intel/pt: Fix crash with stop filters in single-range modeTristan Hume1-2/+3
commit 1d9093457b243061a9bba23543c38726e864a643 upstream. Add a check for !buf->single before calling pt_buffer_region_size in a place where a missing check can cause a kernel crash. Fixes a bug introduced by commit 670638477aed ("perf/x86/intel/pt: Opportunistically use single range output mode"), which added a support for PT single-range output mode. Since that commit if a PT stop filter range is hit while tracing, the kernel will crash because of a null pointer dereference in pt_handle_status due to calling pt_buffer_region_size without a ToPA configured. The commit which introduced single-range mode guarded almost all uses of the ToPA buffer variables with checks of the buf->single variable, but missed the case where tracing was stopped by the PT hardware, which happens when execution hits a configured stop filter. Tested that hitting a stop filter while PT recording successfully records a trace with this patch but crashes without this patch. Fixes: 670638477aed ("perf/x86/intel/pt: Opportunistically use single range output mode") Signed-off-by: Tristan Hume <tristan@thume.ca> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@kernel.org Link: https://lkml.kernel.org/r/20220127220806.73664-1-tristan@thume.ca Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01KVM: nVMX: Allow VMREAD when Enlightened VMCS is in useVitaly Kuznetsov2-16/+51
commit 6cbbaab60ff33f59355492c241318046befd9ffc upstream. Hyper-V TLFS explicitly forbids VMREAD and VMWRITE instructions when Enlightened VMCS interface is in use: "Any VMREAD or VMWRITE instructions while an enlightened VMCS is active is unsupported and can result in unexpected behavior."" Windows 11 + WSL2 seems to ignore this, attempts to VMREAD VMCS field 0x4404 ("VM-exit interruption information") are observed. Failing these attempts with nested_vmx_failInvalid() makes such guests unbootable. Microsoft confirms this is a Hyper-V bug and claims that it'll get fixed eventually but for the time being we need a workaround. (Temporary) allow VMREAD to get data from the currently loaded Enlightened VMCS. Note: VMWRITE instructions remain forbidden, it is not clear how to handle them properly and hopefully won't ever be needed. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220112170134.1904308-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01KVM: nVMX: Implement evmcs_field_offset() suitable for handle_vmread()Vitaly Kuznetsov2-10/+25
commit 892a42c10ddb945d3a4dcf07dccdf9cb98b21548 upstream. In preparation to allowing reads from Enlightened VMCS from handle_vmread(), implement evmcs_field_offset() to get the correct read offset. get_evmcs_offset(), which is being used by KVM-on-Hyper-V, is almost what's needed but a few things need to be adjusted. First, WARN_ON() is unacceptable for handle_vmread() as any field can (in theory) be supplied by the guest and not all fields are defined in eVMCS v1. Second, we need to handle 'holes' in eVMCS (missing fields). It also sounds like a good idea to WARN_ON() if such fields are ever accessed by KVM-on-Hyper-V. Implement dedicated evmcs_field_offset() helper. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220112170134.1904308-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01KVM: nVMX: Rename vmcs_to_field_offset{,_table}Vitaly Kuznetsov3-8/+8
commit 2423a4c0d17418eca1ba1e3f48684cb2ab7523d5 upstream. vmcs_to_field_offset{,_table} may sound misleading as VMCS is an opaque blob which is not supposed to be accessed directly. In fact, vmcs_to_field_offset{,_table} are related to KVM defined VMCS12 structure. Rename vmcs_field_to_offset() to get_vmcs12_field_offset() for clarity. No functional change intended. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220112170134.1904308-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPINTony Luck1-0/+1
commit e464121f2d40eabc7d11823fb26db807ce945df4 upstream. Missed adding the Icelake-D CPU to the list. It uses the same MSRs to control and read the inventory number as all the other models. Fixes: dc6b025de95b ("x86/mce: Add Xeon Icelake to list of CPUs that support PPIN") Reported-by: Ailin Xu <ailin.xu@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220121174743.1875294-2-tony.luck@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01x86/MCE/AMD: Allow thresholding interface updates after initYazen Ghannam1-1/+1
commit 1f52b0aba6fd37653416375cb8a1ca673acf8d5f upstream. Changes to the AMD Thresholding sysfs code prevents sysfs writes from updating the underlying registers once CPU init is completed, i.e. "threshold_banks" is set. Allow the registers to be updated if the thresholding interface is already initialized or if in the init path. Use the "set_lvt_off" value to indicate if running in the init path, since this value is only set during init. Fixes: a037f3ca0ea0 ("x86/mce/amd: Make threshold bank setting hotplug robust") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220117161328.19148-1-yazen.ghannam@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01PCI/sysfs: Find shadow ROM before static attribute initializationBjorn Helgaas1-2/+2
commit 66d28b21fe6b3da8d1e9f0a7ba38bc61b6c547e1 upstream. Ville reported that the sysfs "rom" file for VGA devices disappeared after 527139d738d7 ("PCI/sysfs: Convert "rom" to static attribute"). Prior to 527139d738d7, FINAL fixups, including pci_fixup_video() where we find shadow ROMs, were run before pci_create_sysfs_dev_files() created the sysfs "rom" file. After 527139d738d7, "rom" is a static attribute and is created before FINAL fixups are run, so we didn't create "rom" files for shadow ROMs: acpi_pci_root_add ... pci_scan_single_device pci_device_add pci_fixup_video # <-- new HEADER fixup device_add ... if (grp->is_visible()) pci_dev_rom_attr_is_visible # after 527139d738d7 pci_bus_add_devices pci_bus_add_device pci_fixup_device(pci_fixup_final) pci_fixup_video # <-- previous FINAL fixup pci_create_sysfs_dev_files if (pci_resource_len(pdev, PCI_ROM_RESOURCE)) sysfs_create_bin_file("rom") # before 527139d738d7 Change pci_fixup_video() to be a HEADER fixup so it runs before sysfs static attributes are initialized. Rename the Loongson pci_fixup_radeon() to pci_fixup_video() and make its dmesg logging identical to the others since it is doing the same job. Link: https://lore.kernel.org/r/YbxqIyrkv3GhZVxx@intel.com Fixes: 527139d738d7 ("PCI/sysfs: Convert "rom" to static attribute") Link: https://lore.kernel.org/r/20220126154001.16895-1-helgaas@kernel.org Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org # v5.13+ Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Krzysztof Wilczyński <kw@linux.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any timeLike Xu1-2/+2
commit 05a9e065059e566f218f8778c4d17ee75db56c55 upstream. XCR0 is reset to 1 by RESET but not INIT and IA32_XSS is zeroed by both RESET and INIT. The kvm_set_msr_common()'s handling of MSR_IA32_XSS also needs to update kvm_update_cpuid_runtime(). In the above cases, the size in bytes of the XSAVE area containing all states enabled by XCR0 or (XCRO | IA32_XSS) needs to be updated. For simplicity and consistency, existing helpers are used to write values and call kvm_update_cpuid_runtime(), and it's not exactly a fast path. Fixes: a554d207dc46 ("KVM: X86: Processor States following Reset or INIT") Cc: stable@vger.kernel.org Signed-off-by: Like Xu <likexu@tencent.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220126172226.2298529-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSSLike Xu1-0/+1
commit 4c282e51e4450b94680d6ca3b10f830483b1f243 upstream. Do a runtime CPUID update for a vCPU if MSR_IA32_XSS is written, as the size in bytes of the XSAVE area is affected by the states enabled in XSS. Fixes: 203000993de5 ("kvm: vmx: add MSR logic for XSAVES") Cc: stable@vger.kernel.org Signed-off-by: Like Xu <likexu@tencent.com> [sean: split out as a separate patch, adjust Fixes tag] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220126172226.2298529-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01KVM: x86: Keep MSR_IA32_XSS unchanged for INITXiaoyao Li1-2/+1
commit be4f3b3f82271c3193ce200a996dc70682c8e622 upstream. It has been corrected from SDM version 075 that MSR_IA32_XSS is reset to zero on Power up and Reset but keeps unchanged on INIT. Fixes: a554d207dc46 ("KVM: X86: Processor States following Reset or INIT") Cc: stable@vger.kernel.org Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220126172226.2298529-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01KVM: x86: Check .flags in kvm_cpuid_check_equal() tooVitaly Kuznetsov1-0/+1
commit 033a3ea59a19df63edb4db6bfdbb357cd028258a upstream. kvm_cpuid_check_equal() checks for the (full) equality of the supplied CPUID data so .flags need to be checked too. Reported-by: Sean Christopherson <seanjc@google.com> Fixes: c6617c61e8fe ("KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220126131804.2839410-1-vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>