summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx/vmx.c
AgeCommit message (Collapse)AuthorFilesLines
2024-08-03KVM: nVMX: Fold requested virtual interrupt check into has_nested_events()Sean Christopherson1-20/+0
commit 321ef62b0c5f6f57bb8500a2ca5986052675abbf upstream. Check for a Requested Virtual Interrupt, i.e. a virtual interrupt that is pending delivery, in vmx_has_nested_events() and drop the one-off kvm_x86_ops.guest_apic_has_interrupt() hook. In addition to dropping a superfluous hook, this fixes a bug where KVM would incorrectly treat virtual interrupts _for L2_ as always enabled due to kvm_arch_interrupt_allowed(), by way of vmx_interrupt_blocked(), treating IRQs as enabled if L2 is active and vmcs12 is configured to exit on IRQs, i.e. KVM would treat a virtual interrupt for L2 as a valid wake event based on L1's IRQ blocking status. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-03KVM: VMX: Split out the non-virtualization part of vmx_interrupt_blocked()Sean Christopherson1-3/+8
commit 322a569c4b4188a0da2812f9e952780ce09b74ba upstream. Move the non-VMX chunk of the "interrupt blocked" checks to a separate helper so that KVM can reuse the code to detect if interrupts are blocked for L2, e.g. to determine if a virtual interrupt _for L2_ is a valid wake event. If L1 disables HLT-exiting for L2, nested APICv is enabled, and L2 HLTs, then L2 virtual interrupts are valid wake events, but if and only if interrupts are unblocked for L2. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-03Merge branch 'kvm-fixes-6.10-1' into HEADPaolo Bonzini1-2/+9
* Fixes and debugging help for the #VE sanity check. Also disable it by default, even for CONFIG_DEBUG_KERNEL, because it was found to trigger spuriously (most likely a processor erratum as the exact symptoms vary by generation). * Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled situation (GIF=0 or interrupt shadow) when the processor supports virtual NMI. While generally KVM will not request an NMI window when virtual NMIs are supported, in this case it *does* have to single-step over the interrupt shadow or enable the STGI intercept, in order to deliver the latched second NMI. * Drop support for hand tuning APIC timer advancement from userspace. Since we have adaptive tuning, and it has proved to work well, drop the module parameter for manual configuration and with it a few stupid bugs that it had.
2024-05-23KVM: x86/mmu: Print SPTEs on unexpected #VESean Christopherson1-0/+5
Print the SPTEs that correspond to the faulting GPA on an unexpected EPT Violation #VE to help the user debug failures, e.g. to pinpoint which SPTE didn't have SUPPRESS_VE set. Opportunistically assert that the underlying exit reason was indeed an EPT Violation, as the CPU has *really* gone off the rails if a #VE occurs due to a completely unexpected exit reason. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: VMX: Dump VMCS on unexpected #VESean Christopherson1-1/+3
Dump the VMCS on an unexpected #VE, otherwise it's practically impossible to figure out why the #VE occurred. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: VMX: Don't kill the VM on an unexpected #VESean Christopherson1-2/+2
Don't terminate the VM on an unexpected #VE, as it's extremely unlikely the #VE is fatal to the guest, and even less likely that it presents a danger to the host. Simply resume the guest on "failure", as the #VE info page's BUSY field will prevent converting any more EPT Violations to #VEs for the vCPU (at least, that's what the BUSY field is supposed to do). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-16Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-281/+159
Pull KVM updates from Paolo Bonzini: "ARM: - Move a lot of state that was previously stored on a per vcpu basis into a per-CPU area, because it is only pertinent to the host while the vcpu is loaded. This results in better state tracking, and a smaller vcpu structure. - Add full handling of the ERET/ERETAA/ERETAB instructions in nested virtualisation. The last two instructions also require emulating part of the pointer authentication extension. As a result, the trap handling of pointer authentication has been greatly simplified. - Turn the global (and not very scalable) LPI translation cache into a per-ITS, scalable cache, making non directly injected LPIs much cheaper to make visible to the vcpu. - A batch of pKVM patches, mostly fixes and cleanups, as the upstreaming process seems to be resuming. Fingers crossed! - Allocate PPIs and SGIs outside of the vcpu structure, allowing for smaller EL2 mapping and some flexibility in implementing more or less than 32 private IRQs. - Purge stale mpidr_data if a vcpu is created after the MPIDR map has been created. - Preserve vcpu-specific ID registers across a vcpu reset. - Various minor cleanups and improvements. LoongArch: - Add ParaVirt IPI support - Add software breakpoint support - Add mmio trace events support RISC-V: - Support guest breakpoints using ebreak - Introduce per-VCPU mp_state_lock and reset_cntx_lock - Virtualize SBI PMU snapshot and counter overflow interrupts - New selftests for SBI PMU and Guest ebreak - Some preparatory work for both TDX and SNP page fault handling. This also cleans up the page fault path, so that the priorities of various kinds of fauls (private page, no memory, write to read-only slot, etc.) are easier to follow. x86: - Minimize amount of time that shadow PTEs remain in the special REMOVED_SPTE state. This is a state where the mmu_lock is held for reading but concurrent accesses to the PTE have to spin; shortening its use allows other vCPUs to repopulate the zapped region while the zapper finishes tearing down the old, defunct page tables. - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which is defined by hardware but left for software use. This lets KVM communicate its inability to map GPAs that set bits 51:48 on hosts without 5-level nested page tables. Guest firmware is expected to use the information when mapping BARs; this avoids that they end up at a legal, but unmappable, GPA. - Fixed a bug where KVM would not reject accesses to MSR that aren't supposed to exist given the vCPU model and/or KVM configuration. - As usual, a bunch of code cleanups. x86 (AMD): - Implement a new and improved API to initialize SEV and SEV-ES VMs, which will also be extendable to SEV-SNP. The new API specifies the desired encryption in KVM_CREATE_VM and then separately initializes the VM. The new API also allows customizing the desired set of VMSA features; the features affect the measurement of the VM's initial state, and therefore enabling them cannot be done tout court by the hypervisor. While at it, the new API includes two bugfixes that couldn't be applied to the old one without a flag day in userspace or without affecting the initial measurement. When a SEV-ES VM is created with the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected once the VMSA has been encrypted. Also, the FPU and AVX state will be synchronized and encrypted too. - Support for GHCB version 2 as applicable to SEV-ES guests. This, once more, is only accessible when using the new KVM_SEV_INIT2 flow for initialization of SEV-ES VMs. x86 (Intel): - An initial bunch of prerequisite patches for Intel TDX were merged. They generally don't do anything interesting. The only somewhat user visible change is a new debugging mode that checks that KVM's MMU never triggers a #VE virtualization exception in the guest. - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to L1, as per the SDM. Generic: - Use vfree() instead of kvfree() for allocations that always use vcalloc() or __vcalloc(). - Remove .change_pte() MMU notifier - the changes to non-KVM code are small and Andrew Morton asked that I also take those through the KVM tree. The callback was only ever implemented by KVM (which was also the original user of MMU notifiers) but it had been nonfunctional ever since calls to set_pte_at_notify were wrapped with invalidate_range_start and invalidate_range_end... in 2012. Selftests: - Enhance the demand paging test to allow for better reporting and stressing of UFFD performance. - Convert the steal time test to generate TAP-friendly output. - Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed time across two different clock domains. - Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT. - Avoid unnecessary use of "sudo" in the NX hugepage test wrapper shell script, to play nice with running in a minimal userspace environment. - Allow skipping the RSEQ test's sanity check that the vCPU was able to complete a reasonable number of KVM_RUNs, as the assert can fail on a completely valid setup. If the test is run on a large-ish system that is otherwise idle, and the test isn't affined to a low-ish number of CPUs, the vCPU task can be repeatedly migrated to CPUs that are in deep sleep states, which results in the vCPU having very little net runtime before the next migration due to high wakeup latencies. - Define _GNU_SOURCE for all selftests to fix a warning that was introduced by a change to kselftest_harness.h late in the 6.9 cycle, and because forcing every test to #define _GNU_SOURCE is painful. - Provide a global pseudo-RNG instance for all tests, so that library code can generate random, but determinstic numbers. - Use the global pRNG to randomly force emulation of select writes from guest code on x86, e.g. to help validate KVM's emulation of locked accesses. - Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception handlers at VM creation, instead of forcing tests to manually trigger the related setup. Documentation: - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits) selftests/kvm: remove dead file KVM: selftests: arm64: Test vCPU-scoped feature ID registers KVM: selftests: arm64: Test that feature ID regs survive a reset KVM: selftests: arm64: Store expected register value in set_id_regs KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope KVM: arm64: Only reset vCPU-scoped feature ID regs once KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs() KVM: arm64: Rename is_id_reg() to imply VM scope KVM: arm64: Destroy mpidr_data for 'late' vCPU creation KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support KVM: arm64: Fix hvhe/nvhe early alias parsing KVM: SEV: Allow per-guest configuration of GHCB protocol version KVM: SEV: Add GHCB handling for termination requests KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests KVM: SEV: Add support to handle AP reset MSR protocol KVM: x86: Explicitly zero kvm_caps during vendor module load KVM: x86: Fully re-initialize supported_mce_cap on vendor module load KVM: x86: Fully re-initialize supported_vm_types on vendor module load KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values ...
2024-05-12Merge tag 'kvm-x86-vmx-6.10' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-2/+0
KVM VMX changes for 6.10: - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to L1, as per the SDM. - Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is used only when synthesizing nested EPT violation, i.e. it's not the vCPU's "real" exit_qualification, which is tracked elsewhere. - Add a sanity check to assert that EPT Violations are the only sources of nested PML Full VM-Exits.
2024-05-10Merge tag 'loongarch-kvm-6.10' of ↵Paolo Bonzini1-6/+35
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.10 1. Add ParaVirt IPI support. 2. Add software breakpoint support. 3. Add mmio trace events support.
2024-04-30x86/irq: Remove bitfields in posted interrupt descriptorJacob Pan1-1/+1
Mixture of bitfields and types is weird and really not intuitive, remove bitfields and use typed data exclusively. Bitfields often result in inferior machine code. Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-4-jacob.jun.pan@linux.intel.com Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
2024-04-30KVM: VMX: Move posted interrupt descriptor out of VMX codeJacob Pan1-0/+1
To prepare native usage of posted interrupts, move the PID declarations out of VMX code such that they can be shared. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20240423174114.526704-2-jacob.jun.pan@linux.intel.com
2024-04-19KVM: VMX: Introduce test mode related to EPT violation VEIsaku Yamahata1-3/+50
To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM uses the suppress #VE bit in EPT entries selectively, in order to be able to trap non-present conditions. However, #VE isn't used for VMX and it's a bug if it happens. To be defensive and test that VMX case isn't broken introduce an option ept_violation_ve_test and when it's set, BUG the vm. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <d6db6ba836605c0412e166359ba5c46a63c22f86.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19KVM, x86: add architectural support code for #VEPaolo Bonzini1-0/+4
Dump the contents of the #VE info data structure and assert that #VE does not happen, but do not yet do anything with it. No functional change intended, separated for clarity only. Extracted from a patch by Isaku Yamahata <isaku.yamahata@intel.com>. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argumentSean Christopherson1-9/+7
TDX uses different ABI to get information about VM exit. Pass intr_info to the NMI and INTR handlers instead of pulling it from vcpu_vmx in preparation for sharing the bulk of the handlers with TDX. When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds exit qualification etc rather than the VMCS fields because VMM doesn't have access to the VMCS. The eventual code will be VMX: - get exit reason, intr_info, exit_qualification, and etc from VMCS - call NMI/INTR handlers (common code) TDX: - get exit reason, intr_info, exit_qualification, and etc from guest registers - call NMI/INTR handlers (common code) Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDXPaolo Bonzini1-269/+100
KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions to operate on VM. TDX doesn't allow VMM to operate VMCS directly. Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to indirectly operate those data structures. This means we must have a TDX version of kvm_x86_ops. The existing global struct kvm_x86_ops already defines an interface which can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM structure. To allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or TDX at run time. To split the runtime switch, the VMX implementation, and the TDX implementation, add main.c, and move out the vmx_x86_ops hooks in preparation for adding TDX. Use 'vt' for the naming scheme as a nod to VT-x and as a concatenation of VmxTdx. The eventually converted code will look like this: vmx.c: vmx_op() { ... } VMX initialization tdx.c: tdx_op() { ... } TDX initialization x86_ops.h: vmx_op(); tdx_op(); main.c: static vt_op() { if (tdx) tdx_op() else vmx_op() } static struct kvm_x86_ops vt_x86_ops = { .op = vt_op, initialization functions call both VMX and TDX initialization Opportunistically, fix the name inconsistency from vmx_create_vcpu() and vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free(). Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacksSean Christopherson1-1/+9
Disable LBR virtualization if the CPU doesn't support callstacks, which were introduced in HSW (see commit e9d7f7cd97c4 ("perf/x86/intel: Add basic Haswell LBR call stack support"), as KVM unconditionally configures the perf LBR event with PERF_SAMPLE_BRANCH_CALL_STACK, i.e. LBR virtualization always fails on pre-HSW CPUs. Simply disable LBR support on such CPUs, as it has never worked, i.e. there is no risk of breaking an existing setup, and figuring out a way to performantly context switch LBRs on old CPUs is not worth the effort. Fixes: be635e34c284 ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES") Cc: Mingwei Zhang <mizhang@google.com> Cc: Jim Mattson <jmattson@google.com> Tested-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20240307011344.835640-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-11KVM: VMX: Snapshot LBR capabilities during module initializationSean Christopherson1-4/+5
Snapshot VMX's LBR capabilities once during module initialization instead of calling into perf every time a vCPU reconfigures its vPMU. This will allow massaging the LBR capabilities, e.g. if the CPU doesn't support callstacks, without having to remember to update multiple locations. Opportunistically tag vmx_get_perf_capabilities() with __init, as it's only called from vmx_set_cpu_caps(). Reviewed-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20240307011344.835640-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09KVM: x86: Move nEPT exit_qualification field from kvm_vcpu_arch to x86_exceptionSean Christopherson1-2/+0
Move the exit_qualification field that is used to track information about in-flight nEPT violations from "struct kvm_vcpu_arch" to "x86_exception", i.e. associate the information with the actual nEPT violation instead of the vCPU. To handle bits that are pulled from vmcs.EXIT_QUALIFICATION, i.e. that are propagated from the "original" EPT violation VM-Exit, simply grab them from the VMCS on-demand when injecting a nEPT Violation or a PML Full VM-exit. Aside from being ugly, having an exit_qualification field in kvm_vcpu_arch is outright dangerous, e.g. see commit d7f0a00e438d ("KVM: VMX: Report up-to-date exit qualification to userspace"). Opportunstically add a comment to call out that PML Full and EPT Violation VM-Exits use the same bit to report NMI blocking information. Link: https://lore.kernel.org/r/20240209221700.393189-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08KVM: x86/pmu: Disable support for adaptive PEBSSean Christopherson1-2/+22
Drop support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken without an obvious/easy path forward, and because exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak host kernel addresses to the guest. Bug #1 is that KVM doesn't account for the upper 32 bits of IA32_FIXED_CTR_CTRL when (re)programming fixed counters, e.g fixed_ctrl_field() drops the upper bits, reprogram_fixed_counters() stores local variables as u8s and truncates the upper bits too, etc. Bug #2 is that, because KVM _always_ sets precise_ip to a non-zero value for PEBS events, perf will _always_ generate an adaptive record, even if the guest requested a basic record. Note, KVM will also enable adaptive PEBS in individual *counter*, even if adaptive PEBS isn't exposed to the guest, but this is benign as MSR_PEBS_DATA_CFG is guaranteed to be zero, i.e. the guest will only ever see Basic records. Bug #3 is in perf. intel_pmu_disable_fixed() doesn't clear the upper bits either, i.e. leaves ICL_FIXED_0_ADAPTIVE set, and intel_pmu_enable_fixed() effectively doesn't clear ICL_FIXED_0_ADAPTIVE either. I.e. perf _always_ enables ADAPTIVE counters, regardless of what KVM requests. Bug #4 is that adaptive PEBS *might* effectively bypass event filters set by the host, as "Updated Memory Access Info Group" records information that might be disallowed by userspace via KVM_SET_PMU_EVENT_FILTER. Bug #5 is that KVM doesn't ensure LBR MSRs hold guest values (or at least zeros) when entering a vCPU with adaptive PEBS, which allows the guest to read host LBRs, i.e. host RIPs/addresses, by enabling "LBR Entries" records. Disable adaptive PEBS support as an immediate fix due to the severity of the LBR leak in particular, and because fixing all of the bugs will be non-trivial, e.g. not suitable for backporting to stable kernels. Note! This will break live migration, but trying to make KVM play nice with live migration would be quite complicated, wouldn't be guaranteed to work (i.e. KVM might still kill/confuse the guest), and it's not clear that there are any publicly available VMMs that support adaptive PEBS, let alone live migrate VMs that support adaptive PEBS, e.g. QEMU doesn't support PEBS in any capacity. Link: https://lore.kernel.org/all/20240306230153.786365-1-seanjc@google.com Link: https://lore.kernel.org/all/ZeepGjHCeSfadANM@google.com Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") Cc: stable@vger.kernel.org Cc: Like Xu <like.xu.linux@gmail.com> Cc: Mingwei Zhang <mizhang@google.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhang Xiong <xiong.y.zhang@intel.com> Cc: Lv Zhiyuan <zhiyuan.lv@intel.com> Cc: Dapeng Mi <dapeng1.mi@intel.com> Cc: Jim Mattson <jmattson@google.com> Acked-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20240307005833.827147-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-73/+84
Pull kvm updates from Paolo Bonzini: "S390: - Changes to FPU handling came in via the main s390 pull request - Only deliver to the guest the SCLP events that userspace has requested - More virtual vs physical address fixes (only a cleanup since virtual and physical address spaces are currently the same) - Fix selftests undefined behavior x86: - Fix a restriction that the guest can't program a PMU event whose encoding matches an architectural event that isn't included in the guest CPUID. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed *using the architectural encoding*. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the event *in general*. It might support it, and it might support it using the same encoding that made it into the architectural PMU spec - Fix a variety of bugs in KVM's emulation of RDPMC (more details on individual commits) and add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID and therefore are easier to validate with selftests than with custom guests (aka kvm-unit-tests) - Zero out PMU state on AMD if the virtual PMU is disabled, it does not cause any bug but it wastes time in various cases where KVM would check if a PMC event needs to be synthesized - Optimize triggering of emulated events, with a nice ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace with an internal error exit code - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot - Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM doesn't support yielding in the middle of processing a zap, and 1GiB granularity resulted in multi-millisecond lags that are quite impolite for CONFIG_PREEMPT kernels - Allocate write-tracking metadata on-demand to avoid the memory overhead when a kernel is built with i915 virtualization support but the workloads use neither shadow paging nor i915 virtualization - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives - Fix the debugregs ABI for 32-bit KVM - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit, which allowed some optimization for both Intel and AMD - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, causing extra unnecessary work - Cleanup the logic for checking if the currently loaded vCPU is in-kernel - Harden against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel x86 Xen emulation: - Overlay pages can now be cached based on host virtual address, instead of guest physical addresses. This removes the need to reconfigure and invalidate the cache if the guest changes the gpa but the underlying host virtual address remains the same - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior) - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs RISC-V: - Support exception and interrupt handling in selftests - New self test for RISC-V architectural timer (Sstc extension) - New extension support (Ztso, Zacas) - Support userspace emulation of random number seed CSRs ARM: - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests LoongArch: - Set reserved bits as zero in CPUCFG - Start SW timer only when vcpu is blocking - Do not restart SW timer when it is expired - Remove unnecessary CSR register saving during enter guest - Misc cleanups and fixes as usual Generic: - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically always true on all architectures except MIPS (where Kconfig determines the available depending on CPU capabilities). It is replaced either by an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM) everywhere else - Factor common "select" statements in common code instead of requiring each architecture to specify it - Remove thoroughly obsolete APIs from the uapi headers - Move architecture-dependent stuff to uapi/asm/kvm.h - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, so that there's no need to remember to *conditionally* clean up after the worker Selftests: - Reduce boilerplate especially when utilize selftest TAP infrastructure - Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory - Fix benign bugs where tests neglect to close() guest_memfd files" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits) selftests: kvm: remove meaningless assignments in Makefiles KVM: riscv: selftests: Add Zacas extension to get-reg-list test RISC-V: KVM: Allow Zacas extension for Guest/VM KVM: riscv: selftests: Add Ztso extension to get-reg-list test RISC-V: KVM: Allow Ztso extension for Guest/VM RISC-V: KVM: Forward SEED CSR access to user space KVM: riscv: selftests: Add sstc timer test KVM: riscv: selftests: Change vcpu_has_ext to a common function KVM: riscv: selftests: Add guest helper to get vcpu id KVM: riscv: selftests: Add exception handling support LoongArch: KVM: Remove unnecessary CSR register saving during enter guest LoongArch: KVM: Do not restart SW timer when it is expired LoongArch: KVM: Start SW timer only when vcpu is blocking LoongArch: KVM: Set reserved bits as zero in CPUCFG KVM: selftests: Explicitly close guest_memfd files in some gmem tests KVM: x86/xen: fix recursive deadlock in timer injection KVM: pfncache: simplify locking and make more self-contained KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled KVM: x86/xen: improve accuracy of Xen timers ...
2024-03-12Merge tag 'x86-core-2024-03-11' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ...
2024-03-12Merge tag 'x86-fred-2024-03-10' of ↵Linus Torvalds1-3/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FRED support from Thomas Gleixner: "Support for x86 Fast Return and Event Delivery (FRED). FRED is a replacement for IDT event delivery on x86 and addresses most of the technical nightmares which IDT exposes: 1) Exception cause registers like CR2 need to be manually preserved in nested exception scenarios. 2) Hardware interrupt stack switching is suboptimal for nested exceptions as the interrupt stack mechanism rewinds the stack on each entry which requires a massive effort in the low level entry of #NMI code to handle this. 3) No hardware distinction between entry from kernel or from user which makes establishing kernel context more complex than it needs to be especially for unconditionally nestable exceptions like NMI. 4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a problem when the perf NMI takes a fault when collecting a stack trace. 5) Partial restore of ESP when returning to a 16-bit segment 6) Limitation of the vector space which can cause vector exhaustion on large systems. 7) Inability to differentiate NMI sources FRED addresses these shortcomings by: 1) An extended exception stack frame which the CPU uses to save exception cause registers. This ensures that the meta information for each exception is preserved on stack and avoids the extra complexity of preserving it in software. 2) Hardware interrupt stack switching is non-rewinding if a nested exception uses the currently interrupt stack. 3) The entry points for kernel and user context are separate and GS BASE handling which is required to establish kernel context for per CPU variable access is done in hardware. 4) NMIs are now nesting protected. They are only reenabled on the return from NMI. 5) FRED guarantees full restore of ESP 6) FRED does not put a limitation on the vector space by design because it uses a central entry points for kernel and user space and the CPUstores the entry type (exception, trap, interrupt, syscall) on the entry stack along with the vector number. The entry code has to demultiplex this information, but this removes the vector space restriction. The first hardware implementations will still have the current restricted vector space because lifting this limitation requires further changes to the local APIC. 7) FRED stores the vector number and meta information on stack which allows having more than one NMI vector in future hardware when the required local APIC changes are in place. The series implements the initial FRED support by: - Reworking the existing entry and IDT handling infrastructure to accomodate for the alternative entry mechanism. - Expanding the stack frame to accomodate for the extra 16 bytes FRED requires to store context and meta information - Providing FRED specific C entry points for events which have information pushed to the extended stack frame, e.g. #PF and #DB. - Providing FRED specific C entry points for #NMI and #MCE - Implementing the FRED specific ASM entry points and the C code to demultiplex the events - Providing detection and initialization mechanisms and the necessary tweaks in context switching, GS BASE handling etc. The FRED integration aims for maximum code reuse vs the existing IDT implementation to the extent possible and the deviation in hot paths like context switching are handled with alternatives to minimalize the impact. The low level entry and exit paths are seperate due to the extended stack frame and the hardware based GS BASE swichting and therefore have no impact on IDT based systems. It has been extensively tested on existing systems and on the FRED simulation and as of now there are no outstanding problems" * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/fred: Fix init_task thread stack pointer initialization MAINTAINERS: Add a maintainer entry for FRED x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly x86/fred: Invoke FRED initialization code to enable FRED x86/fred: Add FRED initialization functions x86/syscall: Split IDT syscall setup code into idt_syscall_init() KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled x86/traps: Add sysvec_install() to install a system interrupt handler x86/fred: FRED entry/exit and dispatch code x86/fred: Add a machine check entry stub for FRED x86/fred: Add a NMI entry stub for FRED x86/fred: Add a debug fault entry stub for FRED x86/idtentry: Incorporate definitions/declarations of the FRED entries x86/fred: Make exc_page_fault() work for FRED x86/fred: Allow single-step trap and NMI when starting a new task x86/fred: No ESPFIX needed when FRED is enabled ...
2024-03-11Merge tag 'kvm-x86-vmx-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-41/+31
KVM VMX changes for 6.9: - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace due to an unexpected VM-Exit while the CPU was vectoring an exception. - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support. - Clean up the logic for massaging the passthrough MSR bitmaps when userspace changes its MSR filter.
2024-03-11Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-32/+53
KVM x86 misc changes for 6.9: - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives (though in fairness in KMSAN, it's comically difficult to see that the uninitialized memory is never truly consumed). - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading DR6 and DR7. - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit. This allows VMX to further optimize handling preemption timer exits, and allows SVM to avoid sending a duplicate IPI (SVM also has a need to force an exit). - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, and add WARN to guard against similar bugs. - Provide a dedicated arch hook for checking if a different vCPU was in-kernel (for directed yield), and simplify the logic for checking if the currently loaded vCPU is in-kernel. - Misc cleanups and fixes.
2024-02-27KVM: VMX: Combine "check" and "get" APIs for passthrough MSR lookupsSean Christopherson1-39/+26
Combine possible_passthrough_msr_slot() and is_valid_passthrough_msr() into a single function, vmx_get_passthrough_msr_slot(), and have the combined helper return the slot on success, using a negative value to indicate "failure". Combining the operations avoids iterating over the array of passthrough MSRs twice for relevant MSRs. Suggested-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20240223202104.3330974-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27KVM: VMX: return early if msr_bitmap is not supportedDongli Zhang1-0/+3
The vmx_msr_filter_changed() may directly/indirectly calls only vmx_enable_intercept_for_msr() or vmx_disable_intercept_for_msr(). Those two functions may exit immediately if !cpu_has_vmx_msr_bitmap(). vmx_msr_filter_changed() -> vmx_disable_intercept_for_msr() -> pt_update_intercept_for_msr() -> vmx_set_intercept_for_msr() -> vmx_enable_intercept_for_msr() -> vmx_disable_intercept_for_msr() Therefore, we exit early if !cpu_has_vmx_msr_bitmap(). Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20240223202104.3330974-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27KVM: VMX: fix comment to add LBR to passthrough MSRsDongli Zhang1-1/+1
According to the is_valid_passthrough_msr(), the LBR MSRs are also passthrough MSRs, since the commit 1b5ac3226a1a ("KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVE"). Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20240223202104.3330974-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23KVM: x86: Fully defer to vendor code to decide how to force immediate exitSean Christopherson1-18/+14
Now that vmx->req_immediate_exit is used only in the scope of vmx_vcpu_run(), use force_immediate_exit to detect that KVM should usurp the VMX preemption to force a VM-Exit and let vendor code fully handle forcing a VM-Exit. Opportunsitically drop __kvm_request_immediate_exit() and just have vendor code call smp_send_reschedule() directly. SVM already does this when injecting an event while also trying to single-step an IRET, i.e. it's not exactly secret knowledge that KVM uses a reschedule IPI to force an exit. Link: https://lore.kernel.org/r/20240110012705.506918-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23KVM: VMX: Handle KVM-induced preemption timer exits in fastpath for L2Sean Christopherson1-2/+20
Eat VMX treemption timer exits in the fastpath regardless of whether L1 or L2 is active. The VM-Exit is 100% KVM-induced, i.e. there is nothing directly related to the exit that KVM needs to do on behalf of the guest, thus there is no reason to wait until the slow path to do nothing. Opportunistically add comments explaining why preemption timer exits for emulating the guest's APIC timer need to go down the slow path. Link: https://lore.kernel.org/r/20240110012705.506918-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23KVM: x86: Move handling of is_guest_mode() into fastpath exit handlersSean Christopherson1-3/+3
Let the fastpath code decide which exits can/can't be handled in the fastpath when L2 is active, e.g. when KVM generates a VMX preemption timer exit to forcefully regain control, there is no "work" to be done and so such exits can be handled in the fastpath regardless of whether L1 or L2 is active. Moving the is_guest_mode() check into the fastpath code also makes it easier to see that L2 isn't allowed to use the fastpath in most cases, e.g. it's not immediately obvious why handle_fastpath_preemption_timer() is called from the fastpath and the normal path. Link: https://lore.kernel.org/r/20240110012705.506918-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23KVM: VMX: Handle forced exit due to preemption timer in fastpathSean Christopherson1-5/+8
Handle VMX preemption timer VM-Exits due to KVM forcing an exit in the exit fastpath, i.e. avoid calling back into handle_preemption_timer() for the same exit. There is no work to be done for forced exits, as the name suggests the goal is purely to get control back in KVM. In addition to shaving a few cycles, this will allow cleanly separating handle_fastpath_preemption_timer() from handle_preemption_timer(), e.g. it's not immediately obvious why _apparently_ calling handle_fastpath_preemption_timer() twice on a "slow" exit is necessary: the "slow" call is necessary to handle exits from L2, which are excluded from the fastpath by vmx_vcpu_run(). Link: https://lore.kernel.org/r/20240110012705.506918-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23KVM: VMX: Re-enter guest in fastpath for "spurious" preemption timer exitsSean Christopherson1-2/+9
Re-enter the guest in the fast path if VMX preeemption timer VM-Exit was "spurious", i.e. if KVM "soft disabled" the timer by writing -1u and by some miracle the timer expired before any other VM-Exit occurred. This is just an intermediate step to cleaning up the preemption timer handling, optimizing these types of spurious VM-Exits is not interesting as they are extremely rare/infrequent. Link: https://lore.kernel.org/r/20240110012705.506918-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23KVM: x86: Plumb "force_immediate_exit" into kvm_entry() tracepointSean Christopherson1-2/+2
Annotate the kvm_entry() tracepoint with "immediate exit" when KVM is forcing a VM-Exit immediately after VM-Enter, e.g. when KVM wants to inject an event but needs to first complete some other operation. Knowing that KVM is (or isn't) forcing an exit is useful information when debugging issues related to event injection. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240110012705.506918-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23KVM: x86: Make kvm_get_dr() return a value, not use an out parameterSean Christopherson1-4/+1
Convert kvm_get_dr()'s output parameter to a return value, and clean up most of the mess that was created by forcing callers to provide a pointer. No functional change intended. Acked-by: Mathias Krause <minipli@grsecurity.net> Reviewed-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240209220752.388160-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20KVM/VMX: Move VERW closer to VMentry for MDS mitigationPawan Gupta1-4/+16
During VMentry VERW is executed to mitigate MDS. After VERW, any memory access like register push onto stack may put host data in MDS affected CPU buffers. A guest can then use MDS to sample host data. Although likelihood of secrets surviving in registers at current VERW callsite is less, but it can't be ruled out. Harden the MDS mitigation by moving the VERW mitigation late in VMentry path. Note that VERW for MMIO Stale Data mitigation is unchanged because of the complexity of per-guest conditional VERW which is not easy to handle that late in asm with no GPRs available. If the CPU is also affected by MDS, VERW is unconditionally executed late in asm regardless of guest having MMIO access. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-6-a6216d83edb7%40linux.intel.com
2024-02-20x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static keyPawan Gupta1-1/+1
The VERW mitigation at exit-to-user is enabled via a static branch mds_user_clear. This static branch is never toggled after boot, and can be safely replaced with an ALTERNATIVE() which is convenient to use in asm. Switch to ALTERNATIVE() to use the VERW mitigation late in exit-to-user path. Also remove the now redundant VERW in exc_nmi() and arch_exit_to_user_mode(). Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-4-a6216d83edb7%40linux.intel.com
2024-02-14Merge branch 'x86/bugs' into x86/core, to pick up pending changes before ↵Ingo Molnar1-1/+1
dependent patches Merge in pending alternatives patching infrastructure changes, before applying more patches. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-10work around gcc bugs with 'asm goto' with outputsLinus Torvalds1-2/+2
We've had issues with gcc and 'asm goto' before, and we created a 'asm_volatile_goto()' macro for that in the past: see commits 3f0116c3238a ("compiler/gcc4: Add quirk for 'asm goto' miscompilation bug") and a9f180345f53 ("compiler/gcc4: Make quirk for asm_volatile_goto() unconditional"). Then, much later, we ended up removing the workaround in commit 43c249ea0b1e ("compiler-gcc.h: remove ancient workaround for gcc PR 58670") because we no longer supported building the kernel with the affected gcc versions, but we left the macro uses around. Now, Sean Christopherson reports a new version of a very similar problem, which is fixed by re-applying that ancient workaround. But the problem in question is limited to only the 'asm goto with outputs' cases, so instead of re-introducing the old workaround as-is, let's rename and limit the workaround to just that much less common case. It looks like there are at least two separate issues that all hit in this area: (a) some versions of gcc don't mark the asm goto as 'volatile' when it has outputs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98619 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110420 which is easy to work around by just adding the 'volatile' by hand. (b) Internal compiler errors: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110422 which are worked around by adding the extra empty 'asm' as a barrier, as in the original workaround. but the problem Sean sees may be a third thing since it involves bad code generation (not an ICE) even with the manually added 'volatile'. but the same old workaround works for this case, even if this feels a bit like voodoo programming and may only be hiding the issue. Reported-and-tested-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/20240208220604.140859-1-seanjc@google.com/ Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Andrew Pinski <quic_apinski@quicinc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-02-07KVM: VMX: Report up-to-date exit qualification to userspaceChao Gao1-1/+1
Use vmx_get_exit_qual() to read the exit qualification. vcpu->arch.exit_qualification is cached for EPT violation only and even for EPT violation, it is stale at this point because the up-to-date value is cached later in handle_ept_violation(). Fixes: 70bcd708dfd1 ("KVM: vmx: expose more information for KVM_INTERNAL_ERROR_DELIVERY_EV exits") Signed-off-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20231229022652.300095-1-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handlingXin Li1-3/+9
When FRED is enabled, call fred_entry_from_kvm() to handle IRQ/NMI in IRQ/NMI induced VM exits. Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Shan Kang <shan.kang@intel.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231205105030.8698-33-xin3.li@intel.com
2024-01-18Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-19/+67
Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
2024-01-10x86/bugs: Rename CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINEBreno Leitao1-1/+1
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options. [ mingo: Converted a few more uses in comments/messages as well. ] Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ariel Miculas <amiculas@cisco.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
2024-01-08Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-2/+53
KVM x86 support for virtualizing Linear Address Masking (LAM) Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality checks for most virtual address usage in 64-bit mode, such that only the most significant bit of the untranslated address bits must match the polarity of the last translated address bit. This allows software to use ignored, untranslated address bits for metadata, e.g. to efficiently tag pointers for address sanitization. LAM can be enabled separately for user pointers and supervisor pointers, and for userspace LAM can be select between 48-bit and 57-bit masking - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15. - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6. For user pointers, LAM enabling utilizes two previously-reserved high bits from CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and 61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.: - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.: - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers The modified LAM canonicality checks: - LAM_S48 : [ 1 ][ metadata ][ 1 ] 63 47 - LAM_U48 : [ 0 ][ metadata ][ 0 ] 63 47 - LAM_S57 : [ 1 ][ metadata ][ 1 ] 63 56 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ] 63 56 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ] 63 56..47 The bulk of KVM support for LAM is to emulate LAM's modified canonicality checks. The approach taken by KVM is to "fill" the metadata bits using the highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value from the raw, untagged virtual address is kept for the canonicality check. This untagging allows Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26], enabling via CR3 and CR4 bits, etc.
2024-01-03arch/x86: Fix typosBjorn Helgaas1-1/+1
Fix typos, most reported by "codespell arch/x86". Only touches comments, no code changes. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2023-12-07KVM: nVMX: Hide more stuff under CONFIG_KVM_HYPERVVitaly Kuznetsov1-0/+3
'hv_evmcs_vmptr'/'hv_evmcs_map'/'hv_evmcs' fields in 'struct nested_vmx' should not be used when !CONFIG_KVM_HYPERV, hide them when !CONFIG_KVM_HYPERV. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-16-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: x86: Make Hyper-V emulation optionalVitaly Kuznetsov1-0/+2
Hyper-V emulation in KVM is a fairly big chunk and in some cases it may be desirable to not compile it in to reduce module sizes as well as the attack surface. Introduce CONFIG_KVM_HYPERV option to make it possible. Note, there's room for further nVMX/nSVM code optimizations when !CONFIG_KVM_HYPERV, this will be done in follow-up patches. Reorganize Makefile a bit so all CONFIG_HYPERV and CONFIG_KVM_HYPERV files are grouped together. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20231205103630.1391318-13-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: VMX: Split off vmx_onhyperv.{ch} from hyperv.{ch}Vitaly Kuznetsov1-0/+1
hyperv.{ch} is currently a mix of stuff which is needed by both Hyper-V on KVM and KVM on Hyper-V. As a preparation to making Hyper-V emulation optional, put KVM-on-Hyper-V specific code into dedicated files. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-4-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: x86: Move Hyper-V partition assist page out of Hyper-V emulation contextVitaly Kuznetsov1-11/+3
Hyper-V partition assist page is used when KVM runs on top of Hyper-V and is not used for Windows/Hyper-V guests on KVM, this means that 'hv_pa_pg' placement in 'struct kvm_hv' is unfortunate. As a preparation to making Hyper-V emulation optional, move 'hv_pa_pg' to 'struct kvm_arch' and put it under CONFIG_HYPERV. While on it, introduce hv_get_partition_assist_page() helper to allocate partition assist page. Move the comment explaining why we use a single page for all vCPUs from VMX and expand it a bit. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-3-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-29KVM: x86: Use KVM-governed feature framework to track "LAM enabled"Binbin Wu1-0/+1
Use the governed feature framework to track if Linear Address Masking (LAM) is "enabled", i.e. if LAM can be used by the guest. Using the framework to avoid the relative expensive call guest_cpuid_has() during cr3 and vmexit handling paths for LAM. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-14-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-29KVM: x86: Virtualize LAM for user pointerRobert Hoo1-3/+9
Add support to allow guests to set the new CR3 control bits for Linear Address Masking (LAM) and add implementation to get untagged address for user pointers. LAM modifies the canonical check for 64-bit linear addresses, allowing software to use the masked/ignored address bits for metadata. Hardware masks off the metadata bits before using the linear addresses to access memory. LAM uses two new CR3 non-address bits, LAM_U48 (bit 62) and LAM_U57 (bit 61), to configure LAM for user pointers. LAM also changes VMENTER to allow both bits to be set in VMCS's HOST_CR3 and GUEST_CR3 for virtualization. When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set any of the two LAM control bits. However, when EPT is off, the actual CR3 used by the guest is generated from the shadow MMU root which is different from the CR3 that is *set* by the guest, and KVM needs to manually apply any active control bits to VMCS's GUEST_CR3 based on the cached CR3 *seen* by the guest. KVM manually checks guest's CR3 to make sure it points to a valid guest physical address (i.e. to support smaller MAXPHYSADDR in the guest). Extend this check to allow the two LAM control bits to be set. After check, LAM bits of guest CR3 will be stripped off to extract guest physical address. In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3 and GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when L0 enters L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2 directly. KVM also manually checks VMCS12's HOST_CR3 and GUEST_CR3 being valid physical address. Extend such check to allow the new LAM control bits too. Note, LAM doesn't have a global control bit to turn on/off LAM completely, but purely depends on hardware's CPUID to determine it can be enabled or not. That means, when EPT is on, even when KVM doesn't expose LAM to guest, the guest can still set LAM control bits in CR3 w/o causing problem. This is an unfortunate virtualization hole. KVM could choose to intercept CR3 in this case and inject fault but this would hurt performance when running a normal VM w/o LAM support. This is undesirable. Just choose to let the guest do such illegal thing as the worst case is guest being killed when KVM eventually find out such illegal behaviour and that the guest is misbehaving. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-12-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>