summaryrefslogtreecommitdiff
path: root/arch/x86/include/asm/kvm_host.h
AgeCommit message (Collapse)AuthorFilesLines
2024-06-05KVM: x86: Print names of apicv inhibit reasons in tracesAlejandro Jimenez1-0/+19
Use the tracing infrastructure helper __print_flags() for printing flag bitfields, to enhance the trace output by displaying a string describing each of the inhibit reasons set. The kvm_apicv_inhibit_changed tracepoint currently shows the raw bitmap value, requiring the user to consult the source file where the inhibit reasons are defined to decode the trace output. Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Link: https://lore.kernel.org/r/20240506225321.3440701-2-alejandro.j.jimenez@oracle.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: x86: Make nanoseconds per APIC bus cycle a VM variableIsaku Yamahata1-0/+1
Introduce the VM variable "nanoseconds per APIC bus cycle" in preparation to make the APIC bus frequency configurable. The TDX architecture hard-codes the core crystal clock frequency to 25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture does not allow the VMM to override the value. In addition, per Intel SDM: "The APIC timer frequency will be the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register." The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded APIC bus frequency of 1GHz. Introduce the VM variable "nanoseconds per APIC bus cycle" to prepare for allowing userspace to tell KVM to use the frequency that TDX mandates instead of the default 1Ghz. Doing so ensures that the guest doesn't have a conflicting view of the APIC bus frequency. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> [reinette: rework changelog] Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/ae75ce37c6c38bb4efd10a0a41932984c40b24ac.1714081726.git.reinette.chatre@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-04KVM: x86/pmu: Change ambiguous _mask suffix to _rsvd in kvm_pmuDapeng Mi1-5/+5
Several '_mask' suffixed variables such as, global_ctrl_mask, are defined in kvm_pmu structure. However the _mask suffix is ambiguous and misleading since it's not a real mask with positive logic. On the contrary it represents the reserved bits of corresponding MSRs and these bits should not be accessed. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240430005239.13527-2-dapeng1.mi@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03Merge branch 'kvm-6.11-sev-snp' into HEADPaolo Bonzini1-0/+5
Pull base x86 KVM support for running SEV-SNP guests from Michael Roth: * add some basic infrastructure and introduces a new KVM_X86_SNP_VM vm_type to handle differences versus the existing KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types. * implement the KVM API to handle the creation of a cryptographic launch context, encrypt/measure the initial image into guest memory, and finalize it before launching it. * implement handling for various guest-generated events such as page state changes, onlining of additional vCPUs, etc. * implement the gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges as well as cleaning them up prior to returning them to the host for use as normal memory. Because those cleanup hooks supplant certain activities like issuing WBINVDs during KVM MMU invalidations, avoid duplicating that work to avoid unecessary overhead. This merge leaves out support support for attestation guest requests and for loading the signing keys to be used for attestation requests.
2024-06-03KVM: x86: Add a struct to consolidate host values, e.g. EFER, XCR0, etc...Sean Christopherson1-1/+0
Add "struct kvm_host_values kvm_host" to hold the various host values that KVM snapshots during initialization. Bundling the host values into a single struct simplifies adding new MSRs and other features with host state/values that KVM cares about, and provides a one-stop shop. E.g. adding a new value requires one line, whereas tracking each value individual often requires three: declaration, definition, and export. No functional change intended. Link: https://lore.kernel.org/r/20240423221521.2923759-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-23KVM: x86/mmu: Print SPTEs on unexpected #VESean Christopherson1-0/+1
Print the SPTEs that correspond to the faulting GPA on an unexpected EPT Violation #VE to help the user debug failures, e.g. to pinpoint which SPTE didn't have SUPPRESS_VE set. Opportunistically assert that the underlying exit reason was indeed an EPT Violation, as the CPU has *really* gone off the rails if a #VE occurs due to a completely unexpected exit reason. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12KVM: SEV: Support SEV-SNP AP Creation NAE eventTom Lendacky1-0/+1
Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP guests to alter the register state of the APs on their own. This allows the guest a way of simulating INIT-SIPI. A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used so as to avoid updating the VMSA pointer while the vCPU is running. For CREATE The guest supplies the GPA of the VMSA to be used for the vCPU with the specified APIC ID. The GPA is saved in the svm struct of the target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to the vCPU and then the vCPU is kicked. For CREATE_ON_INIT: The guest supplies the GPA of the VMSA to be used for the vCPU with the specified APIC ID the next time an INIT is performed. The GPA is saved in the svm struct of the target vCPU. For DESTROY: The guest indicates it wishes to stop the vCPU. The GPA is cleared from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to vCPU and then the vCPU is kicked. The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked as a result of the event or as a result of an INIT. If a new VMSA is to be installed, the VMSA guest page is set as the VMSA in the vCPU VMCB and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is cleared in the vCPU VMCB and the vCPU state is set to KVM_MP_STATE_HALTED to prevent it from being run. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Message-ID: <20240501085210.2213060-13-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12KVM: SEV: Add support to handle RMP nested page faultsBrijesh Singh1-0/+1
When SEV-SNP is enabled in the guest, the hardware places restrictions on all memory accesses based on the contents of the RMP table. When hardware encounters RMP check failure caused by the guest memory access it raises the #NPF. The error code contains additional information on the access type. See the APM volume 2 for additional information. When using gmem, RMP faults resulting from mismatches between the state in the RMP table vs. what the guest expects via its page table result in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This means the only expected case that needs to be handled in the kernel is when the page size of the entry in the RMP table is larger than the mapping in the nested page table, in which case a PSMASH instruction needs to be issued to split the large RMP entry into individual 4K entries so that subsequent accesses can succeed. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Message-ID: <20240501085210.2213060-12-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12Merge branch 'kvm-coco-hooks' into HEADPaolo Bonzini1-0/+3
Common patches for the target-independent functionality and hooks that are needed by SEV-SNP and TDX.
2024-05-12Merge tag 'kvm-x86-vmx-6.10' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-3/+0
KVM VMX changes for 6.10: - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to L1, as per the SDM. - Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is used only when synthesizing nested EPT violation, i.e. it's not the vCPU's "real" exit_qualification, which is tracked elsewhere. - Add a sanity check to assert that EPT Violations are the only sources of nested PML Full VM-Exits.
2024-05-10Merge tag 'loongarch-kvm-6.10' of ↵Paolo Bonzini1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.10 1. Add ParaVirt IPI support. 2. Add software breakpoint support. 3. Add mmio trace events support.
2024-05-10Merge branch 'kvm-coco-pagefault-prep' into HEADPaolo Bonzini1-21/+25
A combination of prep work for TDX and SNP, and a clean up of the page fault path to (hopefully) make it easier to follow the rules for private memory, noslot faults, writes to read-only slots, etc.
2024-05-10KVM: x86: Add hook for determining max NPT mapping levelMichael Roth1-0/+1
In the case of SEV-SNP, whether or not a 2MB page can be mapped via a 2MB mapping in the guest's nested page table depends on whether or not any subpages within the range have already been initialized as private in the RMP table. The existing mixed-attribute tracking in KVM is insufficient here, for instance: - gmem allocates 2MB page - guest issues PVALIDATE on 2MB page - guest later converts a subpage to shared - SNP host code issues PSMASH to split 2MB RMP mapping to 4K - KVM MMU splits NPT mapping to 4K - guest later converts that shared page back to private At this point there are no mixed attributes, and KVM would normally allow for 2MB NPT mappings again, but this is actually not allowed because the RMP table mappings are 4K and cannot be promoted on the hypervisor side, so the NPT mappings must still be limited to 4K to match this. Add a hook to determine the max NPT mapping size in situations like this. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240501085210.2213060-3-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-10KVM: guest_memfd: Add hook for invalidating memoryMichael Roth1-0/+1
In some cases, like with SEV-SNP, guest memory needs to be updated in a platform-specific manner before it can be safely freed back to the host. Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to allow for special handling of this sort when freeing memory in response to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go ahead and define an arch-specific hook for x86 since it will be needed for handling memory used for SEV-SNP guests. Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20231230172351.574091-6-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-10KVM: guest_memfd: Add hook for initializing memoryPaolo Bonzini1-0/+1
guest_memfd pages are generally expected to be in some arch-defined initial state prior to using them for guest memory. For SEV-SNP this initial state is 'private', or 'guest-owned', and requires additional operations to move these pages into a 'private' state by updating the corresponding entries the RMP table. Allow for an arch-defined hook to handle updates of this sort, and go ahead and implement one for x86 so KVM implementations like AMD SVM can register a kvm_x86_ops callback to handle these updates for SEV-SNP guests. The preparation callback is always called when allocating/grabbing folios via gmem, and it is up to the architecture to keep track of whether or not the pages are already in the expected state (e.g. the RMP table in the case of SEV-SNP). In some cases, it is necessary to defer the preparation of the pages to handle things like in-place encryption of initial guest memory payloads before marking these pages as 'private'/'guest-owned'. Add an argument (always true for now) to kvm_gmem_get_folio() that allows for the preparation callback to be bypassed. To detect possible issues in the way userspace initializes memory, it is only possible to add an unprepared page if it is not already included in the filemap. Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/ Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20231230172351.574091-5-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: check for invalid async page faults involving private memoryPaolo Bonzini1-0/+1
Right now the error code is not used when an async page fault is completed. This is not a problem in the current code, but it is untidy. For protected VMs, we will also need to check that the page attributes match the current state of the page, because asynchronous page faults can only occur on shared pages (private pages go through kvm_faultin_pfn_private() instead of __gfn_to_pfn_memslot()). Start by piping the error code from kvm_arch_setup_async_pf() to kvm_arch_async_page_ready() via the architecture-specific async page fault data. For now, it can be used to assert that there are no async page faults on private memory. Extracted from a patch by Isaku Yamahata. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Use synthetic page fault error code to indicate private faultsSean Christopherson1-1/+6
Add and use a synthetic, KVM-defined page fault error code to indicate whether a fault is to private vs. shared memory. TDX and SNP have different mechanisms for reporting private vs. shared, and KVM's software-protected VMs have no mechanism at all. Usurp an error code flag to avoid having to plumb another parameter to kvm_mmu_page_fault() and friends. Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it for TDX and software-protected VMs as appropriate, but that would require *clearing* the flag for SEV and SEV-ES VMs, which support encrypted memory at the hardware layer, but don't utilize private memory at the KVM layer. Opportunistically add a comment to call out that the logic for software- protected VMs is (and was before this commit) broken for nested MMUs, i.e. for nested TDP, as the GPA is an L2 GPA. Punt on trying to play nice with nested MMUs as there is a _lot_ of functionality that simply doesn't work for software-protected VMs, e.g. all of the paths where KVM accesses guest memory need to be updated to be aware of private vs. shared memory. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20240228024147.41573-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handlerSean Christopherson1-0/+6
Move the sanity check that hardware never sets bits that collide with KVM- define synthetic bits from kvm_mmu_page_fault() to npf_interception(), i.e. make the sanity check #NPF specific. The legacy #PF path already WARNs if _any_ of bits 63:32 are set, and the error code that comes from VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's EXIT_QUALIFICATION into error code flags). Add a compile-time assert in the legacy #PF handler to make sure that KVM- define flags are covered by its existing sanity check on the upper bits. Opportunistically add a description of PFERR_IMPLICIT_ACCESS, since we are removing the comment that defined it. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20240228024147.41573-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Define more SEV+ page fault error bits/flags for #NPFSean Christopherson1-0/+4
Define more #NPF error code flags that are relevant to SEV+ (mostly SNP) guests, as specified by the APM: * Bit 31 (RMP): Set to 1 if the fault was caused due to an RMP check or a VMPL check failure, 0 otherwise. * Bit 34 (ENC): Set to 1 if the guest’s effective C-bit was 1, 0 otherwise. * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between PVALIDATE or RMPADJUST and the RMP, 0 otherwise. * Bit 36 (VMPL): Set to 1 if the fault was caused by a VMPL permission check failure, 0 otherwise. Note, the APM is *extremely* misleading, and strongly implies that the above flags can _only_ be set for #NPF exits from SNP guests. That is a lie, as bit 34 (C-bit=1, i.e. was encrypted) can be set when running _any_ flavor of SEV guest on SNP capable hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Remove separate "bit" defines for page fault error code masksSean Christopherson1-22/+10
Open code the bit number directly in the PFERR_* masks and drop the intermediate PFERR_*_BIT defines, as having to bounce through two macros just to see which flag corresponds to which bit is quite annoying, as is having to define two macros just to add recognition of a new flag. Use ternary operator to derive the bit in permission_fault(), the one function that actually needs the bit number as part of clever shifting to avoid conditional branches. Generally the compiler is able to turn it into a conditional move, and if not it's not really a big deal. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240228024147.41573-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19KVM: x86/mmu: Track shadow MMIO value on a per-VM basisSean Christopherson1-0/+2
TDX will use a different shadow PTE entry value for MMIO from VMX. Add a member to kvm_arch and track value for MMIO per-VM instead of a global variable. By using the per-VM EPT entry value for MMIO, the existing VMX logic is kept working. Introduce a separate setter function so that guest TD can use a different value later. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <229a18434e5d83f45b1fcd7bf1544d79db1becb6.1705965635.git.isaku.yamahata@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12KVM: x86: Split core of hypercall emulation to helper functionSean Christopherson1-0/+4
By necessity, TDX will use a different register ABI for hypercalls. Break out the core functionality so that it may be reused for TDX. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11KVM: x86: add fields to struct kvm_arch for CoCo featuresPaolo Bonzini1-2/+5
Some VM types have characteristics in common; in fact, the only use of VM types right now is kvm_arch_has_private_mem and it assumes that _all_ nonzero VM types have private memory. We will soon introduce a VM type for SEV and SEV-ES VMs, and at that point we will have two special characteristics of confidential VMs that depend on the VM type: not just if memory is private, but also whether guest state is protected. For the latter we have kvm->arch.guest_state_protected, which is only set on a fully initialized VM. For VM types with protected guest state, we can actually fix a problem in the SEV-ES implementation, where ioctls to set registers do not cause an error even if the VM has been initialized and the guest state encrypted. Make sure that when using VM types that will become an error. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20240209183743.22030-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240404121327.3107131-8-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11KVM: introduce new vendor op for KVM_GET_DEVICE_ATTRPaolo Bonzini1-0/+1
Allow vendor modules to provide their own attributes on /dev/kvm. To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR and KVM_GET_DEVICE_ATTR in terms of the same function. You're not supposed to use KVM_GET_DEVICE_ATTR to do complicated computations, especially on /dev/kvm. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240404121327.3107131-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatibleSean Christopherson1-0/+1
Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with helpers to check if a vCPU is compatible AMD vs. Intel. To handle Intel vs. AMD behavior related to masking the LVTPC entry, KVM will need to check for vendor compatibility on every PMI injection, i.e. querying for AMD will soon be a moderately hot path. Note! This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's default behavior, both if userspace omits (or never sets) CPUID 0x0 and if userspace sets a completely unknown vendor. One could argue that KVM should treat such vCPUs as not being compatible with Intel *or* AMD, but that would add useless complexity to KVM. KVM needs to do *something* in the face of vendor specific behavior, and so unless KVM conjured up a magic third option, choosing to treat unknown vendors as neither Intel nor AMD means that checks on AMD compatibility would yield Intel behavior, and checks for Intel compatibility would yield AMD behavior. And that's far worse as it would effectively yield random behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs. !Intel. And practically speaking, all x86 CPUs follow either Intel or AMD architecture, i.e. "supporting" an unknown third architecture adds no value. Deliberately don't convert any of the existing guest_cpuid_is_intel() checks, as the Intel side of things is messier due to some flows explicitly checking for exactly vendor==Intel, versus some flows assuming anything that isn't "AMD compatible" gets Intel behavior. The Intel code will be cleaned up in the future. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240405235603.1173076-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-09KVM: x86: Move nEPT exit_qualification field from kvm_vcpu_arch to x86_exceptionSean Christopherson1-3/+0
Move the exit_qualification field that is used to track information about in-flight nEPT violations from "struct kvm_vcpu_arch" to "x86_exception", i.e. associate the information with the actual nEPT violation instead of the vCPU. To handle bits that are pulled from vmcs.EXIT_QUALIFICATION, i.e. that are propagated from the "original" EPT violation VM-Exit, simply grab them from the VMCS on-demand when injecting a nEPT Violation or a PML Full VM-exit. Aside from being ugly, having an exit_qualification field in kvm_vcpu_arch is outright dangerous, e.g. see commit d7f0a00e438d ("KVM: VMX: Report up-to-date exit qualification to userspace"). Opportunstically add a comment to call out that PML Full and EPT Violation VM-Exits use the same bit to report NMI blocking information. Link: https://lore.kernel.org/r/20240209221700.393189-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-6/+22
Pull kvm updates from Paolo Bonzini: "S390: - Changes to FPU handling came in via the main s390 pull request - Only deliver to the guest the SCLP events that userspace has requested - More virtual vs physical address fixes (only a cleanup since virtual and physical address spaces are currently the same) - Fix selftests undefined behavior x86: - Fix a restriction that the guest can't program a PMU event whose encoding matches an architectural event that isn't included in the guest CPUID. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed *using the architectural encoding*. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the event *in general*. It might support it, and it might support it using the same encoding that made it into the architectural PMU spec - Fix a variety of bugs in KVM's emulation of RDPMC (more details on individual commits) and add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID and therefore are easier to validate with selftests than with custom guests (aka kvm-unit-tests) - Zero out PMU state on AMD if the virtual PMU is disabled, it does not cause any bug but it wastes time in various cases where KVM would check if a PMC event needs to be synthesized - Optimize triggering of emulated events, with a nice ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace with an internal error exit code - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot - Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM doesn't support yielding in the middle of processing a zap, and 1GiB granularity resulted in multi-millisecond lags that are quite impolite for CONFIG_PREEMPT kernels - Allocate write-tracking metadata on-demand to avoid the memory overhead when a kernel is built with i915 virtualization support but the workloads use neither shadow paging nor i915 virtualization - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives - Fix the debugregs ABI for 32-bit KVM - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit, which allowed some optimization for both Intel and AMD - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, causing extra unnecessary work - Cleanup the logic for checking if the currently loaded vCPU is in-kernel - Harden against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel x86 Xen emulation: - Overlay pages can now be cached based on host virtual address, instead of guest physical addresses. This removes the need to reconfigure and invalidate the cache if the guest changes the gpa but the underlying host virtual address remains the same - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior) - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs RISC-V: - Support exception and interrupt handling in selftests - New self test for RISC-V architectural timer (Sstc extension) - New extension support (Ztso, Zacas) - Support userspace emulation of random number seed CSRs ARM: - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests LoongArch: - Set reserved bits as zero in CPUCFG - Start SW timer only when vcpu is blocking - Do not restart SW timer when it is expired - Remove unnecessary CSR register saving during enter guest - Misc cleanups and fixes as usual Generic: - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically always true on all architectures except MIPS (where Kconfig determines the available depending on CPU capabilities). It is replaced either by an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM) everywhere else - Factor common "select" statements in common code instead of requiring each architecture to specify it - Remove thoroughly obsolete APIs from the uapi headers - Move architecture-dependent stuff to uapi/asm/kvm.h - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, so that there's no need to remember to *conditionally* clean up after the worker Selftests: - Reduce boilerplate especially when utilize selftest TAP infrastructure - Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory - Fix benign bugs where tests neglect to close() guest_memfd files" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits) selftests: kvm: remove meaningless assignments in Makefiles KVM: riscv: selftests: Add Zacas extension to get-reg-list test RISC-V: KVM: Allow Zacas extension for Guest/VM KVM: riscv: selftests: Add Ztso extension to get-reg-list test RISC-V: KVM: Allow Ztso extension for Guest/VM RISC-V: KVM: Forward SEED CSR access to user space KVM: riscv: selftests: Add sstc timer test KVM: riscv: selftests: Change vcpu_has_ext to a common function KVM: riscv: selftests: Add guest helper to get vcpu id KVM: riscv: selftests: Add exception handling support LoongArch: KVM: Remove unnecessary CSR register saving during enter guest LoongArch: KVM: Do not restart SW timer when it is expired LoongArch: KVM: Start SW timer only when vcpu is blocking LoongArch: KVM: Set reserved bits as zero in CPUCFG KVM: selftests: Explicitly close guest_memfd files in some gmem tests KVM: x86/xen: fix recursive deadlock in timer injection KVM: pfncache: simplify locking and make more self-contained KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled KVM: x86/xen: improve accuracy of Xen timers ...
2024-03-12Merge tag 'x86_sev_for_v6.9_rc1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Add the x86 part of the SEV-SNP host support. This will allow the kernel to be used as a KVM hypervisor capable of running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP is the ultimate goal of the AMD confidential computing side, providing the most comprehensive confidential computing environment up to date. This is the x86 part and there is a KVM part which did not get ready in time for the merge window so latter will be forthcoming in the next cycle. - Rework the early code's position-dependent SEV variable references in order to allow building the kernel with clang and -fPIE/-fPIC and -mcmodel=kernel - The usual set of fixes, cleanups and improvements all over the place * tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/sev: Disable KMSAN for memory encryption TUs x86/sev: Dump SEV_STATUS crypto: ccp - Have it depend on AMD_IOMMU iommu/amd: Fix failure return from snp_lookup_rmpentry() x86/sev: Fix position dependent variable references in startup code crypto: ccp: Make snp_range_list static x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT Documentation: virt: Fix up pre-formatted text block for SEV ioctls crypto: ccp: Add the SNP_SET_CONFIG command crypto: ccp: Add the SNP_COMMIT command crypto: ccp: Add the SNP_PLATFORM_STATUS command x86/cpufeatures: Enable/unmask SEV-SNP CPU feature KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown crypto: ccp: Handle legacy SEV commands when SNP is enabled crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled crypto: ccp: Handle the legacy TMR allocation when SNP is enabled x86/sev: Introduce an SNP leaked pages list crypto: ccp: Provide an API to issue SEV and SNP commands ...
2024-03-11Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-1/+10
KVM x86 PMU changes for 6.9: - Fix several bugs where KVM speciously prevents the guest from utilizing fixed counters and architectural event encodings based on whether or not guest CPUID reports support for the _architectural_ encoding. - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads, priority of VMX interception vs #GP, PMC types in architectural PMUs, etc. - Add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID, i.e. are difficult to validate via KVM-Unit-Tests. - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting cycles, e.g. when checking if a PMC event needs to be synthesized when skipping an instruction. - Optimize triggering of emulated events, e.g. for "count instructions" events when skipping an instruction, which yields a ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest. - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit.
2024-03-11Merge tag 'kvm-x86-mmu-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-0/+9
KVM x86 MMU changes for 6.9: - Clean up code related to unprotecting shadow pages when retrying a guest instruction after failed #PF-induced emulation. - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if a reschedule is needed, e.g. if a high priority task needs to run. Because KVM doesn't support yielding in the middle of processing a zapped non-leaf SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when attempting to schedule in a high priority. - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot. - Allocate write-tracking metadata on-demand to avoid the memory overhead when running kernels built with KVMGT support (external write-tracking enabled), but for workloads that don't use nested virtualization (shadow paging) or KVMGT.
2024-03-11Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-5/+3
KVM x86 misc changes for 6.9: - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives (though in fairness in KMSAN, it's comically difficult to see that the uninitialized memory is never truly consumed). - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading DR6 and DR7. - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit. This allows VMX to further optimize handling preemption timer exits, and allows SVM to avoid sending a duplicate IPI (SVM also has a need to force an exit). - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, and add WARN to guard against similar bugs. - Provide a dedicated arch hook for checking if a different vCPU was in-kernel (for directed yield), and simplify the logic for checking if the currently loaded vCPU is in-kernel. - Misc cleanups and fixes.
2024-02-27kvm/x86: allocate the write-tracking metadata on-demandAndrei Vagin1-0/+9
The write-track is used externally only by the gpu/drm/i915 driver. Currently, it is always enabled, if a kernel has been compiled with this driver. Enabling the write-track mechanism adds a two-byte overhead per page across all memory slots. It isn't significant for regular VMs. However in gVisor, where the entire process virtual address space is mapped into the VM, even with a 39-bit address space, the overhead amounts to 256MB. Rework the write-tracking mechanism to enable it on-demand in kvm_page_track_register_notifier. Here is Sean's comment about the locking scheme: The only potential hiccup would be if taking slots_arch_lock would deadlock, but it should be impossible for slots_arch_lock to be taken in any other path that involves VFIO and/or KVMGT *and* can be coincident. Except for kvm_arch_destroy_vm() (which deletes KVM's internal memslots), slots_arch_lock is taken only through KVM ioctls(), and the caller of kvm_page_track_register_notifier() *must* hold a reference to the VM. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrei Vagin <avagin@google.com> Link: https://lore.kernel.org/r/20240213192340.2023366-1-avagin@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27KVM: x86/pmu: Explicitly check NMI from guest to reducee false positivesLike Xu1-1/+9
Explicitly check that the source of external interrupt is indeed an NMI in kvm_arch_pmi_in_guest(), which reduces perf-kvm false positive samples (host samples labelled as guest samples) generated by perf/core NMI mode if an NMI arrives after VM-Exit, but before kvm_after_interrupt(): # test: perf-record + cpu-cycles:HP (which collects host-only precise samples) # Symbol Overhead sys usr guest sys guest usr # ....................................... ........ ........ ........ ......... ......... # # Before: [g] entry_SYSCALL_64 24.63% 0.00% 0.00% 24.63% 0.00% [g] syscall_return_via_sysret 23.23% 0.00% 0.00% 23.23% 0.00% [g] files_lookup_fd_raw 6.35% 0.00% 0.00% 6.35% 0.00% # After: [k] perf_adjust_freq_unthr_context 57.23% 57.23% 0.00% 0.00% 0.00% [k] __vmx_vcpu_run 4.09% 4.09% 0.00% 0.00% 0.00% [k] vmx_update_host_rsp 3.17% 3.17% 0.00% 0.00% 0.00% In the above case, perf records the samples labelled '[g]', the RIPs behind the weird samples are actually being queried by perf_instruction_pointer() after determining whether it's in GUEST state or not, and here's the issue: If VM-Exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest() will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt(). During this window, if a PMI occurs on host (since the KVM instructions on host are being executed), the control flow, with the help of the host NMI context, will be transferred to perf/core to generate performance samples, thus perf_instruction_pointer() and perf_guest_get_ip() is called. Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may cause perf/core to mistakenly assume that the source RIP of the host NMI belongs to the guest world and use perf_guest_get_ip() to get the RIP of a vCPU that has already exited by a non-NMI interrupt. Error samples are recorded and presented to the end-user via perf-report. Such false positive samples could be eliminated by explicitly determining if the exit reason is KVM_HANDLING_NMI. Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI is cleared, it's also still possible that another PMI is generated on host. Also for perf/core timer mode, the false positives are still possible since those non-NMI sources of interrupts are not always being used by perf/core. For events that are host-only, perf/core can and should eliminate false positives by checking event->attr.exclude_guest, i.e. events that are configured to exclude KVM guests should never fire in the guest. Events that are configured to count host and guest are trickier, perhaps impossible to handle with 100% accuracy? And regardless of what accuracy is provided by perf/core, improving KVM's accuracy is cheap and easy, with no real downsides. Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting") Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231206032054.55070-1-likexu@tencent.com [sean: massage changelog, squash !!in_nmi() fixup from Like] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23KVM: x86: Fully defer to vendor code to decide how to force immediate exitSean Christopherson1-3/+0
Now that vmx->req_immediate_exit is used only in the scope of vmx_vcpu_run(), use force_immediate_exit to detect that KVM should usurp the VMX preemption to force a VM-Exit and let vendor code fully handle forcing a VM-Exit. Opportunsitically drop __kvm_request_immediate_exit() and just have vendor code call smp_send_reschedule() directly. SVM already does this when injecting an event while also trying to single-step an IRET, i.e. it's not exactly secret knowledge that KVM uses a reschedule IPI to force an exit. Link: https://lore.kernel.org/r/20240110012705.506918-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23KVM: x86: Plumb "force_immediate_exit" into kvm_entry() tracepointSean Christopherson1-1/+2
Annotate the kvm_entry() tracepoint with "immediate exit" when KVM is forcing a VM-Exit immediately after VM-Enter, e.g. when KVM wants to inject an event but needs to first complete some other operation. Knowing that KVM is (or isn't) forcing an exit is useful information when debugging issues related to event injection. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240110012705.506918-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23KVM: x86: Make kvm_get_dr() return a value, not use an out parameterSean Christopherson1-1/+1
Convert kvm_get_dr()'s output parameter to a return value, and clean up most of the mess that was created by forcing callers to provide a pointer. No functional change intended. Acked-by: Mathias Krause <minipli@grsecurity.net> Reviewed-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240209220752.388160-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Snapshot and clear reprogramming bitmap before reprogrammingSean Christopherson1-0/+1
Refactor the handling of the reprogramming bitmap to snapshot and clear to-be-processed bits before doing the reprogramming, and then explicitly set bits for PMCs that need to be reprogrammed (again). This will allow adding a macro to iterate over all valid PMCs without having to add special handling for the reprogramming bit, which (a) can have bits set for non-existent PMCs and (b) needs to clear such bits to avoid wasting cycles in perpetuity. Note, the existing behavior of clearing bits after reprogramming does NOT have a race with kvm_vm_ioctl_set_pmu_event_filter(). Setting a new PMU filter synchronizes SRCU _before_ setting the bitmap, i.e. guarantees that the vCPU isn't in the middle of reprogramming with a stale filter prior to setting the bitmap. Link: https://lore.kernel.org/r/20231110022857.1273836-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86: Give a hint when Win2016 might fail to boot due to XSAVES erratumMaciej S. Szmigiero1-0/+2
Since commit b0563468eeac ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17") kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs. Because KVM CPU caps are initialized from the kernel boot CPU features this makes the XSAVES feature also unavailable for KVM guests in this case. At the same time the XSAVEC feature is left enabled. Unfortunately, having XSAVEC but no XSAVES in CPUID breaks Hyper-V enabled Windows Server 2016 VMs that have more than one vCPU. Let's at least give users hint in the kernel log what could be wrong since these VMs currently simply hang at boot with a black screen - giving no clue what suddenly broke them and how to make them work again. Trigger the kernel message hint based on the particular guest ID written to the Guest OS Identity Hyper-V MSR implemented by KVM. Defer this check to when the L1 Hyper-V hypervisor enables SVM in EFER since we want to limit this message to Hyper-V enabled Windows guests only (Windows session running nested as L2) but the actual Guest OS Identity MSR write is done by L1 and happens before it enables SVM. Fixes: b0563468eeac ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <b83ab45c5e239e5d148b0ae7750133a67ac9575c.1706127425.git.maciej.szmigiero@oracle.com> [Move some checks before mutex_lock(), rename function. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-29KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safeBrijesh Singh1-0/+1
Implement a workaround for an SNP erratum where the CPU will incorrectly signal an RMP violation #PF if a hugepage (2MB or 1GB) collides with the RMP entry of a VMCB, VMSA or AVIC backing page. When SEV-SNP is globally enabled, the CPU marks the VMCB, VMSA, and AVIC backing pages as "in-use" via a reserved bit in the corresponding RMP entry after a successful VMRUN. This is done for _all_ VMs, not just SNP-Active VMs. If the hypervisor accesses an in-use page through a writable translation, the CPU will throw an RMP violation #PF. On early SNP hardware, if an in-use page is 2MB-aligned and software accesses any part of the associated 2MB region with a hugepage, the CPU will incorrectly treat the entire 2MB region as in-use and signal a an RMP violation #PF. To avoid this, the recommendation is to not use a 2MB-aligned page for the VMCB, VMSA or AVIC pages. Add a generic allocator that will ensure that the page returned is not 2MB-aligned and is safe to be used when SEV-SNP is enabled. Also implement similar handling for the VMCB/VMSA pages of nested guests. [ mdr: Squash in nested guest handling from Ashish, commit msg fixups. ] Reported-by: Alper Gun <alpergun@google.com> # for nested VMSA case Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Marc Orr <marcorr@google.com> Signed-off-by: Marc Orr <marcorr@google.com> Co-developed-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240126041126.1927228-22-michael.roth@amd.com
2024-01-18Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-12/+63
Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
2024-01-08Merge tag 'kvm-x86-mmu-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-5/+6
KVM x86 MMU changes for 6.8: - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Relax the TDP MMU's lockdep assertions related to holding mmu_lock for read versus write so that KVM doesn't pass "bool shared" all over the place just to have precise assertions in paths that don't actually care about whether the caller is a reader or a writer.
2024-01-08Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-1/+4
KVM x86 support for virtualizing Linear Address Masking (LAM) Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality checks for most virtual address usage in 64-bit mode, such that only the most significant bit of the untranslated address bits must match the polarity of the last translated address bit. This allows software to use ignored, untranslated address bits for metadata, e.g. to efficiently tag pointers for address sanitization. LAM can be enabled separately for user pointers and supervisor pointers, and for userspace LAM can be select between 48-bit and 57-bit masking - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15. - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6. For user pointers, LAM enabling utilizes two previously-reserved high bits from CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and 61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.: - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.: - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers The modified LAM canonicality checks: - LAM_S48 : [ 1 ][ metadata ][ 1 ] 63 47 - LAM_U48 : [ 0 ][ metadata ][ 0 ] 63 47 - LAM_S57 : [ 1 ][ metadata ][ 1 ] 63 56 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ] 63 56 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ] 63 56..47 The bulk of KVM support for LAM is to emulate LAM's modified canonicality checks. The approach taken by KVM is to "fill" the metadata bits using the highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value from the raw, untagged virtual address is kept for the canonicality check. This untagging allows Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26], enabling via CR3 and CR4 bits, etc.
2024-01-08Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-1/+16
KVM x86 PMU changes for 6.8: - Fix a variety of bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow.
2024-01-03arch/x86: Fix typosBjorn Helgaas1-1/+1
Fix typos, most reported by "codespell arch/x86". Only touches comments, no code changes. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2023-12-07KVM: x86: Make Hyper-V emulation optionalVitaly Kuznetsov1-0/+6
Hyper-V emulation in KVM is a fairly big chunk and in some cases it may be desirable to not compile it in to reduce module sizes as well as the attack surface. Introduce CONFIG_KVM_HYPERV option to make it possible. Note, there's room for further nVMX/nSVM code optimizations when !CONFIG_KVM_HYPERV, this will be done in follow-up patches. Reorganize Makefile a bit so all CONFIG_HYPERV and CONFIG_KVM_HYPERV files are grouped together. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20231205103630.1391318-13-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: x86: Move Hyper-V partition assist page out of Hyper-V emulation contextVitaly Kuznetsov1-1/+1
Hyper-V partition assist page is used when KVM runs on top of Hyper-V and is not used for Windows/Hyper-V guests on KVM, this means that 'hv_pa_pg' placement in 'struct kvm_hv' is unfortunate. As a preparation to making Hyper-V emulation optional, move 'hv_pa_pg' to 'struct kvm_arch' and put it under CONFIG_HYPERV. While on it, introduce hv_get_partition_assist_page() helper to allocate partition assist page. Move the comment explaining why we use a single page for all vCPUs from VMX and expand it a bit. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-3-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: x86/xen: Remove unneeded xen context from kvm_arch when !CONFIG_KVM_XENVitaly Kuznetsov1-0/+5
Saving a few bytes of memory per KVM VM is certainly great but what's more important is the ability to see where the code accesses Xen emulation context while CONFIG_KVM_XEN is not enabled. Currently, kvm_cpu_get_extint() is the only such place and it is harmless: kvm_xen_has_interrupt() always returns '0' when !CONFIG_KVM_XEN. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-2-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01KVM: x86/mmu: always take tdp_mmu_pages_lockPaolo Bonzini1-5/+6
It is cheap to take tdp_mmu_pages_lock in all write-side critical sections. We already do it all the time when zapping with read_lock(), so it is not a problem to do it from the kvm_tdp_mmu_zap_all() path (aka kvm_arch_flush_shadow_all(), aka VM destruction and MMU notifier release). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231125083400.1399197-4-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86/pmu: Track emulated counter events instead of previous counterSean Christopherson1-1/+16
Explicitly track emulated counter events instead of using the common counter value that's shared with the hardware counter owned by perf. Bumping the common counter requires snapshotting the pre-increment value in order to detect overflow from emulation, and the snapshot approach is inherently flawed. Snapshotting the previous counter at every increment assumes that there is at most one emulated counter event per emulated instruction (or rather, between checks for KVM_REQ_PMU). That's mostly holds true today because KVM only emulates (branch) instructions retired, but the approach will fall apart if KVM ever supports event types that don't have a 1:1 relationship with instructions. And KVM already has a relevant bug, as handle_invalid_guest_state() emulates multiple instructions without checking KVM_REQ_PMU, i.e. could miss an overflow event due to clobbering pmc->prev_counter. Not checking KVM_REQ_PMU is problematic in both cases, but at least with the emulated counter approach, the resulting behavior is delayed overflow detection, as opposed to completely lost detection. Tracking the emulated count fixes another bug where the snapshot approach can signal spurious overflow due to incorporating both the emulated count and perf's count in the check, i.e. if overflow is detected by perf, then KVM's emulation will also incorrectly signal overflow. Add a comment in the related code to call out the need to process emulated events *after* pausing the perf event (big kudos to Mingwei for figuring out that particular wrinkle). Cc: Mingwei Zhang <mizhang@google.com> Cc: Roman Kagan <rkagan@amazon.de> Cc: Jim Mattson <jmattson@google.com> Cc: Dapeng Mi <dapeng1.mi@linux.intel.com> Cc: Like Xu <like.xu.linux@gmail.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20231103230541.352265-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86/mmu: Declare flush_remote_tlbs{_range}() hooks iff HYPERV!=nSean Christopherson1-0/+12
Declare the kvm_x86_ops hooks used to wire up paravirt TLB flushes when running under Hyper-V if and only if CONFIG_HYPERV!=n. Wrapping yet more code with IS_ENABLED(CONFIG_HYPERV) eliminates a handful of conditional branches, and makes it super obvious why the hooks *might* be valid. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231018192325.1893896-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>