summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-05-10Merge branch 'kvm-vmx-ve' into HEADPaolo Bonzini13-43/+167
Allow a non-zero value for non-present SPTE and removed SPTE, so that TDX can set the "suppress VE" bit.
2024-05-07KVM: x86: Explicitly zero kvm_caps during vendor module loadSean Christopherson1-0/+7
Zero out all of kvm_caps when loading a new vendor module to ensure that KVM can't inadvertently rely on global initialization of a field, and add a comment above the definition of kvm_caps to call out that all fields needs to be explicitly computed during vendor module load. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Fully re-initialize supported_mce_cap on vendor module loadSean Christopherson1-3/+2
Effectively reset supported_mce_cap on vendor module load to ensure that capabilities aren't unintentionally preserved across module reload, e.g. if kvm-intel.ko added a module param to control LMCE support, or if someone somehow managed to load a vendor module that doesn't support LMCE after loading and unloading kvm-intel.ko. Practically speaking, this bug is a non-issue as kvm-intel.ko doesn't have a module param for LMCE, and there is no system in the world that supports both kvm-intel.ko and kvm-amd.ko. Fixes: c45dcc71b794 ("KVM: VMX: enable guest access to LMCE related MSRs") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Fully re-initialize supported_vm_types on vendor module loadSean Christopherson1-1/+2
Recompute the entire set of supported VM types when a vendor module is loaded, as preserving supported_vm_types across vendor module unload and reload can result in VM types being incorrectly treated as supported. E.g. if a vendor module is loaded with TDP enabled, unloaded, and then reloaded with TDP disabled, KVM_X86_SW_PROTECTED_VM will be incorrectly retained. Ditto for SEV_VM and SEV_ES_VM and their respective module params in kvm-amd.ko. Fixes: 2a955c4db1dd ("KVM: x86: Add supported_vm_types to kvm_caps") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07Merge tag 'kvm-riscv-6.10-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini67-403/+2186
KVM/riscv changes for 6.10 - Support guest breakpoints using ebreak - Introduce per-VCPU mp_state_lock and reset_cntx_lock - Virtualize SBI PMU snapshot and counter overflow interrupts - New selftests for SBI PMU and Guest ebreak
2024-04-26KVM: riscv: selftests: Add commandline option for SBI PMU testAtish Patra1-9/+64
SBI PMU test comprises of multiple tests and user may want to run only a subset depending on the platform. The most common case would be to run all to validate all the tests. However, some platform may not support all events or all ISA extensions. The commandline option allows user to disable any set of tests if they want to. Suggested-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240420151741.962500-25-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26KVM: riscv: selftests: Add a test for counter overflowAtish Patra1-0/+113
Add a test for verifying overflow interrupt. Currently, it relies on overflow support on cycle/instret events. This test works for cycle/ instret events which support sampling via hpmcounters on the platform. There are no ISA extensions to detect if a platform supports that. Thus, this test will fail on platform with virtualization but doesn't support overflow on these two events. Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-24-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26KVM: riscv: selftests: Add a test for PMU snapshot functionalityAtish Patra3-0/+181
Verify PMU snapshot functionality by setting up the shared memory correctly and reading the counter values from the shared memory instead of the CSR. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-23-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26KVM: riscv: selftests: Add SBI PMU selftestAtish Patra2-0/+370
This test implements basic sanity test and cycle/instret event counting tests. Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-22-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26KVM: riscv: selftests: Add SBI PMU extension definitionsAtish Patra1-0/+66
The SBI PMU extension definition is required for upcoming SBI PMU selftests. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-21-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26KVM: riscv: selftests: Add Sscofpmf to get-reg-list testAtish Patra1-0/+4
The KVM RISC-V allows Sscofpmf extension for Guest/VM so let us add this extension to get-reg-list test. Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-20-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26KVM: riscv: selftests: Add helper functions for extension checksAtish Patra2-1/+11
__vcpu_has_ext can check both SBI and ISA extensions when the first argument is properly converted to SBI/ISA extension IDs. Introduce two helper functions to make life easier for developers so they don't have to worry about the conversions. Replace the current usages as well with new helpers. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240420151741.962500-19-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26KVM: riscv: selftests: Move sbi definitions to its own header fileAtish Patra4-40/+54
The SBI definitions will continue to grow. Move the sbi related definitions to its own header file from processor.h Suggested-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240420151741.962500-18-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Improve firmware counter read functionAtish Patra3-3/+8
Rename the function to indicate that it is meant for firmware counter read. While at it, add a range sanity check for it as well. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240420151741.962500-17-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Support 64 bit firmware counters on RV32Atish Patra3-2/+52
The SBI v2.0 introduced a fw_read_hi function to read 64 bit firmware counters for RV32 based systems. Add infrastructure to support that. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-16-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Add perf sampling support for guestsAtish Patra7-8/+93
KVM enables perf for guest via counter virtualization. However, the sampling can not be supported as there is no mechanism to enabled trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot to provide the counter overflow data via the shared memory. In case of sampling event, the host first sets the guest's LCOFI interrupt and injects to the guest via irq filtering mechanism defined in AIA specification. Thus, ssaia must be enabled in the host in order to use perf sampling in the guest. No other AIA dependency w.r.t kernel is required. Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-15-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Implement SBI PMU Snapshot featureAtish Patra3-1/+130
PMU Snapshot function allows to minimize the number of traps when the guest access configures/access the hpmcounters. If the snapshot feature is enabled, the hypervisor updates the shared memory with counter data and state of overflown counters. The guest can just read the shared memory instead of trap & emulate done by the hypervisor. This patch doesn't implement the counter overflow yet. Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-14-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: No need to exit to the user space if perf event failedAtish Patra2-8/+12
Currently, we return a linux error code if creating a perf event failed in kvm. That shouldn't be necessary as guest can continue to operate without perf profiling or profiling with firmware counters. Return appropriate SBI error code to indicate that PMU configuration failed. An error message in kvm already describes the reason for failure. Fixes: 0cb74b65d2e5 ("RISC-V: KVM: Implement perf support without sampling") Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-13-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: No need to update the counter value during resetAtish Patra1-6/+2
The virtual counter value is updated during pmu_ctr_read. There is no need to update it in reset case. Otherwise, it will be counted twice which is incorrect. Fixes: 0cb74b65d2e5 ("RISC-V: KVM: Implement perf support without sampling") Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-12-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Fix the initial sample period valueAtish Patra1-1/+1
The initial sample period value when counter value is not assigned should be set to maximum value supported by the counter width. Otherwise, it may result in spurious interrupts. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240420151741.962500-11-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26drivers/perf: riscv: Implement SBI PMU snapshot functionAtish Patra3-21/+264
SBI v2.0 SBI introduced PMU snapshot feature which adds the following features. 1. Read counter values directly from the shared memory instead of csr read. 2. Start multiple counters with initial values with one SBI call. These functionalities optimizes the number of traps to the higher privilege mode. If the kernel is in VS mode while the hypervisor deploy trap & emulate method, this would minimize all the hpmcounter CSR read traps. If the kernel is running in S-mode, the benefits reduced to CSR latency vs DRAM/cache latency as there is no trap involved while accessing the hpmcounter CSRs. In both modes, it does saves the number of ecalls while starting multiple counter together with an initial values. This is a likely scenario if multiple counters overflow at the same time. Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Samuel Holland <samuel.holland@sifive.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-10-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22drivers/perf: riscv: Fix counter mask iteration for RV32Atish Patra1-9/+12
For RV32, used_hw_ctrs can have more than 1 word if the firmware chooses to interleave firmware/hardware counters indicies. Even though it's a unlikely scenario, handle that case by iterating over all the words instead of just using the first word. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-9-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22RISC-V: Use the minor version mask while computing sbi versionAtish Patra1-2/+2
As per the SBI specification, minor version is encoded in the lower 24 bits only. Make sure that the SBI version is computed with the appropriate mask. Currently, there is no minor version in use. Thus, it doesn't change anything functionality but it is good to be compliant with the specification. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-8-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22RISC-V: KVM: Rename the SBI_STA_SHMEM_DISABLE to a generic nameAtish Patra3-6/+6
SBI_STA_SHMEM_DISABLE is a macro to invoke disable shared memory commands. As this can be invoked from other SBI extension context as well, rename it to more generic name as SBI_SHMEM_DISABLE. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-7-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22RISC-V: Add SBI PMU snapshot definitionsAtish Patra1-0/+11
SBI PMU Snapshot function optimizes the number of traps to higher privilege mode by leveraging a shared memory between the S/VS-mode and the M/HS mode. Add the definitions for that extension and new error codes. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-6-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22drivers/perf: riscv: Use BIT macro for shifting operationsAtish Patra2-11/+11
It is a good practice to use BIT() instead of (1 << x). Replace the current usages with BIT(). Take this opportunity to replace few (1UL << x) with BIT() as well for consistency. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-5-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22drivers/perf: riscv: Read upper bits of a firmware counterAtish Patra1-5/+20
SBI v2.0 introduced a explicit function to read the upper 32 bits for any firmware counter width that is longer than 32bits. This is only applicable for RV32 where firmware counter can be 64 bit. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-4-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22RISC-V: Add FIRMWARE_READ_HI definitionAtish Patra1-0/+1
SBI v2.0 added another function to SBI PMU extension to read the upper bits of a counter with width larger than XLEN. Add the definition for that function. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Clément Léger <cleger@rivosinc.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-3-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22RISC-V: Fix the typo in Scountovf CSR nameAtish Patra2-2/+2
The counter overflow CSR name is "scountovf" not "sscountovf". Fix the csr name. Fixes: 4905ec2fb7e6 ("RISC-V: Add sscofpmf extension support") Reviewed-by: Clément Léger <cleger@rivosinc.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-2-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22RISCV: KVM: Introduce vcpu->reset_cntx_lockYong-Xuan Wang3-0/+10
Originally, the use of kvm->lock in SBI_EXT_HSM_HART_START also avoids the simultaneous updates to the reset context of target VCPU. Since this lock has been replace with vcpu->mp_state_lock, and this new lock also protects the vcpu->mp_state. We have to add a separate lock for vcpu->reset_cntx. Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240417074528.16506-3-yongxuan.wang@sifive.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22RISCV: KVM: Introduce mp_state_lock to avoid lock inversionYong-Xuan Wang4-29/+73
Documentation/virt/kvm/locking.rst advises that kvm->lock should be acquired outside vcpu->mutex and kvm->srcu. However, when KVM/RISC-V handling SBI_EXT_HSM_HART_START, the lock ordering is vcpu->mutex, kvm->srcu then kvm->lock. Although the lockdep checking no longer complains about this after commit f0f44752f5f6 ("rcu: Annotate SRCU's update-side lockdep dependencies"), it's necessary to replace kvm->lock with a new dedicated lock to ensure only one hart can execute the SBI_EXT_HSM_HART_START call for the target hart simultaneously. Additionally, this patch also rename "power_off" to "mp_state" with two possible values. The vcpu->mp_state_lock also protects the access of vcpu->mp_state. Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240417074528.16506-2-yongxuan.wang@sifive.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-19KVM: VMX: Introduce test mode related to EPT violation VEIsaku Yamahata4-4/+73
To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM uses the suppress #VE bit in EPT entries selectively, in order to be able to trap non-present conditions. However, #VE isn't used for VMX and it's a bug if it happens. To be defensive and test that VMX case isn't broken introduce an option ept_violation_ve_test and when it's set, BUG the vm. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <d6db6ba836605c0412e166359ba5c46a63c22f86.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19KVM, x86: add architectural support code for #VEPaolo Bonzini2-0/+16
Dump the contents of the #VE info data structure and assert that #VE does not happen, but do not yet do anything with it. No functional change intended, separated for clarity only. Extracted from a patch by Isaku Yamahata <isaku.yamahata@intel.com>. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19KVM: x86/mmu: Track shadow MMIO value on a per-VM basisSean Christopherson5-10/+13
TDX will use a different shadow PTE entry value for MMIO from VMX. Add a member to kvm_arch and track value for MMIO per-VM instead of a global variable. By using the per-VM EPT entry value for MMIO, the existing VMX logic is kept working. Introduce a separate setter function so that guest TD can use a different value later. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <229a18434e5d83f45b1fcd7bf1544d79db1becb6.1705965635.git.isaku.yamahata@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19KVM: x86/mmu: Add Suppress VE bit to EPT shadow_mmio_mask/shadow_present_maskIsaku Yamahata2-2/+5
To make use of the same value of shadow_mmio_mask and shadow_present_mask for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and shadow_present_mask so that they can be common for both VMX and TDX. TDX will require shadow_mmio_mask and shadow_present_mask to include VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for shared GPA. For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the spte value is defined so as to cause EPT misconfig. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <97cc616b3563cd8277be91aaeb3e14bce23c3649.1705965635.git.isaku.yamahata@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTESean Christopherson3-14/+28
For TD guest, the current way to emulate MMIO doesn't work any more, as KVM is not able to access the private memory of TD guest and do the emulation. Instead, TD guest expects to receive #VE when it accesses the MMIO and then it can explicitly make hypercall to KVM to get the expected information. To achieve this, the TDX module always enables "EPT-violation #VE" in the VMCS control. And accordingly, for the MMIO spte for the shared GPA, 1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT violation happens on TD accessing MMIO range. 2. On EPT violation, KVM sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive the #VE instead of EPT misconfiguration unlike VMX case. For the shared GPA that is not populated yet, EPT violation need to be triggered when TD guest accesses such shared GPA. The non-present SPTE value for shared GPA should set "suppress #VE" bit. Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and REMOVED_SPTE. Unconditionally set the "suppress #VE" bit (which is bit 63) for both AMD and Intel as: 1) AMD hardware doesn't use this bit when present bit is off; 2) for normal VMX guest, KVM never enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <a99cb866897c7083430dce7f24c63b17d7121134.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTESean Christopherson4-13/+19
The TDX support will need the "suppress #VE" bit (bit 63) set as the initial value for SPTE. To reduce code change size, introduce a new macro SHADOW_NONPRESENT_VALUE for the initial value for the shadow page table entry (SPTE) and replace hard-coded value 0 for it. Initialize shadow page tables with their value. The plan is to unconditionally set the "suppress #VE" bit for both AMD and Intel as: 1) AMD hardware uses the bit 63 as NX for present SPTE and ignored for non-present SPTE; 2) for conventional VMX guests, KVM never enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by hardware. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <acdf09bf60cad12c495005bf3495c54f6b3069c9.1705965635.git.isaku.yamahata@intel.com> [Remove unnecessary CONFIG_X86_64 check. - Paolo] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19KVM: Allow page-sized MMU caches to be initialized with custom 64-bit valuesSean Christopherson2-2/+15
Add support to MMU caches for initializing a page with a custom 64-bit value, e.g. to pre-fill an entire page table with non-zero PTE values. The functionality will be used by x86 to support Intel's TDX, which needs to set bit 63 in all non-present PTEs in order to prevent !PRESENT page faults from getting reflected into the guest (Intel's EPT Violation #VE architecture made the less than brilliant decision of having the per-PTE behavior be opt-out instead of opt-in). Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <5919f685f109a1b0ebc6bd8fc4536ee94bcc172d.1705965635.git.isaku.yamahata@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19Merge x86 bugfixes from Linux 6.9-rc3Paolo Bonzini773-4479/+9771
Pull fix for SEV-SNP late disable bugs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12KVM: SEV: use u64_to_user_ptr throughoutPaolo Bonzini1-22/+22
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argumentSean Christopherson1-9/+7
TDX uses different ABI to get information about VM exit. Pass intr_info to the NMI and INTR handlers instead of pulling it from vcpu_vmx in preparation for sharing the bulk of the handlers with TDX. When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds exit qualification etc rather than the VMCS fields because VMM doesn't have access to the VMCS. The eventual code will be VMX: - get exit reason, intr_info, exit_qualification, and etc from VMCS - call NMI/INTR handlers (common code) TDX: - get exit reason, intr_info, exit_qualification, and etc from guest registers - call NMI/INTR handlers (common code) Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDXPaolo Bonzini4-270/+391
KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions to operate on VM. TDX doesn't allow VMM to operate VMCS directly. Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to indirectly operate those data structures. This means we must have a TDX version of kvm_x86_ops. The existing global struct kvm_x86_ops already defines an interface which can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM structure. To allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or TDX at run time. To split the runtime switch, the VMX implementation, and the TDX implementation, add main.c, and move out the vmx_x86_ops hooks in preparation for adding TDX. Use 'vt' for the naming scheme as a nod to VT-x and as a concatenation of VmxTdx. The eventually converted code will look like this: vmx.c: vmx_op() { ... } VMX initialization tdx.c: tdx_op() { ... } TDX initialization x86_ops.h: vmx_op(); tdx_op(); main.c: static vt_op() { if (tdx) tdx_op() else vmx_op() } static struct kvm_x86_ops vt_x86_ops = { .op = vt_op, initialization functions call both VMX and TDX initialization Opportunistically, fix the name inconsistency from vmx_create_vcpu() and vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free(). Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12KVM: x86: Split core of hypercall emulation to helper functionSean Christopherson2-18/+42
By necessity, TDX will use a different register ABI for hypercalls. Break out the core functionality so that it may be reused for TDX. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12Merge branch 'kvm-sev-init2' into HEADPaolo Bonzini25-185/+703
The idea that no parameter would ever be necessary when enabling SEV or SEV-ES for a VM was decidedly optimistic. The first source of variability that was encountered is the desired set of VMSA features, as that affects the measurement of the VM's initial state and cannot be changed arbitrarily by the hypervisor. This series adds all the APIs that are needed to customize the features, with room for future enhancements: - a new /dev/kvm device attribute to retrieve the set of supported features (right now, only debug swap) - a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct, replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT It then puts the new op to work by including the VMSA features as a field of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of supported VMSA features for backwards compatibility; but I am considering also making them use zero as the feature mask, and will gladly adjust the patches if so requested. In order to avoid creating *two* new KVM_MEM_ENCRYPT_OPs, I decided that I could as well make SEV and SEV-ES use VM types. This allows SEV-SNP to reuse the KVM_SEV_INIT2 ioctl. And while at it, KVM_SEV_INIT2 also includes two bugfixes. First of all, SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT, reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor once the VMSA has been encrypted... which is how the API should have always behaved. Second, they also synchronize the FPU and AVX state. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12Merge branch 'mm-delete-change-gpte' into HEADPaolo Bonzini26-419/+16
The .change_pte() MMU notifier callback was intended as an optimization and for this reason it was initially called without a surrounding mmu_notifier_invalidate_range_{start,end}() pair. It was only ever implemented by KVM (which was also the original user of MMU notifiers) and the rules on when to call set_pte_at_notify() rather than set_pte_at() have always been pretty obscure. It may seem a miracle that it has never caused any hard to trigger bugs, but there's a good reason for that: KVM's implementation has been nonfunctional for a good part of its existence. Already in 2012, commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) changed the .change_pte() callback to occur within an invalidate_range_start/end() pair; and because KVM unmaps the sPTEs during .invalidate_range_start(), .change_pte() has no hope of finding a sPTE to change. Therefore, all the code for .change_pte() can be removed from both KVM and mm/, and set_pte_at_notify() can be replaced with just set_pte_at(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12mm: replace set_pte_at_notify() with just set_pte_at()Paolo Bonzini5-19/+8
With the demise of the .change_pte() MMU notifier callback, there is no notification happening in set_pte_at_notify(). It is a synonym of set_pte_at() and can be replaced with it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11mmu_notifier: remove the .change_pte() callbackPaolo Bonzini2-61/+2
The scope of set_pte_at_notify() has reduced more and more through the years. Initially, it was meant for when the change to the PTE was not bracketed by mmu_notifier_invalidate_range_{start,end}(). However, that has not been so for over ten years. During all this period the only implementation of .change_pte() was KVM and it had no actual functionality, because it was called after mmu_notifier_invalidate_range_start() zapped the secondary PTE. Now that this (nonfunctional) user of the .change_pte() callback is gone, the whole callback can be removed. For now, leave in place set_pte_at_notify() even though it is just a synonym for set_pte_at(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Message-ID: <20240405115815.3226315-4-pbonzini@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11KVM: remove unused argument of kvm_handle_hva_range()Paolo Bonzini1-6/+1
The only user was kvm_mmu_notifier_change_pte(), which is now gone. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20240405115815.3226315-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11KVM: delete .change_pte MMU notifier callbackPaolo Bonzini20-335/+7
The .change_pte() MMU notifier callback was intended as an optimization. The original point of it was that KSM could tell KVM to flip its secondary PTE to a new location without having to first zap it. At the time there was also an .invalidate_page() callback; both of them were *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), and .invalidate_page() also doubled as a fallback implementation of .change_pte(). Later on, however, both callbacks were changed to occur within an invalidate_range_start/end() block. In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) did so to remove the fallback from .invalidate_page() to .change_pte() and allow sleepable .invalidate_page() hooks. This however made KVM's usage of the .change_pte() callback completely moot, because KVM unmaps the sPTEs during .invalidate_range_start() and therefore .change_pte() has no hope of finding a sPTE to change. Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as well as all the architecture specific implementations. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by: Bibo Mao <maobibo@loongson.cn> Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11selftests: kvm: add test for transferring FPU state into VMSAPaolo Bonzini1-0/+89
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-18-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>