summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2024-02-01KVM: x86/pmu: Expand the comment about what bits are check emulating eventsSean Christopherson1-1/+14
Expand the comment about what bits are and aren't checked when emulating PMC events in software. As pointed out by Jim, AMD's mask includes bits 35:32, which on Intel overlap with the IN_TX and IN_TXCP bits (32 and 33) as well as reserved bits (34 and 45). Checking The IN_TX* bits is actually correct, as it's safe to assert that the vCPU can't be in an HLE/RTM transaction if KVM is emulating an instruction, i.e. KVM *shouldn't count if either of those bits is set. For the reserved bits, KVM is has equal odds of being right if Intel adds new behavior, i.e. ignoring them is just as likely to be correct as checking them. Opportunistically explain *why* the other flags aren't checked. Suggested-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20231110022857.1273836-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Snapshot event selectors that KVM emulates in softwareSean Christopherson4-14/+24
Snapshot the event selectors for the events that KVM emulates in software, which is currently instructions retired and branch instructions retired. The event selectors a tied to the underlying CPU, i.e. are constant for a given platform even though perf doesn't manage the mappings as such. Getting the event selectors from perf isn't exactly cheap, especially if mitigations are enabled, as at least one indirect call is involved. Snapshot the values in KVM instead of optimizing perf as working with the raw event selectors will be required if KVM ever wants to emulate events that aren't part of perf's uABI, i.e. that don't have an "enum perf_hw_id" entry. Link: https://lore.kernel.org/r/20231110022857.1273836-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Process only enabled PMCs when emulating events in softwareSean Christopherson1-1/+10
Mask off disabled counters based on PERF_GLOBAL_CTRL *before* iterating over PMCs to emulate (branch) instruction required events in software. In the common case where the guest isn't utilizing the PMU, pre-checking for enabled counters turns a relatively expensive search into a few AND uops and a Jcc. Sadly, PMUs without PERF_GLOBAL_CTRL, e.g. most existing AMD CPUs, are out of luck as there is no way to check that a PMC isn't being used without checking the PMC's event selector. Cc: Konstantin Khorenko <khorenko@virtuozzo.com> Link: https://lore.kernel.org/r/20231110022857.1273836-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Add macros to iterate over all PMCs given a bitmapSean Christopherson3-24/+15
Add and use kvm_for_each_pmc() to dedup a variety of open coded for-loops that iterate over valid PMCs given a bitmap (and because seeing checkpatch whine about bad macro style is always amusing). No functional change intended. Link: https://lore.kernel.org/r/20231110022857.1273836-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Snapshot and clear reprogramming bitmap before reprogrammingSean Christopherson1-23/+29
Refactor the handling of the reprogramming bitmap to snapshot and clear to-be-processed bits before doing the reprogramming, and then explicitly set bits for PMCs that need to be reprogrammed (again). This will allow adding a macro to iterate over all valid PMCs without having to add special handling for the reprogramming bit, which (a) can have bits set for non-existent PMCs and (b) needs to clear such bits to avoid wasting cycles in perpetuity. Note, the existing behavior of clearing bits after reprogramming does NOT have a race with kvm_vm_ioctl_set_pmu_event_filter(). Setting a new PMU filter synchronizes SRCU _before_ setting the bitmap, i.e. guarantees that the vCPU isn't in the middle of reprogramming with a stale filter prior to setting the bitmap. Link: https://lore.kernel.org/r/20231110022857.1273836-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Move pmc_idx => pmc translation helper to common codeSean Christopherson4-23/+36
Add a common helper for *internal* PMC lookups, and delete the ops hook and Intel's implementation. Keep AMD's implementation, but rename it to amd_pmu_get_pmc() to make it somewhat more obvious that it's suited for both KVM-internal and guest-initiated lookups. Because KVM tracks all counters in a single bitmap, getting a counter when iterating over a bitmap, e.g. of all valid PMCs, requires a small amount of math, that while simple, isn't super obvious and doesn't use the same semantics as PMC lookups from RDPMC! Although AMD doesn't support fixed counters, the common PMU code still behaves as if there a split, the high half of which just happens to always be empty. Opportunstically add a comment to explain both what is going on, and why KVM uses a single bitmap, e.g. the boilerplate for iterating over separate bitmaps could be done via macros, so it's not (just) about deduplicating code. Link: https://lore.kernel.org/r/20231110022857.1273836-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Add common define to capture fixed counters offsetSean Christopherson3-11/+13
Add a common define to "officially" solidify KVM's split of counters, i.e. to commit to using bits 31:0 to track general purpose counters and bits 63:32 to track fixed counters (which only Intel supports). KVM already bleeds this behavior all over common PMU code, and adding a KVM- defined macro allows clarifying that the value is a _base_, as oppposed to the _flag_ that is used to access fixed PMCs via RDPMC (which perf confusingly calls INTEL_PMC_FIXED_RDPMC_BASE). No functional change intended. Link: https://lore.kernel.org/r/20231110022857.1273836-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Zero out PMU metadata on AMD if PMU is disabledSean Christopherson2-16/+20
Move the purging of common PMU metadata from intel_pmu_refresh() to kvm_pmu_refresh(), and invoke the vendor refresh() hook if and only if the VM is supposed to have a vPMU. KVM already denies access to the PMU based on kvm->arch.enable_pmu, as get_gp_pmc_amd() returns NULL for all PMCs in that case, i.e. KVM already violates AMD's architecture by not virtualizing a PMU (kernels have long since learned to not panic when the PMU is unavailable). But configuring the PMU as if it were enabled causes unwanted side effects, e.g. calls to kvm_pmu_trigger_event() waste an absurd number of cycles due to the all_valid_pmc_idx bitmap being non-zero. Fixes: b1d66dad65dc ("KVM: x86/svm: Add module param to control PMU virtualization") Reported-by: Konstantin Khorenko <khorenko@virtuozzo.com> Closes: https://lore.kernel.org/all/20231109180646.2963718-2-khorenko@virtuozzo.com Link: https://lore.kernel.org/r/20231110022857.1273836-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86: Make gtod_is_based_on_tsc() return 'bool'Vitaly Kuznetsov1-1/+1
gtod_is_based_on_tsc() is boolean in nature, i.e. it returns '1' for good clocksources and '0' otherwise. Moreover, its result is used raw by kvm_get_time_and_clockread()/kvm_get_walltime_and_clockread() which are 'bool'. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240109141121.1619463-6-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/mmu: Use KMEM_CACHE instead of kmem_cache_create()Kunwu Chan1-3/+1
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Note, KMEM_CACHE() uses the required alignment of the struct, '8' as the alignment, whereas KVM's existing code passes '0'. In the end, the two values yield the same result as x86's minimum slab alignment is also '8' (which is not at all coincidental). Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240116100025.95702-1-chentao@kylinos.cn [sean: call out alignment behavior] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86: Give a hint when Win2016 might fail to boot due to XSAVES erratumMaciej S. Szmigiero3-0/+57
Since commit b0563468eeac ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17") kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs. Because KVM CPU caps are initialized from the kernel boot CPU features this makes the XSAVES feature also unavailable for KVM guests in this case. At the same time the XSAVEC feature is left enabled. Unfortunately, having XSAVEC but no XSAVES in CPUID breaks Hyper-V enabled Windows Server 2016 VMs that have more than one vCPU. Let's at least give users hint in the kernel log what could be wrong since these VMs currently simply hang at boot with a black screen - giving no clue what suddenly broke them and how to make them work again. Trigger the kernel message hint based on the particular guest ID written to the Guest OS Identity Hyper-V MSR implemented by KVM. Defer this check to when the L1 Hyper-V hypervisor enables SVM in EFER since we want to limit this message to Hyper-V enabled Windows guests only (Windows session running nested as L2) but the actual Guest OS Identity MSR write is done by L1 and happens before it enables SVM. Fixes: b0563468eeac ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <b83ab45c5e239e5d148b0ae7750133a67ac9575c.1706127425.git.maciej.szmigiero@oracle.com> [Move some checks before mutex_lock(), rename function. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-01KVM: x86: Check irqchip mode before create PITTengfei Yu1-0/+3
As the kvm api(https://docs.kernel.org/virt/kvm/api.html) reads, KVM_CREATE_PIT2 call is only valid after enabling in-kernel irqchip support via KVM_CREATE_IRQCHIP. Without this check, I can create PIT first and enable irqchip-split then, which may cause the PIT invalid because of lacking of in-kernel PIC to inject the interrupt. Signed-off-by: Tengfei Yu <moehanabichan@gmail.com> Message-Id: <20240125050823.4893-1-moehanabichan@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-01KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handlingXin Li1-3/+9
When FRED is enabled, call fred_entry_from_kvm() to handle IRQ/NMI in IRQ/NMI induced VM exits. Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Shan Kang <shan.kang@intel.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231205105030.8698-33-xin3.li@intel.com
2024-01-31KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpuPrasad Pandit1-1/+2
kvm_vcpu_ioctl_x86_set_vcpu_events() routine makes 'KVM_REQ_NMI' request for a vcpu even when its 'events->nmi.pending' is zero. Ex: qemu_thread_start kvm_vcpu_thread_fn qemu_wait_io_event qemu_wait_io_event_common process_queued_cpu_work do_kvm_cpu_synchronize_post_init/_reset kvm_arch_put_registers kvm_put_vcpu_events (cpu, level=[2|3]) This leads vCPU threads in QEMU to constantly acquire & release the global mutex lock, delaying the guest boot due to lock contention. Add check to make KVM_REQ_NMI request only if vcpu has NMI pending. Fixes: bdedff263132 ("KVM: x86: Route pending NMIs from userspace through process_nmi()") Cc: stable@vger.kernel.org Signed-off-by: Prasad Pandit <pjp@fedoraproject.org> Link: https://lore.kernel.org/r/20240103075343.549293-1-ppandit@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31KVM: x86/pmu: Explicitly check for RDPMC of unsupported Intel PMC typesSean Christopherson1-6/+15
Explicitly check for attempts to read unsupported PMC types instead of letting the bounds check fail. Functionally, letting the check fail is ok, but it's unnecessarily subtle and does a poor job of documenting the architectural behavior that KVM is emulating. Reviewed-by: Dapeng Mi  <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31KVM: x86/pmu: Treat "fixed" PMU type in RDPMC as index as a value, not flagSean Christopherson1-3/+11
Refactor KVM's handling of ECX for RDPMC to treat the FIXED modifier as an explicit value, not a flag (minus one wart). While non-architectural PMUs do use bit 31 as a flag (for "fast" reads), architectural PMUs use the upper half of ECX to encode the type. From the SDM: ECX[31:16] specifies type of PMC while ECX[15:0] specifies the index of the PMC to be read within that type Note, that the known supported types are 4000H and 2000H, i.e. look a lot like flags, doesn't contradict the above statement that ECX[31:16] holds the type, at least not by any sane reading of the SDM. Keep the explicitly clearing of the FIXED "flag", as KVM subtly relies on that behavior to disallow unsupported types while allowing the correct indices for fixed counters. This wart will be cleaned up in short order. Opportunistically grab the per-type bitmask in the if-else blocks to eliminate the one-off usage of the local "fixed" bool. Reported-by: Jim Mattson <jmattson@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31KVM: x86/pmu: Disallow "fast" RDPMC for architectural Intel PMUsSean Christopherson1-4/+18
Inject #GP on RDPMC if the "fast" flag is set for architectural Intel PMUs, i.e. if the PMU version is non-zero. Per Intel's SDM, and confirmed on bare metal, the "fast" flag is supported only for non-architectural PMUs, and is reserved for architectural PMUs. If the processor does not support architectural performance monitoring (CPUID.0AH:EAX[7:0]=0), ECX[30:0] specifies the index of the PMC to be read. Setting ECX[31] selects “fast” read mode if supported. In this mode, RDPMC returns bits 31:0 of the PMC in EAX while clearing EDX to zero. If the processor does support architectural performance monitoring (CPUID.0AH:EAX[7:0] ≠ 0), ECX[31:16] specifies type of PMC while ECX[15:0] specifies the index of the PMC to be read within that type. The following PMC types are currently defined: — General-purpose counters use type 0. The index x (to read IA32_PMCx) must be less than the value enumerated by CPUID.0AH.EAX[15:8] (thus ECX[15:8] must be zero). — Fixed-function counters use type 4000H. The index x (to read IA32_FIXED_CTRx) can be used if either CPUID.0AH.EDX[4:0] > x or CPUID.0AH.ECX[x] = 1 (thus ECX[15:5] must be 0). — Performance metrics use type 2000H. This type can be used only if IA32_PERF_CAPABILITIES.PERF_METRICS_AVAILABLE[bit 15]=1. For this type, the index in ECX[15:0] is implementation specific. Opportunistically WARN if KVM ever actually tries to complete RDPMC for a non-architectural PMU, and drop the non-existent "support" for fast RDPMC, as KVM doesn't support such PMUs, i.e. kvm_pmu_rdpmc() should reject the RDPMC before getting to the Intel code. Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests") Fixes: 67f4d4288c35 ("KVM: x86: rdpmc emulation checks the counter incorrectly") Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31KVM: x86/pmu: Apply "fast" RDPMC only to Intel PMUsSean Christopherson2-4/+15
Move the handling of "fast" RDPMC instructions, which drop bits 63:32 of the count, to Intel. The "fast" flag, and all modifiers for that matter, are Intel-only and aren't supported by AMD. Opportunistically replace open coded bit crud with proper #defines, and add comments to try and disentangle the flags vs. values mess for non-architectural vs. architectural PMUs. Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31KVM: x86/pmu: Prioritize VMX interception over #GP on RDPMC due to bad indexSean Christopherson7-28/+26
Apply the pre-intercepts RDPMC validity check only to AMD, and rename all relevant functions to make it as clear as possible that the check is not a standard PMC index check. On Intel, the basic rule is that only invalid opcodes and privilege/permission/mode checks have priority over VM-Exit, i.e. RDPMC with an invalid index should VM-Exit, not #GP. While the SDM doesn't explicitly call out RDPMC, it _does_ explicitly use RDMSR of a non-existent MSR as an example where VM-Exit has priority over #GP, and RDPMC is effectively just a variation of RDMSR. Manually testing on various Intel CPUs confirms this behavior, and the inverted priority was introduced for SVM compatibility, i.e. was not an intentional change for Intel PMUs. On AMD, *all* exceptions on RDPMC have priority over VM-Exit. Check for a NULL kvm_pmu_ops.check_rdpmc_early instead of using a RET0 static call so as to provide a convenient location to document the difference between Intel and AMD, and to again try to make it as obvious as possible that the early check is a one-off thing, not a generic "is this PMC valid?" helper. Fixes: 8061252ee0d2 ("KVM: SVM: Add intercept checks for remaining twobyte instructions") Cc: Jim Mattson <jmattson@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31KVM: x86/pmu: Don't ignore bits 31:30 for RDPMC index on AMDSean Christopherson1-3/+1
Stop stripping bits 31:30 prior to validating/consuming the RDPMC index on AMD. Per the APM's documentation of RDPMC, *values* greater than 27 are reserved. The behavior of upper bits being flags is firmly Intel-only. Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31KVM: x86/pmu: Get eventsel for fixed counters from perfSean Christopherson1-13/+17
Get the event selectors used to effectively request fixed counters for perf events from perf itself instead of hardcoding them in KVM and hoping that they match the underlying hardware. While fixed counters 0 and 1 use architectural events, as of ffbe4ab0beda ("perf/x86/intel: Extend the ref-cycles event to GP counters") fixed counter 2 (reference TSC cycles) may use a software-defined pseudo-encoding or a real hardware-defined encoding. Reported-by: Kan Liang <kan.liang@linux.intel.com> Closes: https://lkml.kernel.org/r/4281eee7-6423-4ec8-bb18-c6aeee1faf2c%40linux.intel.com Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31KVM: x86/pmu: Setup fixed counters' eventsel during PMU initializationSean Christopherson1-11/+5
Set the eventsel for all fixed counters during PMU initialization, the eventsel is hardcoded and consumed if and only if the counter is supported, i.e. there is no reason to redo the setup every time the PMU is refreshed. Configuring all KVM-supported fixed counter also eliminates a potential pitfall if/when KVM supports discontiguous fixed counters, in which case configuring only nr_arch_fixed_counters will be insufficient (ignoring the fact that KVM will need many other changes to support discontiguous fixed counters). Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31KVM: x86/pmu: Remove KVM's enumeration of Intel's architectural encodingsSean Christopherson1-49/+23
Drop KVM's enumeration of Intel's architectural event encodings, and instead open code the three encodings (of which only two are real) that KVM uses to emulate fixed counters. Now that KVM doesn't incorrectly enforce the availability of architectural encodings, there is no reason for KVM to ever care about the encodings themselves, at least not in the current format of an array indexed by the encoding's position in CPUID. Opportunistically add a comment to explain why KVM cares about eventsel values for fixed counters. Suggested-by: Jim Mattson <jmattson@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31KVM: x86/pmu: Allow programming events that match unsupported arch eventsSean Christopherson4-46/+0
Remove KVM's bogus restriction that the guest can't program an event whose encoding matches an unsupported architectural event. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed using the architectural encoding. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the architectural event. Preventing the guest from counting events whose encoding happens to match an architectural event breaks existing functionality whenever Intel adds an architectural encoding that was *ever* used for a CPU that doesn't enumerate support for the architectural event, even if the encoding is for the exact same event! E.g. the architectural encoding for Top-Down Slots is 0x01a4. Broadwell CPUs, which do not support the Top-Down Slots architectural event, 0x01a4 is a valid, model-specific event. Denying guest usage of 0x01a4 if/when KVM adds support for Top-Down slots would break any Broadwell-based guest. Reported-by: Kan Liang <kan.liang@linux.intel.com> Closes: https://lore.kernel.org/all/2004baa6-b494-462c-a11f-8104ea152c6a@linux.intel.com Fixes: a21864486f7e ("KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event") Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31KVM: x86/pmu: Always treat Fixed counters as available when supportedSean Christopherson1-1/+14
Treat fixed counters as available when they are supported, i.e. don't silently ignore an enabled fixed counter just because guest CPUID says the associated general purpose architectural event is unavailable. KVM originally treated fixed counters as always available, but that got changed as part of a fix to avoid confusing REF_CPU_CYCLES, which does NOT map to an architectural event, with the actual architectural event used associated with bit 7, TOPDOWN_SLOTS. The commit justified the change with: If the event is marked as unavailable in the Intel guest CPUID 0AH.EBX leaf, we need to avoid any perf_event creation, whether it's a gp or fixed counter. but that justification doesn't mesh with reality. The Intel SDM uses "architectural events" to refer to both general purpose events (the ones with the reverse polarity mask in CPUID.0xA.EBX) and the events for fixed counters, e.g. the SDM makes statements like: Each of the fixed-function PMC can count only one architectural performance event. but the fact that fixed counter 2 (TSC reference cycles) doesn't have an associated general purpose architectural makes trying to apply the mask from CPUID.0xA.EBX impossible. Furthermore, the lack of enumeration for an architectural event in CPUID only means the CPU doesn't officially support the architectural encoding, i.e. it doesn't mean using the architectural encoding _won't_ work, it sipmly means there are no guarantees that it will work as expected. E.g. if KVM is running in a VM that advertises a fixed counters but not the corresponding architectural event encoding, and perf decides to use a general purpose counter instead of a fixed counter, odds are very good that the underlying hardware actually does support the architectrual encoding, and that programming the encoding will count the right thing. In other words, asking perf to count the event will probably work, whereas intentionally doing nothing is obviously guaranteed to fail. Note, at the time of the change, KVM didn't enforce hardware support, i.e. didn't prevent userspace from enumerating support in guest CPUID.0xA.EBX for architectural events that aren't supported in hardware. I.e. silently dropping the fixed counter didn't somehow protection against counting the wrong event, it just enforced guest CPUID. And practically speaking, this issue is almost certainly limited to running KVM on a funky virtual CPU model. No known real hardware has an asymmetric PMU where a fixed counter is supported but the associated architectural event is not. Fixes: a21864486f7e ("KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event") Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-29KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safeBrijesh Singh5-5/+52
Implement a workaround for an SNP erratum where the CPU will incorrectly signal an RMP violation #PF if a hugepage (2MB or 1GB) collides with the RMP entry of a VMCB, VMSA or AVIC backing page. When SEV-SNP is globally enabled, the CPU marks the VMCB, VMSA, and AVIC backing pages as "in-use" via a reserved bit in the corresponding RMP entry after a successful VMRUN. This is done for _all_ VMs, not just SNP-Active VMs. If the hypervisor accesses an in-use page through a writable translation, the CPU will throw an RMP violation #PF. On early SNP hardware, if an in-use page is 2MB-aligned and software accesses any part of the associated 2MB region with a hugepage, the CPU will incorrectly treat the entire 2MB region as in-use and signal a an RMP violation #PF. To avoid this, the recommendation is to not use a 2MB-aligned page for the VMCB, VMSA or AVIC pages. Add a generic allocator that will ensure that the page returned is not 2MB-aligned and is safe to be used when SEV-SNP is enabled. Also implement similar handling for the VMCB/VMSA pages of nested guests. [ mdr: Squash in nested guest handling from Ashish, commit msg fixups. ] Reported-by: Alper Gun <alpergun@google.com> # for nested VMSA case Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Marc Orr <marcorr@google.com> Signed-off-by: Marc Orr <marcorr@google.com> Co-developed-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240126041126.1927228-22-michael.roth@amd.com
2024-01-29crypto: ccp: Add support to initialize the AMD-SP for SEV-SNPBrijesh Singh1-1/+4
Before SNP VMs can be launched, the platform must be appropriately configured and initialized via the SNP_INIT command. During the execution of SNP_INIT command, the firmware configures and enables SNP security policy enforcement in many system components. Some system components write to regions of memory reserved by early x86 firmware (e.g. UEFI). Other system components write to regions provided by the operation system, hypervisor, or x86 firmware. Such system components can only write to HV-fixed pages or Default pages. They will error when attempting to write to pages in other page states after SNP_INIT enables their SNP enforcement. Starting in SNP firmware v1.52, the SNP_INIT_EX command takes a list of system physical address ranges to convert into the HV-fixed page states during the RMP initialization. If INIT_RMP is 1, hypervisors should provide all system physical address ranges that the hypervisor will never assign to a guest until the next RMP re-initialization. For instance, the memory that UEFI reserves should be included in the range list. This allows system components that occasionally write to memory (e.g. logging to UEFI reserved regions) to not fail due to RMP initialization and SNP enablement. Note that SNP_INIT(_EX) must not be executed while non-SEV guests are executing, otherwise it is possible that the system could reset or hang. The psp_init_on_probe module parameter was added for SEV/SEV-ES support and the init_ex_path module parameter to allow for time for the necessary file system to be mounted/available. SNP_INIT(_EX) does not use the file associated with init_ex_path. So, to avoid running into issues where SNP_INIT(_EX) is called while there are other running guests, issue it during module probe regardless of the psp_init_on_probe setting, but maintain the previous deferrable handling for SEV/SEV-ES initialization. [ mdr: Squash in psp_init_on_probe changes from Tom, reduce proliferation of 'probe' function parameter where possible. bp: Fix 32-bit allmodconfig build. ] Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Co-developed-by: Jarkko Sakkinen <jarkko@profian.com> Signed-off-by: Jarkko Sakkinen <jarkko@profian.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240126041126.1927228-14-michael.roth@amd.com
2024-01-29KVM: x86: Use mutex guards to eliminate __kvm_x86_vendor_init()Nikolay Borisov1-12/+3
Use the recently introduced guard(mutex) infrastructure acquire and automatically release vendor_module_lock when the guard goes out of scope. Drop the inner __kvm_x86_vendor_init(), its sole purpose was to simplify releasing vendor_module_lock in error paths. No functional change intended. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20231030141728.1406118-1-nik.borisov@suse.com [sean: rewrite changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-18Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds47-1032/+1747
Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
2024-01-10x86/bugs: Rename CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINEBreno Leitao5-6/+6
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options. [ mingo: Converted a few more uses in comments/messages as well. ] Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ariel Miculas <amiculas@cisco.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
2024-01-09Merge tag 'x86-cleanups-2024-01-08' of ↵Linus Torvalds7-10/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: - Change global variables to local - Add missing kernel-doc function parameter descriptions - Remove unused parameter from a macro - Remove obsolete Kconfig entry - Fix comments - Fix typos, mostly scripted, manually reviewed and a micro-optimization got misplaced as a cleanup: - Micro-optimize the asm code in secondary_startup_64_no_verify() * tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arch/x86: Fix typos x86/head_64: Use TESTB instead of TESTL in secondary_startup_64_no_verify() x86/docs: Remove reference to syscall trampoline in PTI x86/Kconfig: Remove obsolete config X86_32_SMP x86/io: Remove the unused 'bw' parameter from the BUILDIO() macro x86/mtrr: Document missing function parameters in kernel-doc x86/setup: Make relocated_ramdisk a local variable of relocate_initrd()
2024-01-08Merge tag 'vfs-6.8.misc' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_*() calls with their current ida_*() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ...
2024-01-08Merge tag 'kvm-x86-mmu-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini3-58/+48
KVM x86 MMU changes for 6.8: - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Relax the TDP MMU's lockdep assertions related to holding mmu_lock for read versus write so that KVM doesn't pass "bool shared" all over the place just to have precise assertions in paths that don't actually care about whether the caller is a reader or a writer.
2024-01-08Merge tag 'kvm-x86-xen-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2-6/+31
KVM Xen change for 6.8: To workaround Xen guests that don't expect Xen PV clocks to be marked as being based on a stable TSC, add a Xen config knob to allow userspace to opt out of KVM setting the "TSC stable" bit in Xen PV clocks. Note, the "TSC stable" bit was added to the PVCLOCK ABI by KVM without an ack from Xen, i.e. KVM isn't entirely blameless for the buggy guest behavior.
2024-01-08Merge tag 'kvm-x86-svm-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini3-19/+21
KVM SVM changes for 6.8: - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - Fix a benign NMI virtualization bug where KVM would unnecessarily intercept IRET when manually injecting an NMI, e.g. when KVM pends an NMI and injects a second, "simultaneous" NMI.
2024-01-08Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini16-29/+129
KVM x86 support for virtualizing Linear Address Masking (LAM) Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality checks for most virtual address usage in 64-bit mode, such that only the most significant bit of the untranslated address bits must match the polarity of the last translated address bit. This allows software to use ignored, untranslated address bits for metadata, e.g. to efficiently tag pointers for address sanitization. LAM can be enabled separately for user pointers and supervisor pointers, and for userspace LAM can be select between 48-bit and 57-bit masking - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15. - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6. For user pointers, LAM enabling utilizes two previously-reserved high bits from CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and 61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.: - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.: - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers The modified LAM canonicality checks: - LAM_S48 : [ 1 ][ metadata ][ 1 ] 63 47 - LAM_U48 : [ 0 ][ metadata ][ 0 ] 63 47 - LAM_S57 : [ 1 ][ metadata ][ 1 ] 63 56 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ] 63 56 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ] 63 56..47 The bulk of KVM support for LAM is to emulate LAM's modified canonicality checks. The approach taken by KVM is to "fill" the metadata bits using the highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value from the raw, untagged virtual address is kept for the canonicality check. This untagging allows Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26], enabling via CR3 and CR4 bits, etc.
2024-01-08Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini5-107/+120
KVM x86 PMU changes for 6.8: - Fix a variety of bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow.
2024-01-08Merge tag 'kvm-x86-misc-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini6-41/+70
KVM x86 misc changes for 6.8: - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds.
2024-01-08Merge tag 'kvm-x86-hyperv-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini25-741/+1024
KVM x86 Hyper-V changes for 6.8: - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time.
2024-01-08Merge tag 'kvm-x86-generic-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-2/+2
Common KVM changes for 6.8: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
2024-01-08KVM: x86: add missing "depends on KVM"Paolo Bonzini1-1/+1
Support for KVM software-protected VMs should not be configurable, if KVM is not available at all. Fixes: 89ea60c2c7b5 ("KVM: x86: Add support for "protected VMs" that can utilize private memory") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-08KVM: introduce CONFIG_KVM_COMMONPaolo Bonzini1-2/+1
CONFIG_HAVE_KVM is currently used by some architectures to either enabled the KVM config proper, or to enable host-side code that is not part of the KVM module. However, CONFIG_KVM's "select" statement in virt/kvm/Kconfig corresponds to a third meaning, namely to enable common Kconfigs required by all architectures that support KVM. These three meanings can be replaced respectively by an architecture-specific Kconfig, by IS_ENABLED(CONFIG_KVM), or by a new Kconfig symbol that is in turn selected by the architecture-specific "config KVM". Start by introducing such a new Kconfig symbol, CONFIG_KVM_COMMON. Unlike CONFIG_HAVE_KVM, it is selected by CONFIG_KVM, not by architecture code, and it brings in all dependencies of common KVM code. In particular, INTERVAL_TREE was missing in loongarch and riscv, so that is another thing that is fixed. Fixes: 8132d887a702 ("KVM: remove CONFIG_HAVE_KVM_EVENTFD", 2023-12-08) Reported-by: Randy Dunlap <rdunlap@infradead.org> Closes: https://lore.kernel.org/all/44907c6b-c5bd-4e4a-a921-e4d3825539d8@infradead.org/ Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-03arch/x86: Fix typosBjorn Helgaas7-10/+10
Fix typos, most reported by "codespell arch/x86". Only touches comments, no code changes. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2024-01-02Merge tag 'kvm-riscv-6.8-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini3-1/+21
KVM/riscv changes for 6.8 part #1 - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Steal time account support along with selftest
2024-01-02Merge tag 'loongarch-kvm-6.8' of ↵Paolo Bonzini3-6/+12
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.8 1. Optimization for memslot hugepage checking. 2. Cleanup and fix some HW/SW timer issues. 3. Add LSX/LASX (128bit/256bit SIMD) support.
2023-12-13KVM: SEV: Do not intercept accesses to MSR_IA32_XSS for SEV-ES guestsMichael Roth3-1/+21
When intercepts are enabled for MSR_IA32_XSS, the host will swap in/out the guest-defined values while context-switching to/from guest mode. However, in the case of SEV-ES, vcpu->arch.guest_state_protected is set, so the guest-defined value is effectively ignored when switching to guest mode with the understanding that the VMSA will handle swapping in/out this register state. However, SVM is still configured to intercept these accesses for SEV-ES guests, so the values in the initial MSR_IA32_XSS are effectively read-only, and a guest will experience undefined behavior if it actually tries to write to this MSR. Fortunately, only CET/shadowstack makes use of this register on SEV-ES-capable systems currently, which isn't yet widely used, but this may become more of an issue in the future. Additionally, enabling intercepts of MSR_IA32_XSS results in #VC exceptions in the guest in certain paths that can lead to unexpected #VC nesting levels. One example is SEV-SNP guests when handling #VC exceptions for CPUID instructions involving leaf 0xD, subleaf 0x1, since they will access MSR_IA32_XSS as part of servicing the CPUID #VC, then generate another #VC when accessing MSR_IA32_XSS, which can lead to guest crashes if an NMI occurs at that point in time. Running perf on a guest while it is issuing such a sequence is one example where these can be problematic. Address this by disabling intercepts of MSR_IA32_XSS for SEV-ES guests if the host/guest configuration allows it. If the host/guest configuration doesn't allow for MSR_IA32_XSS, leave it intercepted so that it can be caught by the existing checks in kvm_{set,get}_msr_common() if the guest still attempts to access it. Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Cc: Alexey Kardashevskiy <aik@amd.com> Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20231016132819.1002933-4-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-08KVM: remove CONFIG_HAVE_KVM_IRQFDPaolo Bonzini1-1/+0
All platforms with a kernel irqchip have support for irqfd. Unify the two configuration items so that userspace can expect to use irqfd to inject interrupts into the irqchip. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-08KVM: remove CONFIG_HAVE_KVM_EVENTFDPaolo Bonzini1-1/+0
virt/kvm/eventfd.c is compiled unconditionally, meaning that the ioeventfds member of struct kvm is accessed unconditionally. CONFIG_HAVE_KVM_EVENTFD therefore must be defined for KVM common code to compile successfully, remove it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-08KVM: SVM: Update EFER software model on CR0 trap for SEV-ESSean Christopherson1-3/+5
In general, activating long mode involves setting the EFER_LME bit in the EFER register and then enabling the X86_CR0_PG bit in the CR0 register. At this point, the EFER_LMA bit will be set automatically by hardware. In the case of SVM/SEV guests where writes to CR0 are intercepted, it's necessary for the host to set EFER_LMA on behalf of the guest since hardware does not see the actual CR0 write. In the case of SEV-ES guests where writes to CR0 are trapped instead of intercepted, the hardware *does* see/record the write to CR0 before exiting and passing the value on to the host, so as part of enabling SEV-ES support commit f1c6366e3043 ("KVM: SVM: Add required changes to support intercepts under SEV-ES") dropped special handling of the EFER_LMA bit with the understanding that it would be set automatically. However, since the guest never explicitly sets the EFER_LMA bit, the host never becomes aware that it has been set. This becomes problematic when userspace tries to get/set the EFER values via KVM_GET_SREGS/KVM_SET_SREGS, since the EFER contents tracked by the host will be missing the EFER_LMA bit, and when userspace attempts to pass the EFER value back via KVM_SET_SREGS it will fail a sanity check that asserts that EFER_LMA should always be set when X86_CR0_PG and EFER_LME are set. Fix this by always inferring the value of EFER_LMA based on X86_CR0_PG and EFER_LME, regardless of whether or not SEV-ES is enabled. Fixes: f1c6366e3043 ("KVM: SVM: Add required changes to support intercepts under SEV-ES") Reported-by: Peter Gonda <pgonda@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210507165947.2502412-2-seanjc@google.com> [A two year old patch that was revived after we noticed the failure in KVM_SET_SREGS and a similar patch was posted by Michael Roth. This is Sean's patch, but with Michael's more complete commit message. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-08Merge tag 'kvm-x86-fixes-6.7-rcN' of https://github.com/kvm-x86/linux into ↵Paolo Bonzini2-3/+7
kvm-master KVM fixes for 6.7-rcN: - When checking if a _running_ vCPU is "in-kernel", i.e. running at CPL0, get the CPL directly instead of relying on preempted_in_kernel, which is valid if and only if the vCPU was preempted, i.e. NOT running. - Set .owner for various KVM file_operations so that files refcount the KVM module until KVM is done executing _all_ code, including the last few instructions of kvm_put_kvm(). And then revert the misguided attempt to rely on "struct kvm" refcounts to pin KVM-the-module. - Fix a benign "return void" that was recently introduced.