Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/gds fixes from Dave Hansen:
"Mitigate Gather Data Sampling issue:
- Add Base GDS mitigation
- Support GDS_NO under KVM
- Fix a documentation typo"
* tag 'gds-for-linus-2023-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/x86: Fix backwards on/off logic about YMM support
KVM: Add GDS_NO support to KVM
x86/speculation: Add Kconfig option for GDS
x86/speculation: Add force option to GDS mitigation
x86/speculation: Add Gather Data Sampling mitigation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/srso fixes from Borislav Petkov:
"Add a mitigation for the speculative RAS (Return Address Stack)
overflow vulnerability on AMD processors.
In short, this is yet another issue where userspace poisons a
microarchitectural structure which can then be used to leak privileged
information through a side channel"
* tag 'x86_bugs_srso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/srso: Tie SBPB bit setting to microcode patch detection
x86/srso: Add a forgotten NOENDBR annotation
x86/srso: Fix return thunks in generated code
x86/srso: Add IBPB on VMEXIT
x86/srso: Add IBPB
x86/srso: Add SRSO_NO support
x86/srso: Add IBPB_BRTYPE support
x86/srso: Add a Speculative RAS Overflow mitigation
x86/bugs: Increase the x86 bugs vector size to two u32s
|
|
To avoid possible time-of-check/time-of-use issues, the GHCB should
almost never be accessed outside dump_ghcb, sev_es_sync_to_ghcb
and sev_es_sync_from_ghcb. The only legitimate uses are to set the
exitinfo fields and to find the address of the scratch area embedded
in the ghcb. Accessing ghcb_usage also goes through svm->sev_es.ghcb
in sev_es_validate_vmgexit(), but that is because anyway the value is
not used.
Removing a shortcut variable that contains the value of svm->sev_es.ghcb
makes these cases a bit more verbose, but it limits the chance of someone
reading the ghcb by mistake.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
A KVM guest using SEV-ES or SEV-SNP with multiple vCPUs can trigger
a double fetch race condition vulnerability and invoke the VMGEXIT
handler recursively.
sev_handle_vmgexit() maps the GHCB page using kvm_vcpu_map() and then
fetches the exit code using ghcb_get_sw_exit_code(). Soon after,
sev_es_validate_vmgexit() fetches the exit code again. Since the GHCB
page is shared with the guest, the guest is able to quickly swap the
values with another vCPU and hence bypass the validation. One vmexit code
that can be rejected by sev_es_validate_vmgexit() is SVM_EXIT_VMGEXIT;
if sev_handle_vmgexit() observes it in the second fetch, the call
to svm_invoke_exit_handler() will invoke sev_handle_vmgexit() again
recursively.
To avoid the race, always fetch the GHCB data from the places where
sev_es_sync_from_ghcb stores it.
Exploiting recursions on linux kernel has been proven feasible
in the past, but the impact is mitigated by stack guard pages
(CONFIG_VMAP_STACK). Still, if an attacker manages to call the handler
multiple times, they can theoretically trigger a stack overflow and
cause a denial-of-service, or potentially guest-to-host escape in kernel
configurations without stack guard pages.
Note that winning the race reliably in every iteration is very tricky
due to the very tight window of the fetches; depending on the compiler
settings, they are often consecutive because of optimization and inlining.
Tested by booting an SEV-ES RHEL9 guest.
Fixes: CVE-2023-4155
Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Cc: stable@vger.kernel.org
Reported-by: Andy Nguyen <theflow@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Validation of the GHCB is susceptible to time-of-check/time-of-use vulnerabilities.
To avoid them, we would like to always snapshot the fields that are read in
sev_es_validate_vmgexit(), and not use the GHCB anymore after it returns.
This means:
- invoking sev_es_sync_from_ghcb() before any GHCB access, including before
sev_es_validate_vmgexit()
- snapshotting all fields including the valid bitmap and the sw_scratch field,
which are currently not caching anywhere.
The valid bitmap is the first thing to be copied out of the GHCB; then,
further accesses will use the copy in svm->sev_es.
Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Skip writes to MSR_AMD64_TSC_RATIO that are done in the context of a vCPU
if guest state isn't loaded, i.e. if KVM will update MSR_AMD64_TSC_RATIO
during svm_prepare_switch_to_guest() before entering the guest. Checking
guest_state_loaded may or may not be a net positive for performance as
the current_tsc_ratio cache will optimize away duplicate WRMSRs in the
vast majority of scenarios. However, the cost of the check is negligible,
and the real motivation is to document that KVM needs to load the vCPU's
value only when running the vCPU.
Link: https://lore.kernel.org/r/20230729011608.1065019-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the @offset and @multiplier params from the kvm_x86_ops hooks for
propagating TSC offsets/multipliers into hardware, and instead have the
vendor implementations pull the information directly from the vCPU
structure. The respective vCPU fields _must_ be written at the same
time in order to maintain consistent state, i.e. it's not random luck
that the value passed in by all callers is grabbed from the vCPU.
Explicitly grabbing the value from the vCPU field in SVM's implementation
in particular will allow for additional cleanup without introducing even
more subtle dependencies. Specifically, SVM can skip the WRMSR if guest
state isn't loaded, i.e. svm_prepare_switch_to_guest() will load the
correct value for the vCPU prior to entering the guest.
This also reconciles KVM's handling of related values that are stored in
the vCPU, as svm_write_tsc_offset() already assumes/requires the caller
to have updated l1_tsc_offset.
Link: https://lore.kernel.org/r/20230729011608.1065019-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly disable preemption when writing MSR_AMD64_TSC_RATIO only in the
"outer" helper, as all direct callers of the "inner" helper now run with
preemption already disabled. And that isn't a coincidence, as the outer
helper requires a vCPU and is intended to be used when modifying guest
state and/or emulating guest instructions, which are typically done with
preemption enabled.
Direct use of the inner helper should be extremely limited, as the only
time KVM should modify MSR_AMD64_TSC_RATIO without a vCPU is when
sanitizing the MSR for a specific pCPU (currently done when {en,dis}abling
disabling SVM). The other direct caller is svm_prepare_switch_to_guest(),
which does have a vCPU, but is a one-off special case: KVM is about to
enter the guest on a specific pCPU and thus must have preemption disabled.
Link: https://lore.kernel.org/r/20230729011608.1065019-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When emulating nested SVM transitions, use the outer helper for writing
the TSC multiplier for L2. Using the inner helper only for one-off cases,
i.e. for paths where KVM is NOT emulating or modifying vCPU state, will
allow for multiple cleanups:
- Explicitly disabling preemption only in the outer helper
- Getting the multiplier from the vCPU field in the outer helper
- Skipping the WRMSR in the outer helper if guest state isn't loaded
Opportunistically delete an extra newline.
No functional change intended.
Link: https://lore.kernel.org/r/20230729011608.1065019-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When emulating nested VM-Exit, load L1's TSC multiplier if L1's desired
ratio doesn't match the current ratio, not if the ratio L1 is using for
L2 diverges from the default. Functionally, the end result is the same
as KVM will run L2 with L1's multiplier if L2's multiplier is the default,
i.e. checking that L1's multiplier is loaded is equivalent to checking if
L2 has a non-default multiplier.
However, the assertion that TSC scaling is exposed to L1 is flawed, as
userspace can trigger the WARN at will by writing the MSR and then
updating guest CPUID to hide the feature (modifying guest CPUID is
allowed anytime before KVM_RUN). E.g. hacking KVM's state_test
selftest to do
vcpu_set_msr(vcpu, MSR_AMD64_TSC_RATIO, 0);
vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_TSCRATEMSR);
after restoring state in a new VM+vCPU yields an endless supply of:
------------[ cut here ]------------
WARNING: CPU: 10 PID: 206939 at arch/x86/kvm/svm/nested.c:1105
nested_svm_vmexit+0x6af/0x720 [kvm_amd]
Call Trace:
nested_svm_exit_handled+0x102/0x1f0 [kvm_amd]
svm_handle_exit+0xb9/0x180 [kvm_amd]
kvm_arch_vcpu_ioctl_run+0x1eab/0x2570 [kvm]
kvm_vcpu_ioctl+0x4c9/0x5b0 [kvm]
? trace_hardirqs_off+0x4d/0xa0
__se_sys_ioctl+0x7a/0xc0
__x64_sys_ioctl+0x21/0x30
do_syscall_64+0x41/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Unlike the nested VMRUN path, hoisting the svm->tsc_scaling_enabled check
into the if-statement is wrong as KVM needs to ensure L1's multiplier is
loaded in the above scenario. Alternatively, the WARN_ON() could simply
be deleted, but that would make KVM's behavior even more subtle, e.g. it's
not immediately obvious why it's safe to write MSR_AMD64_TSC_RATIO when
checking only tsc_ratio_msr.
Fixes: 5228eb96a487 ("KVM: x86: nSVM: implement nested TSC scaling")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230729011608.1065019-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Check for nested TSC scaling support on nested SVM VMRUN instead of
asserting that TSC scaling is exposed to L1 if L1's MSR_AMD64_TSC_RATIO
has diverged from KVM's default. Userspace can trigger the WARN at will
by writing the MSR and then updating guest CPUID to hide the feature
(modifying guest CPUID is allowed anytime before KVM_RUN). E.g. hacking
KVM's state_test selftest to do
vcpu_set_msr(vcpu, MSR_AMD64_TSC_RATIO, 0);
vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_TSCRATEMSR);
after restoring state in a new VM+vCPU yields an endless supply of:
------------[ cut here ]------------
WARNING: CPU: 164 PID: 62565 at arch/x86/kvm/svm/nested.c:699
nested_vmcb02_prepare_control+0x3d6/0x3f0 [kvm_amd]
Call Trace:
<TASK>
enter_svm_guest_mode+0x114/0x560 [kvm_amd]
nested_svm_vmrun+0x260/0x330 [kvm_amd]
vmrun_interception+0x29/0x30 [kvm_amd]
svm_invoke_exit_handler+0x35/0x100 [kvm_amd]
svm_handle_exit+0xe7/0x180 [kvm_amd]
kvm_arch_vcpu_ioctl_run+0x1eab/0x2570 [kvm]
kvm_vcpu_ioctl+0x4c9/0x5b0 [kvm]
__se_sys_ioctl+0x7a/0xc0
__x64_sys_ioctl+0x21/0x30
do_syscall_64+0x41/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x45ca1b
Note, the nested #VMEXIT path has the same flaw, but needs a different
fix and will be handled separately.
Fixes: 5228eb96a487 ("KVM: x86: nSVM: implement nested TSC scaling")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230729011608.1065019-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Latest Intel platform GraniteRapids-D introduces AMX-COMPLEX, which adds
two instructions to perform matrix multiplication of two tiles containing
complex elements and accumulate the results into a packed single precision
tile.
AMX-COMPLEX is enumerated via CPUID.(EAX=7,ECX=1):EDX[bit 8]
Advertise AMX_COMPLEX if it's supported in hardware. There are no VMX
controls for the feature, i.e. the instructions can't be interecepted, and
KVM advertises base AMX in CPUID if AMX is supported in hardware, even if
KVM doesn't advertise AMX as being supported in XCR0, e.g. because the
process didn't opt-in to allocating tile data.
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230802022954.193843-1-tao1.su@linux.intel.com
[sean: tweak last paragraph of changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Bail from vmx_emergency_disable() without processing the list of loaded
VMCSes if CR4.VMXE=0, i.e. if the CPU can't be post-VMXON. It should be
impossible for the list to have entries if VMX is already disabled, and
even if that invariant doesn't hold, VMCLEAR will #UD anyways, i.e.
processing the list is pointless even if it somehow isn't empty.
Assuming no existing KVM bugs, this should be a glorified nop. The
primary motivation for the change is to avoid having code that looks like
it does VMCLEAR, but then skips VMXON, which is nonsensical.
Suggested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that kvm_rebooting is guaranteed to be true prior to disabling SVM
in an emergency, use the existing stgi() helper instead of open coding
STGI. In effect, eat faults on STGI if and only if kvm_rebooting==true.
Link: https://lore.kernel.org/r/20230721201859.2307736-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Set kvm_rebooting when virtualization is disabled in an emergency so that
KVM eats faults on virtualization instructions even if kvm_reboot() isn't
reached.
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move cpu_svm_disable() into KVM proper now that all hardware
virtualization management is routed through KVM. Remove the now-empty
virtext.h.
No functional change intended.
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Disable migration when probing VMX support during module load to ensure
the CPU is stable, mostly to match similar SVM logic, where allowing
migration effective requires deliberately writing buggy code. As a bonus,
KVM won't report the wrong CPU to userspace if VMX is unsupported, but in
practice that is a very, very minor bonus as the only way that reporting
the wrong CPU would actually matter is if hardware is broken or if the
system is misconfigured, i.e. if KVM gets migrated from a CPU that _does_
support VMX to a CPU that does _not_ support VMX.
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Check "this" CPU instead of the boot CPU when querying SVM support so that
the per-CPU checks done during hardware enabling actually function as
intended, i.e. will detect issues where SVM isn't support on all CPUs.
Disable migration for the use from svm_init() mostly so that the standard
accessors for the per-CPU data can be used without getting yelled at by
CONFIG_DEBUG_PREEMPT=y sanity checks. Preventing the "disabled by BIOS"
error message from reporting the wrong CPU is largely a bonus, as ensuring
a stable CPU during module load is a non-goal for KVM.
Link: https://lore.kernel.org/all/ZAdxNgv0M6P63odE@google.com
Cc: Kai Huang <kai.huang@intel.com>
Cc: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fold the guts of cpu_has_svm() into kvm_is_svm_supported(), its sole
remaining user.
No functional change intended.
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Make building KVM SVM support depend on support for AMD or Hygon. KVM
already effectively restricts SVM support to AMD and Hygon by virtue of
the vendor string checks in cpu_has_svm(), and KVM VMX supports depends
on one of its three known vendors (Intel, Centaur, or Zhaoxin).
Add the CPU_SUP_HYGON clause even though CPU_SUP_HYGON selects CPU_SUP_AMD
to document that KVM SVM support isn't just for AMD CPUs, and to prevent
breakage should Hygon support ever become a standalone thing.
Link: https://lore.kernel.org/r/20230721201859.2307736-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that VMX is disabled in emergencies via the virt callbacks, move the
VMXOFF helpers into KVM, the only remaining user.
No functional change intended.
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fold the raw CPUID check for VMX into kvm_is_vmx_supported(), its sole
user. Keep the check even though KVM also checks X86_FEATURE_VMX, as the
intent is to provide a unique error message if VMX is unsupported by
hardware, whereas X86_FEATURE_VMX may be clear due to firmware and/or
kernel actions.
No functional change intended.
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use the virt callback to disable SVM (and set GIF=1) during an emergency
instead of blindly attempting to disable SVM. Like the VMX case, if a
hypervisor, i.e. KVM, isn't loaded/active, SVM can't be in use.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use KVM VMX's reboot/crash callback to do VMXOFF in an emergency instead
of manually and blindly doing VMXOFF. There's no need to attempt VMXOFF
if a hypervisor, i.e. KVM, isn't loaded/active, i.e. if the CPU can't
possibly be post-VMXON.
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Provide dedicated helpers to (un)register virt hooks used during an
emergency crash/reboot, and WARN if there is an attempt to overwrite
the registered callback, or an attempt to do an unpaired unregister.
Opportunsitically use rcu_assign_pointer() instead of RCU_INIT_POINTER(),
mainly so that the set/unset paths are more symmetrical, but also because
any performance gains from using RCU_INIT_POINTER() are meaningless for
this code.
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
VMCLEAR active VMCSes before any emergency reboot, not just if the kernel
may kexec into a new kernel after a crash. Per Intel's SDM, the VMX
architecture doesn't require the CPU to flush the VMCS cache on INIT. If
an emergency reboot doesn't RESET CPUs, cached VMCSes could theoretically
be kept and only be written back to memory after the new kernel is booted,
i.e. could effectively corrupt memory after reboot.
Opportunistically remove the setting of the global pointer to NULL to make
checkpatch happy.
Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Retry the optimized APIC map recalculation if an APIC-enabled vCPU shows
up between allocating the map and filling in the map data. Conditionally
reschedule before retrying even though the number of vCPUs that can be
created is bounded by KVM. Retrying a few thousand times isn't so slow
as to be hugely problematic, but it's not blazing fast either.
Reset xapic_id_mistach on each retry as a vCPU could change its xAPIC ID
between loops, but do NOT reset max_id. The map size also factors in
whether or not a vCPU's local APIC is hardware-enabled, i.e. userspace
and/or the guest can theoretically keep KVM retrying indefinitely. The
only downside is that KVM will allocate more memory than is strictly
necessary if the vCPU with the highest x2APIC ID disabled its APIC while
the recalculation was in-progress.
Refresh kvm->arch.apic_map_dirty to opportunistically change it from
DIRTY => UPDATE_IN_PROGRESS to avoid an unnecessary recalc from a
different task, i.e. if another task is waiting to attempt an update
(which is likely since a retry happens if and only if an update is
required).
Link: https://lore.kernel.org/r/20230602233250.1014316-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the call to kvm_x86_pmu.hw_event_available(), which has nothing to
with the userspace PMU filter, out of check_pmu_event_filter() and into
its sole caller pmc_event_is_allowed(). pmc_event_is_allowed() didn't
exist when commit 7aadaa988c5e ("KVM: x86/pmu: Drop amd_event_mapping[]
in the KVM context"), so presumably the motivation for invoking
.hw_event_available() from check_pmu_event_filter() was to avoid having
to add multiple call sites.
Link: https://lore.kernel.org/r/20230607010206.1425277-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Assert that the number of known fixed_pmc_events matches the max number of
fixed counters supported by KVM, and clean up related code.
Opportunistically extend setup_fixed_pmc_eventsel()'s use of
array_index_nospec() to cover fixed_counters, as nr_arch_fixed_counters is
set based on userspace input (but capped using KVM-controlled values).
Link: https://lore.kernel.org/r/20230607010206.1425277-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Walk only the "real", i.e. non-pseudo, architectural events when checking
if a hardware event is available, i.e. isn't disabled by guest CPUID.
Skipping pseudo-arch events in the loop body is unnecessarily convoluted,
especially now that KVM has enums that delineate between real and pseudo
events.
Link: https://lore.kernel.org/r/20230607010206.1425277-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add "enum intel_pmu_architectural_events" to replace the magic numbers for
the (pseudo-)architectural events, and to give a meaningful name to each
event so that new readers don't need psychic powers to understand what the
code is doing.
Cc: Aaron Lewis <aaronlewis@google.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Reviewed-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230607010206.1425277-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use the recently introduced svm_get_lbr_vmcb() instead an open coded
equivalent to retrieve the target VMCB when emulating writes to
MSR_IA32_DEBUGCTLMSR.
No functional change intended.
Link: https://lore.kernel.org/r/20230607203519.1570167-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Clean up the enable_lbrv computation in svm_update_lbrv() to consolidate
the logic for computing enable_lbrv into a single statement, and to remove
the coding style violations (lack of curly braces on nested if).
No functional change intended.
Link: https://lore.kernel.org/r/20230607203519.1570167-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Refactor KVM's handling of LBR MSRs on SVM to avoid a second layer of
case statements, and thus eliminate a dead KVM_BUG() call, which (a) will
never be hit in the current code base and (b) if a future commit breaks
things, will never fire as KVM passes "false" instead "true" or '1' for
the KVM_BUG() condition.
Reported-by: Michal Luczaj <mhal@rbox.co>
Cc: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230607203519.1570167-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Remove the superfluous flush of the current TLB in VMX's handling of
migration of the APIC-access page, as a full TLB flush on all vCPUs will
have already been performed in response to kvm_unmap_gfn_range() *if*
there were SPTEs pointing at the APIC-access page. And if there were no
valid SPTEs, then there can't possibly be TLB entries to flush.
The extra flush was added by commit fb6c81984313 ("kvm: vmx: Flush TLB
when the APIC-access address changes"), with the justification of "because
the SDM says so". The SDM said, and still says:
As detailed in Section xx.x.x, an access to the APIC-access page might
not cause an APIC-access VM exit if software does not properly invalidate
information that may be cached from the EPT paging structures. If EPT was
in use on a logical processor at one time with EPTP X, it is recommended
that software use the INVEPT instruction with the “single-context” INVEPT
type and with EPTP X in the INVEPT descriptor before a VM entry on the
same logical processor that enables EPT with EPTP X and either (a) the
"virtualize APIC accesses" VM- execution control was changed from 0 to 1;
or (b) the value of the APIC-access address was changed.
But the "recommendation" for (b) is predicated on there actually being
a valid EPT translation *and* possible TLB entries for the GPA (or guest
VA when using shadow paging). It's possible that a different vCPU has
established a mapping for the new page, but the current vCPU can't have
entered the guest, i.e. can't have created a TLB entry, between flushing
the old mappings and changing its vmcs.APIC_ACCESS_ADDR.
kvm_unmap_gfn_range() waits for all vCPUs to ack KVM_REQ_APIC_PAGE_RELOAD,
and then flushes remote TLBs (which may or may not also pend a request).
Thus the vCPU is guaranteed to update vmcs.APIC_ACCESS_ADDR before
re-entering the guest and before it can possibly create new TLB entries.
In other words, KVM does flush in this case, it just does so earlier
on while handling the page migration.
Note, VMX also flushes if the vCPU is migrated to a new pCPU, i.e. if
the vCPU is migrated to a pCPU that entered the guest for a different
vCPU.
Suggested-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230721233858.2343941-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that KVM snapshots the host's MSR_IA32_ARCH_CAPABILITIES, drop the
similar snapshot/cache of whether or not KVM is allowed to manipulate
MSR_IA32_MCU_OPT_CTRL.FB_CLEAR_DIS. The motivation for the cache was
presumably to avoid the RDMSR, e.g. boot_cpu_has_bug() is quite cheap, and
modifying the vCPU's MSR_IA32_ARCH_CAPABILITIES is an infrequent option
and a relatively slow path.
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230607004311.1420507-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Snapshot the host's MSR_IA32_ARCH_CAPABILITIES, if it's supported, instead
of reading the MSR every time KVM wants to query the host state, e.g. when
initializing the default value during vCPU creation. The paths that query
ARCH_CAPABILITIES aren't particularly performance sensitive, but creating
vCPUs is a frequent enough operation that burning 8 bytes is a good
trade-off.
Alternatively, KVM could add a field in kvm_caps and thus skip the
on-demand calculations entirely, but a pure snapshot isn't possible due to
the way KVM handles the l1tf_vmx_mitigation module param. And unlike the
other "supported" fields in kvm_caps, KVM doesn't enforce the "supported"
value, i.e. KVM treats ARCH_CAPABILITIES like a CPUID leaf and lets
userspace advertise whatever it wants. Those problems are solvable, but
it's not clear there is real benefit versus snapshotting the host value,
and grabbing the host value will allow additional cleanup of KVM's
FB_CLEAR_CTRL code.
Link: https://lore.kernel.org/all/20230524061634.54141-2-chao.gao@intel.com
Cc: Chao Gao <chao.gao@intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230607004311.1420507-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Advertise CPUID 0x80000005 (L1 cache and TLB info) to userspace so that
VMMs that reflect KVM_GET_SUPPORTED_CPUID into KVM_SET_CPUID2 will
enumerate sane cache/TLB information to the guest.
CPUID 0x80000006 (L2 cache and TLB and L3 cache info) has been returned
since commit 43d05de2bee7 ("KVM: pass through CPUID(0x80000006)").
Enumerating both 0x80000005 and 0x80000006 with KVM_GET_SUPPORTED_CPUID
is better than reporting one or the other, and 0x80000005 could be helpful
for VMM to pass it to KVM_SET_CPUID{,2} for the same reason with
0x80000006.
Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Link: https://lore.kernel.org/all/ZK7NmfKI9xur%2FMop@google.com
Link: https://lore.kernel.org/r/20230712183136.85561-1-itazur@amazon.com
[sean: add link, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Remove x86_emulate_ops::guest_has_long_mode along with its implementation,
emulator_guest_has_long_mode(). It has been unused since commit
1d0da94cdafe ("KVM: x86: do not go through ctxt->ops when emulating rsm").
No functional change intended.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230718101809.1249769-1-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In a spirit of using a sledgehammer to crack a nut, make sync_regs() feed
__set_sregs() and kvm_vcpu_ioctl_x86_set_vcpu_events() with kernel's own
copy of data.
Both __set_sregs() and kvm_vcpu_ioctl_x86_set_vcpu_events() assume they
have exclusive rights to structs they operate on. While this is true when
coming from an ioctl handler (caller makes a local copy of user's data),
sync_regs() breaks this contract; a pointer to a user-modifiable memory
(vcpu->run->s.regs) is provided. This can lead to a situation when incoming
data is checked and/or sanitized only to be re-set by a user thread running
in parallel.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Fixes: 01643c51bfcf ("KVM: x86: KVM_CAP_SYNC_REGS")
Link: https://lore.kernel.org/r/20230728001606.2275586-2-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use sysfs_emit() instead of the sprintf() for sysfs entries. sysfs_emit()
knows the maximum of the temporary buffer used for outputting sysfs
content and avoids overrunning the buffer length.
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230625073438.57427-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Stuff CR0 and/or CR4 to be compliant with a restricted guest if and only
if KVM itself is not configured to utilize unrestricted guests, i.e. don't
stuff CR0/CR4 for a restricted L2 that is running as the guest of an
unrestricted L1. Any attempt to VM-Enter a restricted guest with invalid
CR0/CR4 values should fail, i.e. in a nested scenario, KVM (as L0) should
never observe a restricted L2 with incompatible CR0/CR4, since nested
VM-Enter from L1 should have failed.
And if KVM does observe an active, restricted L2 with incompatible state,
e.g. due to a KVM bug, fudging CR0/CR4 instead of letting VM-Enter fail
does more harm than good, as KVM will often neglect to undo the side
effects, e.g. won't clear rmode.vm86_active on nested VM-Exit, and thus
the damage can easily spill over to L1. On the other hand, letting
VM-Enter fail due to bad guest state is more likely to contain the damage
to L2 as KVM relies on hardware to perform most guest state consistency
checks, i.e. KVM needs to be able to reflect a failed nested VM-Enter into
L1 irrespective of (un)restricted guest behavior.
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Fixes: bddd82d19e2e ("KVM: nVMX: KVM needs to unset "unrestricted guest" VM-execution control in vmcs02 if vmcs12 doesn't set it")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230613203037.1968489-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Reject KVM_SET_SREGS{2} with -EINVAL if the incoming CR0 is invalid,
e.g. due to setting bits 63:32, illegal combinations, or to a value that
isn't allowed in VMX (non-)root mode. The VMX checks in particular are
"fun" as failure to disallow Real Mode for an L2 that is configured with
unrestricted guest disabled, when KVM itself has unrestricted guest
enabled, will result in KVM forcing VM86 mode to virtual Real Mode for
L2, but then fail to unwind the related metadata when synthesizing a
nested VM-Exit back to L1 (which has unrestricted guest enabled).
Opportunistically fix a benign typo in the prototype for is_valid_cr4().
Cc: stable@vger.kernel.org
Reported-by: syzbot+5feef0b9ee9c8e9e5689@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000f316b705fdf6e2b4@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230613203037.1968489-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that handle_fastpath_set_msr_irqoff() acquires kvm->srcu, i.e. allows
dereferencing memslots during WRMSR emulation, drop the requirement that
"next RIP" is valid. In hindsight, acquiring kvm->srcu would have been a
better fix than avoiding the pastpath, but at the time it was thought that
accessing SRCU-protected data in the fastpath was a one-off edge case.
This reverts commit 5c30e8101e8d5d020b1d7119117889756a6ed713.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230721224337.2335137-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Temporarily acquire kvm->srcu for read when potentially emulating WRMSR in
the VM-Exit fastpath handler, as several of the common helpers used during
emulation expect the caller to provide SRCU protection. E.g. if the guest
is counting instructions retired, KVM will query the PMU event filter when
stepping over the WRMSR.
dump_stack+0x85/0xdf
lockdep_rcu_suspicious+0x109/0x120
pmc_event_is_allowed+0x165/0x170
kvm_pmu_trigger_event+0xa5/0x190
handle_fastpath_set_msr_irqoff+0xca/0x1e0
svm_vcpu_run+0x5c3/0x7b0 [kvm_amd]
vcpu_enter_guest+0x2108/0x2580
Alternatively, check_pmu_event_filter() could acquire kvm->srcu, but this
isn't the first bug of this nature, e.g. see commit 5c30e8101e8d ("KVM:
SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"). Providing
protection for the entirety of WRMSR emulation will allow reverting the
aforementioned commit, and will avoid having to play whack-a-mole when new
uses of SRCU-protected structures are inevitably added in common emulation
helpers.
Fixes: dfdeda67ea2d ("KVM: x86/pmu: Prevent the PMU from counting disallowed events")
Reported-by: Greg Thelen <gthelen@google.com>
Reported-by: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230721224337.2335137-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use vmread_error() to report VM-Fail on VMREAD for the "asm goto" case,
now that trampoline case has yet another wrapper around vmread_error() to
play nice with instrumentation.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230721235637.2345403-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Mark vmread_error_trampoline() as noinstr, and add a second trampoline
for the CONFIG_CC_HAS_ASM_GOTO_OUTPUT=n case to enable instrumentation
when handling VM-Fail on VMREAD. VMREAD is used in various noinstr
flows, e.g. immediately after VM-Exit, and objtool rightly complains that
the call to the error trampoline leaves a no-instrumentation section
without annotating that it's safe to do so.
vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0xc9:
call to vmread_error_trampoline() leaves .noinstr.text section
Note, strictly speaking, enabling instrumentation in the VM-Fail path
isn't exactly safe, but if VMREAD fails the kernel/system is likely hosed
anyways, and logging that there is a fatal error is more important than
*maybe* encountering slightly unsafe instrumentation.
Reported-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230721235637.2345403-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As was attempted commit 14717e203186 ("kvm: Conditionally register IRQ
bypass consumer"): "if we don't support a mechanism for bypassing IRQs,
don't register as a consumer. Initially this applied to AMD processors,
but when AVIC support was implemented for assigned devices,
kvm_arch_has_irq_bypass() was always returning true.
We can still skip registering the consumer where enable_apicv
or posted-interrupts capability is unsupported or globally disabled.
This eliminates meaningless dev_info()s when the connect fails
between producer and consumer", such as on Linux hosts where enable_apicv
or posted-interrupts capability is unsupported or globally disabled.
Cc: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Yong He <alexyonghe@tencent.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217379
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20230724111236.76570-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The pid_table of ipiv is the persistent memory allocated by
per-vcpu, which should be counted into the memory cgroup.
Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Message-Id: <CAPm50aLxCQ3TQP2Lhc0PX3y00iTRg+mniLBqNDOC=t9CLxMwwA@mail.gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The code was blindly assuming that kvm_cpu_get_interrupt never returns -1
when there is a pending interrupt.
While this should be true, a bug in KVM can still cause this.
If -1 is returned, the code before this patch was converting it to 0xFF,
and 0xFF interrupt was injected to the guest, which results in an issue
which was hard to debug.
Add WARN_ON_ONCE to catch this case and skip the injection
if this happens again.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20230726135945.260841-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|