summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2022-07-28Revert "KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled"Paolo Bonzini1-20/+1
Since commit 5f76f6f5ff96 ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled"), KVM has taken ownership of the "load IA32_BNDCFGS" and "clear IA32_BNDCFGS" VMX entry/exit controls, trying to set these bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS MSRs if the guest's CPUID supports MPX, and clear otherwise. The intent of the patch was to apply it to L0 in order to work around L1 kernels that lack the fix in commit 691bd4340bef ("kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS", 2017-07-04): by hiding the control bits from L0, L1 hides BNDCFGS from KVM_GET_MSR_INDEX_LIST, and the L1 bug is neutralized even in the lack of commit 691bd4340bef. This was perhaps a sensible kludge at the time, but a horrible idea in the long term and in fact it has not been extended to other CPUID bits like these: X86_FEATURE_LM => VM_EXIT_HOST_ADDR_SPACE_SIZE, VM_ENTRY_IA32E_MODE, VMX_MISC_SAVE_EFER_LMA X86_FEATURE_TSC => CPU_BASED_RDTSC_EXITING, CPU_BASED_USE_TSC_OFFSETTING, SECONDARY_EXEC_TSC_SCALING X86_FEATURE_INVPCID_SINGLE => SECONDARY_EXEC_ENABLE_INVPCID X86_FEATURE_MWAIT => CPU_BASED_MONITOR_EXITING, CPU_BASED_MWAIT_EXITING X86_FEATURE_INTEL_PT => SECONDARY_EXEC_PT_CONCEAL_VMX, SECONDARY_EXEC_PT_USE_GPA, VM_EXIT_CLEAR_IA32_RTIT_CTL, VM_ENTRY_LOAD_IA32_RTIT_CTL X86_FEATURE_XSAVES => SECONDARY_EXEC_XSAVES These days it's sort of common knowledge that any MSR in KVM_GET_MSR_INDEX_LIST must allow *at least* setting it with KVM_SET_MSR to a default value, so it is unlikely that something like commit 5f76f6f5ff96 will be needed again. So revert it, at the potential cost of breaking L1s with a 6 year old kernel. While in principle the L0 owner doesn't control what runs on L1, such an old hypervisor would probably have many other bugs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Let userspace set nVMX MSR to any _host_ supported valueSean Christopherson1-33/+37
Restrict the nVMX MSRs based on KVM's config, not based on the guest's current config. Using the guest's config to audit the new config prevents userspace from restoring the original config (KVM's config) if at any point in the past the guest's config was restricted in any way. Fixes: 62cc6b9dc61e ("KVM: nVMX: support restore of VMX capability MSRs") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Rename handle_vm{on,off}() to handle_vmx{on,off}()Sean Christopherson1-4/+4
Rename the exit handlers for VMXON and VMXOFF to match the instruction names, the terms "vmon" and "vmoff" are not used anywhere in Intel's documentation, nor are they used elsehwere in KVM. Sadly, the exit reasons are exposed to userspace and so cannot be renamed without breaking userspace. :-( Fixes: ec378aeef9df ("KVM: nVMX: Implement VMXON and VMXOFF") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4Sean Christopherson1-9/+14
Inject a #UD if L1 attempts VMXON with a CR0 or CR4 that is disallowed per the associated nested VMX MSRs' fixed0/1 settings. KVM cannot rely on hardware to perform the checks, even for the few checks that have higher priority than VM-Exit, as (a) KVM may have forced CR0/CR4 bits in hardware while running the guest, (b) there may incompatible CR0/CR4 bits that have lower priority than VM-Exit, e.g. CR0.NE, and (c) userspace may have further restricted the allowed CR0/CR4 values by manipulating the guest's nested VMX MSRs. Note, despite a very strong desire to throw shade at Jim, commit 70f3aac964ae ("kvm: nVMX: Remove superfluous VMX instruction fault checks") is not to blame for the buggy behavior (though the comment...). That commit only removed the CR0.PE, EFLAGS.VM, and COMPATIBILITY mode checks (though it did erroneously drop the CPL check, but that has already been remedied). KVM may force CR0.PE=1, but will do so only when also forcing EFLAGS.VM=1 to emulate Real Mode, i.e. hardware will still #UD. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216033 Fixes: ec378aeef9df ("KVM: nVMX: Implement VMXON and VMXOFF") Reported-by: Eric Li <ercli@ucdavis.edu> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Account for KVM reserved CR4 bits in consistency checksSean Christopherson1-1/+2
Check that the guest (L2) and host (L1) CR4 values that would be loaded by nested VM-Enter and VM-Exit respectively are valid with respect to KVM's (L0 host) allowed CR4 bits. Failure to check KVM reserved bits would allow L1 to load an illegal CR4 (or trigger hardware VM-Fail or failed VM-Entry) by massaging guest CPUID to allow features that are not supported by KVM. Amusingly, KVM itself is an accomplice in its doom, as KVM adjusts L1's MSR_IA32_VMX_CR4_FIXED1 to allow L1 to enable bits for L2 based on L1's CPUID model. Note, although nested_{guest,host}_cr4_valid() are _currently_ used if and only if the vCPU is post-VMXON (nested.vmxon == true), that may not be true in the future, e.g. emulating VMXON has a bug where it doesn't check the allowed/required CR0/CR4 bits. Cc: stable@vger.kernel.org Fixes: 3899152ccbf4 ("KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86: Split kvm_is_valid_cr4() and export only the non-vendor bitsSean Christopherson4-7/+14
Split the common x86 parts of kvm_is_valid_cr4(), i.e. the reserved bits checks, into a separate helper, __kvm_is_valid_cr4(), and export only the inner helper to vendor code in order to prevent nested VMX from calling back into vmx_is_valid_cr4() via kvm_is_valid_cr4(). On SVM, this is a nop as SVM doesn't place any additional restrictions on CR4. On VMX, this is also currently a nop, but only because nested VMX is missing checks on reserved CR4 bits for nested VM-Enter. That bug will be fixed in a future patch, and could simply use kvm_is_valid_cr4() as-is, but nVMX has _another_ bug where VMXON emulation doesn't enforce VMX's restrictions on CR0/CR4. The cleanest and most intuitive way to fix the VMXON bug is to use nested_host_cr{0,4}_valid(). If the CR4 variant routes through kvm_is_valid_cr4(), using nested_host_cr4_valid() won't do the right thing for the VMXON case as vmx_is_valid_cr4() enforces VMX's restrictions if and only if the vCPU is post-VMXON. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEsSean Christopherson3-37/+24
When zapping collapsible SPTEs in the TDP MMU, don't bottom out on a leaf SPTE now that KVM doesn't require a PFN to compute the host mapping level, i.e. now that there's no need to first find a leaf SPTE and then step back up. Drop the now unused tdp_iter_step_up(), as it is not the safest of helpers (using any of the low level iterators requires some understanding of the various side effects). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715232107.3775620-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Document the "rules" for using host_pfn_mapping_level()Sean Christopherson1-7/+35
Add a comment to document how host_pfn_mapping_level() can be used safely, as the line between safe and dangerous is quite thin. E.g. if KVM were to ever support in-place promotion to create huge pages, consuming the level is safe if the caller holds mmu_lock and checks that there's an existing _leaf_ SPTE, but unsafe if the caller only checks that there's a non-leaf SPTE. Opportunistically tweak the existing comments to explicitly document why KVM needs to use READ_ONCE(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715232107.3775620-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEsSean Christopherson3-26/+7
Drop the requirement that a pfn be backed by a refcounted, compound or or ZONE_DEVICE, struct page, and instead rely solely on the host page tables to identify huge pages. The PageCompound() check is a remnant of an old implementation that identified (well, attempt to identify) huge pages without walking the host page tables. The ZONE_DEVICE check was added as an exception to the PageCompound() requirement. In other words, neither check is actually a hard requirement, if the primary has a pfn backed with a huge page, then KVM can back the pfn with a huge page regardless of the backing store. Dropping the @pfn parameter will also allow KVM to query the max host mapping level without having to first get the pfn, which is advantageous for use outside of the page fault path where KVM wants to take action if and only if a page can be mapped huge, i.e. avoids the pfn lookup for gfns that can't be backed with a huge page. Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220715232107.3775620-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Restrict mapping level based on guest MTRR iff they're usedSean Christopherson1-7/+19
Restrict the mapping level for SPTEs based on the guest MTRRs if and only if KVM may actually use the guest MTRRs to compute the "real" memtype. For all forms of paging, guest MTRRs are purely virtual in the sense that they are completely ignored by hardware, i.e. they affect the memtype only if software manually consumes them. The only scenario where KVM consumes the guest MTRRs is when shadow_memtype_mask is non-zero and the guest has non-coherent DMA, in all other cases KVM simply leaves the PAT field in SPTEs as '0' to encode WB memtype. Note, KVM may still ultimately ignore guest MTRRs, e.g. if the backing pfn is host MMIO, but false positives are ok as they only cause a slight performance blip (unless the guest is doing weird things with its MTRRs, which is extremely unlikely). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220715230016.3762909-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Add shadow mask for effective host MTRR memtypeSean Christopherson2-3/+19
Add shadow_memtype_mask to capture that EPT needs a non-zero memtype mask instead of relying on TDP being enabled, as NPT doesn't need a non-zero mask. This is a glorified nop as kvm_x86_ops.get_mt_mask() returns zero for NPT anyways. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220715230016.3762909-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86: Drop unnecessary goto+label in kvm_arch_init()Sean Christopherson1-16/+8
Return directly if kvm_arch_init() detects an error before doing any real work, jumping through a label obfuscates what's happening and carries the unnecessary risk of leaving 'r' uninitialized. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220715230016.3762909-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86: Reject loading KVM if host.PAT[0] != WBSean Christopherson1-0/+15
Reject KVM if entry '0' in the host's IA32_PAT MSR is not programmed to writeback (WB) memtype. KVM subtly relies on IA32_PAT entry '0' to be programmed to WB by leaving the PAT bits in shadow paging and NPT SPTEs as '0'. If something other than WB is in PAT[0], at _best_ guests will suffer very poor performance, and at worst KVM will crash the system by breaking cache-coherency expecations (e.g. using WC for guest memory). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220715230016.3762909-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: SVM: Fix x2APIC MSRs interceptionSuravee Suthikulpanit1-25/+27
The index for svm_direct_access_msrs was incorrectly initialized with the APIC MMIO register macros. Fix by introducing a macro for calculating x2APIC MSRs. Fixes: 5c127c85472c ("KVM: SVM: Adding support for configuring x2APIC MSRs interception") Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <20220718083833.222117-1-suravee.suthikulpanit@amd.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Remove underscores from __pte_list_remove()Sean Christopherson1-4/+4
Remove the underscores from __pte_list_remove(), the function formerly known as pte_list_remove() is now named kvm_zap_one_rmap_spte() to show that it zaps rmaps/PTEs, i.e. doesn't just remove an entry from a list. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Rename pte_list_{destroy,remove}() to show they zap SPTEsSean Christopherson1-8/+9
Rename pte_list_remove() and pte_list_destroy() to kvm_zap_one_rmap_spte() and kvm_zap_all_rmap_sptes() respectively to document that (a) they zap SPTEs and (b) to better document how they differ (remove vs. destroy does not exactly scream "one vs. all"). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Rename rmap zap helpers to eliminate "unmap" wrapperSean Christopherson1-8/+8
Rename kvm_unmap_rmap() and kvm_zap_rmap() to kvm_zap_rmap() and __kvm_zap_rmap() respectively to show that what was the "unmap" helper is just a wrapper for the "zap" helper, i.e. that they do the exact same thing, one just exists to deal with its caller passing in more params. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Rename __kvm_zap_rmaps() to align with other nomenclatureSean Christopherson1-2/+2
Rename __kvm_zap_rmaps() to kvm_rmap_zap_gfn_range() to avoid future confusion with a soon-to-be-introduced __kvm_zap_rmap(). Using a plural "rmaps" is somewhat ambiguous without additional context, as it's not obvious whether it's referring to multiple rmap lists, versus multiple rmap entries within a single list. Use kvm_rmap_zap_gfn_range() to align with the pattern established by kvm_rmap_zap_collapsible_sptes(), without losing the information that it zaps only rmap-based MMUs, i.e. don't rename it to __kvm_zap_gfn_range(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Drop the "p is for pointer" from rmap helpersSean Christopherson1-22/+21
Drop the trailing "p" from rmap helpers, i.e. rename functions to simply be kvm_<action>_rmap(). Declaring that a function takes a pointer is completely unnecessary and goes against kernel style. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Directly "destroy" PTE list when recycling rmapsSean Christopherson1-1/+1
Use pte_list_destroy() directly when recycling rmaps instead of bouncing through kvm_unmap_rmapp() and kvm_zap_rmapp(). Calling kvm_unmap_rmapp() is unnecessary and odd as it requires passing dummy parameters; passing NULL for @slot when __rmap_add() already has a valid slot is especially weird and confusing. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Return a u64 (the old SPTE) from mmu_spte_clear_track_bits()Sean Christopherson1-1/+1
Return a u64, not an int, from mmu_spte_clear_track_bits(). The return value is the old SPTE value, which is very much a 64-bit value. The sole caller that consumes the return value, drop_spte(), already uses a u64. The only reason that truncating the SPTE value is not problematic is because drop_spte() only queries the shadow-present bit, which is in the lower 32 bits. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nSVM: Pull CS.Base from actual VMCB12 for soft int/ex re-injectionMaciej S. Szmigiero1-4/+5
enter_svm_guest_mode() first calls nested_vmcb02_prepare_control() to copy control fields from VMCB12 to the current VMCB, then nested_vmcb02_prepare_save() to perform a similar copy of the save area. This means that nested_vmcb02_prepare_control() still runs with the previous save area values in the current VMCB so it shouldn't take the L2 guest CS.Base from this area. Explicitly pull CS.Base from the actual VMCB12 instead in enter_svm_guest_mode(). Granted, having a non-zero CS.Base is a very rare thing (and even impossible in 64-bit mode), having it change between nested VMRUNs is probably even rarer, but if it happens it would create a really subtle bug so it's better to fix it upfront. Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <4caa0f67589ae3c22c311ee0e6139496902f2edc.1658159083.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-19KVM: x86: Protect the unused bits in MSR exiting flagsAaron Lewis1-0/+8
The flags for KVM_CAP_X86_USER_SPACE_MSR and KVM_X86_SET_MSR_FILTER have no protection for their unused bits. Without protection, future development for these features will be difficult. Add the protection needed to make it possible to extend these features in the future. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220714161314.1715227-1-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-15KVM: emulate: do not adjust size of fastop and setcc subroutinesPaolo Bonzini1-10/+7
Instead of doing complicated calculations to find the size of the subroutines (which are even more complicated because they need to be stringified into an asm statement), just hardcode to 16. It is less dense for a few combinations of IBT/SLS/retbleed, but it has the advantage of being really simple. Cc: stable@vger.kernel.org # 5.15.x: 84e7051c0bc1: x86/kvm: fix FASTOP_SIZE when return thunks are enabled Cc: stable@vger.kernel.org Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-15KVM: x86: Remove unnecessary includeLu Baolu1-1/+0
intel-iommu.h is not needed in kvm/x86 anymore. Remove its include. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Link: https://lore.kernel.org/r/20220514014322.2927339-6-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-07-14KVM: x86: Check target, not vCPU's x2APIC ID, when applying hotplug hackSean Christopherson1-9/+9
When applying the hotplug hack to match x2APIC IDs for vCPUs in xAPIC mode, check the target APID ID for being unaddressable in xAPIC mode instead of checking the vCPU's x2APIC ID, and in that case proceed as if apic_x2apic_mode(vcpu) were true. Functionally, it does not matter whether you compare kvm_x2apic_id(apic) or mda with 0xff, since the two values are then checked for equality. But in isolation, checking the x2APIC ID takes an unnecessary dependency on the x2APIC ID being read-only (which isn't strictly true on AMD CPUs, and is difficult to document as well); it also requires KVM to fallthrough and check the xAPIC ID as well to deal with a writable xAPIC ID, whereas the xAPIC ID _can't_ match a target ID greater than 0xff. Opportunistically reword the comment to call out the various subtleties, and to fix a typo reported by Zhang Jiaming. No functional change intended. Cc: Zhang Jiaming <jiaming@nfschina.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14KVM: x86: Fully initialize 'struct kvm_lapic_irq' in kvm_pv_kick_cpu_op()Vitaly Kuznetsov1-8/+10
'vector' and 'trig_mode' fields of 'struct kvm_lapic_irq' are left uninitialized in kvm_pv_kick_cpu_op(). While these fields are normally not needed for APIC_DM_REMRD, they're still referenced by __apic_accept_irq() for trace_kvm_apic_accept_irq(). Fully initialize the structure to avoid consuming random stack memory. Fixes: a183b638b61c ("KVM: x86: make apic_accept_irq tracepoint more generic") Reported-by: syzbot+d6caa905917d353f0d07@syzkaller.appspotmail.com Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220708125147.593975-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14KVM: x86: Restrict get_mt_mask() to a u8, use KVM_X86_OP_OPTIONAL_RET0Sean Christopherson2-7/+1
Restrict get_mt_mask() to a u8 and reintroduce using a RET0 static_call for the SVM implementation. EPT stores the memtype information in the lower 8 bits (bits 6:3 to be precise), and even returns a shifted u8 without an explicit cast to a larger type; there's no need to return a full u64. Note, RET0 doesn't play nice with a u64 return on 32-bit kernels, see commit bf07be36cd88 ("KVM: x86: do not use KVM_X86_OP_OPTIONAL_RET0 for get_mt_mask"). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220714153707.3239119-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14KVM: x86: Add dedicated helper to get CPUID entry with significant indexSean Christopherson8-42/+84
Add a second CPUID helper, kvm_find_cpuid_entry_index(), to handle KVM queries for CPUID leaves whose index _may_ be significant, and drop the index param from the existing kvm_find_cpuid_entry(). Add a WARN in the inner helper, cpuid_entry2_find(), to detect attempts to retrieve a CPUID entry whose index is significant without explicitly providing an index. Using an explicit magic number and letting callers omit the index avoids confusion by eliminating the myriad cases where KVM specifies '0' as a dummy value. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14KVM: SVM: fix task switch emulation on INTn instruction.Maxim Levitsky1-0/+1
Recently KVM's SVM code switched to re-injecting software interrupt events, if something prevented their delivery. Task switch due to task gate in the IDT, however is an exception to this rule, because in this case, INTn instruction causes a task switch intercept and its emulation completes the INTn emulation as well. Add a missing case to task_switch_interception for that. This fixes 32 bit kvm unit test taskswitch2. Fixes: 7e5b5ef8dca322 ("KVM: SVM: Re-inject INTn instead of retrying the insn on "failure"") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <20220714124453.188655-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14KVM: x86/mmu: Fix typo and tweak comment for split_desc_cache capacitySean Christopherson1-7/+8
Remove a spurious closing paranthesis and tweak the comment about the cache capacity for PTE descriptors (rmaps) eager page splitting to tone down the assertion slightly, and to call out that topup requires dropping mmu_lock, which is the real motivation for avoiding topup (as opposed to memory usage). Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220712020724.1262121-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14KVM: x86/mmu: Expand quadrant comment for PG_LEVEL_4K shadow pagesSean Christopherson1-4/+11
Tweak the comment above the computation of the quadrant for PG_LEVEL_4K shadow pages to explicitly call out how and why KVM uses role.quadrant to consume gPTE bits. Opportunistically wrap an unnecessarily long line. No functional change intended. Link: https://lore.kernel.org/all/YqvWvBv27fYzOFdE@google.com Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220712020724.1262121-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14KVM: x86/mmu: Add optimized helper to retrieve an SPTE's indexSean Christopherson3-14/+18
Add spte_index() to dedup all the code that calculates a SPTE's index into its parent's page table and/or spt array. Opportunistically tweak the calculation to avoid pointer arithmetic, which is subtle (subtract in 8-byte chunks) and less performant (requires the compiler to generate the subtraction). Suggested-by: David Matlack <dmatlack@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220712020724.1262121-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14Merge commit 'kvm-vmx-nested-tsc-fix' into kvm-masterPaolo Bonzini1-1/+0
Merge bugfix needed in both 5.19 (because it's bad) and 5.20 (because it is a prerequisite to test new features).
2022-07-14kvm: stats: tell userspace which values are booleanPaolo Bonzini1-1/+1
Some of the statistics values exported by KVM are always only 0 or 1. It can be useful to export this fact to userspace so that it can track them specially (for example by polling the value every now and then to compute a % of time spent in a specific state). Therefore, add "boolean value" as a new "unit". While it is not exactly a unit, it walks and quacks like one. In particular, using the type would be wrong because boolean values could be instantaneous or peak values (e.g. "is the rmap allocated?") or even two-bucket histograms (e.g. "number of posted vs. non-posted interrupt injections"). Suggested-by: Amneesh Singh <natto@weirdnatto.in> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14x86/kvm: fix FASTOP_SIZE when return thunks are enabledThadeu Lima de Souza Cascardo1-4/+6
The return thunk call makes the fastop functions larger, just like IBT does. Consider a 16-byte FASTOP_SIZE when CONFIG_RETHUNK is enabled. Otherwise, functions will be incorrectly aligned and when computing their position for differently sized operators, they will executed in the middle or end of a function, which may as well be an int3, leading to a crash like: [ 36.091116] int3: 0000 [#1] SMP NOPTI [ 36.091119] CPU: 3 PID: 1371 Comm: qemu-system-x86 Not tainted 5.15.0-41-generic #44 [ 36.091120] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 [ 36.091121] RIP: 0010:xaddw_ax_dx+0x9/0x10 [kvm] [ 36.091185] Code: 00 0f bb d0 c3 cc cc cc cc 48 0f bb d0 c3 cc cc cc cc 0f 1f 80 00 00 00 00 0f c0 d0 c3 cc cc cc cc 66 0f c1 d0 c3 cc cc cc cc <0f> 1f 80 00 00 00 00 0f c1 d0 c3 cc cc cc cc 48 0f c1 d0 c3 cc cc [ 36.091186] RSP: 0018:ffffb1f541143c98 EFLAGS: 00000202 [ 36.091188] RAX: 0000000089abcdef RBX: 0000000000000001 RCX: 0000000000000000 [ 36.091188] RDX: 0000000076543210 RSI: ffffffffc073c6d0 RDI: 0000000000000200 [ 36.091189] RBP: ffffb1f541143ca0 R08: ffff9f1803350a70 R09: 0000000000000002 [ 36.091190] R10: ffff9f1803350a70 R11: 0000000000000000 R12: ffff9f1803350a70 [ 36.091190] R13: ffffffffc077fee0 R14: 0000000000000000 R15: 0000000000000000 [ 36.091191] FS: 00007efdfce8d640(0000) GS:ffff9f187dd80000(0000) knlGS:0000000000000000 [ 36.091192] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 36.091192] CR2: 0000000000000000 CR3: 0000000009b62002 CR4: 0000000000772ee0 [ 36.091195] PKRU: 55555554 [ 36.091195] Call Trace: [ 36.091197] <TASK> [ 36.091198] ? fastop+0x5a/0xa0 [kvm] [ 36.091222] x86_emulate_insn+0x7b8/0xe90 [kvm] [ 36.091244] x86_emulate_instruction+0x2f4/0x630 [kvm] [ 36.091263] ? kvm_arch_vcpu_load+0x7c/0x230 [kvm] [ 36.091283] ? vmx_prepare_switch_to_host+0xf7/0x190 [kvm_intel] [ 36.091290] complete_emulated_mmio+0x297/0x320 [kvm] [ 36.091310] kvm_arch_vcpu_ioctl_run+0x32f/0x550 [kvm] [ 36.091330] kvm_vcpu_ioctl+0x29e/0x6d0 [kvm] [ 36.091344] ? kvm_vcpu_ioctl+0x120/0x6d0 [kvm] [ 36.091357] ? __fget_files+0x86/0xc0 [ 36.091362] ? __fget_files+0x86/0xc0 [ 36.091363] __x64_sys_ioctl+0x92/0xd0 [ 36.091366] do_syscall_64+0x59/0xc0 [ 36.091369] ? syscall_exit_to_user_mode+0x27/0x50 [ 36.091370] ? do_syscall_64+0x69/0xc0 [ 36.091371] ? syscall_exit_to_user_mode+0x27/0x50 [ 36.091372] ? __x64_sys_writev+0x1c/0x30 [ 36.091374] ? do_syscall_64+0x69/0xc0 [ 36.091374] ? exit_to_user_mode_prepare+0x37/0xb0 [ 36.091378] ? syscall_exit_to_user_mode+0x27/0x50 [ 36.091379] ? do_syscall_64+0x69/0xc0 [ 36.091379] ? do_syscall_64+0x69/0xc0 [ 36.091380] ? do_syscall_64+0x69/0xc0 [ 36.091381] ? do_syscall_64+0x69/0xc0 [ 36.091381] entry_SYSCALL_64_after_hwframe+0x61/0xcb [ 36.091384] RIP: 0033:0x7efdfe6d1aff [ 36.091390] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00 [ 36.091391] RSP: 002b:00007efdfce8c460 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 36.091393] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007efdfe6d1aff [ 36.091393] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000c [ 36.091394] RBP: 0000558f1609e220 R08: 0000558f13fb8190 R09: 00000000ffffffff [ 36.091394] R10: 0000558f16b5e950 R11: 0000000000000246 R12: 0000000000000000 [ 36.091394] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 [ 36.091396] </TASK> [ 36.091397] Modules linked in: isofs nls_iso8859_1 kvm_intel joydev kvm input_leds serio_raw sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_devintf ipmi_msghandler drm msr ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net net_failover crypto_simd ahci xhci_pci cryptd psmouse virtio_blk libahci xhci_pci_renesas failover [ 36.123271] ---[ end trace db3c0ab5a48fabcc ]--- [ 36.123272] RIP: 0010:xaddw_ax_dx+0x9/0x10 [kvm] [ 36.123319] Code: 00 0f bb d0 c3 cc cc cc cc 48 0f bb d0 c3 cc cc cc cc 0f 1f 80 00 00 00 00 0f c0 d0 c3 cc cc cc cc 66 0f c1 d0 c3 cc cc cc cc <0f> 1f 80 00 00 00 00 0f c1 d0 c3 cc cc cc cc 48 0f c1 d0 c3 cc cc [ 36.123320] RSP: 0018:ffffb1f541143c98 EFLAGS: 00000202 [ 36.123321] RAX: 0000000089abcdef RBX: 0000000000000001 RCX: 0000000000000000 [ 36.123321] RDX: 0000000076543210 RSI: ffffffffc073c6d0 RDI: 0000000000000200 [ 36.123322] RBP: ffffb1f541143ca0 R08: ffff9f1803350a70 R09: 0000000000000002 [ 36.123322] R10: ffff9f1803350a70 R11: 0000000000000000 R12: ffff9f1803350a70 [ 36.123323] R13: ffffffffc077fee0 R14: 0000000000000000 R15: 0000000000000000 [ 36.123323] FS: 00007efdfce8d640(0000) GS:ffff9f187dd80000(0000) knlGS:0000000000000000 [ 36.123324] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 36.123325] CR2: 0000000000000000 CR3: 0000000009b62002 CR4: 0000000000772ee0 [ 36.123327] PKRU: 55555554 [ 36.123328] Kernel panic - not syncing: Fatal exception in interrupt [ 36.123410] Kernel Offset: 0x1400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 36.135305] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Fixes: aa3d480315ba ("x86: Use return-thunk in asm code") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@suse.de> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Message-Id: <20220713171241.184026-1-cascardo@canonical.com> Tested-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14KVM: nVMX: Always enable TSC scaling for L2 when it was enabled for L1Vitaly Kuznetsov1-1/+0
Windows 10/11 guests with Hyper-V role (WSL2) enabled are observed to hang upon boot or shortly after when a non-default TSC frequency was set for L1. The issue is observed on a host where TSC scaling is supported. The problem appears to be that Windows doesn't use TSC frequency for its guests even when the feature is advertised and KVM filters SECONDARY_EXEC_TSC_SCALING out when creating L2 controls from L1's. This leads to L2 running with the default frequency (matching host's) while L1 is running with an altered one. Keep SECONDARY_EXEC_TSC_SCALING in secondary exec controls for L2 when it was set for L1. TSC_MULTIPLIER is already correctly computed and written by prepare_vmcs02(). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220712135009.952805-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14KVM: VMX: Update PT MSR intercepts during filter change iff PT in host+guestSean Christopherson1-1/+3
Update the Processor Trace (PT) MSR intercepts during a filter change if and only if PT may be exposed to the guest, i.e. only if KVM is operating in the so called "host+guest" mode where PT can be used simultaneously by both the host and guest. If PT is in system mode, the host is the sole owner of PT and the MSRs should never be passed through to the guest. Luckily the missed check only results in unnecessary work, as select RTIT MSRs are passed through only when RTIT tracing is enabled "in" the guest, and tracing can't be enabled in the guest when KVM is in system mode (writes to guest.MSR_IA32_RTIT_CTL are disallowed). Cc: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20220712015838.1253995-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-14KVM: x86: WARN only once if KVM leaves a dangling userspace I/O requestSean Christopherson1-2/+4
Change a WARN_ON() to separate WARN_ON_ONCE() if KVM has an outstanding PIO or MMIO request without an associated callback, i.e. if KVM queued a userspace I/O exit but didn't actually exit to userspace before moving on to something else. Warning on every KVM_RUN risks spamming the kernel if KVM gets into a bad state. Opportunistically split the WARNs so that it's easier to triage failures when a WARN fires. Deliberately do not use KVM_BUG_ON(), i.e. don't kill the VM. While the WARN is all but guaranteed to fire if and only if there's a KVM bug, a dangling I/O request does not present a danger to KVM (that flag is truly truly consumed only in a single emulator path), and any such bug is unlikely to be fatal to the VM (KVM essentially failed to do something it shouldn't have tried to do in the first place). In other words, note the bug, but let the VM keep running. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220711232750.1092012-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-14KVM: x86: Set error code to segment selector on LLDT/LTR non-canonical #GPSean Christopherson1-2/+2
When injecting a #GP on LLDT/LTR due to a non-canonical LDT/TSS base, set the error code to the selector. Intel SDM's says nothing about the #GP, but AMD's APM explicitly states that both LLDT and LTR set the error code to the selector, not zero. Note, a non-canonical memory operand on LLDT/LTR does generate a #GP(0), but the KVM code in question is specific to the base from the descriptor. Fixes: e37a75a13cda ("KVM: x86: Emulator ignores LDTR/TR extended base on LLDT/LTR") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220711232750.1092012-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-14KVM: x86: Mark TSS busy during LTR emulation _after_ all fault checksSean Christopherson1-10/+9
Wait to mark the TSS as busy during LTR emulation until after all fault checks for the LTR have passed. Specifically, don't mark the TSS busy if the new TSS base is non-canonical. Opportunistically drop the one-off !seg_desc.PRESENT check for TR as the only reason for the early check was to avoid marking a !PRESENT TSS as busy, i.e. the common !PRESENT is now done before setting the busy bit. Fixes: e37a75a13cda ("KVM: x86: Emulator ignores LDTR/TR extended base on LLDT/LTR") Reported-by: syzbot+760a73552f47a8cd0fd9@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Hou Wenlong <houwenlong.hwl@antgroup.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220711232750.1092012-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-14KVM: x86: Tweak name of MONITOR/MWAIT #UD quirk to make it #UD specificSean Christopherson1-1/+1
Add a "UD" clause to KVM_X86_QUIRK_MWAIT_NEVER_FAULTS to make it clear that the quirk only controls the #UD behavior of MONITOR/MWAIT. KVM doesn't currently enforce fault checks when MONITOR/MWAIT are supported, but that could change in the future. SVM also has a virtualization hole in that it checks all faults before intercepts, and so "never faults" is already a lie when running on SVM. Fixes: bfbcc81bb82c ("KVM: x86: Add a quirk for KVM's "MONITOR/MWAIT are NOPs!" behavior") Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220711225753.1073989-4-seanjc@google.com
2022-07-13KVM: x86: Query vcpu->vcpu_idx directly and drop its accessor, againSean Christopherson1-5/+5
Read vcpu->vcpu_idx directly instead of bouncing through the one-line wrapper, kvm_vcpu_get_idx(), and drop the wrapper. The wrapper is a remnant of the original implementation and serves no purpose; remove it (again) before it gains more users. kvm_vcpu_get_idx() was removed in the not-too-distant past by commit 4eeef2424153 ("KVM: x86: Query vcpu->vcpu_idx directly and drop its accessor"), but was unintentionally re-introduced by commit a54d806688fe ("KVM: Keep memslots in tree-based structures instead of array-based ones"), likely due to a rebase goof. The wrapper then managed to gain users in KVM's Xen code. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20220614225615.3843835-1-seanjc@google.com
2022-07-13KVM: x86/mmu: Replace UNMAPPED_GVA with INVALID_GPA for gva_to_gpa()Hou Wenlong3-13/+13
The result of gva_to_gpa() is physical address not virtual address, it is odd that UNMAPPED_GVA macro is used as the result for physical address. Replace UNMAPPED_GVA with INVALID_GPA and drop UNMAPPED_GVA macro. No functional change intended. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/6104978956449467d3c68f1ad7f2c2f6d771d0ee.1656667239.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-12KVM: nVMX: Always enable TSC scaling for L2 when it was enabled for L1Vitaly Kuznetsov1-1/+0
Windows 10/11 guests with Hyper-V role (WSL2) enabled are observed to hang upon boot or shortly after when a non-default TSC frequency was set for L1. The issue is observed on a host where TSC scaling is supported. The problem appears to be that Windows doesn't use TSC scaling for its guests, even when the feature is advertised, and KVM filters SECONDARY_EXEC_TSC_SCALING out when creating L2 controls from L1's VMCS. This leads to L2 running with the default frequency (matching host's) while L1 is running with an altered one. Keep SECONDARY_EXEC_TSC_SCALING in secondary exec controls for L2 when it was set for L1. TSC_MULTIPLIER is already correctly computed and written by prepare_vmcs02(). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Fixes: d041b5ea93352b ("KVM: nVMX: Enable nested TSC scaling") Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220712135009.952805-1-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-09KVM: x86: Fully initialize 'struct kvm_lapic_irq' in kvm_pv_kick_cpu_op()Vitaly Kuznetsov1-8/+10
'vector' and 'trig_mode' fields of 'struct kvm_lapic_irq' are left uninitialized in kvm_pv_kick_cpu_op(). While these fields are normally not needed for APIC_DM_REMRD, they're still referenced by __apic_accept_irq() for trace_kvm_apic_accept_irq(). Fully initialize the structure to avoid consuming random stack memory. Fixes: a183b638b61c ("KVM: x86: make apic_accept_irq tracepoint more generic") Reported-by: syzbot+d6caa905917d353f0d07@syzkaller.appspotmail.com Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220708125147.593975-1-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-09KVM: x86: Fix handling of APIC LVT updates when userspace changes MCG_CAPSean Christopherson3-2/+22
Add a helper to update KVM's in-kernel local APIC in response to MCG_CAP being changed by userspace to fix multiple bugs. First and foremost, KVM needs to check that there's an in-kernel APIC prior to dereferencing vcpu->arch.apic. Beyond that, any "new" LVT entries need to be masked, and the APIC version register needs to be updated as it reports out the number of LVT entries. Fixes: 4b903561ec49 ("KVM: x86: Add Corrected Machine Check Interrupt (CMCI) emulation to lapic.") Reported-by: syzbot+8cdad6430c24f396f158@syzkaller.appspotmail.com Cc: Siddh Raman Pant <code@siddh.me> Cc: Jue Wang <juew@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-09KVM: x86: Initialize number of APIC LVT entries during APIC creationSean Christopherson1-0/+7
Initialize the number of LVT entries during APIC creation, else the field will be incorrectly left '0' if userspace never invokes KVM_X86_SETUP_MCE. Add and use a helper to calculate the number of entries even though MCG_CMCI_P is not set by default in vcpu->arch.mcg_cap. Relying on that to always be true is unnecessarily risky, and subtle/confusing as well. Fixes: 4b903561ec49 ("KVM: x86: Add Corrected Machine Check Interrupt (CMCI) emulation to lapic.") Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Jue Wang <juew@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-09Merge branch 'kvm-5.20-msr-eperm'Sean Christopherson1-3/+3
Merge a bug fix and cleanups for {g,s}et_msr_mce() using a base that predates commit 281b52780b57 ("KVM: x86: Add emulation for MSR_IA32_MCx_CTL2 MSRs."), which was written with the intention that it be applied _after_ the bug fix and cleanups. The bug fix in particular needs to be sent to stable trees; give them a stable hash to use.
2022-07-09KVM: x86: Add helpers to identify CTL and STATUS MCi MSRsSean Christopherson1-6/+19
Add helpers to identify CTL (control) and STATUS MCi MSR types instead of open coding the checks using the offset. Using the offset is perfectly safe, but unintuitive, as understanding what the code does requires knowing that the offset calcuation will not affect the lower three bits. Opportunistically comment the STATUS logic to save readers a trip to Intel's SDM or AMD's APM to understand the "data != 0" check. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20220512222716.4112548-4-seanjc@google.com