summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-12-08KVM: VMX: Introduce vmx_msr_bitmap_l01_changed() helperVitaly Kuznetsov1-4/+13
In preparation to enabling 'Enlightened MSR Bitmap' feature for Hyper-V guests move MSR bitmap update tracking to a dedicated helper. Note: vmx_msr_bitmap_l01_changed() is called when MSR bitmap might be updated. KVM doesn't check if the bit we're trying to set is already set (or the bit it's trying to clear is already cleared). Such situations should not be common and a few false positives should not be a problem. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211129094704.326635-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08Merge branch 'kvm-on-hv-msrbm-fix' into HEADPaolo Bonzini591-5393/+6943
Merge bugfix for enlightened MSR Bitmap, before adding support to KVM for exposing the feature to nested guests.
2021-12-08KVM: x86: Exit to userspace if emulation prepared a completion callbackHou Wenlong1-0/+3
em_rdmsr() and em_wrmsr() return X86EMUL_IO_NEEDED if MSR accesses required an exit to userspace. However, x86_emulate_insn() doesn't return X86EMUL_*, so x86_emulate_instruction() doesn't directly act on X86EMUL_IO_NEEDED; instead, it looks for other signals to differentiate between PIO, MMIO, etc. causing RDMSR/WRMSR emulation not to exit to userspace now. Nevertheless, if the userspace_msr_exit_test testcase in selftests is changed to test RDMSR/WRMSR with a forced emulation prefix, the test passes. What happens is that first userspace exit information is filled but the userspace exit does not happen. Because x86_emulate_instruction() returns 1, the guest retries the instruction---but this time RIP has already been adjusted past the forced emulation prefix, so the guest executes RDMSR/WRMSR and the userspace exit finally happens. Since the X86EMUL_IO_NEEDED path has provided a complete_userspace_io callback, x86_emulate_instruction() can just return 0 if the callback is not NULL. Then RDMSR/WRMSR instruction emulation will exit to userspace directly, without the RDMSR/WRMSR vmexit. Fixes: 1ae099540e8c7 ("KVM: x86: Allow deflecting unknown MSR accesses to user space") Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <56f9df2ee5c05a81155e2be366c9dc1f7adc8817.1635842679.git.houwenlong93@linux.alibaba.com>
2021-12-08KVM: nVMX: Don't use Enlightened MSR Bitmap for L3Vitaly Kuznetsov1-9/+13
When KVM runs as a nested hypervisor on top of Hyper-V it uses Enlightened VMCS and enables Enlightened MSR Bitmap feature for its L1s and L2s (which are actually L2s and L3s from Hyper-V's perspective). When MSR bitmap is updated, KVM has to reset HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP from clean fields to make Hyper-V aware of the change. For KVM's L1s, this is done in vmx_disable_intercept_for_msr()/vmx_enable_intercept_for_msr(). MSR bitmap for L2 is build in nested_vmx_prepare_msr_bitmap() by blending MSR bitmap for L1 and L1's idea of MSR bitmap for L2. KVM, however, doesn't check if the resulting bitmap is different and never cleans HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP in eVMCS02. This is incorrect and may result in Hyper-V missing the update. The issue could've been solved by calling evmcs_touch_msr_bitmap() for eVMCS02 from nested_vmx_prepare_msr_bitmap() unconditionally but doing so would not give any performance benefits (compared to not using Enlightened MSR Bitmap at all). 3-level nesting is also not a very common setup nowadays. Don't enable 'Enlightened MSR Bitmap' feature for KVM's L2s (real L3s) for now. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211129094704.326635-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: x86: Use different callback if msr access comes from the emulatorHou Wenlong1-36/+49
If msr access triggers an exit to userspace, the complete_userspace_io callback would skip instruction by vendor callback for kvm_skip_emulated_instruction(). However, when msr access comes from the emulator, e.g. if kvm.force_emulation_prefix is enabled and the guest uses rdmsr/wrmsr with kvm prefix, VM_EXIT_INSTRUCTION_LEN in vmcs is invalid and kvm_emulate_instruction() should be used to skip instruction instead. As Sean noted, unlike the previous case, there's no #UD if unrestricted guest is disabled and the guest accesses an MSR in Big RM. So the correct way to fix this is to attach a different callback when the msr access comes from the emulator. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <34208da8f51580a06e45afefac95afea0e3f96e3.1635842679.git.houwenlong93@linux.alibaba.com>
2021-12-08KVM: x86: Add an emulation type to handle completion of user exitsHou Wenlong2-4/+17
The next patch would use kvm_emulate_instruction() with EMULTYPE_SKIP in complete_userspace_io callback to fix a problem in msr access emulation. However, EMULTYPE_SKIP only updates RIP, more things like updating interruptibility state and injecting single-step #DBs would be done in the callback. Since the emulator also does those things after x86_emulate_insn(), add a new emulation type to pair with EMULTYPE_SKIP to do those things for completion of user exits within the emulator. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <8f8c8e268b65f31d55c2881a4b30670946ecfa0d.1635842679.git.houwenlong93@linux.alibaba.com>
2021-12-08KVM: x86: Handle 32-bit wrap of EIP for EMULTYPE_SKIP with flat code segSean Christopherson1-1/+6
Truncate the new EIP to a 32-bit value when handling EMULTYPE_SKIP as the decode phase does not truncate _eip. Wrapping the 32-bit boundary is legal if and only if CS is a flat code segment, but that check is implicitly handled in the form of limit checks in the decode phase. Opportunstically prepare for a future fix by storing the result of any truncation in "eip" instead of "_eip". Fixes: 1957aa63be53 ("KVM: VMX: Handle single-step #DB for EMULTYPE_SKIP on EPT misconfig") Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <093eabb1eab2965201c9b018373baf26ff256d85.1635842679.git.houwenlong93@linux.alibaba.com>
2021-12-08KVM: Clear pv eoi pending bit only when it is setLi RongQing1-21/+19
merge pv_eoi_get_pending and pv_eoi_clr_pending into a single function pv_eoi_test_and_clear_pending, which returns and clear the value of the pending bit. This makes it possible to clear the pending bit only if the guest had set it, and otherwise skip the call to pv_eoi_put_user(). This can save up to 300 nsec on AMD EPYC processors. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Message-Id: <1636026974-50555-2-git-send-email-lirongqing@baidu.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: x86: don't print when fail to read/write pv eoi memoryLi RongQing1-12/+6
If guest gives MSR_KVM_PV_EOI_EN a wrong value, this printk() will be trigged, and kernel log is spammed with the useless message Fixes: 0d88800d5472 ("kvm: x86: ioapic and apic debug macros cleanup") Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Cc: stable@kernel.org Message-Id: <1636026974-50555-1-git-send-email-lirongqing@baidu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Remove mmu parameter from load_pdptrs()Lai Jiangshan5-12/+12
It uses vcpu->arch.walk_mmu always; nested EPT does not have PDPTRs, and nested NPT treats them like all other non-leaf page table levels instead of caching them. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Rename gpte_is_8_bytes to has_4_byte_gpte and invert the directionLai Jiangshan5-16/+16
This bit is very close to mean "role.quadrant is not in use", except that it is false also when the MMU is mapping guest physical addresses directly. In that case, role.quadrant is indeed not in use, but there are no guest PTEs at all. Changing the name and direction of the bit removes the special case, since a guest with paging disabled, or not considering guest paging structures as is the case for two-dimensional paging, does not have to deal with 4-byte guest PTEs. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-10-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Use ept_caps_to_lpage_level() in hardware_setup()Lai Jiangshan1-10/+2
Using ept_caps_to_lpage_level is simpler. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-9-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Add parameter huge_page_level to kvm_init_shadow_ept_mmu()Lai Jiangshan4-6/+19
The level of supported large page on nEPT affects the rsvds_bits_mask. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Add huge_page_level to __reset_rsvds_bits_mask_ept()Lai Jiangshan1-10/+19
Bit 7 on pte depends on the level of supported large page. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Remove mmu->translate_gpaLai Jiangshan5-20/+19
Reduce an indirect function call (retpoline) and some intialization code. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Add parameter struct kvm_mmu *mmu into mmu->gva_to_gpa()Lai Jiangshan4-68/+41
The mmu->gva_to_gpa() has no "struct kvm_mmu *mmu", so an extra FNAME(gva_to_gpa_nested) is needed. Add the parameter can simplify the code. And it makes it explicit that the walk is upon vcpu->arch.walk_mmu for gva and vcpu->arch.mmu for L2 gpa in translate_nested_gpa() via the new parameter. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Calculate quadrant when !role.gpte_is_8_bytesLai Jiangshan1-1/+1
role.quadrant is only valid when gpte size is 4 bytes and only be calculated when gpte size is 4 bytes. Although "vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL" also means gpte size is 4 bytes, but using "!role.gpte_is_8_bytes" is clearer Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-15-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Remove useless code to set role.gpte_is_8_bytes when role.directLai Jiangshan1-2/+0
role.gpte_is_8_bytes is unused when role.direct; there is no point in changing a bit in the role, the value that was set when the MMU is initialized is just fine. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-14-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Remove unused declaration of __kvm_mmu_free_some_pages()Lai Jiangshan1-1/+0
The body of __kvm_mmu_free_some_pages() has been removed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-13-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Fix comment in __kvm_mmu_create()Lai Jiangshan1-1/+1
The allocation of special roots is moved to mmu_alloc_special_roots(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-12-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Skip allocating pae_root for vcpu->arch.guest_mmu when !tdp_enabledLai Jiangshan1-0/+4
It is never used. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: SVM: Allocate sd->save_area with __GFP_ZEROLai Jiangshan1-3/+1
And remove clear_page() on it. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-10-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: SVM: Rename get_max_npt_level() to get_npt_level()Lai Jiangshan1-4/+4
It returns the only proper NPT level, so the "max" in the name is not appropriate. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-9-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Change comments about vmx_get_msr()Lai Jiangshan1-1/+1
The variable name is changed in the code. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Use kvm_set_msr_common() for MSR_IA32_TSC_ADJUST in the default wayLai Jiangshan1-3/+0
MSR_IA32_TSC_ADJUST can be left to the default way which also uese kvm_set_msr_common(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()Lai Jiangshan2-14/+11
The host CR3 in the vcpu thread can only be changed when scheduling. Moving the code in vmx_prepare_switch_to_guest() makes the code simpler. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Update msr value after kvm_set_user_return_msr() succeedsLai Jiangshan1-5/+3
Aoid earlier modification. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Avoid to rdmsrl(MSR_IA32_SYSENTER_ESP)Lai Jiangshan1-3/+11
The value of host MSR_IA32_SYSENTER_ESP is known to be constant for each CPU: (cpu_entry_stack(cpu) + 1) when 32 bit syscall is enabled or NULL is 32 bit syscall is not enabled. So rdmsrl() can be avoided for the first case and both rdmsrl() and vmcs_writel() can be avoided for the second case. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Update mmu->pdptrs only when it is changedLai Jiangshan1-3/+6
It is unchanged in most cases. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211111144527.88852-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Remove kvm_register_clear_available()Lai Jiangshan1-7/+0
It has no user. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-15-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: vmx, svm: clean up mass updates to regs_avail/regs_dirty bitsPaolo Bonzini6-16/+43
Document the meaning of the three combinations of regs_avail and regs_dirty. Update regs_dirty just after writeback instead of doing it later after vmexit. After vmexit, instead, we clear the regs_avail bits corresponding to lazily-loaded registers. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirtyLai Jiangshan1-2/+2
When vcpu->arch.cr3 is changed, it is marked dirty, so vmcs.GUEST_CR3 can be updated only when kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-12-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Mark CR3 dirty when vcpu->arch.cr3 is changedLai Jiangshan2-3/+3
When vcpu->arch.cr3 is changed, it should be marked dirty unless it is being updated to the value of the architecture guest CR3 (i.e. VMX.GUEST_CR3 or vmcb->save.cr3 when tdp is enabled). This patch has no functionality changed because kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3) is superset of kvm_register_mark_available(vcpu, VCPU_EXREG_CR3) with additional change to vcpu->arch.regs_dirty, but no code uses regs_dirty for VCPU_EXREG_CR3. (vmx_load_mmu_pgd() uses vcpu->arch.regs_avail instead to test if VCPU_EXREG_CR3 dirty which means current code (ab)uses regs_avail for VCPU_EXREG_CR3 dirty information.) Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: SVM: Remove references to VCPU_EXREG_CR3Lai Jiangshan2-3/+0
VCPU_EXREG_CR3 is never cleared from vcpu->arch.regs_avail or vcpu->arch.regs_dirty in SVM; therefore, marking CR3 as available is merely a NOP, and testing it will likewise always succeed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-9-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: SVM: Remove outdated comment in svm_load_mmu_pgd()Lai Jiangshan1-1/+0
The comment had been added in the commit 689f3bf21628 ("KVM: x86: unify callbacks to load paging root") and its related code was removed later, and it has nothing to do with the next line of code. So the comment should be removed too. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Move CR0 pdptr_bits into header file as X86_CR0_PDPTR_BITSLai Jiangshan2-2/+4
Not functionality changed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Add and use X86_CR4_PDPTR_BITS when !enable_eptLai Jiangshan3-4/+5
In set_cr4_guest_host_mask(), all cr4 pdptr bits are already set to be intercepted in an unclear way. Add X86_CR4_PDPTR_BITS to make it clear and self-documented. No functionality changed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-6-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Add and use X86_CR4_TLBFLUSH_BITS when !enable_eptLai Jiangshan2-1/+3
In set_cr4_guest_host_mask(), X86_CR4_PGE is set to be intercepted when !enable_ept just because X86_CR4_PGE is the only bit that is responsible for flushing TLB but listed in KVM_POSSIBLE_CR4_GUEST_BITS. It is clearer and self-documented to use X86_CR4_TLBFLUSH_BITS instead. No functionality changed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: SVM: Track dirtiness of PDPTRs even if NPT is disabledLai Jiangshan1-4/+9
Use the same logic to handle the availability of VCPU_EXREG_PDPTR as VMX, also removing a branch in svm_vcpu_run(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Mark VCPU_EXREG_PDPTR available in ept_save_pdptrs()Lai Jiangshan1-1/+1
mmu->pdptrs[] and vmcs.GUEST_PDPTR[0-3] are synced, so mmu->pdptrs is available and GUEST_PDPTR[0-3] is not dirty. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Ensure that dirty PDPTRs are loadedLai Jiangshan1-0/+1
For VMX with EPT, dirty PDPTRs need to be loaded before the next vmentry via vmx_load_mmu_pgd() But not all paths that call load_pdptrs() will cause vmx_load_mmu_pgd() to be invoked. Normally, kvm_mmu_reset_context() is used to cause KVM_REQ_LOAD_MMU_PGD, but sometimes it is skipped: * commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs() when changing CR0.CD and CR0.NW. * commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") skips KVM_REQ_LOAD_MMU_PGD after load_pdptrs() when rewriting the CR3 with the same value. * commit a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after load_pdptrs() when changing CR4.PGE. Fixes: d81135a57aa6 ("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed") Fixes: 21823fbda552 ("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") Fixes: a91a7c709600 ("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: x86/svm: Add module param to control PMU virtualizationLike Xu4-1/+17
For Intel, the guest PMU can be disabled via clearing the PMU CPUID. For AMD, all hw implementations support the base set of four performance counters, with current mainstream hardware indicating the presence of two additional counters via X86_FEATURE_PERFCTR_CORE. In the virtualized world, the AMD guest driver may detect the presence of at least one counter MSR. Most hypervisor vendors would introduce a module param (like lbrv for svm) to disable PMU for all guests. Another control proposal per-VM is to pass PMU disable information via MSR_IA32_PERF_CAPABILITIES or one bit in CPUID Fn4000_00[FF:00]. Both of methods require some guest-side changes, so a module parameter may not be sufficiently granular, but practical enough. Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20211117080304.38989-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Remove vCPU from PI wakeup list before updating PID.NVSean Christopherson1-5/+22
Remove the vCPU from the wakeup list before updating the notification vector in the posted interrupt post-block helper. There is no need to wake the current vCPU as it is by definition not blocking. Practically speaking this is a nop as it only shaves a few meager cycles in the unlikely case that the vCPU was migrated and the previous pCPU gets a wakeup IRQ right before PID.NV is updated. The real motivation is to allow for more readable code in the future, when post-block is merged with vmx_vcpu_pi_load(), at which point removal from the list will be conditional on the old notification vector. Opportunistically add comments to document why KVM has a per-CPU spinlock that, at first glance, appears to be taken only on the owning CPU. Explicitly call out that the spinlock must be taken with IRQs disabled, a detail that was "lost" when KVM switched from spin_lock_irqsave() to spin_lock(), with IRQs disabled for the entirety of the relevant path. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-29-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Move Posted Interrupt ndst computation out of write loopSean Christopherson1-14/+11
Hoist the CPU => APIC ID conversion for the Posted Interrupt descriptor out of the loop to write the descriptor, preemption is disabled so the CPU won't change, and if the APIC ID changes KVM has bigger problems. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-28-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Read Posted Interrupt "control" exactly once per loop iterationSean Christopherson1-3/+3
Use READ_ONCE() when loading the posted interrupt descriptor control field to ensure "old" and "new" have the same base value. If the compiler emits separate loads, and loads into "new" before "old", KVM could theoretically drop the ON bit if it were set between the loads. Fixes: 28b835d60fcc ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-27-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Save/restore IRQs (instead of CLI/STI) during PI pre/post blockSean Christopherson1-6/+7
Save/restore IRQs when disabling IRQs in posted interrupt pre/post block in preparation for moving the code into vcpu_put/load(), where it would be called with IRQs already disabled. No functional changed intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-26-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Drop pointless PI.NDST update when blockingSean Christopherson1-20/+3
Don't update Posted Interrupt's NDST, a.k.a. the target pCPU, in the pre-block path, as NDST is guaranteed to be up-to-date. The comment about the vCPU being preempted during the update is simply wrong, as the update path runs with IRQs disabled (from before snapshotting vcpu->cpu, until after the update completes). Since commit 8b306e2f3c41 ("KVM: VMX: avoid double list add with VT-d posted interrupts", 2017-09-27) The vCPU can get preempted _before_ the update starts, but not during. And if the vCPU is preempted before, vmx_vcpu_pi_load() is responsible for updating NDST when the vCPU is scheduled back in. In that case, the check against the wakeup vector in vmx_vcpu_pi_load() cannot be true as that would require the notification vector to have been set to the wakeup vector _before_ blocking. Opportunistically switch to using vcpu->cpu for the list/lock lookups, which do not need pre_pcpu since the same commit. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Use boolean returns for Posted Interrupt "test" helpersSean Christopherson2-5/+5
Return bools instead of ints for the posted interrupt "test" helpers. The bit position of the flag being test does not matter to the callers, and is in fact lost by virtue of test_bit() itself returning a bool. Returning ints is potentially dangerous, e.g. "pi_test_on(pi_desc) == 1" is safe-ish because ON is bit 0 and thus any sane implementation of pi_test_on() will work, but for SN (bit 1), checking "== 1" would rely on pi_test_on() to return 0 or 1, a.k.a. bools, as opposed to 0 or 2 (the positive bit position). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Drop unnecessary PI logic to handle impossible conditionsSean Christopherson1-14/+10
Drop sanity checks on the validity of the previous pCPU when handling vCPU block/unlock for posted interrupts. The intention behind the sanity checks is to avoid memory corruption in case of a race or incorrect locking, but the code has been stable for a few years now and the checks get in the way of eliminating kvm_vcpu.pre_cpu. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: VMX: Skip Posted Interrupt updates if APICv is hard disabledSean Christopherson1-4/+7
Explicitly skip posted interrupt updates if APICv is disabled in all of KVM, or if the guest doesn't have an in-kernel APIC. The PI descriptor is kept up-to-date if APICv is inhibited, e.g. so that re-enabling APICv doesn't require a bunch of updates, but neither the module param nor the APIC type can be changed on-the-fly. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>