summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2021-06-25KVM: x86/mmu: Use MMU's role to compute PKRU bitmaskSean Christopherson1-14/+7
Use the MMU's role to calculate the Protection Keys (Restrict Userspace) bitmask instead of pulling bits from current vCPU state. For some flows, the vCPU state may not be correct (or relevant), e.g. EPT doesn't interact with PKRU. Case in point, the "ept" param simply disappears. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-34-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Use MMU's role to compute permission bitmaskSean Christopherson1-9/+8
Use the MMU's role to generate the permission bitmasks for the MMU. For some flows, the vCPU state may not be correct (or relevant), e.g. the nested NPT MMU can be initialized with incoherent vCPU state. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-33-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Drop vCPU param from reserved bits calculatorSean Christopherson1-7/+4
Drop the vCPU param from __reset_rsvds_bits_mask() as it's now unused, and ideally will remain unused in the future. Any information that's needed by the low level helper should be explicitly provided as it's used for both shadow/host MMUs and guest MMUs, i.e. vCPU state may be meaningless or simply wrong. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-32-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Use MMU's role to get CR4.PSE for computing rsvd bitsSean Christopherson1-1/+1
Use the MMU's role to get CR4.PSE when calculating reserved bits for the guest's PTEs. Practically speaking, this is a glorified nop as the role always come from vCPU state for the relevant flows, but converting to the roles will provide consistency once everything else is converted, and will Just Work if the "always comes from vCPU" behavior were ever to change (unlikely). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-31-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Don't grab CR4.PSE for calculating shadow reserved bitsSean Christopherson1-6/+9
Unconditionally pass pse=false when calculating reserved bits for shadow PTEs. CR4.PSE is only relevant for 32-bit non-PAE paging, which KVM does not use for shadow paging (including nested NPT). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-30-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Always set new mmu_role immediately after checking old roleSean Christopherson1-6/+9
Refactor shadow MMU initialization to immediately set its new mmu_role after verifying it differs from the old role, and so that all flavors of MMU initialization share the same check-and-set pattern. Immediately setting the role will allow future commits to use mmu_role to configure the MMU without consuming stale state. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-29-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Set CR4.PKE/LA57 in MMU role iff long mode is activeSean Christopherson1-2/+4
Don't set cr4_pke or cr4_la57 in the MMU role if long mode isn't active, which is required for protection keys and 5-level paging to be fully enabled. Ignoring the bit avoids unnecessary reconfiguration on reuse, and also means consumers of mmu_role don't need to manually check for long mode. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-28-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Do not set paging-related bits in MMU role if CR0.PG=0Sean Christopherson1-10/+14
Don't set CR0/CR4/EFER bits in the MMU role if paging is disabled, paging modifiers are irrelevant if there is no paging in the first place. Somewhat arbitrarily clear gpte_is_8_bytes for shadow paging if paging is disabled in the guest. Again, there are no guest PTEs to process, so the size is meaningless. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-27-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Add accessors to query mmu_role bitsSean Christopherson2-1/+22
Add accessors via a builder macro for all mmu_role bits that track a CR0, CR4, or EFER bit, abstracting whether the bits are in the base or the extended role. Future commits will switch to using mmu_role instead of vCPU state to configure the MMU, i.e. there are about to be a large number of users. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-26-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Rename "nxe" role bit to "efer_nx" for macro shenanigansSean Christopherson2-2/+2
Rename "nxe" to "efer_nx" so that future macro magic can use the pattern <reg>_<bit> for all CR0, CR4, and EFER bits that included in the role. Using "efer_nx" also makes it clear that the role bit reflects EFER.NX, not the NX bit in the corresponding PTE. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Use MMU's role_regs, not vCPU state, to compute mmu_roleSean Christopherson1-40/+52
Use the provided role_regs to calculate the mmu_role instead of pulling bits from current vCPU state. For some flows, e.g. nested TDP, the vCPU state may not be correct (or relevant). Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Ignore CR0 and CR4 bits in nested EPT MMU roleSean Christopherson1-1/+3
Do not incorporate CR0/CR4 bits into the role for the nested EPT MMU, as EPT behavior is not influenced by CR0/CR4. Note, this is the guest_mmu, (L1's EPT), not nested_mmu (L2's IA32 paging); the nested_mmu does need CR0/CR4, and is initialized in a separate flow. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Consolidate misc updates into shadow_mmu_init_context()Sean Christopherson1-11/+6
Consolidate the MMU metadata update calls to deduplicate code, and to prep for future cleanup. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Add struct and helpers to retrieve MMU role bits from regsSean Christopherson1-13/+53
Introduce "struct kvm_mmu_role_regs" to hold the register state that is incorporated into the mmu_role. For nested TDP, the register state that is factored into the MMU isn't vCPU state; the dedicated struct will be used to propagate the correct state throughout the flows without having to pass multiple params, and also provides helpers for the various flag accessors. Intentionally make the new helpers cumbersome/ugly by prepending four underscores. In the not-too-distant future, it will be preferable to use the mmu_role to query bits as the mmu_role can drop irrelevant bits without creating contradictions, e.g. clearing CR4 bits when CR0.PG=0. Reserve the clean helper names (no underscores) for the mmu_role. Add a helper for vCPU conversion, which is the common case. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Grab shadow root level from mmu_role for shadow MMUsSean Christopherson1-13/+5
Use the mmu_role to initialize shadow root level instead of assuming the level of KVM's shadow root (host) is the same as that of the guest root, or in the case of 32-bit non-PAE paging where KVM forces PAE paging. For nested NPT, the shadow root level cannot be adapted to L1's NPT root level and is instead always the TDP root level because NPT uses the current host CR0/CR4/EFER, e.g. 64-bit KVM can't drop into 32-bit PAE to shadow L1's NPT. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Move nested NPT reserved bit calculation into MMU properSean Christopherson3-7/+8
Move nested NPT's invocation of reset_shadow_zero_bits_mask() into the MMU proper and unexport said function. Aside from dropping an export, this is a baby step toward eliminating the call entirely by fixing the shadow_root_level confusion. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86: Read and pass all CR0/CR4 role bits to shadow MMU helperSean Christopherson3-9/+10
Grab all CR0/CR4 MMU role bits from current vCPU state when initializing a non-nested shadow MMU. Extract the masks from kvm_post_set_cr{0,4}(), as the CR0/CR4 update masks must exactly match the mmu_role bits, with one exception (see below). The "full" CR0/CR4 will be used by future commits to initialize the MMU and its role, as opposed to the current approach of pulling everything from vCPU, which is incorrect for certain flows, e.g. nested NPT. CR4.LA57 is an exception, as it can be toggled on VM-Exit (for L1's MMU) but can't be toggled via MOV CR4 while long mode is active. I.e. LA57 needs to be in the mmu_role, but technically doesn't need to be checked by kvm_post_set_cr4(). However, the extra check is completely benign as the hardware restrictions simply mean LA57 will never be _the_ cause of a MMU reset during MOV CR4. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Drop smep_andnot_wp check from "uses NX" for shadow MMUsSean Christopherson1-2/+1
Drop the smep_andnot_wp role check from the "uses NX" calculation now that all non-nested shadow MMUs treat NX as used via the !TDP check. The shadow MMU for nested NPT, which shares the helper, does not need to deal with SMEP (or WP) as NPT walks are always "user" accesses and WP is explicitly noted as being ignored: Table walks for guest page tables are always treated as user writes at the nested page table level. A table walk for the guest page itself is always treated as a user access at the nested page table level The host hCR0.WP bit is ignored under nested paging. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: nSVM: Add a comment to document why nNPT uses vmcb01, not vCPU stateSean Christopherson1-0/+6
Add a comment in the nested NPT initialization flow to call out that it intentionally uses vmcb01 instead current vCPU state to get the effective hCR4 and hEFER for L1's NPT context. Note, despite nSVM's efforts to handle the case where vCPU state doesn't reflect L1 state, the MMU may still do the wrong thing due to pulling state from the vCPU instead of the passed in CR0/CR4/EFER values. This will be addressed in future commits. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86: Fix sizes used to pass around CR0, CR4, and EFERSean Christopherson4-9/+10
When configuring KVM's MMU, pass CR0 and CR4 as unsigned longs, and EFER as a u64 in various flows (mostly MMU). Passing the params as u32s is functionally ok since all of the affected registers reserve bits 63:32 to zero (enforced by KVM), but it's technically wrong. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Rename unsync helper and update related commentsSean Christopherson3-13/+34
Rename mmu_need_write_protect() to mmu_try_to_unsync_pages() and update a variety of related, stale comments. Add several new comments to call out subtle details, e.g. that upper-level shadow pages are write-tracked, and that can_unsync is false iff KVM is in the process of synchronizing pages. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Drop the intermediate "transient" __kvm_sync_page()Sean Christopherson1-12/+5
Nove the kvm_unlink_unsync_page() call out of kvm_sync_page() and into it's sole caller, and fold __kvm_sync_page() into kvm_sync_page() since the latter becomes a pure pass-through. There really should be no reason for code to do a complete sync of a shadow page outside of the full kvm_mmu_sync_roots(), e.g. the one use case that creeped in turned out to be flawed and counter-productive. Drop the stale comment about @sp->gfn needing to be write-protected, as it directly contradicts the kvm_mmu_get_page() usage. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: comment on kvm_mmu_get_page's syncing of pagesSean Christopherson1-2/+11
Explain the usage of sync_page() in kvm_mmu_get_page(), which is subtle in how and why it differs from mmu_sync_children(). Signed-off-by: Sean Christopherson <seanjc@google.com> [Split out of a different patch by Sean. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: WARN and zap SP when sync'ing if MMU role mismatchesSean Christopherson2-6/+26
When synchronizing a shadow page, WARN and zap the page if its mmu role isn't compatible with the current MMU context, where "compatible" is an exact match sans the bits that have no meaning in the overall MMU context or will be explicitly overwritten during the sync. Many of the helpers used by sync_page() are specific to the current context, updating a SMM vs. non-SMM shadow page would use the wrong memslots, updating L1 vs. L2 PTEs might work but would be extremely bizaree, and so on and so forth. Drop the guard with respect to 8-byte vs. 4-byte PTEs in __kvm_sync_page(), it was made useless when kvm_mmu_get_page() stopped trying to sync shadow pages irrespective of the current MMU context. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Use MMU role to check for matching guest page sizesSean Christopherson1-13/+3
Originally, __kvm_sync_page used to check the cr4_pae bit in the role to avoid zapping 4-byte kvm_mmu_pages when guest page size are 8-byte or the other way round. However, in commit 47c42e6b4192 ("KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'", 2019-03-28) it was observed that this did not work for nested EPT, where the page table size would be 8 bytes even if CR4.PAE=0. (Note that the check still has to be done for nested *NPT*, so it is not possible to use tdp_enabled or similar). Therefore, a hack was introduced to identify nested EPT shadow pages and unconditionally call __kvm_sync_page() on them. However, it is possible to do without the hack to identify nested EPT shadow pages: if EPT is active, there will be no shadow pages in non-EPT format, and all of them will have gpte_is_8_bytes set to true; we can just check the MMU role directly, and the test will always be true. Even for non-EPT shadow MMUs, this test should really always be true now that __kvm_sync_page() is called if and only if the role is an exact match (kvm_mmu_get_page()) or is part of the current MMU context (kvm_mmu_sync_roots()). A future commit will convert the likely-pointless check into a meaningful WARN to enforce that the mmu_roles of the current context and the shadow page are compatible. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Unconditionally zap unsync SPs when creating >4k SP at GFNSean Christopherson1-34/+16
When creating a new upper-level shadow page, zap unsync shadow pages at the same target gfn instead of attempting to sync the pages. This fixes a bug where an unsync shadow page could be sync'd with an incompatible context, e.g. wrong smm, is_guest, etc... flags. In practice, the bug is relatively benign as sync_page() is all but guaranteed to fail its check that the guest's desired gfn (for the to-be-sync'd page) matches the current gfn associated with the shadow page. I.e. kvm_sync_page() would end up zapping the page anyways. Alternatively, __kvm_sync_page() could be modified to explicitly verify the mmu_role of the unsync shadow page is compatible with the current MMU context. But, except for this specific case, __kvm_sync_page() is called iff the page is compatible, e.g. the transient sync in kvm_mmu_get_page() requires an exact role match, and the call from kvm_sync_mmu_roots() is only synchronizing shadow pages from the current MMU (which better be compatible or KVM has problems). And as described above, attempting to sync shadow pages when creating an upper-level shadow page is unlikely to succeed, e.g. zero successful syncs were observed when running Linux guests despite over a million attempts. Fixes: 9f1a122f970d ("KVM: MMU: allow more page become unsync at getting sp time") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-10-seanjc@google.com> [Remove WARN_ON after __kvm_sync_page. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25Revert "KVM: MMU: record maximum physical address width in ↵Sean Christopherson1-1/+0
kvm_mmu_extended_role" Drop MAXPHYADDR from mmu_role now that all MMUs have their role invalidated after a CPUID update. Invalidating the role forces all MMUs to re-evaluate the guest's MAXPHYADDR, and the guest's MAXPHYADDR can only be changed only through a CPUID update. This reverts commit de3ccd26fafc707b09792d9b633c8b5b48865315. Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is brokenSean Christopherson2-0/+22
Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest instability. Initialize last_vmentry_cpu to -1 and use it to detect if the vCPU has been run at least once when its CPUID model is changed. KVM does not correctly handle changes to paging related settings in the guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc... KVM could theoretically zap all shadow pages, but actually making that happen is a mess due to lock inversion (vcpu->mutex is held). And even then, updating paging settings on the fly would only work if all vCPUs are stopped, updated in concert with identical settings, then restarted. To support running vCPUs with different vCPU models (that affect paging), KVM would need to track all relevant information in kvm_mmu_page_role. Note, that's the _page_ role, not the full mmu_role. Updating mmu_role isn't sufficient as a vCPU can reuse a shadow page translation that was created by a vCPU with different settings and thus completely skip the reserved bit checks (that are tied to CPUID). Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as it would require doubling gfn_track from a u16 to a u32, i.e. would increase KVM's memory footprint by 2 bytes for every 4kb of guest memory. E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT would all need to be tracked. In practice, there is no remotely sane use case for changing any paging related CPUID entries on the fly, so just sweep it under the rug (after yelling at userspace). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86: Force all MMUs to reinitialize if guest CPUID is modifiedSean Christopherson2-3/+15
Invalidate all MMUs' roles after a CPUID update to force reinitizliation of the MMU context/helpers. Despite the efforts of commit de3ccd26fafc ("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"), there are still a handful of CPUID-based properties that affect MMU behavior but are not incorporated into mmu_role. E.g. 1gb hugepage support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all factor into the guest's reserved PTE bits. The obvious alternative would be to add all such properties to mmu_role, but doing so provides no benefit over simply forcing a reinitialization on every CPUID update, as setting guest CPUID is a rare operation. Note, reinitializing all MMUs after a CPUID update does not fix all of KVM's woes. Specifically, kvm_mmu_page_role doesn't track the CPUID properties, which means that a vCPU can reuse shadow pages that should not exist for the new vCPU model, e.g. that map GPAs that are now illegal (due to MAXPHYADDR changes) or that set bits that are now reserved (PAGE_SIZE for 1gb pages), etc... Tracking the relevant CPUID properties in kvm_mmu_page_role would address the majority of problems, but fully tracking that much state in the shadow page role comes with an unpalatable cost as it would require a non-trivial increase in KVM's memory footprint. The GBPAGES case is even worse, as neither Intel nor AMD provides a way to disable 1gb hugepage support in the hardware page walker, i.e. it's a virtualization hole that can't be closed when using TDP. In other words, resetting the MMU after a CPUID update is largely a superficial fix. But, it will allow reverting the tracking of MAXPHYADDR in the mmu_role, and that case in particular needs to mostly work because KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging is supported. For cases where KVM botches guest behavior, the damage is limited to that guest. But for the shadow_root_level, a misconfigured MMU can cause KVM to incorrectly access memory, e.g. due to walking off the end of its shadow page tables. Fixes: 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed") Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack"Sean Christopherson1-0/+1
Restore CR4.LA57 to the mmu_role to fix an amusing edge case with nested virtualization. When KVM (L0) is using TDP, CR4.LA57 is not reflected in mmu_role.base.level because that tracks the shadow root level, i.e. TDP level. Normally, this is not an issue because LA57 can't be toggled while long mode is active, i.e. the guest has to first disable paging, then toggle LA57, then re-enable paging, thus ensuring an MMU reinitialization. But if L1 is crafty, it can load a new CR4 on VM-Exit and toggle LA57 without having to bounce through an unpaged section. L1 can also load a new CR3 on exit, i.e. it doesn't even need to play crazy paging games, a single entry PML5 is sufficient. Such shenanigans are only problematic if L0 and L1 use TDP, otherwise L1 and L2 share an MMU that gets reinitialized on nested VM-Enter/VM-Exit due to mmu_role.base.guest_mode. Note, in the L2 case with nested TDP, even though L1 can switch between L2s with different LA57 settings, thus bypassing the paging requirement, in that case KVM's nested_mmu will track LA57 in base.level. This reverts commit 8053f924cad30bf9f9a24e02b6c8ddfabf5202ea. Fixes: 8053f924cad3 ("KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Use MMU's role to detect CR4.SMEP value in nested NPT walkSean Christopherson1-2/+1
Use the MMU's role to get its effective SMEP value when injecting a fault into the guest. When walking L1's (nested) NPT while L2 is active, vCPU state will reflect L2, whereas NPT uses the host's (L1 in this case) CR0, CR4, EFER, etc... If L1 and L2 have different settings for SMEP and L1 does not have EFER.NX=1, this can result in an incorrect PFEC.FETCH when injecting #NPF. Fixes: e57d4a356ad3 ("KVM: Add instruction fetch checking when walking guest page table") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86: Properly reset MMU context at vCPU RESET/INITSean Christopherson1-0/+13
Reset the MMU context at vCPU INIT (and RESET for good measure) if CR0.PG was set prior to INIT. Simply re-initializing the current MMU is not sufficient as the current root HPA may not be usable in the new context. E.g. if TDP is disabled and INIT arrives while the vCPU is in long mode, KVM will fail to switch to the 32-bit pae_root and bomb on the next VM-Enter due to running with a 64-bit CR3 in 32-bit mode. This bug was papered over in both VMX and SVM, but still managed to rear its head in the MMU role on VMX. Because EFER.LMA=1 requires CR0.PG=1, kvm_calc_shadow_mmu_root_page_role() checks for EFER.LMA without first checking CR0.PG. VMX's RESET/INIT flow writes CR0 before EFER, and so an INIT with the vCPU in 64-bit mode will cause the hack-a-fix to generate the wrong MMU role. In VMX, the INIT issue is specific to running without unrestricted guest since unrestricted guest is available if and only if EPT is enabled. Commit 8668a3c468ed ("KVM: VMX: Reset mmu context when entering real mode") resolved the issue by forcing a reset when entering emulated real mode. In SVM, commit ebae871a509d ("kvm: svm: reset mmu on VCPU reset") forced a MMU reset on every INIT to workaround the flaw in common x86. Note, at the time the bug was fixed, the SVM problem was exacerbated by a complete lack of a CR4 update. The vendor resets will be reverted in future patches, primarily to aid bisection in case there are non-INIT flows that rely on the existing VMX logic. Because CR0.PG is unconditionally cleared on INIT, and because CR0.WP and all CR4/EFER paging bits are ignored if CR0.PG=0, simply checking that CR0.PG was '1' prior to INIT/RESET is sufficient to detect a required MMU context reset. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Treat NX as used (not reserved) for all !TDP shadow MMUsSean Christopherson1-1/+9
Mark NX as being used for all non-nested shadow MMUs, as KVM will set the NX bit for huge SPTEs if the iTLB mutli-hit mitigation is enabled. Checking the mitigation itself is not sufficient as it can be toggled on at any time and KVM doesn't reset MMU contexts when that happens. KVM could reset the contexts, but that would require purging all SPTEs in all MMUs, for no real benefit. And, KVM already forces EFER.NX=1 when TDP is disabled (for WP=0, SMEP=1, NX=0), so technically NX is never reserved for shadow MMUs. Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: x86/mmu: Remove broken WARN that fires on 32-bit KVM w/ nested EPTSean Christopherson1-7/+0
Remove a misguided WARN that attempts to detect the scenario where using a special A/D tracking flag will set reserved bits on a non-MMIO spte. The WARN triggers false positives when using EPT with 32-bit KVM because of the !64-bit clause, which is just flat out wrong. The whole A/D tracking goo is specific to EPT, and one of the big selling points of EPT is that EPT is decoupled from the host's native paging mode. Drop the WARN instead of trying to salvage the check. Keeping a check specific to A/D tracking bits would essentially regurgitate the same code that led to KVM needed the tracking bits in the first place. A better approach would be to add a generic WARN on reserved bits being set, which would naturally cover the A/D tracking bits, work for all flavors of paging, and be self-documenting to some extent. Fixes: 8a406c89532c ("KVM: x86/mmu: Rename and document A/D scheme for TDP SPTEs") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: debugfs: Reuse binary stats descriptorsJing Zhang1-48/+1
To remove code duplication, use the binary stats descriptors in the implementation of the debugfs interface for statistics. This unifies the definition of statistics for the binary and debugfs interfaces. Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-8-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: stats: Support binary stats retrieval for a VCPUJing Zhang1-0/+41
Add a VCPU ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VCPU stats header, descriptors and data. Define VCPU statistics descriptors and header for all architectures. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-5-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25KVM: stats: Support binary stats retrieval for a VMJing Zhang1-0/+25
Add a VM ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VM stats header, descriptors and data. Define VM statistics descriptors and header for all architectures. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-4-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: stats: Add fd-based API to read binary stats dataJing Zhang1-1/+1
This commit defines the API for userspace and prepare the common functionalities to support per VM/VCPU binary stats data readings. The KVM stats now is only accessible by debugfs, which has some shortcomings this change series are supposed to fix: 1. The current debugfs stats solution in KVM could be disabled when kernel Lockdown mode is enabled, which is a potential rick for production. 2. The current debugfs stats solution in KVM is organized as "one stats per file", it is good for debugging, but not efficient for production. 3. The stats read/clear in current debugfs solution in KVM are protected by the global kvm_lock. Besides that, there are some other benefits with this change: 1. All KVM VM/VCPU stats can be read out in a bulk by one copy to userspace. 2. A schema is used to describe KVM statistics. From userspace's perspective, the KVM statistics are self-describing. 3. With the fd-based solution, a separate telemetry would be able to read KVM stats in a less privileged environment. 4. After the initial setup by reading in stats descriptors, a telemetry only needs to read the stats data itself, no more parsing or setup is needed. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-3-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: stats: Separate generic stats from architecture specific onesJing Zhang1-7/+7
Generic KVM stats are those collected in architecture independent code or those supported by all architectures; put all generic statistics in a separate structure. This ensures that they are defined the same way in the statistics API which is being added, removing duplication among different architectures in the declaration of the descriptors. No functional change intended. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-2-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Don't WARN on a NULL shadow page in TDP MMU checkSean Christopherson1-4/+6
Treat a NULL shadow page in the "is a TDP MMU" check as valid, non-TDP root. KVM uses a "direct" PAE paging MMU when TDP is disabled and the guest is running with paging disabled. In that case, root_hpa points at the pae_root page (of which only 32 bytes are used), not a standard shadow page, and the WARN fires (a lot). Fixes: 0b873fd7fb53 ("KVM: x86/mmu: Remove redundant is_tdp_mmu_enabled check") Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622072454.3449146-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: nVMX: Handle split-lock #AC exceptions that happen in L2Sean Christopherson4-2/+11
Mark #ACs that won't be reinjected to the guest as wanted by L0 so that KVM handles split-lock #AC from L2 instead of forwarding the exception to L1. Split-lock #AC isn't yet virtualized, i.e. L1 will treat it like a regular #AC and do the wrong thing, e.g. reinject it into L2. Fixes: e6f8b6c12f03 ("KVM: VMX: Extend VMXs #AC interceptor to handle split lock #AC in guest") Cc: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622172244.3561540-1-seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Fix uninitialized boolean variable flushColin Ian King1-1/+1
In the case where kvm_memslots_have_rmaps(kvm) is false the boolean variable flush is not set and is uninitialized. If is_tdp_mmu_enabled(kvm) is true then the call to kvm_tdp_mmu_zap_collapsible_sptes passes the uninitialized value of flush into the call. Fix this by initializing flush to false. Addresses-Coverity: ("Uninitialized scalar variable") Fixes: e2209710ccc5 ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622150912.23429-1-colin.king@canonical.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86: Print CPU of last attempted VM-entry when dumping VMCS/VMCBJim Mattson2-0/+4
Failed VM-entry is often due to a faulty core. To help identify bad cores, print the id of the last logical processor that attempted VM-entry whenever dumping a VMCS or VMCB. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210621221648.1833148-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-23x86/pkru: Remove xstate fiddling from write_pkru()Thomas Gleixner1-2/+2
The PKRU value of a task is stored in task->thread.pkru when the task is scheduled out. PKRU is restored on schedule in from there. So keeping the XSAVE buffer up to date is a pointless exercise. Remove the xstate fiddling and cleanup all related functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121456.897372712@linutronix.de
2021-06-23x86/pkeys: Move read_pkru() and write_pkru()Dave Hansen2-0/+2
write_pkru() was originally used just to write to the PKRU register. It was mercifully short and sweet and was not out of place in pgtable.h with some other pkey-related code. But, later work included a requirement to also modify the task XSAVE buffer when updating the register. This really is more related to the XSAVE architecture than to paging. Move the read/write_pkru() to asm/pkru.h. pgtable.h won't miss them. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121455.102647114@linutronix.de
2021-06-23x86/fpu: Rename copy_kernel_to_fpregs() to restore_fpregs_from_fpstate()Thomas Gleixner1-2/+2
This is not a copy functionality. It restores the register state from the supplied kernel buffer. No functional changes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121454.716058365@linutronix.de
2021-06-23x86/fpu: Rename copy_fpregs_to_fpstate() to save_fpregs_to_fpstate()Thomas Gleixner1-1/+1
A copy is guaranteed to leave the source intact, which is not the case when FNSAVE is used as that reinitilizes the registers. Save does not make such guarantees and it matches what this is about, i.e. to save the state for a later restore. Rename it to save_fpregs_to_fpstate(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121454.508853062@linutronix.de
2021-06-23x86/kvm: Avoid looking up PKRU in XSAVE bufferDave Hansen1-21/+24
PKRU is being removed from the kernel XSAVE/FPU buffers. This removal will probably include warnings for code that look up PKRU in those buffers. KVM currently looks up the location of PKRU but doesn't even use the pointer that it gets back. Rework the code to avoid calling get_xsave_addr() except in cases where its result is actually used. This makes the code more clear and also avoids the inevitable PKRU warnings. This is probably a good cleanup and could go upstream idependently of any PKRU rework. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121453.541037562@linutronix.de
2021-06-23Merge branch 'topic/ppc-kvm' of ↵Paolo Bonzini3-37/+7
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux into HEAD - Support for the H_RPT_INVALIDATE hypercall - Conversion of Book3S entry/exit to C - Bug fixes
2021-06-21KVM: nVMX: Dynamically compute max VMCS index for vmcs12Sean Christopherson3-8/+43
Calculate the max VMCS index for vmcs12 by walking the array to find the actual max index. Hardcoding the index is prone to bitrot, and the calculation is only done on KVM bringup (albeit on every CPU, but there aren't _that_ many null entries in the array). Fixes: 3c0f99366e34 ("KVM: nVMX: Add a TSC multiplier field in VMCS12") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210618214658.2700765-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>