Age | Commit message (Collapse) | Author | Files | Lines |
|
When preparing vmcb02 for nested VMRUN (or state restore), "enter" guest
mode prior to initializing the MMU for nested NPT so that guest_mode is
set in the MMU's role. KVM's model is that all L2 MMUs are tagged with
guest_mode, as the behavior of hypervisor MMUs tends to be significantly
different than kernel MMUs.
Practically speaking, the bug is relatively benign, as KVM only directly
queries role.guest_mode in kvm_mmu_free_guest_mode_roots() and
kvm_mmu_page_ad_need_write_protect(), which SVM doesn't use, and in paths
that are optimizations (mmu_page_zap_pte() and
shadow_mmu_try_split_huge_pages()).
And while the role is incorprated into shadow page usage, because nested
NPT requires KVM to be using NPT for L1, reusing shadow pages across L1
and L2 is impossible as L1 MMUs will always have direct=1, while L2 MMUs
will have direct=0.
Hoist the TLB processing and setting of HF_GUEST_MASK to the beginning
of the flow instead of forcing guest_mode in the MMU, as nothing in
nested_vmcb02_prepare_control() between the old and new locations touches
TLB flush requests or HF_GUEST_MASK, i.e. there's no reason to present
inconsistent vCPU state to the MMU.
Fixes: 69cb877487de ("KVM: nSVM: move MMU setup to nested_prepare_vmcb_control")
Cc: stable@vger.kernel.org
Reported-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/r/20250130010825.220346-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When waking a VM's NX huge page recovery thread, ensure the thread is
actually alive before trying to wake it. Now that the thread is spawned
on-demand during KVM_RUN, a VM without a recovery thread is reachable via
the related module params.
BUG: kernel NULL pointer dereference, address: 0000000000000040
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:vhost_task_wake+0x5/0x10
Call Trace:
<TASK>
set_nx_huge_pages+0xcc/0x1e0 [kvm]
param_attr_store+0x8a/0xd0
module_attr_store+0x1a/0x30
kernfs_fop_write_iter+0x12f/0x1e0
vfs_write+0x233/0x3e0
ksys_write+0x60/0xd0
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f3b52710104
</TASK>
Modules linked in: kvm_intel kvm
CR2: 0000000000000040
Fixes: 931656b9e2ff ("kvm: defer huge page recovery vhost task to later")
Cc: stable@vger.kernel.org
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250124234623.3609069-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Some libraries want to ensure they are single threaded before forking,
so making the kernel's kvm huge page recovery process a vhost task of
the user process breaks those. The minijail library used by crosvm is
one such affected application.
Defer the task to after the first VM_RUN call, which occurs after the
parent process has forked all its jailed processes. This needs to happen
only once for the kvm instance, so introduce some general-purpose
infrastructure for that, too. It's similar in concept to pthread_once;
except it is actually usable, because the callback takes a parameter.
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250123153543.2769928-1-kbusch@meta.com>
[Move call_once API to include/linux. - Paolo]
Cc: stable@vger.kernel.org
Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As part of enabling TDX virtual machines, support support separation of
private/shared EPT into separate roots.
Confidential computing solutions almost invariably have concepts of
private and shared memory, but they may different a lot in the details.
In SEV, for example, the bit is handled more like a permission bit as
far as the page tables are concerned: the private/shared bit is not
included in the physical address.
For TDX, instead, the bit is more like a physical address bit, with
the host mapping private memory in one half of the address space and
shared in another. Furthermore, the two halves are mapped by different
EPT roots and only the shared half is managed by KVM; the private half
(also called Secure EPT in Intel documentation) gets managed by the
privileged TDX Module via SEAMCALLs.
As a result, the operations that actually change the private half of
the EPT are limited and relatively slow compared to reading a PTE. For
this reason the design for KVM is to keep a mirror of the private EPT in
host memory. This allows KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.
There are thus three sets of EPT page tables: external, mirror and
direct. In the case of TDX (the only user of this framework) the
first two cover private memory, whereas the third manages shared
memory:
external EPT - Hidden within the TDX module, modified via TDX module
calls.
mirror EPT - Bookkeeping tree used as an optimization by KVM, not
used by the processor.
direct EPT - Normal EPT that maps unencrypted shared memory.
Managed like the EPT of a normal VM.
Modifying external EPT
----------------------
Modifications to the mirrored page tables need to also perform the
same operations to the private page tables, which will be handled via
kvm_x86_ops. Although this prep series does not interact with the TDX
module at all to actually configure the private EPT, it does lay the
ground work for doing this.
In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module; however, the
locking is more complicated because inserting a single PTE cannot anymore
be done atomically with a single CMPXCHG. For this reason, the existing
FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the
private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides
protecting operation of KVM, it limits the set of cases in which the
TDX module will encounter contention on its own PTE locks.
Zapping external EPT
--------------------
While the framework tries to be relatively generic, and to be
understandable without knowing TDX much in detail, some requirements of
TDX sometimes leak; for example the private page tables also cannot be
zapped while the range has anything mapped, so the mirrored/private page
tables need to be protected from KVM operations that zap any non-leaf
PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast().
For normal VMs, guest memory is zapped for several reasons: user
memory getting paged out by the guest, memslots getting deleted,
passthrough of devices with non-coherent DMA. Confidential computing
adds to these the conversion of memory between shared and privates. These
operations must not zap any private memory that is in use by the guest.
This is possible because the only zapping that is out of the control
of KVM/userspace is paging out userspace memory, which cannot apply to
guestmemfd operations. Thus a TDX VM will only zap private memory from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.
To avoid zapping too much memory, enums are introduced so that operations
can choose to target only private or shared memory, and thus only
direct or mirror EPT. For example:
Memslot deletion - Private and shared
MMU notifier based zapping - Shared only
Conversion to shared - Private only
Conversion to private - Shared only
Other cases of zapping will not be supported for KVM, for example
APICv update or non-coherent DMA status update; for the latter, TDX will
simply require that the CPU supports self-snoop and honor guest PAT
unconditionally for shared memory.
|
|
KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
instead of just those where KVM needs to manage state and/or explicitly
enable the feature in hardware. Along the way, refactor the code to make
it easier to add features, and to make it more self-documenting how KVM
is handling each feature.
- Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
where KVM unintentionally puts the vCPU into infinite loops in some scenarios
(e.g. if emulation is triggered by the exit), and brings parity between VMX
and SVM.
- Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU, due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
|
|
KVM x86 MMU changes for 6.14:
- Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a
direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
|
|
Return RET_PF* (excluding RET_PF_EMULATE/RET_PF_CONTINUE/RET_PF_INVALID)
instead of 1 in kvm_mmu_page_fault().
The callers of kvm_mmu_page_fault() are KVM page fault handlers (i.e.,
npf_interception(), handle_ept_misconfig(), __vmx_handle_ept_violation(),
kvm_handle_page_fault()). They either check if the return value is > 0 (as
in npf_interception()) or pass it further to vcpu_run() to decide whether
to break out of the kernel loop and return to the user when r <= 0.
Therefore, returning any positive value is equivalent to returning 1.
Warn if r == RET_PF_CONTINUE (which should not be a valid value) to ensure
a positive return value.
This is a preparation to allow TDX's EPT violation handler to check the
RET_PF* value and retry internally for RET_PF_RETRY.
No functional changes are intended.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250113021138.18875-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add a few sanity checks to prevent memslot GFNs from ever having alias bits
set.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
For TDX, the shared half will be mapped in the higher alias, with a "shared
bit" set in the GPA. However, KVM will still manage it with the same
memslots as the private half. This means memslot looks ups and zapping
operations will be provided with a GFN without the shared bit set.
If these memslot GFNs ever had the bit that selects between the two aliases
it could lead to unexpected behavior in the complicated code that directs
faulting or zapping operations between the roots that map the two aliases.
As a safety measure, prevent memslots from being set at a GFN range that
contains the alias bit.
Also, check in the kvm_faultin_pfn() for the fault path. This later check
does less today, as the alias bits are specifically stripped from the GFN
being checked, however future code could possibly call in to the fault
handler in a way that skips this stripping. Since kvm_faultin_pfn() now
has many references to vcpu->kvm, extract it to local variable.
Link: https://lore.kernel.org/kvm/ZpbKqG_ZhCWxl-Fc@google.com/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-19-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Don't zap valid mirror roots in kvm_tdp_mmu_zap_all(), which in effect
is only direct roots (invalid and valid).
For TDX, kvm_tdp_mmu_zap_all() is only called during MMU notifier
release. Since, mirrored EPT comes from guest mem, it will never be
mapped to userspace, and won't apply. But in addition to be unnecessary,
mirrored EPT is cleaned up in a special way during VM destruction.
Pass the KVM_INVALID_ROOTS bit into __for_each_tdp_mmu_root_yield_safe()
as well, to clean up invalid direct roots, as is the current behavior.
While at it, remove an obsolete reference to work item-based zapping.
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-18-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Rename kvm_tdp_mmu_invalidate_all_roots() to
kvm_tdp_mmu_invalidate_roots(), and make it enum kvm_tdp_mmu_root_types
as an argument.
kvm_tdp_mmu_invalidate_roots() is called with different root types. For
kvm_mmu_zap_all_fast() it only operates on shared roots. But when tearing
down a VM it needs to invalidate all roots. Have the callers only
invalidate the required roots instead of all roots.
Within kvm_tdp_mmu_invalidate_roots(), respect the root type
passed by checking the root type in root iterator.
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-17-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Integrate hooks for mirroring page table operations for cases where TDX
will zap PTEs or free page tables.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from the guest. Now add
plumbing to propagate changes to the "external" page tables being
mirrored. Just create the x86_ops for now, leave plumbing the operations
into the TDX module for future patches.
Add two operations for tearing down page tables, one for freeing page
tables (free_external_spt) and one for zapping PTEs (remove_external_spte).
Define them such that remove_external_spte will perform a TLB flush as
well. (in TDX terms "ensure there are no active translations").
TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For zapping/freeing, only
hook tdp_mmu_iter_set_spte() which is used for mapping and linking PTs.
Don't bother hooking tdp_mmu_set_spte_atomic() as it is only used for
zapping PTEs in operations unsupported by TDX: zapping collapsible PTEs and
kvm_mmu_zap_all_fast().
In previous changes to address races around concurrent populating using
tdp_mmu_set_spte_atomic(), a solution was introduced to temporarily set
FROZEN_SPTE in the mirrored page tables while performing the external
operations. Such a solution is not needed for the tear down paths in TDX
as these will always be performed with the mmu_lock held for write.
Sprinkle some KVM_BUG_ON()s to reflect this.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-16-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Integrate hooks for mirroring page table operations for cases where TDX
will set PTEs or link page tables.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from any actual guest
mapping. Now add plumbing to propagate changes to the "external" page
tables being mirrored. Just create the x86_ops for now, leave plumbing the
operations into the TDX module for future patches.
Add two operations for setting up external page tables, one for linking
new page tables and one for setting leaf PTEs. Don't add any op for
configuring the root PFN, as TDX handles this itself. Don't provide a
way to set permissions on the PTEs also, as TDX doesn't support it.
This results in MMU "mirroring" support that is very targeted towards TDX.
Since it is likely there will be no other user, the main benefit of making
the support generic is to keep TDX specific *looking* code outside of the
MMU. As a generic feature it will make enough sense from TDX's
perspective. For developers unfamiliar with TDX arch it can express the
general concepts such that they can continue to work in the code.
TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For setting/linking, only
hook tdp_mmu_set_spte_atomic() which is used for mapping and linking
PTs. Don't bother hooking tdp_mmu_iter_set_spte() as it is only used for
setting PTEs in operations unsupported by TDX: splitting huge pages and
write protecting. Sprinkle KVM_BUG_ON()s to document as code that these
paths are not supported for mirrored page tables. For zapping operations,
leave those for near future changes.
Many operations in the TDP MMU depend on atomicity of the PTE update.
While the mirror PTE on KVM's side can be updated atomically, the update
that happens inside the external operations (S-EPT updates via TDX module
call) can't happen atomically with the mirror update. The following race
could result during two vCPU's populating private memory:
* vcpu 1: atomically update 2M level mirror EPT entry to be present
* vcpu 2: read 2M level EPT entry that is present
* vcpu 2: walk down into 4K level EPT
* vcpu 2: atomically update 4K level mirror EPT entry to be present
* vcpu 2: set_exterma;_spte() to update 4K secure EPT entry => error
because 2M secure EPT entry is not populated yet
* vcpu 1: link_external_spt() to update 2M secure EPT entry
Prevent this by setting the mirror PTE to FROZEN_SPTE while the reflect
operations are performed. Only write the actual mirror PTE value once the
reflect operations have completed. When trying to set a PTE to present and
encountering a frozen SPTE, retry the fault.
By doing this the race is prevented as follows:
* vcpu 1: atomically update 2M level EPT entry to be FROZEN_SPTE
* vcpu 2: read 2M level EPT entry that is FROZEN_SPTE
* vcpu 2: find that the EPT entry is frozen
abandon page table walk to resume guest execution
* vcpu 1: link_external_spt() to update 2M secure EPT entry
* vcpu 1: atomically update 2M level EPT entry to be present (unfreeze)
* vcpu 2: resume guest execution
Depending on vcpu 1 state, vcpu 2 may result in EPT violation
again or make progress on guest execution
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-15-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Teach the MMU notifier callbacks how to check kvm_gfn_range.process to
filter which KVM MMU root types to operate on.
The private GPAs are backed by guest memfd. Such memory is not subjected
to MMU notifier callbacks because it can't be mapped into the host user
address space. Now kvm_gfn_range conveys info about which root to operate
on. Enhance the callback to filter the root page table type.
The KVM MMU notifier comes down to two functions.
kvm_tdp_mmu_unmap_gfn_range() and __kvm_tdp_mmu_age_gfn_range():
- invalidate_range_start() calls kvm_tdp_mmu_unmap_gfn_range()
- invalidate_range_end() doesn't call into arch code
- the other callbacks call __kvm_tdp_mmu_age_gfn_range()
For VM's without a private/shared split in the EPT, all operations
should target the normal(direct) root.
With the switch from for_each_tdp_mmu_root() to
__for_each_tdp_mmu_root() in kvm_tdp_mmu_handle_gfn(), there are no
longer any users of for_each_tdp_mmu_root(). Remove it.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-14-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add the ability for the TDP MMU to maintain a mirror of a separate
mapping.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
In order to handle both shared and private memory, KVM needs to learn to
handle faults and other operations on the correct root for the operation.
KVM could learn the concept of private roots, and operate on them by
calling out to operations that call into the TDX module. But there are two
problems with that:
1. Calls into the TDX module are relatively slow compared to the simple
accesses required to read a PTE managed directly by KVM.
2. Other Coco technologies deal with private memory completely differently
and it will make the code confusing when being read from their
perspective. Special operations added for TDX that set private or zap
private memory will have nothing to do with these other private memory
technologies. (SEV, etc).
To handle these, instead teach the TDP MMU about a new concept "mirror
roots". Such roots maintain page tables that are not actually mapped,
and are just used to traverse quickly to determine if the mid level page
tables need to be installed. When the memory be mirrored needs to actually
be changed, calls can be made to via x86_ops.
private KVM page fault |
| |
V |
private GPA | CPU protected EPTP
| | |
V | V
mirror PT root | external PT root
| | |
V | V
mirror PT --hook to propagate-->external PT
| | |
\--------------------+------\ |
| | |
| V V
| private guest page
|
|
non-encrypted memory | encrypted memory
|
Leave calling out to actually update the private page tables that are being
mirrored for later changes. Just implement the handling of MMU operations
on to mirrored roots.
In order to direct operations to correct root, add root types
KVM_DIRECT_ROOTS and KVM_MIRROR_ROOTS. Tie the usage of mirrored/direct
roots to private/shared with conditionals. It could also be implemented by
making the kvm_tdp_mmu_root_types and kvm_gfn_range_filter enum bits line
up such that conversion could be a direct assignment with a case. Don't do
this because the mapping of private to mirrored is confusing enough. So it
is worth not hiding the logic in type casting.
Cleanup the mirror root in kvm_mmu_destroy() instead of the normal place
in kvm_mmu_free_roots(), because the private root that is being cannot be
rebuilt like a normal root. It needs to persist for the lifetime of the VM.
The TDX module will also need to be provided with page tables to use for
the actual mapping being mirrored by the mirrored page tables. Allocate
these in the mapping path using the recently added
kvm_mmu_alloc_external_spt().
Don't support 2M page for now. This is avoided by forcing 4k pages in the
fault. Add a KVM_BUG_ON() to verify.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-13-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Take the root as an argument of tdp_mmu_for_each_pte() instead of looking
it up in the mmu. With no other purpose of passing the mmu, drop it.
Future changes will want to change which root is used based on the context
of the MMU operation. So change the callers to pass in the root currently
used, mmu->root.hpa in a preparatory patch to make the later one smaller
and easier to review.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-12-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Define an enum kvm_tdp_mmu_root_types to specify the KVM MMU root type [1]
so that the iterator on the root page table can consistently filter the
root page table type instead of only_valid.
TDX KVM will operate on KVM page tables with specified types. Shared page
table, private page table, or both. Introduce an enum instead of bool
only_valid so that we can easily enhance page table types applicable to
shared, private, or both in addition to valid or not. Replace
only_valid=false with KVM_ANY_ROOTS and only_valid=true with
KVM_ANY_VALID_ROOTS. Use KVM_ANY_ROOTS and KVM_ANY_VALID_ROOTS to wrap
KVM_VALID_ROOTS to avoid further code churn when direct vs mirror root
concepts are introduced in future patches.
Link: https://lore.kernel.org/kvm/ZivazWQw1oCU8VBC@google.com/ [1]
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-11-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Extract tdp_mmu_root_match() to check if the root has given types and use
it for the root page table iterator. It checks only_invalid now.
TDX KVM operates on a shared page table only (Shared-EPT), a mirrored page
table only (Secure-EPT), or both based on the operation. KVM MMU notifier
operations only on shared page table. KVM guest_memfd invalidation
operations only on mirrored page table, and so on. Introduce a centralized
matching function instead of open coding matching logic in the iterator.
The next step is to extend the function to check whether the page is shared
or private
Link: https://lore.kernel.org/kvm/ZivazWQw1oCU8VBC@google.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-10-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Teach the MMU to map guest GFNs at a massaged position on the TDP, to aid
in implementing TDX shared memory.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
For TDX, the shared half will be mapped in the higher alias, with a "shared
bit" set in the GPA. However, KVM will still manage it with the same
memslots as the private half. This means memslot looks ups and zapping
operations will be provided with a GFN without the shared bit set.
So KVM will either need to apply or strip the shared bit before mapping or
zapping the shared EPT. Having GFNs sometimes have the shared bit and
sometimes not would make the code confusing.
So instead arrange the code such that GFNs never have shared bit set.
Create a concept of "direct bits", that is stripped from the fault
address when setting fault->gfn, and applied within the TDP MMU iterator.
Calling code will behave as if it is operating on the PTE mapping the GFN
(without shared bits) but within the iterator, the actual mappings will be
shifted using bits specific for the root. SPs will have the GFN set
without the shared bit. In the end the TDP MMU will behave like it is
mapping things at the GFN without the shared bit but with a strange page
table format where everything is offset by the shared bit.
Since TDX only needs to shift the mapping like this for the shared bit,
which is mapped as the normal TDP root, add a "gfn_direct_bits" field to
the kvm_arch structure for each VM with a default value of 0. It will
have the bit set at the position of the GPA shared bit in GFN through TD
specific initialization code. Keep TDX specific concepts out of the MMU
code by not naming it "shared".
Ranged TLB flushes (i.e. flush_remote_tlbs_range()) target specific GFN
ranges. In convention established above, these would need to target the
shifted GFN range. It won't matter functionally, since the actual
implementation will always result in a full flush for the only planned
user (TDX). For correctness reasons, future changes can provide a TDX
x86_ops.flush_remote_tlbs_range implementation to return -EOPNOTSUPP and
force the full flush for TDs.
This leaves one problem. Some operations use a concept of max GFN (i.e.
kvm_mmu_max_gfn()), to iterate over the whole TDP range. When applying the
direct mask to the start of the range, the iterator would end up skipping
iterating over the range not covered by the direct mask bit. For safety,
make sure the __tdp_mmu_zap_root() operation iterates over the full GFN
range supported by the underlying TDP format. Add a new iterator helper,
for_each_tdp_pte_min_level_all(), that iterates the entire TDP GFN range,
regardless of root.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-9-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add a struct kvm argument to the TDP MMU iterators.
Future changes will want to change how the iterator behaves based on a
member of struct kvm. Change the signature and callers of the iterator
loop helpers in a separate patch to make the future one easier to review.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-8-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The kvm_tdp_mmu_alloc_root() function currently always returns 0. This
allows for the caller, mmu_alloc_direct_roots(), to call
kvm_tdp_mmu_alloc_root() and also return 0 in one line:
return kvm_tdp_mmu_alloc_root(vcpu);
So it is useful even though the return value of kvm_tdp_mmu_alloc_root()
is always the same. However, in future changes, kvm_tdp_mmu_alloc_root()
will be called twice in mmu_alloc_direct_roots(). This will force the
first call to either awkwardly handle the return value that will always
be zero or ignore it. So change kvm_tdp_mmu_alloc_root() to return void.
Do it in a separate change so the future change will be cleaner.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240718211230.1492011-7-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Introduce a "is_mirror" member to the kvm_mmu_page_role union to identify
SPTEs associated with the mirrored EPT.
The TDX module maintains the private half of the EPT mapped in the TD in
its protected memory. KVM keeps a copy of the private GPAs in a mirrored
EPT tree within host memory. This "is_mirror" attribute enables vCPUs to
find and get the root page of mirrored EPT from the MMU root list for a
guest TD. This also allows KVM MMU code to detect changes in mirrored EPT
according to the "is_mirror" mmu page role and propagate the changes to
the private EPT managed by TDX module.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-6-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add an external pointer to struct kvm_mmu_page for TDX's private page table
and add helper functions to allocate/initialize/free a private page table
page. TDX will only be supported with the TDP MMU. Because KVM TDP MMU
doesn't use unsync_children and write_flooding_count, pack them to have
room for a pointer and use a union to avoid memory overhead.
For private GPA, CPU refers to a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used, and their cost is expensive.
When KVM resolves the KVM page fault, it walks the page tables. To reuse
the existing KVM MMU code and mitigate the heavy cost of directly walking
the private page table allocate two sets of page tables for the private
half of the GPA space.
For the page tables that KVM will walk, allocate them like normal and refer
to them as mirror page tables. Additionally allocate one more page for the
page tables the CPU will walk, and call them external page tables. Resolve
the KVM page fault with the existing code, and do additional operations
necessary for modifying the external page table in future patches.
The relationship of the types of page tables in this scheme is depicted
below:
KVM page fault |
| |
V |
-------------+---------- |
| | |
V V |
shared GPA private GPA |
| | |
V V |
shared PT root mirror PT root | private PT root
| | | |
V V | V
shared PT mirror PT --propagate--> external PT
| | | |
| \-----------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
non-encrypted memory | encrypted memory
|
PT - Page table
Shared PT - Visible to KVM, and the CPU uses it for shared mappings.
External PT - The CPU uses it, but it is invisible to KVM. TDX module
updates this table to map private guest pages.
Mirror PT - It is visible to KVM, but the CPU doesn't use it. KVM uses
it to propagate PT change to the actual private PT.
Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it
to the TDX vm type.
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20240718211230.1492011-5-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add new members to strut kvm_gfn_range to indicate which mapping
(private-vs-shared) to operate on: enum kvm_gfn_range_filter
attr_filter. Update the core zapping operations to set them appropriately.
TDX utilizes two GPA aliases for the same memslots, one for memory that is
for private memory and one that is for shared. For private memory, KVM
cannot always perform the same operations it does on memory for default
VMs, such as zapping pages and having them be faulted back in, as this
requires guest coordination. However, some operations such as guest driven
conversion of memory between private and shared should zap private memory.
Internally to the MMU, private and shared mappings are tracked on separate
roots. Mapping and zapping operations will operate on the respective GFN
alias for each root (private or shared). So zapping operations will by
default zap both aliases. Add fields in struct kvm_gfn_range to allow
callers to specify which aliases so they can only target the aliases
appropriate for their specific operation.
There was feedback that target aliases should be specified such that the
default value (0) is to operate on both aliases. Several options were
considered. Several variations of having separate bools defined such
that the default behavior was to process both aliases. They either allowed
nonsensical configurations, or were confusing for the caller. A simple
enum was also explored and was close, but was hard to process in the
caller. Instead, use an enum with the default value (0) reserved as a
disallowed value. Catch ranges that didn't have the target aliases
specified by looking for that specific value.
Set target alias with enum appropriately for these MMU operations:
- For KVM's mmu notifier callbacks, zap shared pages only because private
pages won't have a userspace mapping
- For setting memory attributes, kvm_arch_pre_set_memory_attributes()
chooses the aliases based on the attribute.
- For guest_memfd invalidations, zap private only.
Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-3-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Prepare for a future TDX patch which asserts that atomic zapping
(i.e. zapping with mmu_lock taken for read) don't operate on mirror roots.
When tearing down a VM, all roots have to be zapped (including mirror
roots once they're in place) so do that with the mmu_lock taken for write.
kvm_mmu_uninit_tdp_mmu() is invoked either before or after executing any
atomic operations on SPTEs by vCPU threads. Therefore, it will not impact
vCPU threads performance if kvm_tdp_mmu_zap_invalidated_roots() acquires
mmu_lock for write to zap invalid roots.
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-2-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Treat slow-path TDP MMU faults as spurious if the access is allowed given
the existing SPTE to fix a benign warning (other than the WARN itself)
due to replacing a writable SPTE with a read-only SPTE, and to avoid the
unnecessary LOCK CMPXCHG and subsequent TLB flush.
If a read fault races with a write fault, fast GUP fails for any reason
when trying to "promote" the read fault to a writable mapping, and KVM
resolves the write fault first, then KVM will end up trying to install a
read-only SPTE (for a !map_writable fault) overtop a writable SPTE.
Note, it's not entirely clear why fast GUP fails, or if that's even how
KVM ends up with a !map_writable fault with a writable SPTE. If something
else is going awry, e.g. due to a bug in mmu_notifiers, then treating read
faults as spurious in this scenario could effectively mask the underlying
problem.
However, retrying the faulting access instead of overwriting an existing
SPTE is functionally correct and desirable irrespective of the WARN, and
fast GUP _can_ legitimately fail with a writable VMA, e.g. if the Accessed
bit in primary MMU's PTE is toggled and causes a PTE value mismatch. The
WARN was also recently added, specifically to track down scenarios where
KVM is unnecessarily overwrites SPTEs, i.e. treating the fault as spurious
doesn't regress KVM's bug-finding capabilities in any way. In short,
letting the WARN linger because there's a tiny chance it's due to a bug
elsewhere would be excessively paranoid.
Fixes: 1a175082b190 ("KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writable")
Reported-by: Lei Yang <leiyang@redhat.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219588
Tested-by: Lei Yang <leiyang@redhat.com>
Link: https://lore.kernel.org/r/20241218213611.3181643-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
As the first step toward replacing KVM's so-called "governed features"
framework with a more comprehensive, less poorly named implementation,
replace the "kvm_governed_feature" function prefix with "guest_cpu_cap"
and rename guest_can_use() to guest_cpu_cap_has().
The "guest_cpu_cap" naming scheme mirrors that of "kvm_cpu_cap", and
provides a more clear distinction between guest capabilities, which are
KVM controlled (heh, or one might say "governed"), and guest CPUID, which
with few exceptions is fully userspace controlled.
Opportunistically rewrite the comment about XSS passthrough for SEV-ES
guests to avoid referencing so many functions, as such comments are prone
to becoming stale (case in point...).
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
On a first glance it isn't obvious why calling kvm_tdp_page_fault() in
kvm_mmu_do_page_fault() is special cased, as the general case of using
an indirect case would result in calling of kvm_tdp_page_fault()
anyway.
Add a comment to explain the reason.
Signed-off-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20241108161416.28552-1-jgross@suse.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
kvm_vm_create_worker_thread() is meant to be used for kthreads that
can consume significant amounts of CPU time on behalf of a VM or in
response to how the VM behaves (for example how it accesses its memory).
Therefore it wants to charge the CPU time consumed by that work to
the VM's container.
However, because of these threads, cgroups which have kvm instances
inside never complete freezing. This can be trivially reproduced:
root@test ~# mkdir /sys/fs/cgroup/test
root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs
root@test ~# qemu-system-x86_64 -nographic -enable-kvm
and in another terminal:
root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
root@test ~# cat /sys/fs/cgroup/test/cgroup.events
populated 1
frozen 0
The cgroup freezing happens in the signal delivery path but
kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never
calls into the signal delivery path and thus never gets frozen. Because
the cgroup freezer determines whether a given cgroup is frozen by
comparing the number of frozen threads to the total number of threads
in the cgroup, the cgroup never becomes frozen and users waiting for
the state transition may hang indefinitely.
Since the worker kthread is tied to a user process, it's better if
it behaves similarly to user tasks as much as possible, including
being able to send SIGSTOP and SIGCONT. In fact, vhost_task is all
that kvm_vm_create_worker_thread() wanted to be and more: not only it
inherits the userspace process's cgroups, it has other niceties like
being parented properly in the process tree. Use it instead of the
homegrown alternative.
Incidentally, the new code is also better behaved when you flip recovery
back and forth to disabled and back to enabled. If your recovery period
is 1 minute, it will run the next recovery after 1 minute independent
of how many times you flipped the parameter.
(Commit message based on emails from Tejun).
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Luca Boccassi <bluca@debian.org>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Luca Boccassi <bluca@debian.org>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM x86 misc changes for 6.13
- Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
- Quirk KVM's misguided behavior of initialized certain feature MSRs to
their maximum supported feature set, which can result in KVM creating
invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero
value results in the vCPU having invalid state if userspace hides PDCM
from the guest, which can lead to save/restore failures.
- Fix KVM's handling of non-canonical checks for vCPUs that support LA57
to better follow the "architecture", in quotes because the actual
behavior is poorly documented. E.g. most MSR writes and descriptor
table loads ignore CR4.LA57 and operate purely on whether the CPU
supports LA57.
- Bypass the register cache when querying CPL from kvm_sched_out(), as
filling the cache from IRQ context is generally unsafe, and harden the
cache accessors to try to prevent similar issues from occuring in the
future.
- Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
- Minor cleanups
|
|
Drop the per-VM zapped_obsolete_pages list now that the usage from the
defunct mmu_shrinker is gone, and instead use a local list to track pages
in kvm_zap_obsolete_pages(), the sole remaining user of
zapped_obsolete_pages.
Opportunistically add an assertion to verify and document that slots_lock
must be held, i.e. that there can only be one active instance of
kvm_zap_obsolete_pages() at any given time, and by doing so also prove
that using a local list instead of a per-VM list doesn't change any
functionality (beyond trivialities like list initialization).
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com
[sean: split to separate patch, write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Remove KVM's MMU shrinker and (almost) all of its related code, as the
current implementation is very disruptive to VMs (if it ever runs),
without providing any meaningful benefit[1].
Alternatively, KVM could repurpose its shrinker, e.g. to reclaim pages
from the per-vCPU caches[2], but given that no one has complained about
lack of TDP MMU support for the shrinker in the 3+ years since the TDP MMU
was enabled by default, it's safe to say that there is likely no real use
case for initiating reclaim of KVM's page tables from the shrinker.
And while clever/cute, reclaiming the per-vCPU caches doesn't scale the
same way that reclaiming in-use page table pages does. E.g. the amount of
memory being used by a VM doesn't always directly correlate with the
number vCPUs, and even when it does, reclaiming a few pages from per-vCPU
caches likely won't make much of a dent in the VM's total memory usage,
especially for VMs with huge amounts of memory.
Lastly, if it turns out that there is a strong use case for dropping the
per-vCPU caches, re-introducing the shrinker registration is trivial
compared to the complexity of actually reclaiming pages from the caches.
[1] https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com
[2] https://lore.kernel.org/kvm/20241004195540.210396-3-vipinsh@google.com
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com
[sean: keep zapped_obsolete_pages for now, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN and bail out of recover_huge_pages_range() if dirty logging is
enabled. KVM shouldn't be recovering huge pages during dirty logging
anyway, since KVM needs to track writes at 4KiB. However it's not out of
the possibility that that changes in the future.
If KVM wants to recover huge pages during dirty logging, make_huge_spte()
must be updated to write-protect the new huge page mapping. Otherwise,
writes through the newly recovered huge page mapping will not be tracked.
Note that this potential risk did not exist back when KVM zapped to
recover huge page mappings, since subsequent accesses would just be
faulted in at PG_LEVEL_4K if dirty logging was enabled.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-7-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename make_huge_page_split_spte() to make_small_spte(). This ensures
that the usage of "small_spte" and "huge_spte" are consistent between
make_huge_spte() and make_small_spte().
This should also reduce some confusion as make_huge_page_split_spte()
almost reads like it will create a huge SPTE, when in fact it is
creating a small SPTE to split the huge SPTE.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-6-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Recover TDP MMU huge page mappings in-place instead of zapping them when
dirty logging is disabled, and rename functions that recover huge page
mappings when dirty logging is disabled to move away from the "zap
collapsible spte" terminology.
Before KVM flushes TLBs, guest accesses may be translated through either
the (stale) small SPTE or the (new) huge SPTE. This is already possible
when KVM is doing eager page splitting (where TLB flushes are also
batched), and when vCPUs are faulting in huge mappings (where TLBs are
flushed after the new huge SPTE is installed).
Recovering huge pages reduces the number of page faults when dirty
logging is disabled:
$ perf stat -e kvm:kvm_page_fault -- ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g
Before: 393,599 kvm:kvm_page_fault
After: 262,575 kvm:kvm_page_fault
vCPU throughput and the latency of disabling dirty-logging are about
equal compared to zapping, but avoiding faults can be beneficial to
remove vCPU jitter in extreme scenarios.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-5-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Refactor the TDP MMU iterator "need resched" checks into a helper
function so they can be called from a different code path in a
subsequent commit.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-4-dmatlack@google.com
[sean: rebase on a swapped order of checks]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM_MMU_WARN_ON
Convert the WARN in tdp_mmu_iter_cond_resched() that the iterator hasn't
already yielded to a KVM_MMU_WARN_ON() so the code is compiled out for
production kernels (assuming production kernels disable KVM_PROVE_MMU).
Checking for a needed reschedule is a hot path, and KVM sanity checks
iter->yielded in several other less-hot paths, i.e. the odds of KVM not
flagging that something went sideways are quite low. Furthermore, the
odds of KVM not noticing *and* the WARN detecting something worth
investigating are even lower.
Link: https://lore.kernel.org/r/20241031170633.1502783-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Swap the order of the checks in tdp_mmu_iter_cond_resched() so that KVM
checks to see if a resched is needed _before_ checking to see if yielding
must be disallowed to guarantee forward progress. Iterating over TDP MMU
SPTEs is a hot path, e.g. tearing down a root can touch millions of SPTEs,
and not needing to reschedule is by far the common case. On the other
hand, disallowing yielding because forward progress has not been made is a
very rare case.
Returning early for the common case (no resched), effectively reduces the
number of checks from 2 to 1 for the common case, and should make the code
slightly more predictable for the CPU.
To resolve a weird conundrum where the forward progress check currently
returns false, but the need resched check subtly returns iter->yielded,
which _should_ be false (enforced by a WARN), return false unconditionally
(which might also help make the sequence more predictable). If KVM has a
bug where iter->yielded is left danging, continuing to yield is neither
right nor wrong, it was simply an artifact of how the original code was
written.
Unconditionally returning false when yielding is unnecessary or unwanted
will also allow extracting the "should resched" logic to a separate helper
in a future patch.
Cc: David Matlack <dmatlack@google.com>
Reviewed-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20241031170633.1502783-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
As a result of a recent investigation, it was determined that x86 CPUs
which support 5-level paging, don't always respect CR4.LA57 when doing
canonical checks.
In particular:
1. MSRs which contain a linear address, allow full 57-bitcanonical address
regardless of CR4.LA57 state. For example: MSR_KERNEL_GS_BASE.
2. All hidden segment bases and GDT/IDT bases also behave like MSRs.
This means that full 57-bit canonical address can be loaded to them
regardless of CR4.LA57, both using MSRS (e.g GS_BASE) and instructions
(e.g LGDT).
3. TLB invalidation instructions also allow the user to use full 57-bit
address regardless of the CR4.LA57.
Finally, it must be noted that the CPU doesn't prevent the user from
disabling 5-level paging, even when the full 57-bit canonical address is
present in one of the registers mentioned above (e.g GDT base).
In fact, this can happen without any userspace help, when the CPU enters
SMM mode - some MSRs, for example MSR_KERNEL_GS_BASE are left to contain
a non-canonical address in regard to the new mode.
Since most of the affected MSRs and all segment bases can be read and
written freely by the guest without any KVM intervention, this patch makes
the emulator closely follow hardware behavior, which means that the
emulator doesn't take in the account the guest CPUID support for 5-level
paging, and only takes in the account the host CPU support.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-4-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Set SPTEs directly to SHADOW_NONPRESENT_VALUE and batch up TLB flushes
when zapping collapsible SPTEs, rather than freezing them first.
Freezing the SPTE first is not required. It is fine for another thread
holding mmu_lock for read to immediately install a present entry before
TLBs are flushed because the underlying mapping is not changing. vCPUs
that translate through the stale 4K mappings or a new huge page mapping
will still observe the same GPA->HPA translations.
KVM must only flush TLBs before dropping RCU (to avoid use-after-free of
the zapped page tables) and before dropping mmu_lock (to synchronize
with mmu_notifiers invalidating mappings).
In VMs backed with 2MiB pages, batching TLB flushes improves the time it
takes to zap collapsible SPTEs to disable dirty logging:
$ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g
Before: Disabling dirty logging time: 14.334453428s (131072 flushes)
After: Disabling dirty logging time: 4.794969689s (76 flushes)
Skipping freezing SPTEs also avoids stalling vCPU threads on the frozen
SPTE for the time it takes to perform a remote TLB flush. vCPUs faulting
on the zapped mapping can now immediately install a new huge mapping and
proceed with guest execution.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-3-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the @max_level parameter from kvm_mmu_max_mapping_level(). All
callers pass in PG_LEVEL_NUM, so @max_level can be replaced with
PG_LEVEL_NUM in the function body.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-2-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Follow x86's primary MMU, which hasn't flushed TLBs when clearing Accessed
bits for 10+ years, and skip all TLB flushes when aging SPTEs in response
to a clear_flush_young() mmu_notifier event. As documented in x86's
ptep_clear_flush_young(), the probability and impact of "bad" reclaim due
to stale A-bit information is relatively low, whereas the performance cost
of TLB flushes is relatively high. I.e. the cost of flushing TLBs
outweighs the benefits.
On KVM x86, the cost of TLB flushes is even higher, as KVM doesn't batch
TLB flushes for mmu_notifier events (KVM's mmu_notifier contract with MM
makes it all but impossible), and sending IPIs forces all running vCPUs to
go through a VM-Exit => VM-Enter roundtrip.
Furthermore, MGLRU aging of secondary MMUs is expected to use flush-less
mmu_notifiers, i.e. flushing for the !MGLRU will make even less sense, and
will be actively confusing as it wouldn't be clear why KVM "needs" to
flush TLBs for legacy LRU aging, but not for MGLRU aging.
Cc: James Houghton <jthoughton@google.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/20240926013506.860253-18-jthoughton@google.com
Link: https://lore.kernel.org/r/20241011021051.1557902-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
disabled
When making a SPTE, set the Dirty bit in the SPTE as appropriate, even if
hardware A/D bits are disabled. Only EPT allows A/D bits to be disabled,
and for EPT, the bits are software-available (ignored by hardware) when
A/D bits are disabled, i.e. it is perfectly legal for KVM to use the Dirty
to track dirty pages in software.
Link: https://lore.kernel.org/r/20241011021051.1557902-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that the shadow MMU and TDP MMU have identical logic for detecting
required TLB flushes when updating SPTEs, move said logic to a helper so
that the TDP MMU code can benefit from the comments that are currently
exclusive to the shadow MMU.
No functional change intended.
Link: https://lore.kernel.org/r/20241011021051.1557902-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Return immediately if a young SPTE is found when testing, but not updating,
SPTEs. The return value is a boolean, i.e. whether there is one young SPTE
or fifty is irrelevant (ignoring the fact that it's impossible for there to
be fifty SPTEs, as KVM has a hard limit on the number of valid TDP MMU
roots).
Link: https://lore.kernel.org/r/20241011021051.1557902-15-seanjc@google.com
[sean: use guard(rcu)(), as suggested by Paolo]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Skip invalid TDP MMU roots when aging a gfn range. There is zero reason
to process invalid roots, as they by definition hold stale information.
E.g. if a root is invalid because its from a previous memslot generation,
in the unlikely event the root has a SPTE for the gfn, then odds are good
that the gfn=>hva mapping is different, i.e. doesn't map to the hva that
is being aged by the primary MMU.
Link: https://lore.kernel.org/r/20241011021051.1557902-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use the Accessed bit in SPTEs even when A/D bits are disabled in hardware,
i.e. propagate accessed information to SPTE.Accessed even when KVM is
doing manual tracking by making SPTEs not-present. In addition to
eliminating a small amount of code in is_accessed_spte(), this also paves
the way for preserving Accessed information when a SPTE is zapped in
response to a mmu_notifier PROTECTION event, e.g. if a SPTE is zapped
because NUMA balancing kicks in.
Note, EPT is the only flavor of paging in which A/D bits are conditionally
enabled, and the Accessed (and Dirty) bit is software-available when A/D
bits are disabled.
Note #2, there are currently no concrete plans to preserve Accessed
information. Explorations on that front were the initial catalyst, but
the cleanup is the motivation for the actual commit.
Link: https://lore.kernel.org/r/20241011021051.1557902-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Set shadow_dirty_mask to the architectural EPT Dirty bit value even if
A/D bits are disabled at the module level, i.e. even if KVM will never
enable A/D bits in hardware. Doing so provides consistent behavior for
Accessed and Dirty bits, i.e. doesn't leave KVM in a state where it sets
shadow_accessed_mask but not shadow_dirty_mask.
Functionally, this should be one big nop, as consumption of
shadow_dirty_mask is always guarded by a check that hardware A/D bits are
enabled.
Link: https://lore.kernel.org/r/20241011021051.1557902-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that KVM doesn't use shadow_accessed_mask to detect if hardware A/D
bits are enabled, set shadow_accessed_mask for EPT even when A/D bits
are disabled in hardware. This will allow using shadow_accessed_mask for
software purposes, e.g. to preserve accessed status in a non-present SPTE
acros NUMA balancing, if something like that is ever desirable.
Link: https://lore.kernel.org/r/20241011021051.1557902-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a dedicated flag to track if KVM has enabled A/D bits at the module
level, instead of inferring the state based on whether or not the MMU's
shadow_accessed_mask is non-zero. This will allow defining and using
shadow_accessed_mask even when A/D bits aren't used by hardware.
Link: https://lore.kernel.org/r/20241011021051.1557902-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Do a remote TLB flush if installing a leaf SPTE overwrites an existing
leaf SPTE (with the same target pfn, which is enforced by a BUG() in
handle_changed_spte()) and clears the MMU-Writable bit. Since the TDP MMU
passes ACC_ALL to make_spte(), i.e. always requests a Writable SPTE, the
only scenario in which make_spte() should create a !MMU-Writable SPTE is
if the gfn is write-tracked or if KVM is prefetching a SPTE.
When write-protecting for write-tracking, KVM must hold mmu_lock for write,
i.e. can't race with a vCPU faulting in the SPTE. And when prefetching a
SPTE, the TDP MMU takes care to avoid clobbering a shadow-present SPTE,
i.e. it should be impossible to replace a MMU-writable SPTE with a
!MMU-writable SPTE when handling a TDP MMU fault.
Cc: David Matlack <dmatlack@google.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20241011021051.1557902-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|