summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-10-31KVM: x86/mmu: Batch TLB flushes when zapping collapsible TDP MMU SPTEsDavid Matlack1-45/+10
Set SPTEs directly to SHADOW_NONPRESENT_VALUE and batch up TLB flushes when zapping collapsible SPTEs, rather than freezing them first. Freezing the SPTE first is not required. It is fine for another thread holding mmu_lock for read to immediately install a present entry before TLBs are flushed because the underlying mapping is not changing. vCPUs that translate through the stale 4K mappings or a new huge page mapping will still observe the same GPA->HPA translations. KVM must only flush TLBs before dropping RCU (to avoid use-after-free of the zapped page tables) and before dropping mmu_lock (to synchronize with mmu_notifiers invalidating mappings). In VMs backed with 2MiB pages, batching TLB flushes improves the time it takes to zap collapsible SPTEs to disable dirty logging: $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g Before: Disabling dirty logging time: 14.334453428s (131072 flushes) After: Disabling dirty logging time: 4.794969689s (76 flushes) Skipping freezing SPTEs also avoids stalling vCPU threads on the frozen SPTE for the time it takes to perform a remote TLB flush. vCPUs faulting on the zapped mapping can now immediately install a new huge mapping and proceed with guest execution. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-3-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level()David Matlack3-9/+5
Drop the @max_level parameter from kvm_mmu_max_mapping_level(). All callers pass in PG_LEVEL_NUM, so @max_level can be replaced with PG_LEVEL_NUM in the function body. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86: Don't emit TLB flushes when aging SPTEs for mmu_notifiersSean Christopherson2-3/+3
Follow x86's primary MMU, which hasn't flushed TLBs when clearing Accessed bits for 10+ years, and skip all TLB flushes when aging SPTEs in response to a clear_flush_young() mmu_notifier event. As documented in x86's ptep_clear_flush_young(), the probability and impact of "bad" reclaim due to stale A-bit information is relatively low, whereas the performance cost of TLB flushes is relatively high. I.e. the cost of flushing TLBs outweighs the benefits. On KVM x86, the cost of TLB flushes is even higher, as KVM doesn't batch TLB flushes for mmu_notifier events (KVM's mmu_notifier contract with MM makes it all but impossible), and sending IPIs forces all running vCPUs to go through a VM-Exit => VM-Enter roundtrip. Furthermore, MGLRU aging of secondary MMUs is expected to use flush-less mmu_notifiers, i.e. flushing for the !MGLRU will make even less sense, and will be actively confusing as it wouldn't be clear why KVM "needs" to flush TLBs for legacy LRU aging, but not for MGLRU aging. Cc: James Houghton <jthoughton@google.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/all/20240926013506.860253-18-jthoughton@google.com Link: https://lore.kernel.org/r/20241011021051.1557902-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: Allow arch code to elide TLB flushes when aging a young pageSean Christopherson2-14/+10
Add a Kconfig to allow architectures to opt-out of a TLB flush when a young page is aged, as invalidating TLB entries is not functionally required on most KVM-supported architectures. Stale TLB entries can result in false negatives and theoretically lead to suboptimal reclaim, but in practice all observations have been that the performance gained by skipping TLB flushes outweighs any performance lost by reclaiming hot pages. E.g. the primary MMUs for x86 RISC-V, s390, and PPC Book3S elide the TLB flush for ptep_clear_flush_young(), and arm64's MMU skips the trailing DSB that's required for ordering (presumably because there are optimizations related to eliding other TLB flushes when doing make-before-break). Link: https://lore.kernel.org/r/20241011021051.1557902-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Set Dirty bit for new SPTEs, even if _hardware_ A/D bits are ↵Sean Christopherson2-7/+1
disabled When making a SPTE, set the Dirty bit in the SPTE as appropriate, even if hardware A/D bits are disabled. Only EPT allows A/D bits to be disabled, and for EPT, the bits are software-available (ignored by hardware) when A/D bits are disabled, i.e. it is perfectly legal for KVM to use the Dirty to track dirty pages in software. Link: https://lore.kernel.org/r/20241011021051.1557902-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Dedup logic for detecting TLB flushes on leaf SPTE changesSean Christopherson3-20/+30
Now that the shadow MMU and TDP MMU have identical logic for detecting required TLB flushes when updating SPTEs, move said logic to a helper so that the TDP MMU code can benefit from the comments that are currently exclusive to the shadow MMU. No functional change intended. Link: https://lore.kernel.org/r/20241011021051.1557902-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE foundSean Christopherson1-42/+33
Return immediately if a young SPTE is found when testing, but not updating, SPTEs. The return value is a boolean, i.e. whether there is one young SPTE or fifty is irrelevant (ignoring the fact that it's impossible for there to be fifty SPTEs, as KVM has a hard limit on the number of valid TDP MMU roots). Link: https://lore.kernel.org/r/20241011021051.1557902-15-seanjc@google.com [sean: use guard(rcu)(), as suggested by Paolo] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn rangeSean Christopherson1-2/+4
Skip invalid TDP MMU roots when aging a gfn range. There is zero reason to process invalid roots, as they by definition hold stale information. E.g. if a root is invalid because its from a previous memslot generation, in the unlikely event the root has a SPTE for the gfn, then odds are good that the gfn=>hva mapping is different, i.e. doesn't map to the hva that is being aged by the primary MMU. Link: https://lore.kernel.org/r/20241011021051.1557902-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabledSean Christopherson3-13/+5
Use the Accessed bit in SPTEs even when A/D bits are disabled in hardware, i.e. propagate accessed information to SPTE.Accessed even when KVM is doing manual tracking by making SPTEs not-present. In addition to eliminating a small amount of code in is_accessed_spte(), this also paves the way for preserving Accessed information when a SPTE is zapped in response to a mmu_notifier PROTECTION event, e.g. if a SPTE is zapped because NUMA balancing kicks in. Note, EPT is the only flavor of paging in which A/D bits are conditionally enabled, and the Accessed (and Dirty) bit is software-available when A/D bits are disabled. Note #2, there are currently no concrete plans to preserve Accessed information. Explorations on that front were the initial catalyst, but the cleanup is the motivation for the actual commit. Link: https://lore.kernel.org/r/20241011021051.1557902-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Set shadow_dirty_mask for EPT even if A/D bits disabledSean Christopherson1-1/+1
Set shadow_dirty_mask to the architectural EPT Dirty bit value even if A/D bits are disabled at the module level, i.e. even if KVM will never enable A/D bits in hardware. Doing so provides consistent behavior for Accessed and Dirty bits, i.e. doesn't leave KVM in a state where it sets shadow_accessed_mask but not shadow_dirty_mask. Functionally, this should be one big nop, as consumption of shadow_dirty_mask is always guarded by a check that hardware A/D bits are enabled. Link: https://lore.kernel.org/r/20241011021051.1557902-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits disabledSean Christopherson1-1/+1
Now that KVM doesn't use shadow_accessed_mask to detect if hardware A/D bits are enabled, set shadow_accessed_mask for EPT even when A/D bits are disabled in hardware. This will allow using shadow_accessed_mask for software purposes, e.g. to preserve accessed status in a non-present SPTE acros NUMA balancing, if something like that is ever desirable. Link: https://lore.kernel.org/r/20241011021051.1557902-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally enabledSean Christopherson4-16/+20
Add a dedicated flag to track if KVM has enabled A/D bits at the module level, instead of inferring the state based on whether or not the MMU's shadow_accessed_mask is non-zero. This will allow defining and using shadow_accessed_mask even when A/D bits aren't used by hardware. Link: https://lore.kernel.org/r/20241011021051.1557902-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writableSean Christopherson1-1/+3
Do a remote TLB flush if installing a leaf SPTE overwrites an existing leaf SPTE (with the same target pfn, which is enforced by a BUG() in handle_changed_spte()) and clears the MMU-Writable bit. Since the TDP MMU passes ACC_ALL to make_spte(), i.e. always requests a Writable SPTE, the only scenario in which make_spte() should create a !MMU-Writable SPTE is if the gfn is write-tracked or if KVM is prefetching a SPTE. When write-protecting for write-tracking, KVM must hold mmu_lock for write, i.e. can't race with a vCPU faulting in the SPTE. And when prefetching a SPTE, the TDP MMU takes care to avoid clobbering a shadow-present SPTE, i.e. it should be impossible to replace a MMU-writable SPTE with a !MMU-writable SPTE when handling a TDP MMU fault. Cc: David Matlack <dmatlack@google.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20241011021051.1557902-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Fold mmu_spte_update_no_track() into mmu_spte_update()Sean Christopherson1-29/+21
Fold the guts of mmu_spte_update_no_track() into mmu_spte_update() now that the latter doesn't flush when clearing A/D bits, i.e. now that there is no need to explicitly avoid TLB flushes when aging SPTEs. Opportunistically WARN if mmu_spte_update() requests a TLB flush when aging SPTEs, as aging should never modify a SPTE in such a way that KVM thinks a TLB flush is needed. Link: https://lore.kernel.org/r/20241011021051.1557902-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Drop ignored return value from kvm_tdp_mmu_clear_dirty_slot()Sean Christopherson2-15/+7
Drop the return value from kvm_tdp_mmu_clear_dirty_slot() as its sole caller ignores the result (KVM flushes after clearing dirty logs based on the logs themselves, not based on SPTEs). Cc: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20241011021051.1557902-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Don't flush TLBs when clearing Dirty bit in shadow MMUSean Christopherson2-13/+8
Don't force a TLB flush when an SPTE update in the shadow MMU happens to clear the Dirty bit, as KVM unconditionally flushes TLBs when enabling dirty logging, and when clearing dirty logs, KVM flushes based on its software structures, not the SPTEs. I.e. the flows that care about accurate Dirty bit information already ensure there are no stale TLB entries. Opportunistically drop is_dirty_spte() as mmu_spte_update() was the sole caller. Link: https://lore.kernel.org/r/20241011021051.1557902-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Don't force flush if SPTE update clears Accessed bitSean Christopherson1-21/+9
Don't force a TLB flush if mmu_spte_update() clears the Accessed bit, as access tracking tolerates false negatives, as evidenced by the mmu_notifier hooks that explicitly test and age SPTEs without doing a TLB flush. In practice, this is very nearly a nop. spte_write_protect() and spte_clear_dirty() never clear the Accessed bit. make_spte() always sets the Accessed bit for !prefetch scenarios. FNAME(sync_spte) only sets SPTE if the protection bits are changing, i.e. if a flush will be needed regardless of the Accessed bits. And FNAME(pte_prefetch) sets SPTE if and only if the old SPTE is !PRESENT. That leaves kvm_arch_async_page_ready() as the one path that will generate a !ACCESSED SPTE *and* overwrite a PRESENT SPTE. And that's very arguably a bug, as clobbering a valid SPTE in that case is nonsensical. Tested-by: Alex Bennée <alex.bennee@linaro.org> Link: https://lore.kernel.org/r/20241011021051.1557902-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Fold all of make_spte()'s writable handling into one if-elseSean Christopherson1-9/+4
Now that make_spte() no longer uses a funky goto to bail out for a special case of its unsync handling, combine all of the unsync vs. writable logic into a single if-else statement. No functional change intended. Link: https://lore.kernel.org/r/20241011021051.1557902-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Always set SPTE's dirty bit if it's created as writableSean Christopherson1-19/+9
When creating a SPTE, always set the Dirty bit if the Writable bit is set, i.e. if KVM is creating a writable mapping. If two (or more) vCPUs are racing to install a writable SPTE on a !PRESENT fault, only the "winning" vCPU will create a SPTE with W=1 and D=1, all "losers" will generate a SPTE with W=1 && D=0. As a result, tdp_mmu_map_handle_target_level() will fail to detect that the losing faults are effectively spurious, and will overwrite the D=1 SPTE with a D=0 SPTE. For normal VMs, overwriting a present SPTE is a small performance blip; KVM blasts a remote TLB flush, but otherwise life goes on. For upcoming TDX VMs, overwriting a present SPTE is much more costly, and can even lead to the VM being terminated if KVM isn't careful, e.g. if KVM attempts TDH.MEM.PAGE.AUG because the TDX code doesn't detect that the new SPTE is actually the same as the old SPTE (which would be a bug in its own right). Suggested-by: Sagi Shahar <sagis@google.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20241011021051.1557902-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-31KVM: x86/mmu: Flush remote TLBs iff MMU-writable flag is cleared from RO SPTESean Christopherson2-11/+7
Don't force a remote TLB flush if KVM happens to effectively "refresh" a read-only SPTE that is still MMU-Writable, as KVM allows MMU-Writable SPTEs to have Writable TLB entries, even if the SPTE is !Writable. Remote TLBs need to be flushed only when creating a read-only SPTE for write-tracking, i.e. when installing a !MMU-Writable SPTE. In practice, especially now that KVM doesn't overwrite existing SPTEs when prefetching, KVM will rarely "refresh" a read-only, MMU-Writable SPTE, i.e. this is unlikely to eliminate many, if any, TLB flushes. But, more precisely flushing makes it easier to understand exactly when KVM does and doesn't need to flush. Note, x86 architecturally requires relevant TLB entries to be invalidated on a page fault, i.e. there is no risk of putting a vCPU into an infinite loop of read-only page faults. Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20241011021051.1557902-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-25Merge branch 'kvm-no-struct-page' into HEADPaolo Bonzini38-845/+679
TL;DR: Eliminate KVM's long-standing (and heinous) behavior of essentially guessing which pfns are refcounted pages (see kvm_pfn_to_refcounted_page()). Getting there requires "fixing" arch code that isn't obviously broken. Specifically, to get rid of kvm_pfn_to_refcounted_page(), KVM needs to stop marking pages/folios dirty/accessed based solely on the pfn that's stored in KVM's stage-2 page tables. Instead of tracking which SPTEs correspond to refcounted pages, simply remove all of the code that operates on "struct page" based ona the pfn in stage-2 PTEs. This is the back ~40-50% of the series. For x86 in particular, which sets accessed/dirty status when that info would be "lost", e.g. when SPTEs are zapped or KVM clears the dirty flag in a SPTE, foregoing the updates provides very measurable performance improvements for related operations. E.g. when clearing dirty bits as part of dirty logging, and zapping SPTEs to reconstitue huge pages when disabling dirty logging. The front ~40% of the series is cleanups and prep work, and most of it is x86 focused (purely because x86 added the most special cases, *sigh*). E.g. several of the inputs to hva_to_pfn() (and it's myriad wrappers), can be removed by cleaning up and deduplicating x86 code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-25KVM: Don't grab reference on VM_MIXEDMAP pfns that have a "struct page"Sean Christopherson2-76/+2
Now that KVM no longer relies on an ugly heuristic to find its struct page references, i.e. now that KVM can't get false positives on VM_MIXEDMAP pfns, remove KVM's hack to elevate the refcount for pfns that happen to have a valid struct page. In addition to removing a long-standing wart in KVM, this allows KVM to map non-refcounted struct page memory into the guest, e.g. for exposing GPU TTM buffers to KVM guests. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-86-seanjc@google.com>
2024-10-25KVM: Drop APIs that manipulate "struct page" via pfnsSean Christopherson2-60/+0
Remove all kvm_{release,set}_pfn_*() APIs now that all users are gone. No functional change intended. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-85-seanjc@google.com>
2024-10-25KVM: arm64: Don't mark "struct page" accessed when making SPTE youngSean Christopherson3-13/+4
Don't mark pages/folios as accessed in the primary MMU when making a SPTE young in KVM's secondary MMU, as doing so relies on kvm_pfn_to_refcounted_page(), and generally speaking is unnecessary and wasteful. KVM participates in page aging via mmu_notifiers, so there's no need to push "accessed" updates to the primary MMU. Dropping use of kvm_set_pfn_accessed() also paves the way for removing kvm_pfn_to_refcounted_page() and all its users. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-84-seanjc@google.com>
2024-10-25KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEsSean Christopherson2-20/+0
Don't mark pages/folios as accessed in the primary MMU when zapping SPTEs, as doing so relies on kvm_pfn_to_refcounted_page(), and generally speaking is unnecessary and wasteful. KVM participates in page aging via mmu_notifiers, so there's no need to push "accessed" updates to the primary MMU. And if KVM zaps a SPTe in response to an mmu_notifier, marking it accessed _after_ the primary MMU has decided to zap the page is likely to go unnoticed, i.e. odds are good that, if the page is being zapped for reclaim, the page will be swapped out regardless of whether or not KVM marks the page accessed. Dropping x86's use of kvm_set_pfn_accessed() also paves the way for removing kvm_pfn_to_refcounted_page() and all its users. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-83-seanjc@google.com>
2024-10-25KVM: Make kvm_follow_pfn.refcounted_page a required fieldSean Christopherson1-2/+4
Now that the legacy gfn_to_pfn() APIs are gone, and all callers of hva_to_pfn() pass in a refcounted_page pointer, make it a required field to ensure all future usage in KVM plays nice. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-82-seanjc@google.com>
2024-10-25KVM: s390: Use kvm_release_page_dirty() to unpin "struct page" memorySean Christopherson1-1/+1
Use kvm_release_page_dirty() when unpinning guest pages, as the pfn was retrieved via pin_guest_page(), i.e. is guaranteed to be backed by struct page memory. This will allow dropping kvm_release_pfn_dirty() and friends. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-81-seanjc@google.com>
2024-10-25KVM: Drop gfn_to_pfn() APIs now that all users are goneSean Christopherson2-61/+0
Drop gfn_to_pfn() and all its variants now that all users are gone. No functional change intended. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-80-seanjc@google.com>
2024-10-25KVM: PPC: Explicitly require struct page memory for Ultravisor sharingSean Christopherson1-13/+12
Explicitly require "struct page" memory when sharing memory between guest and host via an Ultravisor. Given the number of pfn_to_page() calls in the code, it's safe to assume that KVM already requires that the pfn returned by gfn_to_pfn() is backed by struct page, i.e. this is likely a bug fix, not a reduction in KVM capabilities. Switching to gfn_to_page() will eventually allow removing gfn_to_pfn() and kvm_pfn_to_refcounted_page(). Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-79-seanjc@google.com>
2024-10-25KVM: arm64: Use __gfn_to_page() when copying MTE tags to/from userspaceSean Christopherson1-9/+6
Use __gfn_to_page() instead when copying MTE tags between guest and userspace. This will eventually allow removing gfn_to_pfn_prot(), gfn_to_pfn(), kvm_pfn_to_refcounted_page(), and related APIs. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-78-seanjc@google.com>
2024-10-25KVM: Add support for read-only usage of gfn_to_page()Sean Christopherson2-8/+14
Rework gfn_to_page() to support read-only accesses so that it can be used by arm64 to get MTE tags out of guest memory. Opportunistically rewrite the comment to be even more stern about using gfn_to_page(), as there are very few scenarios where requiring a struct page is actually the right thing to do (though there are such scenarios). Add a FIXME to call out that KVM probably should be pinning pages, not just getting pages. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-77-seanjc@google.com>
2024-10-25KVM: Convert gfn_to_page() to use kvm_follow_pfn()Sean Christopherson1-7/+9
Convert gfn_to_page() to the new kvm_follow_pfn() internal API, which will eventually allow removing gfn_to_pfn() and kvm_pfn_to_refcounted_page(). Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-76-seanjc@google.com>
2024-10-25KVM: PPC: Use kvm_vcpu_map() to map guest memory to patch dcbz instructionsSean Christopherson1-7/+6
Use kvm_vcpu_map() when patching dcbz in guest memory, as a regular GUP isn't technically sufficient when writing to data in the target pages. As per Documentation/core-api/pin_user_pages.rst: Correct (uses FOLL_PIN calls): pin_user_pages() write to the data within the pages unpin_user_pages() INCORRECT (uses FOLL_GET calls): get_user_pages() write to the data within the pages put_page() As a happy bonus, using kvm_vcpu_{,un}map() takes care of creating a mapping and marking the page dirty. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-75-seanjc@google.com>
2024-10-25KVM: PPC: Remove extra get_page() to fix page refcount leakSean Christopherson1-1/+0
Don't manually do get_page() when patching dcbz, as gfn_to_page() gifts the caller a reference. I.e. doing get_page() will leak the page due to not putting all references. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-74-seanjc@google.com>
2024-10-25KVM: MIPS: Use kvm_faultin_pfn() to map pfns into the guestSean Christopherson1-8/+6
Convert MIPS to kvm_faultin_pfn()+kvm_release_faultin_page(), which are new APIs to consolidate arch code and provide consistent behavior across all KVM architectures. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-73-seanjc@google.com>
2024-10-25KVM: MIPS: Mark "struct page" pfns accessed prior to dropping mmu_lockSean Christopherson1-2/+1
Mark pages accessed before dropping mmu_lock when faulting in guest memory so that MIPS can convert to kvm_release_faultin_page() without tripping its lockdep assertion on mmu_lock being held. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-72-seanjc@google.com>
2024-10-25KVM: MIPS: Mark "struct page" pfns accessed only in "slow" page fault pathSean Christopherson1-10/+2
Mark pages accessed only in the slow page fault path in order to remove an unnecessary user of kvm_pfn_to_refcounted_page(). Marking pages accessed in the primary MMU during KVM page fault handling isn't harmful, but it's largely pointless and likely a waste of a cycles since the primary MMU will call into KVM via mmu_notifiers when aging pages. I.e. KVM participates in a "pull" model, so there's no need to also "push" updates. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-71-seanjc@google.com>
2024-10-25KVM: MIPS: Mark "struct page" pfns dirty only in "slow" page fault pathSean Christopherson1-2/+3
Mark pages/folios dirty only the slow page fault path, i.e. only when mmu_lock is held and the operation is mmu_notifier-protected, as marking a page/folio dirty after it has been written back can make some filesystems unhappy (backing KVM guests will such filesystem files is uncommon, and the race is minuscule, hence the lack of complaints). See the link below for details. Link: https://lore.kernel.org/all/cover.1683044162.git.lstoakes@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-70-seanjc@google.com>
2024-10-25KVM: LoongArch: Use kvm_faultin_pfn() to map pfns into the guestSean Christopherson1-8/+6
Convert LoongArch to kvm_faultin_pfn()+kvm_release_faultin_page(), which are new APIs to consolidate arch code and provide consistent behavior across all KVM architectures. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-69-seanjc@google.com>
2024-10-25KVM: LoongArch: Mark "struct page" pfn accessed before dropping mmu_lockSean Christopherson1-1/+1
Mark pages accessed before dropping mmu_lock when faulting in guest memory so that LoongArch can convert to kvm_release_faultin_page() without tripping its lockdep assertion on mmu_lock being held. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-68-seanjc@google.com>
2024-10-25KVM: LoongArch: Mark "struct page" pfns accessed only in "slow" page fault pathSean Christopherson1-18/+2
Mark pages accessed only in the slow path, before dropping mmu_lock when faulting in guest memory so that LoongArch can convert to kvm_release_faultin_page() without tripping its lockdep assertion on mmu_lock being held. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-67-seanjc@google.com>
2024-10-25KVM: LoongArch: Mark "struct page" pfns dirty only in "slow" page fault pathSean Christopherson1-7/+9
Mark pages/folios dirty only the slow page fault path, i.e. only when mmu_lock is held and the operation is mmu_notifier-protected, as marking a page/folio dirty after it has been written back can make some filesystems unhappy (backing KVM guests will such filesystem files is uncommon, and the race is minuscule, hence the lack of complaints). See the link below for details. Link: https://lore.kernel.org/all/cover.1683044162.git.lstoakes@gmail.com Reviewed-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-66-seanjc@google.com>
2024-10-25KVM: PPC: Use kvm_faultin_pfn() to handle page faults on Book3s PRSean Christopherson4-12/+14
Convert Book3S PR to __kvm_faultin_pfn()+kvm_release_faultin_page(), which are new APIs to consolidate arch code and provide consistent behavior across all KVM architectures. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-65-seanjc@google.com>
2024-10-25KVM: PPC: Book3S: Mark "struct page" pfns dirty/accessed after installing PTESean Christopherson1-5/+5
Mark pages/folios dirty/accessed after installing a PTE, and more specifically after acquiring mmu_lock and checking for an mmu_notifier invalidation. Marking a page/folio dirty after it has been written back can make some filesystems unhappy (backing KVM guests will such filesystem files is uncommon, and the race is minuscule, hence the lack of complaints). See the link below for details. This will also allow converting Book3S to kvm_release_faultin_page(), which requires that mmu_lock be held (for the aforementioned reason). Link: https://lore.kernel.org/all/cover.1683044162.git.lstoakes@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-64-seanjc@google.com>
2024-10-25KVM: PPC: Drop unused @kvm_ro param from kvmppc_book3s_instantiate_page()Sean Christopherson3-8/+4
Drop @kvm_ro from kvmppc_book3s_instantiate_page() as it is now only written, and never read. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-63-seanjc@google.com>
2024-10-25KVM: PPC: Use __kvm_faultin_pfn() to handle page faults on Book3s RadixSean Christopherson1-24/+5
Replace Book3s Radix's homebrewed (read: copy+pasted) fault-in logic with __kvm_faultin_pfn(), which functionally does pretty much the exact same thing. Note, when the code was written, KVM indeed didn't do fast GUP without "!atomic && !async", but that has long since changed (KVM tries fast GUP for all writable mappings). Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-62-seanjc@google.com>
2024-10-25KVM: PPC: Use __kvm_faultin_pfn() to handle page faults on Book3s HVSean Christopherson1-21/+4
Replace Book3s HV's homebrewed fault-in logic with __kvm_faultin_pfn(), which functionally does pretty much the exact same thing. Note, when the code was written, KVM indeed didn't do fast GUP without "!atomic && !async", but that has long since changed (KVM tries fast GUP for all writable mappings). Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-61-seanjc@google.com>
2024-10-25KVM: RISC-V: Use kvm_faultin_pfn() when mapping pfns into the guestSean Christopherson1-7/+4
Convert RISC-V to __kvm_faultin_pfn()+kvm_release_faultin_page(), which are new APIs to consolidate arch code and provide consistent behavior across all KVM architectures. Opportunisticaly fix a s/priort/prior typo in the related comment. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Anup Patel <anup@brainfault.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-60-seanjc@google.com>
2024-10-25KVM: RISC-V: Mark "struct page" pfns accessed before dropping mmu_lockSean Christopherson1-3/+3
Mark pages accessed before dropping mmu_lock when faulting in guest memory so that RISC-V can convert to kvm_release_faultin_page() without tripping its lockdep assertion on mmu_lock being held. Marking pages accessed outside of mmu_lock is ok (not great, but safe), but marking pages _dirty_ outside of mmu_lock can make filesystems unhappy (see the link below). Do both under mmu_lock to minimize the chances of doing the wrong thing in the future. Link: https://lore.kernel.org/all/cover.1683044162.git.lstoakes@gmail.com Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Anup Patel <anup@brainfault.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-59-seanjc@google.com>
2024-10-25KVM: RISC-V: Mark "struct page" pfns dirty iff a stage-2 PTE is installedSean Christopherson1-1/+3
Don't mark pages dirty if KVM bails from the page fault handler without installing a stage-2 mapping, i.e. if the page is guaranteed to not be written by the guest. In addition to being a (very) minor fix, this paves the way for converting RISC-V to use kvm_release_faultin_page(). Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Anup Patel <anup@brainfault.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-58-seanjc@google.com>