Age | Commit message (Collapse) | Author | Files | Lines |
|
* kvm-arm64/pgtable-fixes-6.4:
: .
: Fixes for concurrent S2 mapping race from Oliver:
:
: "So it appears that there is a race between two parallel stage-2 map
: walkers that could lead to mapping the incorrect PA for a given IPA, as
: the IPA -> PA relationship picks up an unintended offset. This series
: eliminates the problem by using the current IPA of the walk as the
: source-of-truth regarding where we are in a map operation."
: .
KVM: arm64: Constify start/end/phys fields of the pgtable walker data
KVM: arm64: Infer PA offset from VA in hyp map walker
KVM: arm64: Infer the PA offset from IPA in stage-2 map walker
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/misc-6.4:
: .
: Minor changes for 6.4:
:
: - Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...)
:
: - FP/SVE/SME documentation update, in the hope that this field
: becomes clearer...
:
: - Add workaround for the usual Apple SEIS brokenness
:
: - Random comment fixes
: .
KVM: arm64: vgic: Add Apple M2 PRO/MAX cpus to the list of broken SEIS implementations
KVM: arm64: Clarify host SME state management
KVM: arm64: Restructure check for SVE support in FP trap handler
KVM: arm64: Document check for TIF_FOREIGN_FPSTATE
KVM: arm64: Fix repeated words in comments
KVM: arm64: Use the bitmap API to allocate bitmaps
KVM: arm64: Slightly optimize flush_context()
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Pull kvm updates from Paolo Bonzini:
"s390:
- More phys_to_virt conversions
- Improvement of AP management for VSIE (nested virtualization)
ARM64:
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features being
moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one. This
last part allows the NV timer code to be implemented on top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
x86:
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
enabled, and by giving the guest control of CR0.WP when EPT is
enabled on VMX (VMX-only because SVM doesn't support per-bit
controls)
- Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
return as a bool
- Move AMD_PSFD to cpufeatures.h and purge KVM's definition
- Avoid unnecessary writes+flushes when the guest is only adding new
PTEs
- Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
optimizations when emulating invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
single A/D bit using a LOCK AND instead of XCHG, and skip all of
the "handle changed SPTE" overhead associated with writing the
entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid
having to walk (potentially) all descriptors during insertion and
deletion, which gets quite expensive if the guest is spamming
fork()
- Disallow virtualizing legacy LBRs if architectural LBRs are
available, the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably
PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features
- Overhaul the vmx_pmu_caps selftest to better validate
PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- AMD SVM:
- Add support for virtual NMIs
- Fixes for edge cases related to virtual interrupts
- Intel AMX:
- Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
XTILE_DATA is not being reported due to userspace not opting in
via prctl()
- Fix a bug in emulation of ENCLS in compatibility mode
- Allow emulation of NOP and PAUSE for L2
- AMX selftests improvements
- Misc cleanups
MIPS:
- Constify MIPS's internal callbacks (a leftover from the hardware
enabling rework that landed in 6.3)
Generic:
- Drop unnecessary casts from "void *" throughout kvm_main.c
- Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
struct size by 8 bytes on 64-bit kernels by utilizing a padding
hole
Documentation:
- Fix goof introduced by the conversion to rST"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
KVM: s390: pci: fix virtual-physical confusion on module unload/load
KVM: s390: vsie: clarifications on setting the APCB
KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
KVM: selftests: Test the PMU event "Instructions retired"
KVM: selftests: Copy full counter values from guest in PMU event filter test
KVM: selftests: Use error codes to signal errors in PMU event filter test
KVM: selftests: Print detailed info in PMU event filter asserts
KVM: selftests: Add helpers for PMC asserts in PMU event filter test
KVM: selftests: Add a common helper for the PMU event filter guest code
KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
KVM: arm64: vhe: Drop extra isb() on guest exit
KVM: arm64: vhe: Synchronise with page table walker on MMU update
KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
KVM: arm64: nvhe: Synchronise with page table walker on TLBI
KVM: arm64: Handle 32bit CNTPCTSS traps
KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
KVM: arm64: vgic: Don't acquire its_lock before config_lock
KVM: selftests: Add test to verify KVM's supported XCR0
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
|
|
We share the same handler for general floating point and SVE traps with a
check to make sure we don't handle any SVE traps if the system doesn't
have SVE support. Since we will be adding SME support and wishing to handle
that along with other FP related traps rewrite the check to be more scalable
and a bit clearer too, ensuring we don't misidentify SME traps as SVE ones.
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221214-kvm-arm64-sme-context-switch-v2-2-57ba0082e9ff@kernel.org
|
|
As we are revamping the way the pgtable walker evaluates some of the
data, make it clear that we rely on somew of the fields to be constant
across the lifetime of a walk.
For this, flag the start, end and phys fields of the walk data as
'const', which will generate an error if we were to accidentally
update these fields again.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Similar to the recently fixed stage-2 walker, the hyp map walker
increments the PA and VA of a walk separately. Unlike stage-2, there is
no bug here as the map walker has exclusive access to the stage-1 page
tables.
Nonetheless, in the interest of continuity throughout the page table
code, tweak the hyp map walker to avoid incrementing the PA and instead
use the VA as the authoritative source of how far along a table walk has
gotten. Calculate the PA to use for a leaf PTE by adding the offset of
the VA from the start of the walk to the starting PA.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230421071606.1603916-3-oliver.upton@linux.dev
|
|
Until now, the page table walker counted increments to the PA and IPA
of a walk in two separate places. While the PA is incremented as soon as
a leaf PTE is installed in stage2_map_walker_try_leaf(), the IPA is
actually bumped in the generic table walker context. Critically,
__kvm_pgtable_visit() rereads the PTE after the LEAF callback returns
to work out if a table or leaf was installed, and only bumps the IPA for
a leaf PTE.
This arrangement worked fine when we handled faults behind the write lock,
as the walker had exclusive access to the stage-2 page tables. However,
commit 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel")
started handling all stage-2 faults behind the read lock, opening up a
race where a walker could increment the PA but not the IPA of a walk.
Nothing good ensues, as the walker starts mapping with the incorrect
IPA -> PA relationship.
For example, assume that two vCPUs took a data abort on the same IPA.
One observes that dirty logging is disabled, and the other observed that
it is enabled:
vCPU attempting PMD mapping vCPU attempting PTE mapping
====================================== =====================================
/* install PMD */
stage2_make_pte(ctx, leaf);
data->phys += granule;
/* replace PMD with a table */
stage2_try_break_pte(ctx, data->mmu);
stage2_make_pte(ctx, table);
/* table is observed */
ctx.old = READ_ONCE(*ptep);
table = kvm_pte_table(ctx.old, level);
/*
* map walk continues w/o incrementing
* IPA.
*/
__kvm_pgtable_walk(..., level + 1);
Bring an end to the whole mess by using the IPA as the single source of
truth for how far along a walk has gotten. Work out the correct PA to
map by calculating the IPA offset from the beginning of the walk and add
that to the starting physical address.
Cc: stable@vger.kernel.org
Fixes: 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230421071606.1603916-2-oliver.upton@linux.dev
|
|
* kvm-arm64/spec-ptw:
: .
: On taking an exception from EL1&0 to EL2(&0), the page table walker is
: allowed to carry on with speculative walks started from EL1&0 while
: running at EL2 (see R_LFHQG). Given that the PTW may be actively using
: the EL1&0 system registers, the only safe way to deal with it is to
: issue a DSB before changing any of it.
:
: We already did the right thing for SPE and TRBE, but ignored the PTW
: for unknown reasons (probably because the architecture wasn't crystal
: clear at the time).
:
: This requires a bit of surgery in the nvhe code, though most of these
: patches are comments so that my future self can understand the purpose
: of these barriers. The VHE code is largely unaffected, thanks to the
: DSB in the context switch.
: .
KVM: arm64: vhe: Drop extra isb() on guest exit
KVM: arm64: vhe: Synchronise with page table walker on MMU update
KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
KVM: arm64: nvhe: Synchronise with page table walker on TLBI
KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
__kvm_vcpu_run_vhe() end on VHE with an isb(). However, this
function is only reachable via kvm_call_hyp_ret(), which already
contains an isb() in order to mimick the behaviour of nVHE and
provide a context synchronisation event.
We thus have two isb()s back to back, which is one too many.
Drop the first one and solely rely on the one in the helper.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Contrary to nVHE, VHE is a lot easier when it comes to dealing
with speculative page table walks started at EL1. As we only change
EL1&0 translation regime when context-switching, we already benefit
from the effect of the DSB that sits in the context switch code.
We only need to take care of it in the NV case, where we can
flip between between two EL1 contexts (one of them being the virtual
EL2) without a context switch.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
|
|
We rely on the presence of a DSB at the end of kvm_flush_dcache_to_poc()
that, on top of ensuring completion of the cache clean, also covers
the speculative page table walk started from EL1.
Document this dependency.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
|
|
A TLBI from EL2 impacting EL1 involves messing with the EL1&0
translation regime, and the page table walker may still be
performing speculative walks.
Piggyback on the existing DSBs to always have a DSB ISH that
will synchronise all load/store operations that the PTW may
still have.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
When taking an exception between the EL1&0 translation regime and
the EL2 translation regime, the page table walker is allowed to
complete the walks started from EL0 or EL1 while running at EL2.
It means that altering the system registers that define the EL1&0
translation regime is fraught with danger *unless* we wait for
the completion of such walk with a DSB (R_LFHQG and subsequent
statements in the ARM ARM). We already did the right thing for
other external agents (SPE, TRBE), but not the PTW.
Rework the existing SPE/TRBE synchronisation to include the PTW,
and add the missing DSB on guest exit.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
|
|
MAX_ORDER currently defined as number of orders page allocator supports:
user can ask buddy allocator for page order between 0 and MAX_ORDER-1.
This definition is counter-intuitive and lead to number of bugs all over
the kernel.
Change the definition of MAX_ORDER to be inclusive: the range of orders
user can ask from buddy allocator is 0..MAX_ORDER now.
[kirill@shutemov.name: fix min() warning]
Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box
[akpm@linux-foundation.org: fix another min_t warning]
[kirill@shutemov.name: fixups per Zi Yan]
Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name
[akpm@linux-foundation.org: fix underlining in docs]
Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The existing pKVM code attempts to advertise CSV2/3 using values
initialized to 0, but never set. To advertise CSV2/3 to protected
guests, pass the CSV2/3 values to hyp when initializing hyp's
view of guests' ID_AA64PFR0_EL1.
Similar to non-protected KVM, these are system-wide, rather than
per cpu, for simplicity.
Fixes: 6c30bfb18d0b ("KVM: arm64: Add handlers for protected VM System Registers")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20230404152321.413064-1-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Emulating EL2 also means emulating the EL2 timers. To do so, we expand
our timer framework to deal with at most 4 timers. At any given time,
two timers are using the HW timers, and the two others are purely
emulated.
The role of deciding which is which at any given time is left to a
mapping function which is called every time we need to make such a
decision.
Reviewed-by: Colton Lewis <coltonlewis@google.com>
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-18-maz@kernel.org
|
|
Being able to set a global offset isn't enough.
With NV, we also need to a per-vcpu, per-timer offset (for example,
CNTVCT_EL0 being offset by CNTVOFF_EL2).
Use a similar method as the VM-wide offset to have a timer point
to the shadow register that contains the offset value.
Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-17-maz@kernel.org
|
|
Now that it is likely that CNTPCT_EL0 accesses will trap,
fast-track the emulation of the counter read which doesn't
need more that a simple offsetting.
One day, we'll have CNTPOFF everywhere. One day.
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-14-maz@kernel.org
|
|
CNTPOFF_EL2 is awesome, but it is mostly vapourware, and no publicly
available implementation has it. So for the common mortals, let's
implement the emulated version of this thing.
It means trapping accesses to the physical counter and timer, and
emulate some of it as necessary.
As for CNTPOFF_EL2, nobody sets the offset yet.
Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-6-maz@kernel.org
|
|
Pull kvm updates from Paolo Bonzini:
"ARM:
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place
- Add support for taking stage-2 access faults in parallel. This was
an accidental omission in the original parallel faults
implementation, but should provide a marginal improvement to
machines w/o FEAT_HAFDBS (such as hardware from the fruit company)
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception
handling and masking unsupported features for nested guests
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at
reducing the trap overhead of running nested
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems
- Avoid VM-wide stop-the-world operations when a vCPU accesses its
own redistributor
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected
exceptions in the host
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add [Oliver]
as co-maintainer
RISC-V:
- Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE
- Correctly place the guest in S-mode after redirecting a trap to the
guest
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
s390:
- Sort out confusion between virtual and physical addresses, which
currently are the same on s390
- A new ioctl that performs cmpxchg on guest memory
- A few fixes
x86:
- Change tdp_mmu to a read-only parameter
- Separate TDP and shadow MMU page fault paths
- Enable Hyper-V invariant TSC control
- Fix a variety of APICv and AVIC bugs, some of them real-world, some
of them affecting architecurally legal but unlikely to happen in
practice
- Mark APIC timer as expired if its in one-shot mode and the count
underflows while the vCPU task was being migrated
- Advertise support for Intel's new fast REP string features
- Fix a double-shootdown issue in the emergency reboot code
- Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give
SVM similar treatment to VMX
- Update Xen's TSC info CPUID sub-leaves as appropriate
- Add support for Hyper-V's extended hypercalls, where "support" at
this point is just forwarding the hypercalls to userspace
- Clean up the kvm->lock vs. kvm->srcu sequences when updating the
PMU and MSR filters
- One-off fixes and cleanups
- Fix and cleanup the range-based TLB flushing code, used when KVM is
running on Hyper-V
- Add support for filtering PMU events using a mask. If userspace
wants to restrict heavily what events the guest can use, it can now
do so without needing an absurd number of filter entries
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel Sapphire Rapids
- Fix a mostly benign overflow bug in SEV's
send|receive_update_data()
- Move several SVM-specific flags into vcpu_svm
x86 Intel:
- Handle NMI VM-Exits before leaving the noinstr region
- A few trivial cleanups in the VM-Enter flows
- Stop enabling VMFUNC for L1 purely to document that KVM doesn't
support EPTP switching (or any other VM function) for L1
- Fix a crash when using eVMCS's enlighted MSR bitmaps
Generic:
- Clean up the hardware enable and initialization flow, which was
scattered around multiple arch-specific hooks. Instead, just let
the arch code call into generic code. Both x86 and ARM should
benefit from not having to fight common KVM code's notion of how to
do initialization
- Account allocations in generic kvm_arch_alloc_vm()
- Fix a memory leak if coalesced MMIO unregistration fails
selftests:
- On x86, cache the CPU vendor (AMD vs. Intel) and use the info to
emit the correct hypercall instruction instead of relying on KVM to
patch in VMMCALL
- Use TAP interface for kvm_binary_stats_test and tsc_msrs_test"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (325 commits)
KVM: SVM: hyper-v: placate modpost section mismatch error
KVM: x86/mmu: Make tdp_mmu_allowed static
KVM: arm64: nv: Use reg_to_encoding() to get sysreg ID
KVM: arm64: nv: Only toggle cache for virtual EL2 when SCTLR_EL2 changes
KVM: arm64: nv: Filter out unsupported features from ID regs
KVM: arm64: nv: Emulate EL12 register accesses from the virtual EL2
KVM: arm64: nv: Allow a sysreg to be hidden from userspace only
KVM: arm64: nv: Emulate PSTATE.M for a guest hypervisor
KVM: arm64: nv: Add accessors for SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2
KVM: arm64: nv: Handle SMCs taken from virtual EL2
KVM: arm64: nv: Handle trapped ERET from virtual EL2
KVM: arm64: nv: Inject HVC exceptions to the virtual EL2
KVM: arm64: nv: Support virtual EL2 exceptions
KVM: arm64: nv: Handle HCR_EL2.NV system register traps
KVM: arm64: nv: Add nested virt VCPU primitives for vEL2 VCPU state
KVM: arm64: nv: Add EL2 system registers to vcpu context
KVM: arm64: nv: Allow userspace to set PSR_MODE_EL2x
KVM: arm64: nv: Reset VCPU to EL2 registers if VCPU nested virt is set
KVM: arm64: nv: Introduce nested virtualization VCPU feature
KVM: arm64: Use the S2 MMU context to iterate over S2 table
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- Support for arm64 SME 2 and 2.1. SME2 introduces a new 512-bit
architectural register (ZT0, for the look-up table feature) that
Linux needs to save/restore
- Include TPIDR2 in the signal context and add the corresponding
kselftests
- Perf updates: Arm SPEv1.2 support, HiSilicon uncore PMU updates, ACPI
support to the Marvell DDR and TAD PMU drivers, reset DTM_PMU_CONFIG
(ARM CMN) at probe time
- Support for DYNAMIC_FTRACE_WITH_CALL_OPS on arm64
- Permit EFI boot with MMU and caches on. Instead of cleaning the
entire loaded kernel image to the PoC and disabling the MMU and
caches before branching to the kernel bare metal entry point, leave
the MMU and caches enabled and rely on EFI's cacheable 1:1 mapping of
all of system RAM to populate the initial page tables
- Expose the AArch32 (compat) ELF_HWCAP features to user in an arm64
kernel (the arm32 kernel only defines the values)
- Harden the arm64 shadow call stack pointer handling: stash the shadow
stack pointer in the task struct on interrupt, load it directly from
this structure
- Signal handling cleanups to remove redundant validation of size
information and avoid reading the same data from userspace twice
- Refactor the hwcap macros to make use of the automatically generated
ID registers. It should make new hwcaps writing less error prone
- Further arm64 sysreg conversion and some fixes
- arm64 kselftest fixes and improvements
- Pointer authentication cleanups: don't sign leaf functions, unify
asm-arch manipulation
- Pseudo-NMI code generation optimisations
- Minor fixes for SME and TPIDR2 handling
- Miscellaneous updates: ARCH_FORCE_MAX_ORDER is now selectable,
replace strtobool() to kstrtobool() in the cpufeature.c code, apply
dynamic shadow call stack in two passes, intercept pfn changes in
set_pte_at() without the required break-before-make sequence, attempt
to dump all instructions on unhandled kernel faults
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (130 commits)
arm64: fix .idmap.text assertion for large kernels
kselftest/arm64: Don't require FA64 for streaming SVE+ZA tests
kselftest/arm64: Copy whole EXTRA context
arm64: kprobes: Drop ID map text from kprobes blacklist
perf: arm_spe: Print the version of SPE detected
perf: arm_spe: Add support for SPEv1.2 inverted event filtering
perf: Add perf_event_attr::config3
arm64/sme: Fix __finalise_el2 SMEver check
drivers/perf: fsl_imx8_ddr_perf: Remove set-but-not-used variable
arm64/signal: Only read new data when parsing the ZT context
arm64/signal: Only read new data when parsing the ZA context
arm64/signal: Only read new data when parsing the SVE context
arm64/signal: Avoid rereading context frame sizes
arm64/signal: Make interface for restore_fpsimd_context() consistent
arm64/signal: Remove redundant size validation from parse_user_sigframe()
arm64/signal: Don't redundantly verify FPSIMD magic
arm64/cpufeature: Use helper macros to specify hwcaps
arm64/cpufeature: Always use symbolic name for feature value in hwcaps
arm64/sysreg: Initial unsigned annotations for ID registers
arm64/sysreg: Initial annotation of signed ID registers
...
|
|
* kvm-arm64/nv-prefix:
: Preamble to NV support, courtesy of Marc Zyngier.
:
: This brings in a set of prerequisite patches for supporting nested
: virtualization in KVM/arm64. Of course, there is a long way to go until
: NV is actually enabled in KVM.
:
: - Introduce cpucap / vCPU feature flag to pivot the NV code on
:
: - Add support for EL2 vCPU register state
:
: - Basic nested exception handling
:
: - Hide unsupported features from the ID registers for NV-capable VMs
KVM: arm64: nv: Use reg_to_encoding() to get sysreg ID
KVM: arm64: nv: Only toggle cache for virtual EL2 when SCTLR_EL2 changes
KVM: arm64: nv: Filter out unsupported features from ID regs
KVM: arm64: nv: Emulate EL12 register accesses from the virtual EL2
KVM: arm64: nv: Allow a sysreg to be hidden from userspace only
KVM: arm64: nv: Emulate PSTATE.M for a guest hypervisor
KVM: arm64: nv: Add accessors for SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2
KVM: arm64: nv: Handle SMCs taken from virtual EL2
KVM: arm64: nv: Handle trapped ERET from virtual EL2
KVM: arm64: nv: Inject HVC exceptions to the virtual EL2
KVM: arm64: nv: Support virtual EL2 exceptions
KVM: arm64: nv: Handle HCR_EL2.NV system register traps
KVM: arm64: nv: Add nested virt VCPU primitives for vEL2 VCPU state
KVM: arm64: nv: Add EL2 system registers to vcpu context
KVM: arm64: nv: Allow userspace to set PSR_MODE_EL2x
KVM: arm64: nv: Reset VCPU to EL2 registers if VCPU nested virt is set
KVM: arm64: nv: Introduce nested virtualization VCPU feature
KVM: arm64: Use the S2 MMU context to iterate over S2 table
arm64: Add ARM64_HAS_NESTED_VIRT cpufeature
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/misc:
: Miscellaneous updates
:
: - Convert CPACR_EL1_TTA to the new, generated system register
: definitions.
:
: - Serialize toggling CPACR_EL1.SMEN to avoid unexpected exceptions when
: accessing SVCR in the host.
:
: - Avoid quiescing the guest if a vCPU accesses its own redistributor's
: SGIs/PPIs, eliminating the need to IPI. Largely an optimization for
: nested virtualization, as the L1 accesses the affected registers
: rather often.
:
: - Conversion to kstrtobool()
:
: - Common definition of INVALID_GPA across architectures
:
: - Enable CONFIG_USERFAULTFD for CI runs of KVM selftests
KVM: arm64: Fix non-kerneldoc comments
KVM: selftests: Enable USERFAULTFD
KVM: selftests: Remove redundant setbuf()
arm64/sysreg: clean up some inconsistent indenting
KVM: MMU: Make the definition of 'INVALID_GPA' common
KVM: arm64: vgic-v3: Use kstrtobool() instead of strtobool()
KVM: arm64: vgic-v3: Limit IPI-ing when accessing GICR_{C,S}ACTIVER0
KVM: arm64: Synchronize SMEN on vcpu schedule out
KVM: arm64: Kill CPACR_EL1_TTA definition
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/psci-relay-fixes:
: Fixes for CPU on/resume with pKVM, courtesy Quentin Perret.
:
: A consequence of deprivileging the host is that pKVM relays PSCI calls
: on behalf of the host. pKVM's CPU initialization failed to fully
: initialize the CPU's EL2 state, which notably led to unexpected SVE
: traps resulting in a hyp panic.
:
: The issue is addressed by reusing parts of __finalise_el2 to restore CPU
: state in the PSCI relay.
KVM: arm64: Finalise EL2 state from pKVM PSCI relay
KVM: arm64: Use sanitized values in __check_override in nVHE
KVM: arm64: Introduce finalise_el2_state macro
KVM: arm64: Provide sanitized SYS_ID_AA64SMFR0_EL1 to nVHE
|
|
* kvm-arm64/parallel-access-faults:
: Parallel stage-2 access fault handling
:
: The parallel faults changes that went in to 6.2 covered most stage-2
: aborts, with the exception of stage-2 access faults. Building on top of
: the new infrastructure, this series adds support for handling access
: faults (i.e. updating the access flag) in parallel.
:
: This is expected to provide a performance uplift for cores that do not
: implement FEAT_HAFDBS, such as those from the fruit company.
KVM: arm64: Condition HW AF updates on config option
KVM: arm64: Handle access faults behind the read lock
KVM: arm64: Don't serialize if the access flag isn't set
KVM: arm64: Return EAGAIN for invalid PTE in attr walker
KVM: arm64: Ignore EAGAIN for walks outside of a fault
KVM: arm64: Use KVM's pte type/helpers in handle_access_fault()
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/virtual-cache-geometry:
: Virtualized cache geometry for KVM guests, courtesy of Akihiko Odaki.
:
: KVM/arm64 has always exposed the host cache geometry directly to the
: guest, even though non-secure software should never perform CMOs by
: Set/Way. This was slightly wrong, as the cache geometry was derived from
: the PE on which the vCPU thread was running and not a sanitized value.
:
: All together this leads to issues migrating VMs on heterogeneous
: systems, as the cache geometry saved/restored could be inconsistent.
:
: KVM/arm64 now presents 1 level of cache with 1 set and 1 way. The cache
: geometry is entirely controlled by userspace, such that migrations from
: older kernels continue to work.
KVM: arm64: Mark some VM-scoped allocations as __GFP_ACCOUNT
KVM: arm64: Normalize cache configuration
KVM: arm64: Mask FEAT_CCIDX
KVM: arm64: Always set HCR_TID2
arm64/cache: Move CLIDR macro definitions
arm64/sysreg: Add CCSIDR2_EL1
arm64/sysreg: Convert CCSIDR_EL1 to automatic generation
arm64: Allow the definition of UNKNOWN system register fields
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
We can no longer blindly copy the VCPU's PSTATE into SPSR_EL2 and return
to the guest and vice versa when taking an exception to the hypervisor,
because we emulate virtual EL2 in EL1 and therefore have to translate
the mode field from EL2 to EL1 and vice versa.
This requires keeping track of the state we enter the guest, for which
we transiently use a dedicated flag.
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-15-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Support injecting exceptions and performing exception returns to and
from virtual EL2. This must be done entirely in software except when
taking an exception from vEL0 to vEL2 when the virtual HCR_EL2.{E2H,TGE}
== {1,1} (a VHE guest hypervisor).
[maz: switch to common exception injection framework, illegal exeption
return handling]
Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230209175820.1939006-10-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
'for-next/misc', 'for-next/sme2', 'for-next/tpidr2', 'for-next/scs', 'for-next/compat-hwcap', 'for-next/ftrace', 'for-next/efi-boot-mmu-on', 'for-next/ptrauth' and 'for-next/pseudo-nmi', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
perf: arm_spe: Print the version of SPE detected
perf: arm_spe: Add support for SPEv1.2 inverted event filtering
perf: Add perf_event_attr::config3
drivers/perf: fsl_imx8_ddr_perf: Remove set-but-not-used variable
perf: arm_spe: Support new SPEv1.2/v8.7 'not taken' event
perf: arm_spe: Use new PMSIDR_EL1 register enums
perf: arm_spe: Drop BIT() and use FIELD_GET/PREP accessors
arm64/sysreg: Convert SPE registers to automatic generation
arm64: Drop SYS_ from SPE register defines
perf: arm_spe: Use feature numbering for PMSEVFR_EL1 defines
perf/marvell: Add ACPI support to TAD uncore driver
perf/marvell: Add ACPI support to DDR uncore driver
perf/arm-cmn: Reset DTM_PMU_CONFIG at probe
drivers/perf: hisi: Extract initialization of "cpa_pmu->pmu"
drivers/perf: hisi: Simplify the parameters of hisi_pmu_init()
drivers/perf: hisi: Advertise the PERF_PMU_CAP_NO_EXCLUDE capability
* for-next/sysreg:
: arm64 sysreg and cpufeature fixes/updates
KVM: arm64: Use symbolic definition for ISR_EL1.A
arm64/sysreg: Add definition of ISR_EL1
arm64/sysreg: Add definition for ICC_NMIAR1_EL1
arm64/cpufeature: Remove 4 bit assumption in ARM64_FEATURE_MASK()
arm64/sysreg: Fix errors in 32 bit enumeration values
arm64/cpufeature: Fix field sign for DIT hwcap detection
* for-next/sme:
: SME-related updates
arm64/sme: Optimise SME exit on syscall entry
arm64/sme: Don't use streaming mode to probe the maximum SME VL
arm64/ptrace: Use system_supports_tpidr2() to check for TPIDR2 support
* for-next/kselftest: (23 commits)
: arm64 kselftest fixes and improvements
kselftest/arm64: Don't require FA64 for streaming SVE+ZA tests
kselftest/arm64: Copy whole EXTRA context
kselftest/arm64: Fix enumeration of systems without 128 bit SME for SSVE+ZA
kselftest/arm64: Fix enumeration of systems without 128 bit SME
kselftest/arm64: Don't require FA64 for streaming SVE tests
kselftest/arm64: Limit the maximum VL we try to set via ptrace
kselftest/arm64: Correct buffer size for SME ZA storage
kselftest/arm64: Remove the local NUM_VL definition
kselftest/arm64: Verify simultaneous SSVE and ZA context generation
kselftest/arm64: Verify that SSVE signal context has SVE_SIG_FLAG_SM set
kselftest/arm64: Remove spurious comment from MTE test Makefile
kselftest/arm64: Support build of MTE tests with clang
kselftest/arm64: Initialise current at build time in signal tests
kselftest/arm64: Don't pass headers to the compiler as source
kselftest/arm64: Remove redundant _start labels from FP tests
kselftest/arm64: Fix .pushsection for strings in FP tests
kselftest/arm64: Run BTI selftests on systems without BTI
kselftest/arm64: Fix test numbering when skipping tests
kselftest/arm64: Skip non-power of 2 SVE vector lengths in fp-stress
kselftest/arm64: Only enumerate power of two VLs in syscall-abi
...
* for-next/misc:
: Miscellaneous arm64 updates
arm64/mm: Intercept pfn changes in set_pte_at()
Documentation: arm64: correct spelling
arm64: traps: attempt to dump all instructions
arm64: Apply dynamic shadow call stack patching in two passes
arm64: el2_setup.h: fix spelling typo in comments
arm64: Kconfig: fix spelling
arm64: cpufeature: Use kstrtobool() instead of strtobool()
arm64: Avoid repeated AA64MMFR1_EL1 register read on pagefault path
arm64: make ARCH_FORCE_MAX_ORDER selectable
* for-next/sme2: (23 commits)
: Support for arm64 SME 2 and 2.1
arm64/sme: Fix __finalise_el2 SMEver check
kselftest/arm64: Remove redundant _start labels from zt-test
kselftest/arm64: Add coverage of SME 2 and 2.1 hwcaps
kselftest/arm64: Add coverage of the ZT ptrace regset
kselftest/arm64: Add SME2 coverage to syscall-abi
kselftest/arm64: Add test coverage for ZT register signal frames
kselftest/arm64: Teach the generic signal context validation about ZT
kselftest/arm64: Enumerate SME2 in the signal test utility code
kselftest/arm64: Cover ZT in the FP stress test
kselftest/arm64: Add a stress test program for ZT0
arm64/sme: Add hwcaps for SME 2 and 2.1 features
arm64/sme: Implement ZT0 ptrace support
arm64/sme: Implement signal handling for ZT
arm64/sme: Implement context switching for ZT0
arm64/sme: Provide storage for ZT0
arm64/sme: Add basic enumeration for SME2
arm64/sme: Enable host kernel to access ZT0
arm64/sme: Manually encode ZT0 load and store instructions
arm64/esr: Document ISS for ZT0 being disabled
arm64/sme: Document SME 2 and SME 2.1 ABI
...
* for-next/tpidr2:
: Include TPIDR2 in the signal context
kselftest/arm64: Add test case for TPIDR2 signal frame records
kselftest/arm64: Add TPIDR2 to the set of known signal context records
arm64/signal: Include TPIDR2 in the signal context
arm64/sme: Document ABI for TPIDR2 signal information
* for-next/scs:
: arm64: harden shadow call stack pointer handling
arm64: Stash shadow stack pointer in the task struct on interrupt
arm64: Always load shadow stack pointer directly from the task struct
* for-next/compat-hwcap:
: arm64: Expose compat ARMv8 AArch32 features (HWCAPs)
arm64: Add compat hwcap SSBS
arm64: Add compat hwcap SB
arm64: Add compat hwcap I8MM
arm64: Add compat hwcap ASIMDBF16
arm64: Add compat hwcap ASIMDFHM
arm64: Add compat hwcap ASIMDDP
arm64: Add compat hwcap FPHP and ASIMDHP
* for-next/ftrace:
: Add arm64 support for DYNAMICE_FTRACE_WITH_CALL_OPS
arm64: avoid executing padding bytes during kexec / hibernation
arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
arm64: ftrace: Update stale comment
arm64: patching: Add aarch64_insn_write_literal_u64()
arm64: insn: Add helpers for BTI
arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
ACPI: Don't build ACPICA with '-Os'
Compiler attributes: GCC cold function alignment workarounds
ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS
* for-next/efi-boot-mmu-on:
: Permit arm64 EFI boot with MMU and caches on
arm64: kprobes: Drop ID map text from kprobes blacklist
arm64: head: Switch endianness before populating the ID map
efi: arm64: enter with MMU and caches enabled
arm64: head: Clean the ID map and the HYP text to the PoC if needed
arm64: head: avoid cache invalidation when entering with the MMU on
arm64: head: record the MMU state at primary entry
arm64: kernel: move identity map out of .text mapping
arm64: head: Move all finalise_el2 calls to after __enable_mmu
* for-next/ptrauth:
: arm64 pointer authentication cleanup
arm64: pauth: don't sign leaf functions
arm64: unify asm-arch manipulation
* for-next/pseudo-nmi:
: Pseudo-NMI code generation optimisations
arm64: irqflags: use alternative branches for pseudo-NMI logic
arm64: add ARM64_HAS_GIC_PRIO_RELAXED_SYNC cpucap
arm64: make ARM64_HAS_GIC_PRIO_MASKING depend on ARM64_HAS_GIC_CPUIF_SYSREGS
arm64: rename ARM64_HAS_IRQ_PRIO_MASKING to ARM64_HAS_GIC_PRIO_MASKING
arm64: rename ARM64_HAS_SYSREG_GIC_CPUIF to ARM64_HAS_GIC_CPUIF_SYSREGS
|
|
The EL2 state is not initialised correctly when a CPU comes out of
CPU_{SUSPEND,OFF} as the finalise_el2 function is not being called.
Let's directly call finalise_el2_state from this path to solve the
issue.
Fixes: 504ee23611c4 ("arm64: Add the arm64.nosve command line option")
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20230201103755.1398086-5-qperret@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
We will need a sanitized copy of SYS_ID_AA64SMFR0_EL1 from the nVHE EL2
code shortly, so make sure to provide it with a copy.
Signed-off-by: Quentin Perret <qperret@google.com>
Acked-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20230201103755.1398086-2-qperret@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
We currently have a non-standard SYS_ prefix in the constants generated
for the SPE register bitfields. Drop this in preparation for automatic
register definition generation.
The SPE mask defines were unshifted, and the SPE register field
enumerations were shifted. The autogenerated defines are the opposite,
so make the necessary adjustments.
No functional changes.
Tested-by: James Clark <james.clark@arm.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20220825-arm-spe-v8-7-v4-2-327f860daf28@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Since the One True Way is to use the new generated definition,
kill the KVM-specific definition of CPACR_EL1_TTA, and move
over to CPACR_ELx_TTA, hopefully for the same result.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230112154803.1808559-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
As it currently stands, KVM makes use of FEAT_HAFDBS unconditionally.
Use of the feature in the rest of the kernel is guarded by an associated
Kconfig option.
Align KVM with the rest of the kernel and only enable VTCR_HA when
ARM64_HW_AFDBM is enabled. This can be helpful for testing changes to
the stage-2 access fault path on Armv8.1+ implementations.
Link: https://lore.kernel.org/r/20221202185156.696189-7-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
As the underlying software walkers are able to traverse and update
stage-2 in parallel there is no need to serialize access faults.
Only take the read lock when handling an access fault.
Link: https://lore.kernel.org/r/20221202185156.696189-6-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Of course, if the PTE wasn't changed then there are absolutely no
serialization requirements. Skip the DSB for an unsuccessful update to
the access flag.
Link: https://lore.kernel.org/r/20221202185156.696189-5-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Return EAGAIN for invalid PTEs in the attr walker, signaling to the
caller that any serialization and/or invalidation can be elided.
Link: https://lore.kernel.org/r/20221202185156.696189-4-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The page table walkers are invoked outside fault handling paths, such as
write protecting a range of memory. EAGAIN is generally used by the
walkers to retry execution due to races on a particular PTE, like taking
an access fault on a PTE being invalidated from another thread.
This early return behavior is undesirable for walkers that operate
outside a fault handler. Suppress EAGAIN and continue the walk if
operating outside a fault handler.
Link: https://lore.kernel.org/r/20221202185156.696189-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Always set HCR_TID2 to trap CTR_EL0, CCSIDR2_EL1, CLIDR_EL1, and
CSSELR_EL1. This saves a few lines of code and allows to employ their
access trap handlers for more purposes anticipated by the old
condition for setting HCR_TID2.
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Link: https://lore.kernel.org/r/20230112023852.42012-6-akihiko.odaki@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Now that we are generating ISR_EL1 we have acquired a constant for
ISR_EL1.A, use it rather than the magic number we had been using in the KVM
entry code.
Suggested-by: Marc Zyngier <maz@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20221208-arm64-isr-el1-v2-3-89f7073a1ca9@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The former is an AArch32 legacy, so let's move over to the
verbose (and strictly identical) version.
This involves moving some of the #defines that were private
to KVM into the more generic esr.h.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on (see merge
commit 382b5b87a97d: "Fix a number of issues with MTE, such as
races on the tags being initialised vs the PG_mte_tagged flag as
well as the lack of support for VM_SHARED when KVM is involved.
Patches from Catalin Marinas and Peter Collingbourne").
- Merge the pKVM shadow vcpu state tracking that allows the
hypervisor to have its own view of a vcpu, keeping that state
private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB
pages only) as a prefix of the oncoming support for 4kB and 16kB
pages.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
- Second batch of the lazy destroy patches
- First batch of KVM changes for kernel virtual != physical address
support
- Removal of a unused function
x86:
- Allow compiling out SMM support
- Cleanup and documentation of SMM state save area format
- Preserve interrupt shadow in SMM state save area
- Respond to generic signals during slow page faults
- Fixes and optimizations for the non-executable huge page errata
fix.
- Reprogram all performance counters on PMU filter change
- Cleanups to Hyper-V emulation and tests
- Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
guest running on top of a L1 Hyper-V hypervisor)
- Advertise several new Intel features
- x86 Xen-for-KVM:
- Allow the Xen runstate information to cross a page boundary
- Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
- Add support for 32-bit guests in SCHEDOP_poll
- Notable x86 fixes and cleanups:
- One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
- Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
a few years back when eliminating unnecessary barriers when
switching between vmcs01 and vmcs02.
- Clean up vmread_error_trampoline() to make it more obvious that
params must be passed on the stack, even for x86-64.
- Let userspace set all supported bits in MSR_IA32_FEAT_CTL
irrespective of the current guest CPUID.
- Fudge around a race with TSC refinement that results in KVM
incorrectly thinking a guest needs TSC scaling when running on a
CPU with a constant TSC, but no hardware-enumerated TSC
frequency.
- Advertise (on AMD) that the SMM_CTL MSR is not supported
- Remove unnecessary exports
Generic:
- Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
- Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
- Fix build errors that occur in certain setups (unsure exactly what
is unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
- Introduce actual atomics for clear/set_bit() in selftests
- Add support for pinning vCPUs in dirty_log_perf_test.
- Rename the so called "perf_util" framework to "memstress".
- Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress
tests.
- Add a common ucall implementation; code dedup and pre-work for
running SEV (and beyond) guests in selftests.
- Provide a common constructor and arch hook, which will eventually
be used by x86 to automatically select the right hypercall (AMD vs.
Intel).
- A bunch of added/enabled/fixed selftests for ARM64, covering
memslots, breakpoints, stage-2 faults and access tracking.
- x86-specific selftest changes:
- Clean up x86's page table management.
- Clean up and enhance the "smaller maxphyaddr" test, and add a
related test to cover generic emulation failure.
- Clean up the nEPT support checks.
- Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
- Fix an ordering issue in the AMX test introduced by recent
conversions to use kvm_cpu_has(), and harden the code to guard
against similar bugs in the future. Anything that tiggers
caching of KVM's supported CPUID, kvm_cpu_has() in this case,
effectively hides opt-in XSAVE features if the caching occurs
before the test opts in via prctl().
Documentation:
- Remove deleted ioctls from documentation
- Clean up the docs for the x86 MSR filter.
- Various fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
KVM: x86: Add proper ReST tables for userspace MSR exits/flags
KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
KVM: arm64: selftests: Align VA space allocator with TTBR0
KVM: arm64: Fix benign bug with incorrect use of VA_BITS
KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
KVM: x86: Advertise that the SMM_CTL MSR is not supported
KVM: x86: remove unnecessary exports
KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
tools: KVM: selftests: Convert clear/set_bit() to actual atomics
tools: Drop "atomic_" prefix from atomic test_and_set_bit()
tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
perf tools: Use dedicated non-atomic clear/set bit helpers
tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
KVM: arm64: selftests: Enable single-step without a "full" ucall()
KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
KVM: Remove stale comment about KVM_REQ_UNHALT
KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
KVM: Reference to kvm_userspace_memory_region in doc and comments
KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
...
|
|
* kvm-arm64/misc-6.2:
: .
: Misc fixes for 6.2:
:
: - Fix formatting for the pvtime documentation
:
: - Fix a comment in the VHE-specific Makefile
: .
KVM: arm64: Fix typo in comment
KVM: arm64: Fix pvtime documentation
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/pkvm-vcpu-state: (25 commits)
: .
: Large drop of pKVM patches from Will Deacon and co, adding
: a private vm/vcpu state at EL2, managed independently from
: the EL1 state. From the cover letter:
:
: "This is version six of the pKVM EL2 state series, extending the pKVM
: hypervisor code so that it can dynamically instantiate and manage VM
: data structures without the host being able to access them directly.
: These structures consist of a hyp VM, a set of hyp vCPUs and the stage-2
: page-table for the MMU. The pages used to hold the hypervisor structures
: are returned to the host when the VM is destroyed."
: .
KVM: arm64: Use the pKVM hyp vCPU structure in handle___kvm_vcpu_run()
KVM: arm64: Don't unnecessarily map host kernel sections at EL2
KVM: arm64: Explicitly map 'kvm_vgic_global_state' at EL2
KVM: arm64: Maintain a copy of 'kvm_arm_vmid_bits' at EL2
KVM: arm64: Unmap 'kvm_arm_hyp_percpu_base' from the host
KVM: arm64: Return guest memory from EL2 via dedicated teardown memcache
KVM: arm64: Instantiate guest stage-2 page-tables at EL2
KVM: arm64: Consolidate stage-2 initialisation into a single function
KVM: arm64: Add generic hyp_memcache helpers
KVM: arm64: Provide I-cache invalidation by virtual address at EL2
KVM: arm64: Initialise hypervisor copies of host symbols unconditionally
KVM: arm64: Add per-cpu fixmap infrastructure at EL2
KVM: arm64: Instantiate pKVM hypervisor VM and vCPU structures from EL1
KVM: arm64: Add infrastructure to create and track pKVM instances at EL2
KVM: arm64: Rename 'host_kvm' to 'host_mmu'
KVM: arm64: Add hyp_spinlock_t static initializer
KVM: arm64: Include asm/kvm_mmu.h in nvhe/mem_protect.h
KVM: arm64: Add helpers to pin memory shared with the hypervisor at EL2
KVM: arm64: Prevent the donation of no-map pages
KVM: arm64: Implement do_donate() helper for donating memory
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/parallel-faults:
: .
: Parallel stage-2 fault handling, courtesy of Oliver Upton.
: From the cover letter:
:
: "Presently KVM only takes a read lock for stage 2 faults if it believes
: the fault can be fixed by relaxing permissions on a PTE (write unprotect
: for dirty logging). Otherwise, stage 2 faults grab the write lock, which
: predictably can pile up all the vCPUs in a sufficiently large VM.
:
: Like the TDP MMU for x86, this series loosens the locking around
: manipulations of the stage 2 page tables to allow parallel faults. RCU
: and atomics are exploited to safely build/destroy the stage 2 page
: tables in light of multiple software observers."
: .
KVM: arm64: Reject shared table walks in the hyp code
KVM: arm64: Don't acquire RCU read lock for exclusive table walks
KVM: arm64: Take a pointer to walker data in kvm_dereference_pteref()
KVM: arm64: Handle stage-2 faults in parallel
KVM: arm64: Make table->block changes parallel-aware
KVM: arm64: Make leaf->leaf PTE changes parallel-aware
KVM: arm64: Make block->table PTE changes parallel-aware
KVM: arm64: Split init and set for table PTE
KVM: arm64: Atomically update stage 2 leaf attributes in parallel walks
KVM: arm64: Protect stage-2 traversal with RCU
KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make
KVM: arm64: Use an opaque type for pteps
KVM: arm64: Add a helper to tear down unlinked stage-2 subtrees
KVM: arm64: Don't pass kvm_pgtable through kvm_pgtable_walk_data
KVM: arm64: Pass mm_ops through the visitor context
KVM: arm64: Stash observed pte value in visitor context
KVM: arm64: Combine visitor arguments into a context structure
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Exclusive table walks are the only supported table walk in the hyp, as
there is no construct like RCU available in the hypervisor code. Reject
any attempt to do a shared table walk by returning an error and allowing
the caller to clean up the mess.
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221118182222.3932898-4-oliver.upton@linux.dev
|
|
Marek reported a BUG resulting from the recent parallel faults changes,
as the hyp stage-1 map walker attempted to allocate table memory while
holding the RCU read lock:
BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:274
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by swapper/0/1:
#0: ffff80000a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at:
__create_hyp_mappings+0x80/0xc4
#1: ffff80000a927720 (rcu_read_lock){....}-{1:2}, at:
kvm_pgtable_walk+0x0/0x1f4
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918
Hardware name: Raspberry Pi 3 Model B (DT)
Call trace:
dump_backtrace.part.0+0xe4/0xf0
show_stack+0x18/0x40
dump_stack_lvl+0x8c/0xb8
dump_stack+0x18/0x34
__might_resched+0x178/0x220
__might_sleep+0x48/0xa0
prepare_alloc_pages+0x178/0x1a0
__alloc_pages+0x9c/0x109c
alloc_page_interleave+0x1c/0xc4
alloc_pages+0xec/0x160
get_zeroed_page+0x1c/0x44
kvm_hyp_zalloc_page+0x14/0x20
hyp_map_walker+0xd4/0x134
kvm_pgtable_visitor_cb.isra.0+0x38/0x5c
__kvm_pgtable_walk+0x1a4/0x220
kvm_pgtable_walk+0x104/0x1f4
kvm_pgtable_hyp_map+0x80/0xc4
__create_hyp_mappings+0x9c/0xc4
kvm_mmu_init+0x144/0x1cc
kvm_arch_init+0xe4/0xef4
kvm_init+0x3c/0x3d0
arm_init+0x20/0x30
do_one_initcall+0x74/0x400
kernel_init_freeable+0x2e0/0x350
kernel_init+0x24/0x130
ret_from_fork+0x10/0x20
Since the hyp stage-1 table walkers are serialized by kvm_hyp_pgd_mutex,
RCU protection really doesn't add anything. Don't acquire the RCU read
lock for an exclusive walk.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221118182222.3932898-3-oliver.upton@linux.dev
|
|
Rather than passing through the state of the KVM_PGTABLE_WALK_SHARED
flag, just take a pointer to the whole walker structure instead. Move
around struct kvm_pgtable and the RCU indirection such that the
associated ifdeffery remains in one place while ensuring the walker +
flags definitions precede their use.
No functional change intended.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221118182222.3932898-2-oliver.upton@linux.dev
|
|
As a stepping stone towards deprivileging the host's access to the
guest's vCPU structures, introduce some naive flush/sync routines to
copy most of the host vCPU into the hyp vCPU on vCPU run and back
again on return to EL1.
This allows us to run using the pKVM hyp structures when KVM is
initialised in protected mode.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-27-will@kernel.org
|