summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-11-10KVM: arm64: selftests: Test with every breakpoint/watchpointReiji Watanabe1-12/+42
Currently, the debug-exceptions test always uses only {break,watch}point#0 and the highest numbered context-aware breakpoint. Modify the test to use all {break,watch}points and context-aware breakpoints supported on the system. Signed-off-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020054202.2119018-10-reijiw@google.com
2022-11-10KVM: arm64: selftests: Add a test case for a linked watchpointReiji Watanabe1-0/+35
Currently, the debug-exceptions test doesn't have a test case for a linked watchpoint. Add a test case for the linked watchpoint to the test. The new test case uses the highest numbered context-aware breakpoint (for Context ID match), and the watchpoint#0, which is linked to the context-aware breakpoint. Signed-off-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020054202.2119018-9-reijiw@google.com
2022-11-10KVM: arm64: selftests: Add a test case for a linked breakpointReiji Watanabe1-6/+57
Currently, the debug-exceptions test doesn't have a test case for a linked breakpoint. Add a test case for the linked breakpoint to the test. The new test case uses a pair of breakpoints. One is the higiest numbered context-aware breakpoint (for Context ID match), and the other one is the breakpoint#0 (for Address Match), which is linked to the context-aware breakpoint. Signed-off-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020054202.2119018-8-reijiw@google.com
2022-11-10KVM: arm64: selftests: Change debug_version() to take ID_AA64DFR0_EL1Reiji Watanabe1-5/+4
Change debug_version() to take the ID_AA64DFR0_EL1 value instead of vcpu as an argument, and change its callsite to read ID_AA64DFR0_EL1 (and pass it to debug_version()). Subsequent patches will reuse the register value in the callsite. No functional change intended. Signed-off-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020054202.2119018-7-reijiw@google.com
2022-11-10KVM: arm64: selftests: Stop unnecessary test stage tracking of debug-exceptionsReiji Watanabe1-37/+9
Currently, debug-exceptions test unnecessarily tracks some test stages using GUEST_SYNC(). The code for it needs to be updated as test cases are added or removed. Stop doing the unnecessary stage tracking, as they are not so useful and are a bit pain to maintain. Signed-off-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020054202.2119018-6-reijiw@google.com
2022-11-10KVM: arm64: selftests: Add helpers to enable debug exceptionsReiji Watanabe1-12/+13
Add helpers to enable breakpoint and watchpoint exceptions. No functional change intended. Signed-off-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020054202.2119018-5-reijiw@google.com
2022-11-10KVM: arm64: selftests: Remove the hard-coded {b,w}pn#0 from debug-exceptionsReiji Watanabe1-18/+32
Remove the hard-coded {break,watch}point #0 from the guest_code() in debug-exceptions to allow {break,watch}point number to be specified. Change reset_debug_state() to zeroing all dbg{b,w}{c,v}r_el0 registers so that guest_code() can use the function to reset those registers even when non-zero {break,watch}points are specified for guest_code(). Subsequent patches will add test cases for non-zero {break,watch}points. Signed-off-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020054202.2119018-4-reijiw@google.com
2022-11-10KVM: arm64: selftests: Add write_dbg{b,w}{c,v}r helpers in debug-exceptionsReiji Watanabe1-4/+68
Introduce helpers in the debug-exceptions test to write to dbg{b,w}{c,v}r registers. Those helpers will be useful for test cases that will be added to the test in subsequent patches. No functional change intended. Signed-off-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020054202.2119018-3-reijiw@google.com
2022-11-10KVM: arm64: selftests: Use FIELD_GET() to extract ID register fieldsReiji Watanabe3-5/+8
Use FIELD_GET() macro to extract ID register fields for existing aarch64 selftests code. No functional change intended. Signed-off-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020054202.2119018-2-reijiw@google.com
2022-11-10KVM: selftests: memslot_perf_test: Report optimal memory slotsGavin Shan1-4/+41
The memory area in each slot should be aligned to host page size. Otherwise, the test will fail. For example, the following command fails with the following messages with 64KB-page-size-host and 4KB-pae-size-guest. It's not user friendly to abort the test. Lets do something to report the optimal memory slots, instead of failing the test. # ./memslot_perf_test -v -s 1000 Number of memory slots: 999 Testing map performance with 1 runs, 5 seconds each Adding slots 1..999, each slot with 8 pages + 216 extra pages last ==== Test Assertion Failure ==== lib/kvm_util.c:824: vm_adjust_num_guest_pages(vm->mode, npages) == npages pid=19872 tid=19872 errno=0 - Success 1 0x00000000004065b3: vm_userspace_mem_region_add at kvm_util.c:822 2 0x0000000000401d6b: prepare_vm at memslot_perf_test.c:273 3 (inlined by) test_execute at memslot_perf_test.c:756 4 (inlined by) test_loop at memslot_perf_test.c:994 5 (inlined by) main at memslot_perf_test.c:1073 6 0x0000ffff7ebb4383: ?? ??:0 7 0x00000000004021ff: _start at :? Number of guest pages is not compatible with the host. Try npages=16 Report the optimal memory slots instead of failing the test when the memory area in each slot isn't aligned to host page size. With this applied, the optimal memory slots is reported. # ./memslot_perf_test -v -s 1000 Number of memory slots: 999 Testing map performance with 1 runs, 5 seconds each Memslot count too high for this test, decrease the cap (max is 514) Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020071209.559062-7-gshan@redhat.com
2022-11-10KVM: selftests: memslot_perf_test: Consolidate memoryGavin Shan1-17/+26
The addresses and sizes passed to vm_userspace_mem_region_add() and madvise() should be aligned to host page size, which can be 64KB on aarch64. So it's wrong by passing additional fixed 4KB memory area to various tests. Fix it by passing additional fixed 64KB memory area to various tests. We also add checks to ensure that none of host/guest page size exceeds 64KB. MEM_TEST_MOVE_SIZE is fixed up to 192KB either. With this, the following command works fine on 64KB-page-size-host and 4KB-page-size-guest. # ./memslot_perf_test -v -s 512 Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020071209.559062-6-gshan@redhat.com
2022-11-10KVM: selftests: memslot_perf_test: Support variable guest page sizeGavin Shan1-81/+129
The test case is obviously broken on aarch64 because non-4KB guest page size is supported. The guest page size on aarch64 could be 4KB, 16KB or 64KB. This supports variable guest page size, mostly for aarch64. - The host determines the guest page size when virtual machine is created. The value is also passed to guest through the synchronization area. - The number of guest pages are unknown until the virtual machine is to be created. So all the related macros are dropped. Instead, their values are dynamically calculated based on the guest page size. - The static checks on memory sizes and pages becomes dependent on guest page size, which is unknown until the virtual machine is about to be created. So all the static checks are converted to dynamic checks, done in check_memory_sizes(). - As the address passed to madvise() should be aligned to host page, the size of page chunk is automatically selected, other than one page. - MEM_TEST_MOVE_SIZE has fixed and non-working 64KB. It will be consolidated in next patch. However, the comments about how it's calculated has been correct. - All other changes included in this patch are almost mechanical replacing '4096' with 'guest_page_size'. Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020071209.559062-5-gshan@redhat.com
2022-11-10KVM: selftests: memslot_perf_test: Probe memory slots for onceGavin Shan1-13/+19
prepare_vm() is called in every iteration and run. The allowed memory slots (KVM_CAP_NR_MEMSLOTS) are probed for multiple times. It's not free and unnecessary. Move the probing logic for the allowed memory slots to parse_args() for once, which is upper layer of prepare_vm(). No functional change intended. Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020071209.559062-4-gshan@redhat.com
2022-11-10KVM: selftests: memslot_perf_test: Consolidate loop conditions in prepare_vm()Gavin Shan1-6/+5
There are two loops in prepare_vm(), which have different conditions. 'slot' is treated as meory slot index in the first loop, but index of the host virtual address array in the second loop. It makes it a bit hard to understand the code. Change the usage of 'slot' in the second loop, to treat it as the memory slot index either. No functional change intended. Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020071209.559062-3-gshan@redhat.com
2022-11-10KVM: selftests: memslot_perf_test: Use data->nslots in prepare_vm()Gavin Shan1-5/+5
In prepare_vm(), 'data->nslots' is assigned with 'max_mem_slots - 1' at the beginning, meaning they are interchangeable. Use 'data->nslots' isntead of 'max_mem_slots - 1'. With this, it becomes easier to move the logic of probing number of slots into upper layer in subsequent patches. No functional change intended. Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020071209.559062-2-gshan@redhat.com
2022-11-10KVM: arm64: Handle stage-2 faults in parallelOliver Upton4-28/+13
The stage-2 map walker has been made parallel-aware, and as such can be called while only holding the read side of the MMU lock. Rip out the conditional locking in user_mem_abort() and instead grab the read lock. Continue to take the write lock from other callsites to kvm_pgtable_stage2_map(). Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107220033.1895655-1-oliver.upton@linux.dev
2022-11-10KVM: arm64: Make table->block changes parallel-awareOliver Upton1-12/+3
stage2_map_walker_try_leaf() and friends now handle stage-2 PTEs generically, and perform the correct flush when a table PTE is removed. Additionally, they've been made parallel-aware, using an atomic break to take ownership of the PTE. Stop clearing the PTE in the pre-order callback and instead let stage2_map_walker_try_leaf() deal with it. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107220006.1895572-1-oliver.upton@linux.dev
2022-11-10KVM: arm64: Make leaf->leaf PTE changes parallel-awareOliver Upton1-14/+12
Convert stage2_map_walker_try_leaf() to use the new break-before-make helpers, thereby making the handler parallel-aware. As before, avoid the break-before-make if recreating the existing mapping. Additionally, retry execution if another vCPU thread is modifying the same PTE. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215934.1895478-1-oliver.upton@linux.dev
2022-11-10KVM: arm64: Make block->table PTE changes parallel-awareOliver Upton1-5/+75
In order to service stage-2 faults in parallel, stage-2 table walkers must take exclusive ownership of the PTE being worked on. An additional requirement of the architecture is that software must perform a 'break-before-make' operation when changing the block size used for mapping memory. Roll these two concepts together into helpers for performing a 'break-before-make' sequence. Use a special PTE value to indicate a PTE has been locked by a software walker. Additionally, use an atomic compare-exchange to 'break' the PTE when the stage-2 page tables are possibly shared with another software walker. Elide the DSB + TLBI if the evicted PTE was invalid (and thus not subject to break-before-make). All of the atomics do nothing for now, as the stage-2 walker isn't fully ready to perform parallel walks. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215855.1895367-1-oliver.upton@linux.dev
2022-11-10KVM: arm64: Split init and set for table PTEOliver Upton1-10/+10
Create a helper to initialize a table and directly call smp_store_release() to install it (for now). Prepare for a subsequent change that generalizes PTE writes with a helper. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-11-oliver.upton@linux.dev
2022-11-10KVM: arm64: Atomically update stage 2 leaf attributes in parallel walksOliver Upton1-9/+22
The stage2 attr walker is already used for parallel walks. Since commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging"), KVM acquires the read lock when write-unprotecting a PTE. However, the walker only uses a simple store to update the PTE. This is safe as the only possible race is with hardware updates to the access flag, which is benign. However, a subsequent change to KVM will allow more changes to the stage 2 page tables to be done in parallel. Prepare the stage 2 attribute walker by performing atomic updates to the PTE when walking in parallel. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-10-oliver.upton@linux.dev
2022-11-10KVM: arm64: Protect stage-2 traversal with RCUOliver Upton3-2/+71
Use RCU to safely walk the stage-2 page tables in parallel. Acquire and release the RCU read lock when traversing the page tables. Defer the freeing of table memory to an RCU callback. Indirect the calls into RCU and provide stubs for hypervisor code, as RCU is not available in such a context. The RCU protection doesn't amount to much at the moment, as readers are already protected by the read-write lock (all walkers that free table memory take the write lock). Nonetheless, a subsequent change will futher relax the locking requirements around the stage-2 MMU, thereby depending on RCU. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-9-oliver.upton@linux.dev
2022-11-10KVM: arm64: Tear down unlinked stage-2 subtree after break-before-makeOliver Upton4-63/+39
The break-before-make sequence is a bit annoying as it opens a window wherein memory is unmapped from the guest. KVM should replace the PTE as quickly as possible and avoid unnecessary work in between. Presently, the stage-2 map walker tears down a removed table before installing a block mapping when coalescing a table into a block. As the removed table is no longer visible to hardware walkers after the DSB+TLBI, it is possible to move the remaining cleanup to happen after installing the new PTE. Reshuffle the stage-2 map walker to install the new block entry in the pre-order callback. Unwire all of the teardown logic and replace it with a call to kvm_pgtable_stage2_free_removed() after fixing the PTE. The post-order visitor is now completely unnecessary, so drop it. Finally, touch up the comments to better represent the now simplified map walker. Note that the call to tear down the unlinked stage-2 is indirected as a subsequent change will use an RCU callback to trigger tear down. RCU is not available to pKVM, so there is a need to use different implementations on pKVM and non-pKVM VMs. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-8-oliver.upton@linux.dev
2022-11-10KVM: arm64: Use an opaque type for ptepsOliver Upton3-15/+23
Use an opaque type for pteps and require visitors explicitly dereference the pointer before using. Protecting page table memory with RCU requires that KVM dereferences RCU-annotated pointers before using. However, RCU is not available for use in the nVHE hypervisor and the opaque type can be conditionally annotated with RCU for the stage-2 MMU. Call the type a 'pteref' to avoid a naming collision with raw pteps. No functional change intended. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-7-oliver.upton@linux.dev
2022-11-10KVM: arm64: Add a helper to tear down unlinked stage-2 subtreesOliver Upton2-0/+34
A subsequent change to KVM will move the tear down of an unlinked stage-2 subtree out of the critical path of the break-before-make sequence. Introduce a new helper for tearing down unlinked stage-2 subtrees. Leverage the existing stage-2 free walkers to do so, with a deep call into __kvm_pgtable_walk() as the subtree is no longer reachable from the root. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-6-oliver.upton@linux.dev
2022-11-10KVM: arm64: Don't pass kvm_pgtable through kvm_pgtable_walk_dataOliver Upton1-13/+5
In order to tear down page tables from outside the context of kvm_pgtable (such as an RCU callback), stop passing a pointer through kvm_pgtable_walk_data. No functional change intended. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-5-oliver.upton@linux.dev
2022-11-10KVM: arm64: Pass mm_ops through the visitor contextOliver Upton3-41/+26
As a prerequisite for getting visitors off of struct kvm_pgtable, pass mm_ops through the visitor context. No functional change intended. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-4-oliver.upton@linux.dev
2022-11-10KVM: arm64: Stash observed pte value in visitor contextOliver Upton4-51/+48
Rather than reading the ptep all over the shop, read the ptep once from __kvm_pgtable_visit() and stick it in the visitor context. Reread the ptep after visiting a leaf in case the callback installed a new table underneath. No functional change intended. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-3-oliver.upton@linux.dev
2022-11-10KVM: arm64: Combine visitor arguments into a context structureOliver Upton4-156/+154
Passing new arguments by value to the visitor callbacks is extremely inflexible for stuffing new parameters used by only some of the visitors. Use a context structure instead and pass the pointer through to the visitor callback. While at it, redefine the 'flags' parameter to the visitor to contain the bit indicating the phase of the walk. Pass the entire set of flags through the context structure such that the walker can communicate additional state to the visitor callback. No functional change intended. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-2-oliver.upton@linux.dev
2022-11-10KVM: selftests: Automate choosing dirty ring size in dirty_log_testGavin Shan1-4/+22
In the dirty ring case, we rely on vcpu exit due to full dirty ring state. On ARM64 system, there are 4096 host pages when the host page size is 64KB. In this case, the vcpu never exits due to the full dirty ring state. The similar case is 4KB page size on host and 64KB page size on guest. The vcpu corrupts same set of host pages, but the dirty page information isn't collected in the main thread. This leads to infinite loop as the following log shows. # ./dirty_log_test -M dirty-ring -c 65536 -m 5 Setting log mode to: 'dirty-ring' Test iterations: 32, interval: 10 (ms) Testing guest mode: PA-bits:40, VA-bits:48, 4K pages guest physical test memory offset: 0xffbffe0000 vcpu stops because vcpu is kicked out... Notifying vcpu to continue vcpu continues now. Iteration 1 collected 576 pages <No more output afterwards> Fix the issue by automatically choosing the best dirty ring size, to ensure vcpu exit due to full dirty ring state. The option '-c' becomes a hint to the dirty ring count, instead of the value of it. Signed-off-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-8-gshan@redhat.com
2022-11-10KVM: selftests: Clear dirty ring states between two modes in dirty_log_testGavin Shan1-10/+17
There are two states, which need to be cleared before next mode is executed. Otherwise, we will hit failure as the following messages indicate. - The variable 'dirty_ring_vcpu_ring_full' shared by main and vcpu thread. It's indicating if the vcpu exit due to full ring buffer. The value can be carried from previous mode (VM_MODE_P40V48_4K) to current one (VM_MODE_P40V48_64K) when VM_MODE_P40V48_16K isn't supported. - The current ring buffer index needs to be reset before next mode (VM_MODE_P40V48_64K) is executed. Otherwise, the stale value is carried from previous mode (VM_MODE_P40V48_4K). # ./dirty_log_test -M dirty-ring Setting log mode to: 'dirty-ring' Test iterations: 32, interval: 10 (ms) Testing guest mode: PA-bits:40, VA-bits:48, 4K pages guest physical test memory offset: 0xffbfffc000 : Dirtied 995328 pages Total bits checked: dirty (1012434), clear (7114123), track_next (966700) Testing guest mode: PA-bits:40, VA-bits:48, 64K pages guest physical test memory offset: 0xffbffc0000 vcpu stops because vcpu is kicked out... vcpu continues now. Notifying vcpu to continue Iteration 1 collected 0 pages vcpu stops because dirty ring is full... vcpu continues now. vcpu stops because dirty ring is full... vcpu continues now. vcpu stops because dirty ring is full... ==== Test Assertion Failure ==== dirty_log_test.c:369: cleared == count pid=10541 tid=10541 errno=22 - Invalid argument 1 0x0000000000403087: dirty_ring_collect_dirty_pages at dirty_log_test.c:369 2 0x0000000000402a0b: log_mode_collect_dirty_pages at dirty_log_test.c:492 3 (inlined by) run_test at dirty_log_test.c:795 4 (inlined by) run_test at dirty_log_test.c:705 5 0x0000000000403a37: for_each_guest_mode at guest_modes.c:100 6 0x0000000000401ccf: main at dirty_log_test.c:938 7 0x0000ffff9ecd279b: ?? ??:0 8 0x0000ffff9ecd286b: ?? ??:0 9 0x0000000000401def: _start at ??:? Reset dirty pages (0) mismatch with collected (35566) Fix the issues by clearing 'dirty_ring_vcpu_ring_full' and the ring buffer index before next new mode is to be executed. Signed-off-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-7-gshan@redhat.com
2022-11-10KVM: selftests: Use host page size to map ring buffer in dirty_log_testGavin Shan1-1/+1
In vcpu_map_dirty_ring(), the guest's page size is used to figure out the offset in the virtual area. It works fine when we have same page sizes on host and guest. However, it fails when the page sizes on host and guest are different on arm64, like below error messages indicates. # ./dirty_log_test -M dirty-ring -m 7 Setting log mode to: 'dirty-ring' Test iterations: 32, interval: 10 (ms) Testing guest mode: PA-bits:40, VA-bits:48, 64K pages guest physical test memory offset: 0xffbffc0000 vcpu stops because vcpu is kicked out... Notifying vcpu to continue vcpu continues now. ==== Test Assertion Failure ==== lib/kvm_util.c:1477: addr == MAP_FAILED pid=9000 tid=9000 errno=0 - Success 1 0x0000000000405f5b: vcpu_map_dirty_ring at kvm_util.c:1477 2 0x0000000000402ebb: dirty_ring_collect_dirty_pages at dirty_log_test.c:349 3 0x00000000004029b3: log_mode_collect_dirty_pages at dirty_log_test.c:478 4 (inlined by) run_test at dirty_log_test.c:778 5 (inlined by) run_test at dirty_log_test.c:691 6 0x0000000000403a57: for_each_guest_mode at guest_modes.c:105 7 0x0000000000401ccf: main at dirty_log_test.c:921 8 0x0000ffffb06ec79b: ?? ??:0 9 0x0000ffffb06ec86b: ?? ??:0 10 0x0000000000401def: _start at ??:? Dirty ring mapped private Fix the issue by using host's page size to map the ring buffer. Signed-off-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-6-gshan@redhat.com
2022-11-10KVM: arm64: Enable ring-based dirty memory trackingGavin Shan6-1/+28
Enable ring-based dirty memory tracking on ARM64: - Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL. - Enable CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP. - Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page offset. - Add ARM64 specific kvm_arch_allow_write_without_running_vcpu() to keep the site of saving vgic/its tables out of the no-running-vcpu radar. Signed-off-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-5-gshan@redhat.com
2022-11-10KVM: Support dirty ring in conjunction with bitmapGavin Shan8-17/+112
ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is enabled. It's conflicting with that ring-based dirty page tracking always requires a running VCPU context. Introduce a new flavor of dirty ring that requires the use of both VCPU dirty rings and a dirty bitmap. The expectation is that for non-VCPU sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to the dirty bitmap. Userspace should scan the dirty bitmap before migrating the VM to the target. Use an additional capability to advertise this behavior. The newly added capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Suggested-by: Marc Zyngier <maz@kernel.org> Suggested-by: Peter Xu <peterx@redhat.com> Co-developed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-4-gshan@redhat.com
2022-11-10KVM: Move declaration of kvm_cpu_dirty_log_size() to kvm_dirty_ring.hGavin Shan2-2/+1
Not all architectures like ARM64 need to override the function. Move its declaration to kvm_dirty_ring.h to avoid the following compiling warning on ARM64 when the feature is enabled. arch/arm64/kvm/../../../virt/kvm/dirty_ring.c:14:12: \ warning: no previous prototype for 'kvm_cpu_dirty_log_size' \ [-Wmissing-prototypes] \ int __weak kvm_cpu_dirty_log_size(void) Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-3-gshan@redhat.com
2022-11-10KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULLGavin Shan5-25/+46
The VCPU isn't expected to be runnable when the dirty ring becomes soft full, until the dirty pages are harvested and the dirty ring is reset from userspace. So there is a check in each guest's entrace to see if the dirty ring is soft full or not. The VCPU is stopped from running if its dirty ring has been soft full. The similar check will be needed when the feature is going to be supported on ARM64. As Marc Zyngier suggested, a new event will avoid pointless overhead to check the size of the dirty ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance. Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring becomes soft full in kvm_dirty_ring_push(). The event is only cleared in the check, done in the newly added helper kvm_dirty_ring_check_request(). Since the VCPU is not runnable when the dirty ring becomes soft full, the KVM_REQ_DIRTY_RING_SOFT_FULL event is always set to prevent the VCPU from running until the dirty pages are harvested and the dirty ring is reset by userspace. kvm_dirty_ring_soft_full() becomes a private function with the newly added helper kvm_dirty_ring_check_request(). The alignment for the various event definitions in kvm_host.h is changed to tab character by the way. In order to avoid using 'container_of()', the argument @ring is replaced by @vcpu in kvm_dirty_ring_push(). Link: https://lore.kernel.org/kvmarm/87lerkwtm5.wl-maz@kernel.org Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-2-gshan@redhat.com
2022-11-10Merge tag 'kvmarm-fixes-6.1-3' into kvm-arm64/dirty-ringMarc Zyngier7-36/+34
KVM/arm64 fixes for 6.1, take #3 - Fix the pKVM stage-1 walker erronously using the stage-2 accessor - Correctly convert vcpu->kvm to a hyp pointer when generating an exception in a nVHE+MTE configuration - Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them - Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE - Document the boot requirements for FGT when entering the kernel at EL1 Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-11-09KVM: replace direct irq.h inclusionPaolo Bonzini9-61/+34
virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source directory (i.e. not from arch/*/include) for the sole purpose of retrieving irqchip_in_kernel. Making the function inline in a header that is already included, such as asm/kvm_host.h, is not possible because it needs to look at struct kvm which is defined after asm/kvm_host.h is included. So add a kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is only performance critical on arm64 and x86, and the non-inline function is enough on all other architectures. irq.h can then be deleted from all architectures except x86. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counterLike Xu4-21/+22
Defer reprogramming counters and handling overflow via KVM_REQ_PMU when incrementing counters. KVM skips emulated WRMSR in the VM-Exit fastpath, the fastpath runs with IRQs disabled, skipping instructions can increment and reprogram counters, reprogramming counters can sleep, and sleeping is disallowed while IRQs are disabled. [*] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 [*] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2981888, name: CPU 15/KVM [*] preempt_count: 1, expected: 0 [*] RCU nest depth: 0, expected: 0 [*] INFO: lockdep is turned off. [*] irq event stamp: 0 [*] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [*] hardirqs last disabled at (0): [<ffffffff8121222a>] copy_process+0x146a/0x62d0 [*] softirqs last enabled at (0): [<ffffffff81212269>] copy_process+0x14a9/0x62d0 [*] softirqs last disabled at (0): [<0000000000000000>] 0x0 [*] Preemption disabled at: [*] [<ffffffffc2063fc1>] vcpu_enter_guest+0x1001/0x3dc0 [kvm] [*] CPU: 17 PID: 2981888 Comm: CPU 15/KVM Kdump: 5.19.0-rc1-g239111db364c-dirty #2 [*] Call Trace: [*] <TASK> [*] dump_stack_lvl+0x6c/0x9b [*] __might_resched.cold+0x22e/0x297 [*] __mutex_lock+0xc0/0x23b0 [*] perf_event_ctx_lock_nested+0x18f/0x340 [*] perf_event_pause+0x1a/0x110 [*] reprogram_counter+0x2af/0x1490 [kvm] [*] kvm_pmu_trigger_event+0x429/0x950 [kvm] [*] kvm_skip_emulated_instruction+0x48/0x90 [kvm] [*] handle_fastpath_set_msr_irqoff+0x349/0x3b0 [kvm] [*] vmx_vcpu_run+0x268e/0x3b80 [kvm_intel] [*] vcpu_enter_guest+0x1d22/0x3dc0 [kvm] Add a field to kvm_pmc to track the previous counter value in order to defer overflow detection to kvm_pmu_handle_event() (the counter must be paused before handling overflow, and that may increment the counter). Opportunistically shrink sizeof(struct kvm_pmc) a bit. Suggested-by: Wanpeng Li <wanpengli@tencent.com> Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20220831085328.45489-6-likexu@tencent.com [sean: avoid re-triggering KVM_REQ_PMU on overflow, tweak changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220923001355.3741194-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()Like Xu5-10/+22
Batch reprogramming PMU counters by setting KVM_REQ_PMU and thus deferring reprogramming kvm_pmu_handle_event() to avoid reprogramming a counter multiple times during a single VM-Exit. Deferring programming will also allow KVM to fix a bug where immediately reprogramming a counter can result in sleeping (taking a mutex) while interrupts are disabled in the VM-Exit fastpath. Introduce kvm_pmu_request_counter_reprogam() to make it obvious that KVM is _requesting_ a reprogram and not actually doing the reprogram. Opportunistically refine related comments to avoid misunderstandings. Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20220831085328.45489-5-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220923001355.3741194-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Clear "reprogram" bit if counter is disabled or disallowedSean Christopherson1-14/+24
When reprogramming a counter, clear the counter's "reprogram pending" bit if the counter is disabled (by the guest) or is disallowed (by the userspace filter). In both cases, there's no need to re-attempt programming on the next coincident KVM_REQ_PMU as enabling the counter by either method will trigger reprogramming. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220923001355.3741194-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Force reprogramming of all counters on PMU filter changeSean Christopherson2-2/+22
Force vCPUs to reprogram all counters on a PMU filter change to provide a sane ABI for userspace. Use the existing KVM_REQ_PMU to do the programming, and take advantage of the fact that the reprogram_pmi bitmap fits in a u64 to set all bits in a single atomic update. Note, setting the bitmap and making the request needs to be done _after_ the SRCU synchronization to ensure that vCPUs will reprogram using the new filter. KVM's current "lazy" approach is confusing and non-deterministic. It's confusing because, from a developer perspective, the code is buggy as it makes zero sense to let userspace modify the filter but then not actually enforce the new filter. The lazy approach is non-deterministic because KVM enforces the filter whenever a counter is reprogrammed, not just on guest WRMSRs, i.e. a guest might gain/lose access to an event at random times depending on what is going on in the host. Note, the resulting behavior is still non-determinstic while the filter is in flux. If userspace wants to guarantee deterministic behavior, all vCPUs should be paused during the filter update. Jim Mattson <jmattson@google.com> Fixes: 66bb8a065f5a ("KVM: x86: PMU Event Filter") Cc: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220923001355.3741194-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: WARN if TDP MMU SP disallows hugepage after being zappedSean Christopherson1-4/+3
Extend the accounting sanity check in kvm_recover_nx_huge_pages() to the TDP MMU, i.e. verify that zapping a shadow page unaccounts the disallowed NX huge page regardless of the MMU type. Recovery runs while holding mmu_lock for write and so it should be impossible to get false positives on the WARN. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()Mingwei Zhang1-1/+2
Explicitly check if a NX huge page is disallowed when determining if a page fault needs to be forced to use a smaller sized page. KVM currently assumes that the NX huge page mitigation is the only scenario where KVM will force a shadow page instead of a huge page, and so unnecessarily keeps an existing shadow page instead of replacing it with a huge page. Any scenario that causes KVM to zap leaf SPTEs may result in having a SP that can be made huge without violating the NX huge page mitigation. E.g. prior to commit 5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging"), KVM would keep shadow pages after disabling dirty logging due to a live migration being canceled, resulting in degraded performance due to running with 4kb pages instead of huge pages. Although the dirty logging case is "fixed", that fix is coincidental, i.e. is an implementation detail, and there are other scenarios where KVM will zap leaf SPTEs. E.g. zapping leaf SPTEs in response to a host page migration (mmu_notifier invalidation) to create a huge page would yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume a huge page is disallowed. Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> [sean: use spte_to_child_sp(), massage changelog, fold into if-statement] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-Id: <20221019165618.927057-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Add helper to convert SPTE value to its shadow pageSean Christopherson4-19/+29
Add a helper to convert a SPTE to its shadow page to deduplicate a variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to get the shadow page for a SPTE without dropping high bits. Opportunistically add a comment in mmu_free_root_page() documenting why it treats the root HPA as a SPTE. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pagesSean Christopherson2-19/+12
Track the number of TDP MMU "shadow" pages instead of tracking the pages themselves. With the NX huge page list manipulation moved out of the common linking flow, elminating the list-based tracking means the happy path of adding a shadow page doesn't need to acquire a spinlock and can instead inc/dec an atomic. Keep the tracking as the WARN during TDP MMU teardown on leaked shadow pages is very, very useful for detecting KVM bugs. Tracking the number of pages will also make it trivial to expose the counter to userspace as a stat in the future, which may or may not be desirable. Note, the TDP MMU needs to use a separate counter (and stat if that ever comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the TDP MMU consumes so few pages relative to shadow paging), and including TDP MMU pages in that counter would break both the shrinker and shadow MMUs, e.g. if a VM is using nested TDP. Cc: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-Id: <20221019165618.927057-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTESean Christopherson3-25/+39
Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP visible to other readers, i.e. before setting its SPTE. This will allow KVM to query the flag when determining if a shadow page can be replaced by a NX huge page without violating the rules of the mitigation. Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible for another CPU to see a shadow page without an up-to-date nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated dance. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-Id: <20221019165618.927057-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUsSean Christopherson2-1/+13
Account and track NX huge pages for nonpaging MMUs so that a future enhancement to precisely check if a shadow page can't be replaced by a NX huge page doesn't get false positives. Without correct tracking, KVM can get stuck in a loop if an instruction is fetching and writing data on the same huge page, e.g. KVM installs a small executable page on the fetch fault, replaces it with an NX huge page on the write fault, and faults again on the fetch. Alternatively, and perhaps ideally, KVM would simply not enforce the workaround for nonpaging MMUs. The guest has no page tables to abuse and KVM is guaranteed to switch to a different MMU on CR0.PG being toggled so there's no security or performance concerns. However, getting make_spte() to play nice now and in the future is unnecessarily complex. In the current code base, make_spte() can enforce the mitigation if TDP is enabled or the MMU is indirect, but make_spte() may not always have a vCPU/MMU to work with, e.g. if KVM were to support in-line huge page promotion when disabling dirty logging. Without a vCPU/MMU, KVM could either pass in the correct information and/or derive it from the shadow page, but the former is ugly and the latter subtly non-trivial due to the possibility of direct shadow pages in indirect MMUs. Given that using shadow paging with an unpaged guest is far from top priority _and_ has been subjected to the workaround since its inception, keep it simple and just fix the accounting glitch. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221019165618.927057-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Rename NX huge pages fields/functions for consistencySean Christopherson5-51/+71
Rename most of the variables/functions involved in the NX huge page mitigation to provide consistency, e.g. lpage vs huge page, and NX huge vs huge NX, and also to provide clarity, e.g. to make it obvious the flag applies only to the NX huge page mitigation, not to any condition that prevents creating a huge page. Add a comment explaining what the newly named "possible_nx_huge_pages" tracks. Leave the nx_lpage_splits stat alone as the name is ABI and thus set in stone. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221019165618.927057-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Tag disallowed NX huge pages even if they're not trackedSean Christopherson4-13/+39
Tag shadow pages that cannot be replaced with an NX huge page regardless of whether or not zapping the page would allow KVM to immediately create a huge page, e.g. because something else prevents creating a huge page. I.e. track pages that are disallowed from being NX huge pages regardless of whether or not the page could have been huge at the time of fault. KVM currently tracks pages that were disallowed from being huge due to the NX workaround if and only if the page could otherwise be huge. But that fails to handled the scenario where whatever restriction prevented KVM from installing a huge page goes away, e.g. if dirty logging is disabled, the host mapping level changes, etc... Failure to tag shadow pages appropriately could theoretically lead to false negatives, e.g. if a fetch fault requests a small page and thus isn't tracked, and a read/write fault later requests a huge page, KVM will not reject the huge page as it should. To avoid yet another flag, initialize the list_head and use list_empty() to determine whether or not a page is on the list of NX huge pages that should be recovered. Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is more involved due to mmu_lock being held for read. This will be addressed in a future commit. Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>