summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)AuthorFilesLines
2023-11-06x86: kvm Require const tsc for RTThomas Gleixner1-0/+8
Non constant TSC is a nightmare on bare metal already, but with virtualization it becomes a complete disaster because the workarounds are horrible latency wise. That's also a preliminary for running RT in a guest on top of a RT host. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2021-10-31Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-14/+22
Pull kvm fixes from Paolo Bonzini: - Fixes for s390 interrupt delivery - Fixes for Xen emulator bugs showing up as debug kernel WARNs - Fix another issue with SEV/ES string I/O VMGEXITs * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Take srcu lock in post_kvm_run_save() KVM: SEV-ES: fix another issue with string I/O VMGEXITs KVM: x86/xen: Fix kvm_xen_has_interrupt() sleeping in kvm_vcpu_block() KVM: x86: switch pvclock_gtod_sync_lock to a raw spinlock KVM: s390: preserve deliverable_mask in __airqs_kick_single_vcpu KVM: s390: clear kicked_mask before sleeping again
2021-10-28KVM: x86: Take srcu lock in post_kvm_run_save()David Woodhouse1-0/+8
The Xen interrupt injection for event channels relies on accessing the guest's vcpu_info structure in __kvm_xen_has_interrupt(), through a gfn_to_hva_cache. This requires the srcu lock to be held, which is mostly the case except for this code path: [ 11.822877] WARNING: suspicious RCU usage [ 11.822965] ----------------------------- [ 11.823013] include/linux/kvm_host.h:664 suspicious rcu_dereference_check() usage! [ 11.823131] [ 11.823131] other info that might help us debug this: [ 11.823131] [ 11.823196] [ 11.823196] rcu_scheduler_active = 2, debug_locks = 1 [ 11.823253] 1 lock held by dom:0/90: [ 11.823292] #0: ffff998956ec8118 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x85/0x680 [ 11.823379] [ 11.823379] stack backtrace: [ 11.823428] CPU: 2 PID: 90 Comm: dom:0 Kdump: loaded Not tainted 5.4.34+ #5 [ 11.823496] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 11.823612] Call Trace: [ 11.823645] dump_stack+0x7a/0xa5 [ 11.823681] lockdep_rcu_suspicious+0xc5/0x100 [ 11.823726] __kvm_xen_has_interrupt+0x179/0x190 [ 11.823773] kvm_cpu_has_extint+0x6d/0x90 [ 11.823813] kvm_cpu_accept_dm_intr+0xd/0x40 [ 11.823853] kvm_vcpu_ready_for_interrupt_injection+0x20/0x30 < post_kvm_run_save() inlined here > [ 11.823906] kvm_arch_vcpu_ioctl_run+0x135/0x6a0 [ 11.823947] kvm_vcpu_ioctl+0x263/0x680 Fixes: 40da8ccd724f ("KVM: x86/xen: Add event channel interrupt vector upcall") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Cc: stable@vger.kernel.org Message-Id: <606aaaf29fca3850a63aa4499826104e77a72346.camel@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-25KVM: x86: switch pvclock_gtod_sync_lock to a raw spinlockDavid Woodhouse1-14/+14
On the preemption path when updating a Xen guest's runstate times, this lock is taken inside the scheduler rq->lock, which is a raw spinlock. This was shown in a lockdep warning: [ 89.138354] ============================= [ 89.138356] [ BUG: Invalid wait context ] [ 89.138358] 5.15.0-rc5+ #834 Tainted: G S I E [ 89.138360] ----------------------------- [ 89.138361] xen_shinfo_test/2575 is trying to lock: [ 89.138363] ffffa34a0364efd8 (&kvm->arch.pvclock_gtod_sync_lock){....}-{3:3}, at: get_kvmclock_ns+0x1f/0x130 [kvm] [ 89.138442] other info that might help us debug this: [ 89.138444] context-{5:5} [ 89.138445] 4 locks held by xen_shinfo_test/2575: [ 89.138447] #0: ffff972bdc3b8108 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x77/0x6f0 [kvm] [ 89.138483] #1: ffffa34a03662e90 (&kvm->srcu){....}-{0:0}, at: kvm_arch_vcpu_ioctl_run+0xdc/0x8b0 [kvm] [ 89.138526] #2: ffff97331fdbac98 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0xff/0xbd0 [ 89.138534] #3: ffffa34a03662e90 (&kvm->srcu){....}-{0:0}, at: kvm_arch_vcpu_put+0x26/0x170 [kvm] ... [ 89.138695] get_kvmclock_ns+0x1f/0x130 [kvm] [ 89.138734] kvm_xen_update_runstate+0x14/0x90 [kvm] [ 89.138783] kvm_xen_update_runstate_guest+0x15/0xd0 [kvm] [ 89.138830] kvm_arch_vcpu_put+0xe6/0x170 [kvm] [ 89.138870] kvm_sched_out+0x2f/0x40 [kvm] [ 89.138900] __schedule+0x5de/0xbd0 Cc: stable@vger.kernel.org Reported-by: syzbot+b282b65c2c68492df769@syzkaller.appspotmail.com Fixes: 30b5c851af79 ("KVM: x86/xen: Add support for vCPU runstate information") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <1b02a06421c17993df337493a68ba923f3bd5c0f.camel@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: SEV-ES: go over the sev_pio_data buffer in multiple passes if neededPaolo Bonzini1-16/+56
The PIO scratch buffer is larger than a single page, and therefore it is not possible to copy it in a single step to vcpu->arch/pio_data. Bound each call to emulator_pio_in/out to a single page; keep track of how many I/O operations are left in vcpu->arch.sev_pio_count, so that the operation can be restarted in the complete_userspace_io callback. For OUT, this means that the previous kvm_sev_es_outs implementation becomes an iterator of the loop, and we can consume the sev_pio_data buffer before leaving to userspace. For IN, instead, consuming the buffer and decreasing sev_pio_count is always done in the complete_userspace_io callback, because that is when the memcpy is done into sev_pio_data. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reported-by: Felix Wilhelm <fwilhelm@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: SEV-ES: keep INS functions togetherPaolo Bonzini1-9/+9
Make the diff a little nicer when we actually get to fixing the bug. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86: remove unnecessary arguments from complete_emulator_pio_inPaolo Bonzini1-5/+6
complete_emulator_pio_in can expect that vcpu->arch.pio has been filled in, and therefore does not need the size and count arguments. This makes things nicer when the function is called directly from a complete_userspace_io callback. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86: split the two parts of emulator_pio_inPaolo Bonzini1-17/+28
emulator_pio_in handles both the case where the data is pending in vcpu->arch.pio.count, and the case where I/O has to be done via either an in-kernel device or a userspace exit. For SEV-ES we would like to split these, to identify clearly the moment at which the sev_pio_data is consumed. To this end, create two different functions: __emulator_pio_in fills in vcpu->arch.pio.count, while complete_emulator_pio_in clears it and releases vcpu->arch.pio.data. Because this patch has to be backported, things are left a bit messy. kernel_pio() operates on vcpu->arch.pio, which leads to emulator_pio_in() having with two calls to complete_emulator_pio_in(). It will be fixed in the next release. While at it, remove the unused void* val argument of emulator_pio_in_out. The function currently hardcodes vcpu->arch.pio_data as the source/destination buffer, which sucks but will be fixed after the more severe SEV-ES buffer overflow. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: SEV-ES: clean up kvm_sev_es_ins/outsPaolo Bonzini1-16/+15
A few very small cleanups to the functions, smushed together because the patch is already very small like this: - inline emulator_pio_in_emulated and emulator_pio_out_emulated, since we already have the vCPU - remove the data argument and pull setting vcpu->arch.sev_pio_data into the caller - remove unnecessary clearing of vcpu->arch.pio.count when emulation is done by the kernel (and therefore vcpu->arch.pio.count is already clear on exit from emulator_pio_in and emulator_pio_out). No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86: leave vcpu->arch.pio.count alone in emulator_pio_in_outPaolo Bonzini1-4/+9
Currently emulator_pio_in clears vcpu->arch.pio.count twice if emulator_pio_in_out performs kernel PIO. Move the clear into emulator_pio_out where it is actually necessary. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: SEV-ES: rename guest_ins_data to sev_pio_dataPaolo Bonzini1-2/+2
We will be using this field for OUTS emulation as well, in case the data that is pushed via OUTS spans more than one page. In that case, there will be a need to save the data pointer across exits to userspace. So, change the name to something that refers to any kind of PIO. Also spell out what it is used for, namely SEV-ES. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-21KVM: x86: check for interrupts before deciding whether to exit the fast pathPaolo Bonzini1-5/+5
The kvm_x86_sync_pir_to_irr callback can sometimes set KVM_REQ_EVENT. If that happens exactly at the time that an exit is handled as EXIT_FASTPATH_REENTER_GUEST, vcpu_enter_guest will go incorrectly through the loop that calls kvm_x86_run, instead of processing the request promptly. Fixes: 379a3c8ee444 ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath") Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18KVM: X86: fix lazy allocation of rmapsPaolo Bonzini1-1/+2
If allocation of rmaps fails, but some of the pointers have already been written, those pointers can be cleaned up when the memslot is freed, or even reused later for another attempt at allocating the rmaps. Therefore there is no need to WARN, as done for example in memslot_rmap_alloc, but the allocation *must* be skipped lest KVM will overwrite the previous pointer and will indeed leak memory. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22kvm: x86: Add AMD PMU MSRs to msrs_to_save_all[]Fares Mehanna1-0/+7
Intel PMU MSRs is in msrs_to_save_all[], so add AMD PMU MSRs to have a consistent behavior between Intel and AMD when using KVM_GET_MSRS, KVM_SET_MSRS or KVM_GET_MSR_INDEX_LIST. We have to add legacy and new MSRs to handle guests running without X86_FEATURE_PERFCTR_CORE. Signed-off-by: Fares Mehanna <faresx@amazon.de> Message-Id: <20210915133951.22389-1-faresx@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22KVM: x86: reset pdptrs_from_userspace when exiting smmMaxim Levitsky1-0/+7
When exiting SMM, pdpts are loaded again from the guest memory. This fixes a theoretical bug, when exit from SMM triggers entry to the nested guest which re-uses some of the migration code which uses this flag as a workaround for a legacy userspace. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210913140954.165665-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22KVM: x86: Identify vCPU0 by its vcpu_idx instead of its vCPUs array entrySean Christopherson1-1/+1
Use vcpu_idx to identify vCPU0 when updating HyperV's TSC page, which is shared by all vCPUs and "owned" by vCPU0 (because vCPU0 is the only vCPU that's guaranteed to exist). Using kvm_get_vcpu() to find vCPU works, but it's a rather odd and suboptimal method to check the index of a given vCPU. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210910183220.2397812-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22KVM: x86: Handle SRCU initialization failure during page track initHaimin Zhang1-1/+6
Check the return of init_srcu_struct(), which can fail due to OOM, when initializing the page track mechanism. Lack of checking leads to a NULL pointer deref found by a modified syzkaller. Reported-by: TCS Robot <tcs_robot@tencent.com> Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com> Message-Id: <1630636626-12262-1-git-send-email-tcs_kernel@tencent.com> [Move the call towards the beginning of kvm_arch_init_vm. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22KVM: x86: Clear KVM's cached guest CR3 at RESET/INITSean Christopherson1-0/+3
Explicitly zero the guest's CR3 and mark it available+dirty at RESET/INIT. Per Intel's SDM and AMD's APM, CR3 is zeroed at both RESET and INIT. For RESET, this is a nop as vcpu is zero-allocated. For INIT, the bug has likely escaped notice because no firmware/kernel puts its page tables root at PA=0, let alone relies on INIT to get the desired CR3 for such page tables. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210921000303.400537-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22KVM: x86: Mark all registers as avail/dirty at vCPU creationSean Christopherson1-0/+2
Mark all registers as available and dirty at vCPU creation, as the vCPU has obviously not been loaded into hardware, let alone been given the chance to be modified in hardware. On SVM, reading from "uninitialized" hardware is a non-issue as VMCBs are zero allocated (thus not truly uninitialized) and hardware does not allow for arbitrary field encoding schemes. On VMX, backing memory for VMCSes is also zero allocated, but true initialization of the VMCS _technically_ requires VMWRITEs, as the VMX architectural specification technically allows CPU implementations to encode fields with arbitrary schemes. E.g. a CPU could theoretically store the inverted value of every field, which would result in VMREAD to a zero-allocated field returns all ones. In practice, only the AR_BYTES fields are known to be manipulated by hardware during VMREAD/VMREAD; no known hardware or VMM (for nested VMX) does fancy encoding of cacheable field values (CR0, CR3, CR4, etc...). In other words, this is technically a bug fix, but practically speakings it's a glorified nop. Failure to mark registers as available has been a lurking bug for quite some time. The original register caching supported only GPRs (+RIP, which is kinda sorta a GPR), with the masks initialized at ->vcpu_reset(). That worked because the two cacheable registers, RIP and RSP, are generally speaking not read as side effects in other flows. Arguably, commit aff48baa34c0 ("KVM: Fetch guest cr3 from hardware on demand") was the first instance of failure to mark regs available. While _just_ marking CR3 available during vCPU creation wouldn't have fixed the VMREAD from an uninitialized VMCS bug because ept_update_paging_mode_cr0() unconditionally read vmcs.GUEST_CR3, marking CR3 _and_ intentionally not reading GUEST_CR3 when it's available would have avoided VMREAD to a technically-uninitialized VMCS. Fixes: aff48baa34c0 ("KVM: Fetch guest cr3 from hardware on demand") Fixes: 6de4f3ada40b ("KVM: Cache pdptrs") Fixes: 6de12732c42c ("KVM: VMX: Optimize vmx_get_rflags()") Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields") Fixes: bd31fe495d0d ("KVM: VMX: Add proper cache tracking for CR0") Fixes: f98c1e77127d ("KVM: VMX: Add proper cache tracking for CR4") Fixes: 5addc235199f ("KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags") Fixes: 8791585837f6 ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210921000303.400537-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-06KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is ↵Zelin Deng1-0/+4
adjusted When MSR_IA32_TSC_ADJUST is written by guest due to TSC ADJUST feature especially there's a big tsc warp (like a new vCPU is hot-added into VM which has been up for a long time), tsc_offset is added by a large value then go back to guest. This causes system time jump as tsc_timestamp is not adjusted in the meantime and pvclock monotonic character. To fix this, just notify kvm to update vCPU's guest time before back to guest. Cc: stable@vger.kernel.org Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <1619576521-81399-2-git-send-email-zelin.deng@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20KVM: x86: implement KVM_GUESTDBG_BLOCKIRQMaxim Levitsky1-0/+4
KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts while running. This change is mostly intended for more robust single stepping of the guest and it has the following benefits when enabled: * Resuming from a breakpoint is much more reliable. When resuming execution from a breakpoint, with interrupts enabled, more often than not, KVM would inject an interrupt and make the CPU jump immediately to the interrupt handler and eventually return to the breakpoint, to trigger it again. From the user point of view it looks like the CPU never executed a single instruction and in some cases that can even prevent forward progress, for example, when the breakpoint is placed by an automated script (e.g lx-symbols), which does something in response to the breakpoint and then continues the guest automatically. If the script execution takes enough time for another interrupt to arrive, the guest will be stuck on the same breakpoint RIP forever. * Normal single stepping is much more predictable, since it won't land the debugger into an interrupt handler. * RFLAGS.TF has less chance to be leaked to the guest: We set that flag behind the guest's back to do single stepping but if single step lands us into an interrupt/exception handler it will be leaked to the guest in the form of being pushed to the stack. This doesn't completely eliminate this problem as exceptions can still happen, but at least this reduces the chances of this happening. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210811122927.900604-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20KVM: x86/mmu: Add detailed page size statsMingwei Zhang1-1/+3
Existing KVM code tracks the number of large pages regardless of their sizes. Therefore, when large page of 1GB (or larger) is adopted, the information becomes less useful because lpages counts a mix of 1G and 2M pages. So remove the lpages since it is easy for user space to aggregate the info. Instead, provide a comprehensive page stats of all sizes from 4K to 512G. Suggested-by: Ben Gardon <bgardon@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Cc: Jing Zhang <jingzhangos@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Sean Christopherson <seanjc@google.com> Message-Id: <20210803044607.599629-4-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20KVM: stats: Support linear and logarithmic histogram statisticsJing Zhang1-4/+0
Add new types of KVM stats, linear and logarithmic histogram. Histogram are very useful for observing the value distribution of time or size related stats. Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210802165633.1866976-2-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20KVM: SVM: avoid refreshing avic if its state didn't changeMaxim Levitsky1-1/+8
Since AVIC can be inhibited and uninhibited rapidly it is possible that we have nothing to do by the time the svm_refresh_apicv_exec_ctrl is called. Detect and avoid this, which will be useful when we will start calling avic_vcpu_load/avic_vcpu_put when the avic inhibition state changes. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-14-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20KVM: x86: APICv: fix race in kvm_request_apicv_update on SVMMaxim Levitsky1-15/+24
Currently on SVM, the kvm_request_apicv_update toggles the APICv memslot without doing any synchronization. If there is a mismatch between that memslot state and the AVIC state, on one of the vCPUs, an APIC mmio access can be lost: For example: VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT VCPU1: access an APIC mmio register. Since AVIC is still disabled on VCPU1, the access will not be intercepted by it, and neither will it cause MMIO fault, but rather it will just be read/written from/to the dummy page mapped into the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT. Fix that by adding a lock guarding the AVIC state changes, and carefully order the operations of kvm_request_apicv_update to avoid this race: 1. Take the lock 2. Send KVM_REQ_APICV_UPDATE 3. Update the apic inhibit reason 4. Release the lock This ensures that at (2) all vCPUs are kicked out of the guest mode, but don't yet see the new avic state. Then only after (4) all other vCPUs can update their AVIC state and resume. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-10-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20KVM: x86: don't disable APICv memslot when inhibitedMaxim Levitsky1-13/+8
Thanks to the former patches, it is now possible to keep the APICv memslot always enabled, and it will be invisible to the guest when it is inhibited This code is based on a suggestion from Sean Christopherson: https://lkml.org/lkml/2021/7/19/2970 Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-9-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20KVM: X86: Introduce kvm_mmu_slot_lpages() helpersPeter Xu1-4/+2
Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size. The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather than fetching from the memslot pointer. Start to use the latter one in kvm_alloc_memslot_metadata(). Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220455.26054-4-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13KVM: x86: Kill off __ex() and __kvm_handle_fault_on_reboot()Sean Christopherson1-1/+8
Remove the __kvm_handle_fault_on_reboot() and __ex() macros now that all VMX and SVM instructions use asm goto to handle the fault (or in the case of VMREAD, completely custom logic). Drop kvm_spurious_fault()'s asmlinkage annotation as __kvm_handle_fault_on_reboot() was the only flow that invoked it from assembly code. Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Like Xu <like.xu.linux@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210809173955.1710866-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13KVM: VMX: Reset DR6 only when KVM_DEBUGREG_WONT_EXITPaolo Bonzini1-6/+0
The commit efdab992813fb ("KVM: x86: fix escape of guest dr6 to the host") fixed a bug by resetting DR6 unconditionally when the vcpu being scheduled out. But writing to debug registers is slow, and it can be visible in perf results sometimes, even if neither the host nor the guest activate breakpoints. Since KVM_DEBUGREG_WONT_EXIT on Intel processors is the only case where DR6 gets the guest value, and it never happens at all on SVM, the register can be cleared in vmx.c right after reading it. Reported-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXITPaolo Bonzini1-1/+0
Commit c77fb5fe6f03 ("KVM: x86: Allow the guest to run with dirty debug registers") allows the guest accessing to DRs without exiting when KVM_DEBUGREG_WONT_EXIT and we need to ensure that they are synchronized on entry to the guest---including DR6 that was not synced before the commit. But the commit sets the hardware DR6 not only when KVM_DEBUGREG_WONT_EXIT, but also when KVM_DEBUGREG_BP_ENABLED. The second case is unnecessary and just leads to a more case which leaks stale DR6 to the host which has to be resolved by unconditionally reseting DR6 in kvm_arch_vcpu_put(). Even if KVM_DEBUGREG_WONT_EXIT, however, setting the host DR6 only matters on VMX because SVM always uses the DR6 value from the VMCB. So move this line to vmx.c and make it conditional on KVM_DEBUGREG_WONT_EXIT. Reported-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13KVM: X86: Remove unneeded KVM_DEBUGREG_RELOADLai Jiangshan1-3/+0
Commit ae561edeb421 ("KVM: x86: DR0-DR3 are not clear on reset") added code to ensure eff_db are updated when they're modified through non-standard paths. But there is no reason to also update hardware DRs unless hardware breakpoints are active or DR exiting is disabled, and in those cases updating hardware is handled by KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_BP_ENABLED. KVM_DEBUGREG_RELOAD just causes unnecesarry load of hardware DRs and is better to be removed. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210809174307.145263-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-05KVM: xen: do not use struct gfn_to_hva_cachePaolo Bonzini1-0/+1
gfn_to_hva_cache is not thread-safe, so it is usually used only within a vCPU (whose code is protected by vcpu->mutex). The Xen interface implementation has such a cache in kvm->arch, but it is not really used except to store the location of the shared info page. Replace shinfo_set and shinfo_cache with just the value that is passed via KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the initialization value is not zero anymore and therefore kvm_xen_init_vm needs to be introduced. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03KVM: const-ify all relevant uses of struct kvm_memory_slotHamza Mahfooz1-5/+2
As alluded to in commit f36f3f2846b5 ("KVM: add "new" argument to kvm_arch_commit_memory_region"), a bunch of other places where struct kvm_memory_slot is used, needs to be refactored to preserve the "const"ness of struct kvm_memory_slot across-the-board. Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Message-Id: <20210713023338.57108-1-someguy@effective-light.com> [Do not touch body of slot_rmap_walk_init. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Preserve guest's CR0.CD/NW on INITSean Christopherson1-1/+13
Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as defined by both Intel's SDM and AMD's APM. Note, current versions of Intel's SDM are very poorly written with respect to INIT behavior. Table 9-1. "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT" quite clearly lists power-up, RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1. But the SDM then attempts to qualify CD/NW behavior in a footnote: 2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits are cleared. Presumably that footnote is only meant for INIT, as the RESET case and especially the power-up case are rather non-sensical. Another footnote all but confirms that: 6. Internal caches are invalid after power-up and RESET, but left unchanged with an INIT. Bare metal testing shows that CD/NW are indeed preserved on INIT (someone else can hack their BIOS to check RESET and power-up :-D). Reported-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-47-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: SVM: Emulate #INIT in response to triple fault shutdownSean Christopherson1-0/+1
Emulate a full #INIT instead of simply initializing the VMCB if the guest hits a shutdown. Initializing the VMCB but not other vCPU state, much of which is mirrored by the VMCB, results in incoherent and broken vCPU state. Ideally, KVM would not automatically init anything on shutdown, and instead put the vCPU into e.g. KVM_MP_STATE_UNINITIALIZED and force userspace to explicitly INIT or RESET the vCPU. Even better would be to add KVM_MP_STATE_SHUTDOWN, since technically NMI can break shutdown (and SMI on Intel CPUs). But, that ship has sailed, and emulating #INIT is the next best thing as that has at least some connection with reality since there exist bare metal platforms that automatically INIT the CPU if it hits shutdown. Fixes: 46fe4ddd9dbb ("[PATCH] KVM: SVM: Propagate cpu shutdown events to userspace") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-45-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86Sean Christopherson1-0/+8
Move the setting of CR0, CR4, EFER, RFLAGS, and RIP from vendor code to common x86. VMX and SVM now have near-identical sequences, the only difference being that VMX updates the exception bitmap. Updating the bitmap on SVM is unnecessary, but benign. Unfortunately it can't be left behind in VMX due to the need to update exception intercepts after the control registers are set. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-37-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86/mmu: Skip the permission_fault() check on MMIO if CR0.PG=0Sean Christopherson1-3/+3
Skip the MMU permission_fault() check if paging is disabled when verifying the cached MMIO GVA is usable. The check is unnecessary and can theoretically get a false positive since the MMU doesn't zero out "permissions" or "pkru_mask" when guest paging is disabled. The obvious alternative is to zero out all the bitmasks when configuring nonpaging MMUs, but that's unnecessary work and doesn't align with the MMU's general approach of doing as little as possible for flows that are supposed to be unreachable. This is nearly a nop as the false positive is nothing more than an insignificant performance blip, and more or less limited to string MMIO when L1 is running with paging disabled. KVM doesn't cache MMIO if L2 is active with nested TDP since the "GVA" is really an L2 GPA. If L2 is active without nested TDP, then paging can't be disabled as neither VMX nor SVM allows entering the guest without paging of some form. Jumping back to L1 with paging disabled, in that case direct_map is true and so KVM will use CR2 as a GPA; the only time it doesn't is if the fault from the emulator doesn't match or emulator_can_use_gpa(), and that fails only on string MMIO and other instructions with multiple memory operands. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-27-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Move EDX initialization at vCPU RESET to common codeSean Christopherson1-0/+13
Move the EDX initialization at vCPU RESET, which is now identical between VMX and SVM, into common code. No functional change intended. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Flush the guest's TLB on INITSean Christopherson1-0/+12
Flush the guest's TLB on INIT, as required by Intel's SDM. Although AMD's APM states that the TLBs are unchanged by INIT, it's not clear that that's correct as the APM also states that the TLB is flush on "External initialization of the processor." Regardless, relying on the guest to be paranoid is unnecessarily risky, while an unnecessary flush is benign from a functional perspective and likely has no measurable impact on guest performance. Note, as of the April 2021 version of Intels' SDM, it also contradicts itself with respect to TLB flushing. The overview of INIT explicitly calls out the TLBs as being invalidated, while a table later in the same section says they are unchanged. 9.1 INITIALIZATION OVERVIEW: The major difference is that during an INIT, the internal caches, MSRs, MTRRs, and x87 FPU state are left unchanged (although, the TLBs and BTB are invalidated as with a hardware reset) Table 9-1: Register Power up Reset INIT Data and Code Cache, TLBs: Invalid[6] Invalid[6] Unchanged Given Core2's erratum[*] about global TLB entries not being flush on INIT, it's safe to assume that the table is simply wrong. AZ28. INIT Does Not Clear Global Entries in the TLB Problem: INIT may not flush a TLB entry when: • The processor is in protected mode with paging enabled and the page global enable flag is set (PGE bit of CR4 register) • G bit for the page table entry is set • TLB entry is present in TLB when INIT occurs • Software may encounter unexpected page fault or incorrect address translation due to a TLB entry erroneously left in TLB after INIT. Workaround: Write to CR3, CR4 (setting bits PSE, PGE or PAE) or CR0 (setting bits PG or PE) registers before writing to memory early in BIOS code to clear all the global entries from TLB. Status: For the steppings affected, see the Summary Tables of Changes. [*] https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf Fixes: 6aa8b732ca01 ("[PATCH] kvm: userspace interface") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: APICv: drop immediate APICv disablement on current vCPUMaxim Levitsky1-11/+1
Special case of disabling the APICv on the current vCPU right away in kvm_request_apicv_update doesn't bring much benefit vs raising KVM_REQ_APICV_UPDATE on it instead, since this request will be processed on the next entry to the guest. (the comment about having another #VMEXIT is wrong). It also hides various assumptions that APIVc enable state matches the APICv inhibit state, as this special case only makes those states match on the current vCPU. Previous patches fixed few such assumptions so now it should be safe to drop this special case. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210713142023.106183-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: X86: Add per-vm stat for max rmap list sizePeter Xu1-0/+1
Add a new statistic max_mmu_rmap_size, which stores the maximum size of rmap for the vm. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210625153214.43106-2-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Hoist kvm_dirty_regs check out of sync_regs()Sean Christopherson1-4/+2
Move the kvm_dirty_regs vs. KVM_SYNC_X86_VALID_FIELDS check out of sync_regs() and into its sole caller, kvm_arch_vcpu_ioctl_run(). This allows a future patch to allow synchronizing select state for protected VMs. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <889017a8d31cea46472e0c64b234ef5919278ed9.1625186503.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the VMSean Christopherson1-0/+4
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <0e8760a26151f47dc47052b25ca8b84fffe0641e.1625186503.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-30KVM: x86: accept userspace interrupt only if no event is injectedPaolo Bonzini1-2/+11
Once an exception has been injected, any side effects related to the exception (such as setting CR2 or DR6) have been taked place. Therefore, once KVM sets the VM-entry interruption information field or the AMD EVENTINJ field, the next VM-entry must deliver that exception. Pending interrupts are processed after injected exceptions, so in theory it would not be a problem to use KVM_INTERRUPT when an injected exception is present. However, DOSEMU is using run->ready_for_interrupt_injection to detect interrupt windows and then using KVM_SET_SREGS/KVM_SET_REGS to inject the interrupt manually. For this to work, the interrupt window must be delayed after the completion of the previous event injection. Cc: stable@vger.kernel.org Reported-by: Stas Sergeev <stsp2@yandex.ru> Tested-by: Stas Sergeev <stsp2@yandex.ru> Fixes: 71cc849b7093 ("KVM: x86: Fix split-irqchip vs interrupt injection window request") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-26KVM: x86: Check the right feature bit for MSR_KVM_ASYNC_PF_ACK accessVitaly Kuznetsov1-2/+2
MSR_KVM_ASYNC_PF_ACK MSR is part of interrupt based asynchronous page fault interface and not the original (deprecated) KVM_FEATURE_ASYNC_PF. This is stated in Documentation/virt/kvm/msr.rst. Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20210722123018.260035-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-3/+2
Pull kvm fixes from Paolo Bonzini: - Allow again loading KVM on 32-bit non-PAE builds - Fixes for host SMIs on AMD - Fixes for guest SMIs on AMD - Fixes for selftests on s390 and ARM - Fix memory leak - Enforce no-instrumentation area on vmentry when hardware breakpoints are in use. * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: selftests: smm_test: Test SMM enter from L2 KVM: nSVM: Restore nested control upon leaving SMM KVM: nSVM: Fix L1 state corruption upon return from SMM KVM: nSVM: Introduce svm_copy_vmrun_state() KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails KVM: SVM: add module param to control the #SMI interception KVM: SVM: remove INIT intercept handler KVM: SVM: #SMI interception must not skip the instruction KVM: VMX: Remove vmx_msr_index from vmx.h KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run() KVM: selftests: Address extra memslot parameters in vm_vaddr_alloc kvm: debugfs: fix memory leak in kvm_create_vm_debugfs KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR ...
2021-07-15KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()Lai Jiangshan1-0/+2
When the host is using debug registers but the guest is not using them nor is the guest in guest-debug state, the kvm code does not reset the host debug registers before kvm_x86->run(). Rather, it relies on the hardware vmentry instruction to automatically reset the dr7 registers which ensures that the host breakpoints do not affect the guest. This however violates the non-instrumentable nature around VM entry and exit; for example, when a host breakpoint is set on vcpu->arch.cr2, Another issue is consistency. When the guest debug registers are active, the host breakpoints are reset before kvm_x86->run(). But when the guest debug registers are inactive, the host breakpoints are delayed to be disabled. The host tracing tools may see different results depending on what the guest is doing. To fix the problems, we clear %db7 unconditionally before kvm_x86->run() if the host has set any breakpoints, no matter if the guest is using them or not. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210628172632.81029-1-jiangshanlai@gmail.com> Cc: stable@vger.kernel.org [Only clear %db7 instead of reloading all debug registers. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-14Revert "KVM: x86: WARN and reject loading KVM if NX is supported but not ↵Sean Christopherson1-3/+0
enabled" Let KVM load if EFER.NX=0 even if NX is supported, the analysis and testing (or lack thereof) for the non-PAE host case was garbage. If the kernel won't be using PAE paging, .Ldefault_entry in head_32.S skips over the entire EFER sequence. Hopefully that can be changed in the future to allow KVM to require EFER.NX, but the motivation behind KVM's requirement isn't yet merged. Reverting and revisiting the mess at a later date is by far the safest approach. This reverts commit 8bbed95d2cb6e5de8a342d761a89b0a04faed7be. Fixes: 8bbed95d2cb6 ("KVM: x86: WARN and reject loading KVM if NX is supported but not enabled") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210625001853.318148-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-07Merge tag 'x86-fpu-2021-07-07' of ↵Linus Torvalds1-26/+30
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: "Fixes and improvements for FPU handling on x86: - Prevent sigaltstack out of bounds writes. The kernel unconditionally writes the FPU state to the alternate stack without checking whether the stack is large enough to accomodate it. Check the alternate stack size before doing so and in case it's too small force a SIGSEGV instead of silently corrupting user space data. - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been updated despite the fact that the FPU state which is stored on the signal stack has grown over time which causes trouble in the field when AVX512 is available on a CPU. The kernel does not expose the minimum requirements for the alternate stack size depending on the available and enabled CPU features. ARM already added an aux vector AT_MINSIGSTKSZ for the same reason. Add it to x86 as well. - A major cleanup of the x86 FPU code. The recent discoveries of XSTATE related issues unearthed quite some inconsistencies, duplicated code and other issues. The fine granular overhaul addresses this, makes the code more robust and maintainable, which allows to integrate upcoming XSTATE related features in sane ways" * tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) x86/fpu/xstate: Clear xstate header in copy_xstate_to_uabi_buf() again x86/fpu/signal: Let xrstor handle the features to init x86/fpu/signal: Handle #PF in the direct restore path x86/fpu: Return proper error codes from user access functions x86/fpu/signal: Split out the direct restore code x86/fpu/signal: Sanitize copy_user_to_fpregs_zeroing() x86/fpu/signal: Sanitize the xstate check on sigframe x86/fpu/signal: Remove the legacy alignment check x86/fpu/signal: Move initial checks into fpu__restore_sig() x86/fpu: Mark init_fpstate __ro_after_init x86/pkru: Remove xstate fiddling from write_pkru() x86/fpu: Don't store PKRU in xstate in fpu_reset_fpstate() x86/fpu: Remove PKRU handling from switch_fpu_finish() x86/fpu: Mask PKRU from kernel XRSTOR[S] operations x86/fpu: Hook up PKRU into ptrace() x86/fpu: Add PKRU storage outside of task XSAVE buffer x86/fpu: Dont restore PKRU in fpregs_restore_userspace() x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi() x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs() x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs() ...
2021-06-25kvm: x86: Allow userspace to handle emulation errorsAaron Lewis1-4/+36
Add a fallback mechanism to the in-kernel instruction emulator that allows userspace the opportunity to process an instruction the emulator was unable to. When the in-kernel instruction emulator fails to process an instruction it will either inject a #UD into the guest or exit to userspace with exit reason KVM_INTERNAL_ERROR. This is because it does not know how to proceed in an appropriate manner. This feature lets userspace get involved to see if it can figure out a better path forward. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210510144834.658457-2-aaronlewis@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>