From 5c247d08bc81bbad4c662dcf5654137a2f8483ec Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Feb 2026 20:10:10 +0000 Subject: KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT KVM currently uses the value of CR2 from vmcb02 to update vmcb12 on nested #VMEXIT. This value is incorrect in some cases, causing L1 to run L2 with a corrupted CR2. This could lead to segfaults or data corruption if L2 is in the middle of handling a #PF and reads a corrupted CR2. Use the correct value in vcpu->arch.cr2 instead. The value in vcpu->arch.cr2 is sync'd to vmcb02 shortly before a VMRUN of L2, and sync'd back to vcpu->arch.cr2 shortly after. The value are only out-of-sync in two cases: after save+restore, and after a #PF is injected into L2. In either case, if a #VMEXIT to L1 is synthesized before L2 runs, using the value in vmcb02 would be incorrect. After save+restore, the value of CR2 is restored by KVM_SET_SREGS into vcpu->arch.cr2. It is not reflect in vmcb02 until a VMRUN of L2. Before that, it holds whatever was in vmcb02 before restore, which would be zero on a new vCPU that never ran nested. If a #VMEXIT to L1 is synthesized before L2 ever runs, using vcpu->arch.cr2 to update vmcb12 is the right thing to do. The #PF injection case is more nuanced. Although the APM is a bit unclear about when CR2 is written during a #PF, the SDM is more clear: Processors update CR2 whenever a page fault is detected. If a second page fault occurs while an earlier page fault is being delivered, the faulting linear address of the second fault will overwrite the contents of CR2 (replacing the previous address). These updates to CR2 occur even if the page fault results in a double fault or occurs during the delivery of a double fault. KVM injecting the exception surely counts as the #PF being "detected". More importantly, when an exception is injected into L2 at the time of a synthesized #VMEXIT, KVM updates exit_int_info in vmcb12 accordingly, such that an L1 hypervisor can re-inject the exception. If CR2 is not written at that point, the L1 hypervisor have no way of correctly re-injecting the #PF. Hence, if a #VMEXIT to L1 is synthesized after the #PF is injected into L2 but before it actually runs, using vcpu->arch.cr2 to update vmcb12 is also the right thing to do. Note that KVM does _not_ update vcpu->arch.cr2 when a #PF is pending for L2, only when it is injected. The distinction is important, because only injected (but not intercepted) exceptions are propagated to L1 through exit_int_info. It would be incorrect to update CR2 in vmcb12 for a pending #PF, as L1 would perceive an updated CR2 value with no #PF. Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260203201010.1871056-1-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 53ab6ce3cc26..99f8b8de8159 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1156,7 +1156,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm) vmcb12->save.efer = svm->vcpu.arch.efer; vmcb12->save.cr0 = kvm_read_cr0(vcpu); vmcb12->save.cr3 = kvm_read_cr3(vcpu); - vmcb12->save.cr2 = vmcb02->save.cr2; + vmcb12->save.cr2 = vcpu->arch.cr2; vmcb12->save.cr4 = svm->vcpu.arch.cr4; vmcb12->save.rflags = kvm_get_rflags(vcpu); vmcb12->save.rip = kvm_rip_read(vcpu); -- cgit v1.2.3 From d0ad1b05bbe6f8da159a4dfb6692b3b7ce30ccc8 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 17 Feb 2026 16:54:38 -0800 Subject: KVM: x86: Defer non-architectural deliver of exception payload to userspace read When attempting to play nice with userspace that hasn't enabled KVM_CAP_EXCEPTION_PAYLOAD, defer KVM's non-architectural delivery of the payload until userspace actually reads relevant vCPU state, and more importantly, force delivery of the payload in *all* paths where userspace saves relevant vCPU state, not just KVM_GET_VCPU_EVENTS. Ignoring userspace save/restore for the moment, delivering the payload before the exception is injected is wrong regardless of whether L1 or L2 is running. To make matters even more confusing, the flaw *currently* being papered over by the !is_guest_mode() check isn't even the same bug that commit da998b46d244 ("kvm: x86: Defer setting of CR2 until #PF delivery") was trying to avoid. At the time of commit da998b46d244, KVM didn't correctly handle exception intercepts, as KVM would wait until VM-Entry into L2 was imminent to check if the queued exception should morph to a nested VM-Exit. I.e. KVM would deliver the payload to L2 and then synthesize a VM-Exit into L1. But the payload was only the most blatant issue, e.g. waiting to check exception intercepts would also lead to KVM incorrectly escalating a should-be-intercepted #PF into a #DF. That underlying bug was eventually fixed by commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending VM-Exits at queue time"), but in the interim, commit a06230b62b89 ("KVM: x86: Deliver exception payload on KVM_GET_VCPU_EVENTS") came along and subtly added another dependency on the !is_guest_mode() check. While not recorded in the changelog, the motivation for deferring the !exception_payload_enabled delivery was to fix a flaw where a synthesized MTF (Monitor Trap Flag) VM-Exit would drop a pending #DB and clobber DR6. On a VM-Exit, VMX CPUs save pending #DB information into the VMCS, which is emulated by KVM in nested_vmx_update_pending_dbg() by grabbing the payload from the queue/pending exception. I.e. prematurely delivering the payload would cause the pending #DB to not be recorded in the VMCS, and of course, clobber L2's DR6 as seen by L1. Jumping back to save+restore, the quirked behavior of forcing delivery of the payload only works if userspace does KVM_GET_VCPU_EVENTS *before* CR2 or DR6 is saved, i.e. before KVM_GET_SREGS{,2} and KVM_GET_DEBUGREGS. E.g. if userspace does KVM_GET_SREGS before KVM_GET_VCPU_EVENTS, then the CR2 saved by userspace won't contain the payload for the exception save by KVM_GET_VCPU_EVENTS. Deliberately deliver the payload in the store_regs() path, as it's the least awful option even though userspace may not be doing save+restore. Because if userspace _is_ doing save restore, it could elide KVM_GET_SREGS knowing that SREGS were already saved when the vCPU exited. Link: https://lore.kernel.org/all/20200207103608.110305-1-oupton@google.com Cc: Yosry Ahmed Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed Tested-by: Yosry Ahmed Link: https://patch.msgid.link/20260218005438.2619063-1-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 62 ++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 39 insertions(+), 23 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a03530795707..6e87ec52fa06 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr, vcpu->arch.exception.error_code = error_code; vcpu->arch.exception.has_payload = has_payload; vcpu->arch.exception.payload = payload; - if (!is_guest_mode(vcpu)) - kvm_deliver_exception_payload(vcpu, - &vcpu->arch.exception); return; } @@ -5531,18 +5528,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu, return 0; } -static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, - struct kvm_vcpu_events *events) +static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu) { - struct kvm_queued_exception *ex; - - process_nmi(vcpu); - -#ifdef CONFIG_KVM_SMM - if (kvm_check_request(KVM_REQ_SMI, vcpu)) - process_smi(vcpu); -#endif - /* * KVM's ABI only allows for one exception to be migrated. Luckily, * the only time there can be two queued exceptions is if there's a @@ -5553,21 +5540,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, if (vcpu->arch.exception_vmexit.pending && !vcpu->arch.exception.pending && !vcpu->arch.exception.injected) - ex = &vcpu->arch.exception_vmexit; - else - ex = &vcpu->arch.exception; + return &vcpu->arch.exception_vmexit; + + return &vcpu->arch.exception; +} + +static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu) +{ + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu); /* - * In guest mode, payload delivery should be deferred if the exception - * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1 - * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability, - * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not - * propagate the payload and so it cannot be safely deferred. Deliver - * the payload if the capability hasn't been requested. + * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver + * the pending exception payload when userspace saves *any* vCPU state + * that interacts with exception payloads to avoid breaking userspace. + * + * Architecturally, KVM must not deliver an exception payload until the + * exception is actually injected, e.g. to avoid losing pending #DB + * information (which VMX tracks in the VMCS), and to avoid clobbering + * state if the exception is never injected for whatever reason. But + * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or + * may not propagate the payload across save+restore, and so KVM can't + * safely defer delivery of the payload. */ if (!vcpu->kvm->arch.exception_payload_enabled && ex->pending && ex->has_payload) kvm_deliver_exception_payload(vcpu, ex); +} + +static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, + struct kvm_vcpu_events *events) +{ + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu); + + process_nmi(vcpu); + +#ifdef CONFIG_KVM_SMM + if (kvm_check_request(KVM_REQ_SMI, vcpu)) + process_smi(vcpu); +#endif + + kvm_handle_exception_payload_quirk(vcpu); memset(events, 0, sizeof(*events)); @@ -5746,6 +5758,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, vcpu->arch.guest_state_protected) return -EINVAL; + kvm_handle_exception_payload_quirk(vcpu); + memset(dbgregs, 0, sizeof(*dbgregs)); BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db)); @@ -12136,6 +12150,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) if (vcpu->arch.guest_state_protected) goto skip_protected_regs; + kvm_handle_exception_payload_quirk(vcpu); + kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS); kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS); kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES); -- cgit v1.2.3 From e63fb1379f4b9300a44739964e69549bebbcdca4 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 10 Feb 2026 01:08:06 +0000 Subject: KVM: nSVM: Mark all of vmcb02 dirty when restoring nested state When restoring a vCPU in guest mode, any state restored before KVM_SET_NESTED_STATE (e.g. KVM_SET_SREGS) will mark the corresponding dirty bits in vmcb01, as it is the active VMCB before switching to vmcb02 in svm_set_nested_state(). Hence, mark all fields in vmcb02 dirty in svm_set_nested_state() to capture any previously restored fields. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260210010806.3204289-1-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 99f8b8de8159..d5a8f5608f2d 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1909,6 +1909,12 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, svm_switch_vmcb(svm, &svm->nested.vmcb02); nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip, svm->vmcb->save.cs.base); + /* + * Any previously restored state (e.g. KVM_SET_SREGS) would mark fields + * dirty in vmcb01 instead of vmcb02, so mark all of vmcb02 dirty here. + */ + vmcb_mark_all_dirty(svm->vmcb); + /* * While the nested guest CR3 is already checked and set by * KVM_SET_SREGS, it was set when nested state was yet loaded, -- cgit v1.2.3 From 24f7d36b824b65cf1a2db3db478059187b2a37b0 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 24 Feb 2026 22:50:17 +0000 Subject: KVM: nSVM: Ensure AVIC is inhibited when restoring a vCPU to guest mode On nested VMRUN, KVM ensures AVIC is inhibited by requesting KVM_REQ_APICV_UPDATE, triggering a check of inhibit reasons, finding APICV_INHIBIT_REASON_NESTED, and disabling AVIC. However, when KVM_SET_NESTED_STATE is performed on a vCPU not in guest mode with AVIC enabled, KVM_REQ_APICV_UPDATE is not requested, and AVIC is not inhibited. Request KVM_REQ_APICV_UPDATE in the KVM_SET_NESTED_STATE path if AVIC is active, similar to the nested VMRUN path. Fixes: f44509f849fe ("KVM: x86: SVM: allow AVIC to co-exist with a nested guest running") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260224225017.3303870-1-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index d5a8f5608f2d..3667f8ba5268 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1928,6 +1928,9 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, svm->nested.force_msr_bitmap_recalc = true; + if (kvm_vcpu_apicv_active(vcpu)) + kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu); + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); ret = 0; out_free: -- cgit v1.2.3 From 778d8c1b2a6ffe622ddcd3bb35b620e6e41f4da0 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 25 Feb 2026 00:59:43 +0000 Subject: KVM: nSVM: Sync NextRIP to cached vmcb12 after VMRUN of L2 After VMRUN in guest mode, nested_sync_control_from_vmcb02() syncs fields written by the CPU from vmcb02 to the cached vmcb12. This is because the cached vmcb12 is used as the authoritative copy of some of the controls, and is the payload when saving/restoring nested state. NextRIP is also written by the CPU (in some cases) after VMRUN, but is not sync'd to the cached vmcb12. As a result, it is corrupted after save/restore (replaced by the original value written by L1 on nested VMRUN). This could cause problems for both KVM (e.g. when injecting a soft IRQ) or L1 (e.g. when using NextRIP to advance RIP after emulating an instruction). Fix this by sync'ing NextRIP to the cache after VMRUN of L2, but only after completing interrupts (not in nested_sync_control_from_vmcb02()), as KVM may update NextRIP (e.g. when re-injecting a soft IRQ). Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-2-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 8f8bc863e214..07f096758f34 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4435,6 +4435,16 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) svm_complete_interrupts(vcpu); + /* + * Update the cache after completing interrupts to get an accurate + * NextRIP, e.g. when re-injecting a soft interrupt. + * + * FIXME: Rework svm_get_nested_state() to not pull data from the + * cache (except for maybe int_ctl). + */ + if (is_guest_mode(vcpu)) + svm->nested.ctl.next_rip = svm->vmcb->control.next_rip; + return svm_exit_handlers_fastpath(vcpu); } -- cgit v1.2.3 From 03bee264f8ebfd39e0254c98e112d033a7aa9055 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 25 Feb 2026 00:59:44 +0000 Subject: KVM: nSVM: Sync interrupt shadow to cached vmcb12 after VMRUN of L2 After VMRUN in guest mode, nested_sync_control_from_vmcb02() syncs fields written by the CPU from vmcb02 to the cached vmcb12. This is because the cached vmcb12 is used as the authoritative copy of some of the controls, and is the payload when saving/restoring nested state. int_state is also written by the CPU, specifically bit 0 (i.e. SVM_INTERRUPT_SHADOW_MASK) for nested VMs, but it is not sync'd to cached vmcb12. This does not cause a problem if KVM_SET_NESTED_STATE preceeds KVM_SET_VCPU_EVENTS in the restore path, as an interrupt shadow would be correctly restored to vmcb02 (KVM_SET_VCPU_EVENTS overwrites what KVM_SET_NESTED_STATE restored in int_state). However, if KVM_SET_VCPU_EVENTS preceeds KVM_SET_NESTED_STATE, an interrupt shadow would be restored into vmcb01 instead of vmcb02. This would mostly be benign for L1 (delays an interrupt), but not for L2. For L2, the vCPU could hang (e.g. if a wakeup interrupt is delivered before a HLT that should have been in an interrupt shadow). Sync int_state to the cached vmcb12 in nested_sync_control_from_vmcb02() to avoid this problem. With that, KVM_SET_NESTED_STATE restores the correct interrupt shadow state, and if KVM_SET_VCPU_EVENTS follows it would overwrite it with the same value. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-3-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 3667f8ba5268..2308e40691c4 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -521,6 +521,7 @@ void nested_sync_control_from_vmcb02(struct vcpu_svm *svm) u32 mask; svm->nested.ctl.event_inj = svm->vmcb->control.event_inj; svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err; + svm->nested.ctl.int_state = svm->vmcb->control.int_state; /* Only a few fields of int_ctl are written by the processor. */ mask = V_IRQ_MASK | V_TPR_MASK; -- cgit v1.2.3 From 2303ca26fbb005a45aaf5a547465f978df906cb7 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 25 Feb 2026 00:59:45 +0000 Subject: KVM: selftests: Extend state_test to check vGIF V_GIF_MASK is one of the fields written by the CPU after VMRUN, and sync'd by KVM from vmcb02 to cached vmcb12 after running L2. Part of the reason is to make sure V_GIF_MASK is saved/restored correctly, as the cached vmcb12 is the payload of nested state. Verify that V_GIF_MASK is saved/restored correctly in state_test by enabling vGIF in vmcb12, toggling GIF in L2 at different GUEST_SYNC() points, and verifying that V_GIF_MASK is correctly propagated to the nested state. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-4-yosry@kernel.org Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/x86/state_test.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/tools/testing/selftests/kvm/x86/state_test.c b/tools/testing/selftests/kvm/x86/state_test.c index f2c7a1c297e3..57c7546f3d7c 100644 --- a/tools/testing/selftests/kvm/x86/state_test.c +++ b/tools/testing/selftests/kvm/x86/state_test.c @@ -26,7 +26,9 @@ void svm_l2_guest_code(void) GUEST_SYNC(4); /* Exit to L1 */ vmcall(); + clgi(); GUEST_SYNC(6); + stgi(); /* Done, exit to L1 and never come back. */ vmcall(); } @@ -41,6 +43,8 @@ static void svm_l1_guest_code(struct svm_test_data *svm) generic_svm_setup(svm, svm_l2_guest_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); + vmcb->control.int_ctl |= (V_GIF_ENABLE_MASK | V_GIF_MASK); + GUEST_SYNC(3); run_guest(vmcb, svm->vmcb_gpa); GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_VMMCALL); @@ -222,6 +226,24 @@ static void __attribute__((__flatten__)) guest_code(void *arg) GUEST_DONE(); } +void svm_check_nested_state(int stage, struct kvm_x86_state *state) +{ + struct vmcb *vmcb = (struct vmcb *)state->nested.data.svm; + + if (kvm_cpu_has(X86_FEATURE_VGIF)) { + if (stage == 4) + TEST_ASSERT_EQ(!!(vmcb->control.int_ctl & V_GIF_MASK), 1); + if (stage == 6) + TEST_ASSERT_EQ(!!(vmcb->control.int_ctl & V_GIF_MASK), 0); + } +} + +void check_nested_state(int stage, struct kvm_x86_state *state) +{ + if (kvm_has_cap(KVM_CAP_NESTED_STATE) && kvm_cpu_has(X86_FEATURE_SVM)) + svm_check_nested_state(stage, state); +} + int main(int argc, char *argv[]) { uint64_t *xstate_bv, saved_xstate_bv; @@ -278,6 +300,8 @@ int main(int argc, char *argv[]) kvm_vm_release(vm); + check_nested_state(stage, state); + /* Restore state in a new VM. */ vcpu = vm_recreate_with_one_vcpu(vm); vcpu_load_state(vcpu, state); -- cgit v1.2.3 From e5cdd34b5f74c4a0c72fe43092192f347d999e77 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 25 Feb 2026 00:59:46 +0000 Subject: KVM: selftests: Extend state_test to check next_rip Similar to vGIF, extend state_test to make sure that next_rip is saved correctly in nested state. GUEST_SYNC() in L2 causes IO emulation by KVM, which advances the RIP to the value of next_rip. Hence, if next_rip is saved correctly, its value should match the saved RIP value. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-5-yosry@kernel.org Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/x86/state_test.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/tools/testing/selftests/kvm/x86/state_test.c b/tools/testing/selftests/kvm/x86/state_test.c index 57c7546f3d7c..992a52504a4a 100644 --- a/tools/testing/selftests/kvm/x86/state_test.c +++ b/tools/testing/selftests/kvm/x86/state_test.c @@ -236,6 +236,17 @@ void svm_check_nested_state(int stage, struct kvm_x86_state *state) if (stage == 6) TEST_ASSERT_EQ(!!(vmcb->control.int_ctl & V_GIF_MASK), 0); } + + if (kvm_cpu_has(X86_FEATURE_NRIPS)) { + /* + * GUEST_SYNC() causes IO emulation in KVM, in which case the + * RIP is advanced before exiting to userspace. Hence, the RIP + * in the saved state should be the same as nRIP saved by the + * CPU in the VMCB. + */ + if (stage == 6) + TEST_ASSERT_EQ(vmcb->control.next_rip, state->regs.rip); + } } void check_nested_state(int stage, struct kvm_x86_state *state) -- cgit v1.2.3 From 8d397582f6b5e9fbcf09781c7c934b4910e94a50 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 25 Feb 2026 00:59:47 +0000 Subject: KVM: nSVM: Always use NextRIP as vmcb02's NextRIP after first L2 VMRUN For guests with NRIPS disabled, L1 does not provide NextRIP when running an L2 with an injected soft interrupt, instead it advances the current RIP before running it. KVM uses the current RIP as the NextRIP in vmcb02 to emulate a CPU without NRIPS. However, after L2 runs the first time, NextRIP will be updated by the CPU and/or KVM, and the current RIP is no longer the correct value to use in vmcb02. Hence, after save/restore, use the current RIP if and only if a nested run is pending, otherwise use NextRIP. Give soft_int_next_rip the same treatment, as it's the same logic, just for a narrower use case. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-6-yosry@kernel.org [sean: give soft_int_next_rip the same treatment] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 2308e40691c4..1cc083f95e6a 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -845,24 +845,32 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err; /* - * next_rip is consumed on VMRUN as the return address pushed on the + * NextRIP is consumed on VMRUN as the return address pushed on the * stack for injected soft exceptions/interrupts. If nrips is exposed - * to L1, take it verbatim from vmcb12. If nrips is supported in - * hardware but not exposed to L1, stuff the actual L2 RIP to emulate - * what a nrips=0 CPU would do (L1 is responsible for advancing RIP - * prior to injecting the event). + * to L1, take it verbatim from vmcb12. + * + * If nrips is supported in hardware but not exposed to L1, stuff the + * actual L2 RIP to emulate what a nrips=0 CPU would do (L1 is + * responsible for advancing RIP prior to injecting the event). This is + * only the case for the first L2 run after VMRUN. After that (e.g. + * during save/restore), NextRIP is updated by the CPU and/or KVM, and + * the value of the L2 RIP from vmcb12 should not be used. */ - if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS)) - vmcb02->control.next_rip = svm->nested.ctl.next_rip; - else if (boot_cpu_has(X86_FEATURE_NRIPS)) - vmcb02->control.next_rip = vmcb12_rip; + if (boot_cpu_has(X86_FEATURE_NRIPS)) { + if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || + !svm->nested.nested_run_pending) + vmcb02->control.next_rip = svm->nested.ctl.next_rip; + else + vmcb02->control.next_rip = vmcb12_rip; + } svm->nmi_l1_to_l2 = is_evtinj_nmi(vmcb02->control.event_inj); if (is_evtinj_soft(vmcb02->control.event_inj)) { svm->soft_int_injected = true; svm->soft_int_csbase = vmcb12_csbase; svm->soft_int_old_rip = vmcb12_rip; - if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS)) + if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || + !svm->nested.nested_run_pending) svm->soft_int_next_rip = svm->nested.ctl.next_rip; else svm->soft_int_next_rip = vmcb12_rip; -- cgit v1.2.3 From a0592461f39c00b28f552fe842a063a00043eaa8 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 25 Feb 2026 00:59:48 +0000 Subject: KVM: nSVM: Delay stuffing L2's current RIP into NextRIP until vCPU run For guests with NRIPS disabled, L1 does not provide NextRIP when running an L2 with an injected soft interrupt, instead it advances L2's RIP before running it. KVM uses L2's current RIP as the NextRIP in vmcb02 to emulate a CPU without NRIPS. However, in svm_set_nested_state(), the value used for L2's current RIP comes from vmcb02, which is just whatever the vCPU had in vmcb02 before restoring nested state (zero on a freshly created vCPU). Passing the cached RIP value instead (i.e. kvm_rip_read()) would only fix the issue if registers are restored before nested state. Instead, split the logic of setting NextRIP in vmcb02. Handle the 'normal' case of initializing vmcb02's NextRIP using NextRIP from vmcb12 (or KVM_GET_NESTED_STATE's payload) in nested_vmcb02_prepare_control(). Delay the special case of stuffing L2's current RIP into vmcb02's NextRIP until shortly before the vCPU is run, to make sure the most up-to-date value of RIP is used regardless of KVM_SET_REGS and KVM_SET_NESTED_STATE's relative ordering. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-7-yosry@kernel.org [sean: use new helper, svm_fixup_nested_rips()] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 25 ++++++++----------------- arch/x86/kvm/svm/svm.c | 25 +++++++++++++++++++++++++ 2 files changed, 33 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 1cc083f95e6a..76d959d15e14 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -845,24 +845,15 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err; /* - * NextRIP is consumed on VMRUN as the return address pushed on the - * stack for injected soft exceptions/interrupts. If nrips is exposed - * to L1, take it verbatim from vmcb12. - * - * If nrips is supported in hardware but not exposed to L1, stuff the - * actual L2 RIP to emulate what a nrips=0 CPU would do (L1 is - * responsible for advancing RIP prior to injecting the event). This is - * only the case for the first L2 run after VMRUN. After that (e.g. - * during save/restore), NextRIP is updated by the CPU and/or KVM, and - * the value of the L2 RIP from vmcb12 should not be used. + * If nrips is exposed to L1, take NextRIP as-is. Otherwise, L1 + * advances L2's RIP before VMRUN instead of using NextRIP. KVM will + * stuff the current RIP as vmcb02's NextRIP before L2 is run. After + * the first run of L2 (e.g. after save+restore), NextRIP is updated by + * the CPU and/or KVM and should be used regardless of L1's support. */ - if (boot_cpu_has(X86_FEATURE_NRIPS)) { - if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || - !svm->nested.nested_run_pending) - vmcb02->control.next_rip = svm->nested.ctl.next_rip; - else - vmcb02->control.next_rip = vmcb12_rip; - } + if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || + !svm->nested.nested_run_pending) + vmcb02->control.next_rip = svm->nested.ctl.next_rip; svm->nmi_l1_to_l2 = is_evtinj_nmi(vmcb02->control.event_inj); if (is_evtinj_soft(vmcb02->control.event_inj)) { diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 07f096758f34..f862bafc381a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3738,6 +3738,29 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected) svm->vmcb->control.event_inj = intr->nr | SVM_EVTINJ_VALID | type; } +static void svm_fixup_nested_rips(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (!is_guest_mode(vcpu) || !svm->nested.nested_run_pending) + return; + + /* + * If nrips is supported in hardware but not exposed to L1, stuff the + * actual L2 RIP to emulate what a nrips=0 CPU would do (L1 is + * responsible for advancing RIP prior to injecting the event). Once L2 + * runs after L1 executes VMRUN, NextRIP is updated by the CPU and/or + * KVM, and this is no longer needed. + * + * This is done here (as opposed to when preparing vmcb02) to use the + * most up-to-date value of RIP regardless of the order of restoring + * registers and nested state in the vCPU save+restore path. + */ + if (boot_cpu_has(X86_FEATURE_NRIPS) && + !guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS)) + svm->vmcb->control.next_rip = kvm_rip_read(vcpu); +} + void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode, int trig_mode, int vector) { @@ -4334,6 +4357,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) kvm_register_is_dirty(vcpu, VCPU_EXREG_ERAPS)) svm->vmcb->control.erap_ctl |= ERAP_CONTROL_CLEAR_RAP; + svm_fixup_nested_rips(vcpu); + svm_hv_update_vp_id(svm->vmcb, vcpu); /* -- cgit v1.2.3 From c64bc6ed1764c1b7e3c0017019f743196074092f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 4 Mar 2026 16:06:56 -0800 Subject: KVM: nSVM: Delay setting soft IRQ RIP tracking fields until vCPU run In the save+restore path, when restoring nested state, the values of RIP and CS base passed into nested_vmcb02_prepare_control() are mostly incorrect. They are both pulled from the vmcb02. For CS base, the value is only correct if system regs are restored before nested state. The value of RIP is whatever the vCPU had in vmcb02 before restoring nested state (zero on a freshly created vCPU). Instead, take a similar approach to NextRIP, and delay initializing the RIP tracking fields until shortly before the vCPU is run, to make sure the most up-to-date values of RIP and CS base are used regardless of KVM_SET_SREGS, KVM_SET_REGS, and KVM_SET_NESTED_STATE's relative ordering. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260225005950.3739782-8-yosry@kernel.org [sean: deal with the svm_cancel_injection() madness] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 17 ++++++++--------- arch/x86/kvm/svm/svm.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 37 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 76d959d15e14..3e2841598a36 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -742,9 +742,7 @@ static bool is_evtinj_nmi(u32 evtinj) return type == SVM_EVTINJ_TYPE_NMI; } -static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, - unsigned long vmcb12_rip, - unsigned long vmcb12_csbase) +static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) { u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK; u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK; @@ -856,15 +854,16 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, vmcb02->control.next_rip = svm->nested.ctl.next_rip; svm->nmi_l1_to_l2 = is_evtinj_nmi(vmcb02->control.event_inj); + + /* + * soft_int_csbase, soft_int_old_rip, and soft_int_next_rip (if L1 + * doesn't have NRIPS) are initialized later, before the vCPU is run. + */ if (is_evtinj_soft(vmcb02->control.event_inj)) { svm->soft_int_injected = true; - svm->soft_int_csbase = vmcb12_csbase; - svm->soft_int_old_rip = vmcb12_rip; if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || !svm->nested.nested_run_pending) svm->soft_int_next_rip = svm->nested.ctl.next_rip; - else - svm->soft_int_next_rip = vmcb12_rip; } /* LBR_CTL_ENABLE_MASK is controlled by svm_update_lbrv() */ @@ -962,7 +961,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr); svm_switch_vmcb(svm, &svm->nested.vmcb02); - nested_vmcb02_prepare_control(svm, vmcb12->save.rip, vmcb12->save.cs.base); + nested_vmcb02_prepare_control(svm); nested_vmcb02_prepare_save(svm, vmcb12); ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3, @@ -1907,7 +1906,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, nested_copy_vmcb_control_to_cache(svm, ctl); svm_switch_vmcb(svm, &svm->nested.vmcb02); - nested_vmcb02_prepare_control(svm, svm->vmcb->save.rip, svm->vmcb->save.cs.base); + nested_vmcb02_prepare_control(svm); /* * Any previously restored state (e.g. KVM_SET_SREGS) would mark fields diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index f862bafc381a..d82e30c40eaa 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3637,6 +3637,16 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) return svm_invoke_exit_handler(vcpu, svm->vmcb->control.exit_code); } +static void svm_set_nested_run_soft_int_state(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + svm->soft_int_csbase = svm->vmcb->save.cs.base; + svm->soft_int_old_rip = kvm_rip_read(vcpu); + if (!guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS)) + svm->soft_int_next_rip = kvm_rip_read(vcpu); +} + static int pre_svm_run(struct kvm_vcpu *vcpu) { struct svm_cpu_data *sd = per_cpu_ptr(&svm_data, vcpu->cpu); @@ -3759,6 +3769,13 @@ static void svm_fixup_nested_rips(struct kvm_vcpu *vcpu) if (boot_cpu_has(X86_FEATURE_NRIPS) && !guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS)) svm->vmcb->control.next_rip = kvm_rip_read(vcpu); + + /* + * Simiarly, initialize the soft int metadata here to use the most + * up-to-date values of RIP and CS base, regardless of restore order. + */ + if (svm->soft_int_injected) + svm_set_nested_run_soft_int_state(vcpu); } void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode, @@ -4128,6 +4145,18 @@ static void svm_complete_soft_interrupt(struct kvm_vcpu *vcpu, u8 vector, bool is_soft = (type == SVM_EXITINTINFO_TYPE_SOFT); struct vcpu_svm *svm = to_svm(vcpu); + /* + * Initialize the soft int fields *before* reading them below if KVM + * aborted entry to the guest with a nested VMRUN pending. To ensure + * KVM uses up-to-date values for RIP and CS base across save/restore, + * regardless of restore order, KVM waits to set the soft int fields + * until VMRUN is imminent. But when canceling injection, KVM requeues + * the soft int and will reinject it via the standard injection flow, + * and so KVM needs to grab the state from the pending nested VMRUN. + */ + if (is_guest_mode(vcpu) && svm->nested.nested_run_pending) + svm_set_nested_run_soft_int_state(vcpu); + /* * If NRIPS is enabled, KVM must snapshot the pre-VMRUN next_rip that's * associated with the original soft exception/interrupt. next_rip is -- cgit v1.2.3 From d99df02ff427f461102230f9c5b90a6c64ee8e23 Mon Sep 17 00:00:00 2001 From: Kevin Cheng Date: Sat, 28 Feb 2026 03:33:26 +0000 Subject: KVM: SVM: Inject #UD for INVLPGA if EFER.SVME=0 INVLPGA should cause a #UD when EFER.SVME is not set. Add a check to properly inject #UD when EFER.SVME=0. Fixes: ff092385e828 ("KVM: SVM: Implement INVLPGA") Cc: stable@vger.kernel.org Signed-off-by: Kevin Cheng Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260228033328.2285047-3-chengkev@google.com [sean: tag for stable@] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index d82e30c40eaa..543f9f3f966e 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2367,6 +2367,9 @@ static int invlpga_interception(struct kvm_vcpu *vcpu) gva_t gva = kvm_rax_read(vcpu); u32 asid = kvm_rcx_read(vcpu); + if (nested_svm_check_permissions(vcpu)) + return 1; + /* FIXME: Handle an address size prefix. */ if (!is_long_mode(vcpu)) gva = (u32)gva; -- cgit v1.2.3 From b53ab5167a81537777ac780bbd93d32613aa3bda Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:33:55 +0000 Subject: KVM: nSVM: Avoid clearing VMCB_LBR in vmcb12 svm_copy_lbrs() always marks VMCB_LBR dirty in the destination VMCB. However, nested_svm_vmexit() uses it to copy LBRs to vmcb12, and clearing clean bits in vmcb12 is not architecturally defined. Move vmcb_mark_dirty() to callers and drop it for vmcb12. This also facilitates incoming refactoring that does not pass the entire VMCB to svm_copy_lbrs(). Fixes: d20c796ca370 ("KVM: x86: nSVM: implement nested LBR virtualization") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-2-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 7 +++++-- arch/x86/kvm/svm/svm.c | 2 -- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 3e2841598a36..0a35c815f4d2 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -715,6 +715,7 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12 } else { svm_copy_lbrs(vmcb02, vmcb01); } + vmcb_mark_dirty(vmcb02, VMCB_LBR); svm_update_lbrv(&svm->vcpu); } @@ -1231,10 +1232,12 @@ int nested_svm_vmexit(struct vcpu_svm *svm) kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); if (unlikely(guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && - (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) + (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { svm_copy_lbrs(vmcb12, vmcb02); - else + } else { svm_copy_lbrs(vmcb01, vmcb02); + vmcb_mark_dirty(vmcb01, VMCB_LBR); + } svm_update_lbrv(vcpu); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 543f9f3f966e..9b4f5a46d550 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -848,8 +848,6 @@ void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb) to_vmcb->save.br_to = from_vmcb->save.br_to; to_vmcb->save.last_excp_from = from_vmcb->save.last_excp_from; to_vmcb->save.last_excp_to = from_vmcb->save.last_excp_to; - - vmcb_mark_dirty(to_vmcb, VMCB_LBR); } static void __svm_enable_lbrv(struct kvm_vcpu *vcpu) -- cgit v1.2.3 From 361dbe8173c460a2bf8aee23920f6c2dbdcabb94 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:33:56 +0000 Subject: KVM: SVM: Switch svm_copy_lbrs() to a macro In preparation for using svm_copy_lbrs() with 'struct vmcb_save_area' without a containing 'struct vmcb', and later even 'struct vmcb_save_area_cached', make it a macro. Macros are generally not preferred compared to functions, mainly due to type-safety. However, in this case it seems like having a simple macro copying a few fields is better than copy-pasting the same 5 lines of code in different places. Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-3-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 8 ++++---- arch/x86/kvm/svm/svm.c | 9 --------- arch/x86/kvm/svm/svm.h | 10 +++++++++- 3 files changed, 13 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 0a35c815f4d2..9c64d036e30b 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -710,10 +710,10 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12 * Reserved bits of DEBUGCTL are ignored. Be consistent with * svm_set_msr's definition of reserved bits. */ - svm_copy_lbrs(vmcb02, vmcb12); + svm_copy_lbrs(&vmcb02->save, &vmcb12->save); vmcb02->save.dbgctl &= ~DEBUGCTL_RESERVED_BITS; } else { - svm_copy_lbrs(vmcb02, vmcb01); + svm_copy_lbrs(&vmcb02->save, &vmcb01->save); } vmcb_mark_dirty(vmcb02, VMCB_LBR); svm_update_lbrv(&svm->vcpu); @@ -1233,9 +1233,9 @@ int nested_svm_vmexit(struct vcpu_svm *svm) if (unlikely(guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { - svm_copy_lbrs(vmcb12, vmcb02); + svm_copy_lbrs(&vmcb12->save, &vmcb02->save); } else { - svm_copy_lbrs(vmcb01, vmcb02); + svm_copy_lbrs(&vmcb01->save, &vmcb02->save); vmcb_mark_dirty(vmcb01, VMCB_LBR); } diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9b4f5a46d550..7170f2f623af 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -841,15 +841,6 @@ static void svm_recalc_msr_intercepts(struct kvm_vcpu *vcpu) */ } -void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb) -{ - to_vmcb->save.dbgctl = from_vmcb->save.dbgctl; - to_vmcb->save.br_from = from_vmcb->save.br_from; - to_vmcb->save.br_to = from_vmcb->save.br_to; - to_vmcb->save.last_excp_from = from_vmcb->save.last_excp_from; - to_vmcb->save.last_excp_to = from_vmcb->save.last_excp_to; -} - static void __svm_enable_lbrv(struct kvm_vcpu *vcpu) { to_svm(vcpu)->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK; diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index ebd7b36b1ceb..44d767cd1d25 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -713,8 +713,16 @@ static inline void *svm_vcpu_alloc_msrpm(void) return svm_alloc_permissions_map(MSRPM_SIZE, GFP_KERNEL_ACCOUNT); } +#define svm_copy_lbrs(to, from) \ +do { \ + (to)->dbgctl = (from)->dbgctl; \ + (to)->br_from = (from)->br_from; \ + (to)->br_to = (from)->br_to; \ + (to)->last_excp_from = (from)->last_excp_from; \ + (to)->last_excp_to = (from)->last_excp_to; \ +} while (0) + void svm_vcpu_free_msrpm(void *msrpm); -void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb); void svm_enable_lbrv(struct kvm_vcpu *vcpu); void svm_update_lbrv(struct kvm_vcpu *vcpu); -- cgit v1.2.3 From 3700f0788da6acf73b2df56690f4b201aa4aefd2 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:33:57 +0000 Subject: KVM: SVM: Add missing save/restore handling of LBR MSRs MSR_IA32_DEBUGCTLMSR and LBR MSRs are currently not enumerated by KVM_GET_MSR_INDEX_LIST, and LBR MSRs cannot be set with KVM_SET_MSRS. So save/restore is completely broken. Fix it by adding the MSRs to msrs_to_save_base, and allowing writes to LBR MSRs from userspace only (as they are read-only MSRs) if LBR virtualization is enabled. Additionally, to correctly restore L1's LBRs while L2 is running, make sure the LBRs are copied from the captured VMCB01 save area in svm_copy_vmrun_state(). Note, for VMX, this also fixes a flaw where MSR_IA32_DEBUGCTLMSR isn't reported as an MSR to save/restore. Note #2, over-reporting MSR_IA32_LASTxxx on Intel is ok, as KVM already handles unsupported reads and writes thanks to commit b5e2fec0ebc3 ("KVM: Ignore DEBUGCTL MSRs with no effect") (kvm_do_msr_access() will morph the unsupported userspace write into a nop). Fixes: 24e09cbf480a ("KVM: SVM: enable LBR virtualization") Cc: stable@vger.kernel.org Reported-by: Jim Mattson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-4-yosry@kernel.org [sean: guard with lbrv checks, massage changelog] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 5 +++++ arch/x86/kvm/svm/svm.c | 42 +++++++++++++++++++++++++++++++++++++----- arch/x86/kvm/x86.c | 3 +++ 3 files changed, 45 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 9c64d036e30b..2b1066ce23f5 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1099,6 +1099,11 @@ void svm_copy_vmrun_state(struct vmcb_save_area *to_save, to_save->isst_addr = from_save->isst_addr; to_save->ssp = from_save->ssp; } + + if (kvm_cpu_cap_has(X86_FEATURE_LBRV)) { + svm_copy_lbrs(to_save, from_save); + to_save->dbgctl &= ~DEBUGCTL_RESERVED_BITS; + } } void svm_copy_vmloadsave_state(struct vmcb *to_vmcb, struct vmcb *from_vmcb) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7170f2f623af..e97c56df41f6 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2787,19 +2787,19 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) msr_info->data = svm->tsc_aux; break; case MSR_IA32_DEBUGCTLMSR: - msr_info->data = svm->vmcb->save.dbgctl; + msr_info->data = lbrv ? svm->vmcb->save.dbgctl : 0; break; case MSR_IA32_LASTBRANCHFROMIP: - msr_info->data = svm->vmcb->save.br_from; + msr_info->data = lbrv ? svm->vmcb->save.br_from : 0; break; case MSR_IA32_LASTBRANCHTOIP: - msr_info->data = svm->vmcb->save.br_to; + msr_info->data = lbrv ? svm->vmcb->save.br_to : 0; break; case MSR_IA32_LASTINTFROMIP: - msr_info->data = svm->vmcb->save.last_excp_from; + msr_info->data = lbrv ? svm->vmcb->save.last_excp_from : 0; break; case MSR_IA32_LASTINTTOIP: - msr_info->data = svm->vmcb->save.last_excp_to; + msr_info->data = lbrv ? svm->vmcb->save.last_excp_to : 0; break; case MSR_VM_HSAVE_PA: msr_info->data = svm->nested.hsave_msr; @@ -3074,6 +3074,38 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) vmcb_mark_dirty(svm->vmcb, VMCB_LBR); svm_update_lbrv(vcpu); break; + case MSR_IA32_LASTBRANCHFROMIP: + if (!lbrv) + return KVM_MSR_RET_UNSUPPORTED; + if (!msr->host_initiated) + return 1; + svm->vmcb->save.br_from = data; + vmcb_mark_dirty(svm->vmcb, VMCB_LBR); + break; + case MSR_IA32_LASTBRANCHTOIP: + if (!lbrv) + return KVM_MSR_RET_UNSUPPORTED; + if (!msr->host_initiated) + return 1; + svm->vmcb->save.br_to = data; + vmcb_mark_dirty(svm->vmcb, VMCB_LBR); + break; + case MSR_IA32_LASTINTFROMIP: + if (!lbrv) + return KVM_MSR_RET_UNSUPPORTED; + if (!msr->host_initiated) + return 1; + svm->vmcb->save.last_excp_from = data; + vmcb_mark_dirty(svm->vmcb, VMCB_LBR); + break; + case MSR_IA32_LASTINTTOIP: + if (!lbrv) + return KVM_MSR_RET_UNSUPPORTED; + if (!msr->host_initiated) + return 1; + svm->vmcb->save.last_excp_to = data; + vmcb_mark_dirty(svm->vmcb, VMCB_LBR); + break; case MSR_VM_HSAVE_PA: /* * Old kernels did not validate the value written to diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6e87ec52fa06..64da02d1ee00 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -351,6 +351,9 @@ static const u32 msrs_to_save_base[] = { MSR_IA32_U_CET, MSR_IA32_S_CET, MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP, MSR_IA32_PL2_SSP, MSR_IA32_PL3_SSP, MSR_IA32_INT_SSP_TAB, + MSR_IA32_DEBUGCTLMSR, + MSR_IA32_LASTBRANCHFROMIP, MSR_IA32_LASTBRANCHTOIP, + MSR_IA32_LASTINTFROMIP, MSR_IA32_LASTINTTOIP, }; static const u32 msrs_to_save_pmu[] = { -- cgit v1.2.3 From ac17892e51525ccea892b7e3171e2d1e9bb6fa61 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:33:58 +0000 Subject: KVM: selftests: Add a test for LBR save/restore (ft. nested) Add a selftest exercising save/restore with usage of LBRs in both L1 and L2, and making sure all LBRs remain intact. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-5-yosry@kernel.org Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/include/x86/processor.h | 5 + .../selftests/kvm/x86/svm_lbr_nested_state.c | 145 +++++++++++++++++++++ 3 files changed, 151 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index fdec90e85467..36b48e766e49 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -112,6 +112,7 @@ TEST_GEN_PROGS_x86 += x86/svm_vmcall_test TEST_GEN_PROGS_x86 += x86/svm_int_ctl_test TEST_GEN_PROGS_x86 += x86/svm_nested_shutdown_test TEST_GEN_PROGS_x86 += x86/svm_nested_soft_inject_test +TEST_GEN_PROGS_x86 += x86/svm_lbr_nested_state TEST_GEN_PROGS_x86 += x86/tsc_scaling_sync TEST_GEN_PROGS_x86 += x86/sync_regs_test TEST_GEN_PROGS_x86 += x86/ucna_injection_test diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h index 4ebae4269e68..db0171935197 100644 --- a/tools/testing/selftests/kvm/include/x86/processor.h +++ b/tools/testing/selftests/kvm/include/x86/processor.h @@ -1360,6 +1360,11 @@ static inline bool kvm_is_ignore_msrs(void) return get_kvm_param_bool("ignore_msrs"); } +static inline bool kvm_is_lbrv_enabled(void) +{ + return !!get_kvm_amd_param_integer("lbrv"); +} + uint64_t *vm_get_pte(struct kvm_vm *vm, uint64_t vaddr); uint64_t kvm_hypercall(uint64_t nr, uint64_t a0, uint64_t a1, uint64_t a2, diff --git a/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c b/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c new file mode 100644 index 000000000000..bf16abb1152e --- /dev/null +++ b/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c @@ -0,0 +1,145 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2026, Google, Inc. + */ + +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" +#include "svm_util.h" + + +#define L2_GUEST_STACK_SIZE 64 + +#define DO_BRANCH() do { asm volatile("jmp 1f\n 1: nop"); } while (0) + +struct lbr_branch { + u64 from, to; +}; + +volatile struct lbr_branch l2_branch; + +#define RECORD_AND_CHECK_BRANCH(b) \ +do { \ + wrmsr(MSR_IA32_DEBUGCTLMSR, DEBUGCTLMSR_LBR); \ + DO_BRANCH(); \ + (b)->from = rdmsr(MSR_IA32_LASTBRANCHFROMIP); \ + (b)->to = rdmsr(MSR_IA32_LASTBRANCHTOIP); \ + /* Disable LBR right after to avoid overriding the IPs */ \ + wrmsr(MSR_IA32_DEBUGCTLMSR, 0); \ + \ + GUEST_ASSERT_NE((b)->from, 0); \ + GUEST_ASSERT_NE((b)->to, 0); \ +} while (0) + +#define CHECK_BRANCH_MSRS(b) \ +do { \ + GUEST_ASSERT_EQ((b)->from, rdmsr(MSR_IA32_LASTBRANCHFROMIP)); \ + GUEST_ASSERT_EQ((b)->to, rdmsr(MSR_IA32_LASTBRANCHTOIP)); \ +} while (0) + +#define CHECK_BRANCH_VMCB(b, vmcb) \ +do { \ + GUEST_ASSERT_EQ((b)->from, vmcb->save.br_from); \ + GUEST_ASSERT_EQ((b)->to, vmcb->save.br_to); \ +} while (0) + +static void l2_guest_code(struct svm_test_data *svm) +{ + /* Record a branch, trigger save/restore, and make sure LBRs are intact */ + RECORD_AND_CHECK_BRANCH(&l2_branch); + GUEST_SYNC(true); + CHECK_BRANCH_MSRS(&l2_branch); + vmmcall(); +} + +static void l1_guest_code(struct svm_test_data *svm, bool nested_lbrv) +{ + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + struct vmcb *vmcb = svm->vmcb; + struct lbr_branch l1_branch; + + /* Record a branch, trigger save/restore, and make sure LBRs are intact */ + RECORD_AND_CHECK_BRANCH(&l1_branch); + GUEST_SYNC(true); + CHECK_BRANCH_MSRS(&l1_branch); + + /* Run L2, which will also do the same */ + generic_svm_setup(svm, l2_guest_code, + &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + if (nested_lbrv) + vmcb->control.virt_ext = LBR_CTL_ENABLE_MASK; + else + vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK; + + run_guest(vmcb, svm->vmcb_gpa); + GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); + + /* Trigger save/restore one more time before checking, just for kicks */ + GUEST_SYNC(true); + + /* + * If LBR_CTL_ENABLE is set, L1 and L2 should have separate LBR MSRs, so + * expect L1's LBRs to remain intact and L2 LBRs to be in the VMCB. + * Otherwise, the MSRs are shared between L1 & L2 so expect L2's LBRs. + */ + if (nested_lbrv) { + CHECK_BRANCH_MSRS(&l1_branch); + CHECK_BRANCH_VMCB(&l2_branch, vmcb); + } else { + CHECK_BRANCH_MSRS(&l2_branch); + } + GUEST_DONE(); +} + +void test_lbrv_nested_state(bool nested_lbrv) +{ + struct kvm_x86_state *state = NULL; + struct kvm_vcpu *vcpu; + vm_vaddr_t svm_gva; + struct kvm_vm *vm; + struct ucall uc; + + pr_info("Testing with nested LBRV %s\n", nested_lbrv ? "enabled" : "disabled"); + + vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); + vcpu_alloc_svm(vm, &svm_gva); + vcpu_args_set(vcpu, 2, svm_gva, nested_lbrv); + + for (;;) { + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); + switch (get_ucall(vcpu, &uc)) { + case UCALL_SYNC: + /* Save the vCPU state and restore it in a new VM on sync */ + pr_info("Guest triggered save/restore.\n"); + state = vcpu_save_state(vcpu); + kvm_vm_release(vm); + vcpu = vm_recreate_with_one_vcpu(vm); + vcpu_load_state(vcpu, state); + kvm_x86_state_cleanup(state); + break; + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + /* NOT REACHED */ + case UCALL_DONE: + goto done; + default: + TEST_FAIL("Unknown ucall %lu", uc.cmd); + } + } +done: + kvm_vm_free(vm); +} + +int main(int argc, char *argv[]) +{ + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); + TEST_REQUIRE(kvm_is_lbrv_enabled()); + + test_lbrv_nested_state(/*nested_lbrv=*/false); + test_lbrv_nested_state(/*nested_lbrv=*/true); + + return 0; +} -- cgit v1.2.3 From 01ddcdc55e097ca38c28ae656711b8e6d1df71f8 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:33:59 +0000 Subject: KVM: nSVM: Always inject a #GP if mapping VMCB12 fails on nested VMRUN nested_svm_vmrun() currently only injects a #GP if kvm_vcpu_map() fails with -EINVAL. But it could also fail with -EFAULT if creating a host mapping failed. Inject a #GP in all cases, no reason to treat failure modes differently. Fixes: 8c5fbf1a7231 ("KVM/nSVM: Use the new mapping API for mapping guest memory") CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-6-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 2b1066ce23f5..7a472d7c6e98 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1010,12 +1010,9 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) } vmcb12_gpa = svm->vmcb->save.rax; - ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map); - if (ret == -EINVAL) { + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) { kvm_inject_gp(vcpu, 0); return 1; - } else if (ret) { - return kvm_skip_emulated_instruction(vcpu); } ret = kvm_skip_emulated_instruction(vcpu); -- cgit v1.2.3 From 290c8d82023ab0e1d2782d37136541e017174d7c Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:00 +0000 Subject: KVM: nSVM: Refactor checking LBRV enablement in vmcb12 into a helper Refactor the vCPU cap and vmcb12 flag checks into a helper. The unlikely() annotation is dropped, it's unlikely (huh) to make a difference and the CPU will probably predict it better on its own. CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-7-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 7a472d7c6e98..d419fd516fa9 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -640,6 +640,12 @@ void nested_vmcb02_compute_g_pat(struct vcpu_svm *svm) svm->nested.vmcb02.ptr->save.g_pat = svm->vmcb01.ptr->save.g_pat; } +static bool nested_vmcb12_has_lbrv(struct kvm_vcpu *vcpu) +{ + return guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && + (to_svm(vcpu)->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK); +} + static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12) { bool new_vmcb12 = false; @@ -704,8 +710,7 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12 vmcb_mark_dirty(vmcb02, VMCB_DR); } - if (unlikely(guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && - (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { + if (nested_vmcb12_has_lbrv(vcpu)) { /* * Reserved bits of DEBUGCTL are ignored. Be consistent with * svm_set_msr's definition of reserved bits. @@ -1233,8 +1238,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm) if (!nested_exit_on_intr(svm)) kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); - if (unlikely(guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && - (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { + if (nested_vmcb12_has_lbrv(vcpu)) { svm_copy_lbrs(&vmcb12->save, &vmcb02->save); } else { svm_copy_lbrs(&vmcb01->save, &vmcb02->save); -- cgit v1.2.3 From dcf3648ab71437b504abbfdc4e74622a0f1a56e3 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:01 +0000 Subject: KVM: nSVM: Refactor writing vmcb12 on nested #VMEXIT as a helper Move mapping vmcb12 and updating it out of nested_svm_vmexit() into a helper, no functional change intended. CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-8-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 77 +++++++++++++++++++++++++++-------------------- 1 file changed, 44 insertions(+), 33 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index d419fd516fa9..8c01916cb154 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1124,36 +1124,20 @@ void svm_copy_vmloadsave_state(struct vmcb *to_vmcb, struct vmcb *from_vmcb) to_vmcb->save.sysenter_eip = from_vmcb->save.sysenter_eip; } -int nested_svm_vmexit(struct vcpu_svm *svm) +static int nested_svm_vmexit_update_vmcb12(struct kvm_vcpu *vcpu) { - struct kvm_vcpu *vcpu = &svm->vcpu; - struct vmcb *vmcb01 = svm->vmcb01.ptr; + struct vcpu_svm *svm = to_svm(vcpu); struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; - struct vmcb *vmcb12; struct kvm_host_map map; + struct vmcb *vmcb12; int rc; rc = kvm_vcpu_map(vcpu, gpa_to_gfn(svm->nested.vmcb12_gpa), &map); - if (rc) { - if (rc == -EINVAL) - kvm_inject_gp(vcpu, 0); - return 1; - } + if (rc) + return rc; vmcb12 = map.hva; - /* Exit Guest-Mode */ - leave_guest_mode(vcpu); - svm->nested.vmcb12_gpa = 0; - WARN_ON_ONCE(svm->nested.nested_run_pending); - - kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); - - /* in case we halted in L2 */ - kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE); - - /* Give the current vmcb to the guest */ - vmcb12->save.es = vmcb02->save.es; vmcb12->save.cs = vmcb02->save.cs; vmcb12->save.ss = vmcb02->save.ss; @@ -1190,10 +1174,48 @@ int nested_svm_vmexit(struct vcpu_svm *svm) if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS)) vmcb12->control.next_rip = vmcb02->control.next_rip; + if (nested_vmcb12_has_lbrv(vcpu)) + svm_copy_lbrs(&vmcb12->save, &vmcb02->save); + vmcb12->control.int_ctl = svm->nested.ctl.int_ctl; vmcb12->control.event_inj = svm->nested.ctl.event_inj; vmcb12->control.event_inj_err = svm->nested.ctl.event_inj_err; + trace_kvm_nested_vmexit_inject(vmcb12->control.exit_code, + vmcb12->control.exit_info_1, + vmcb12->control.exit_info_2, + vmcb12->control.exit_int_info, + vmcb12->control.exit_int_info_err, + KVM_ISA_SVM); + + kvm_vcpu_unmap(vcpu, &map); + return 0; +} + +int nested_svm_vmexit(struct vcpu_svm *svm) +{ + struct kvm_vcpu *vcpu = &svm->vcpu; + struct vmcb *vmcb01 = svm->vmcb01.ptr; + struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; + int rc; + + rc = nested_svm_vmexit_update_vmcb12(vcpu); + if (rc) { + if (rc == -EINVAL) + kvm_inject_gp(vcpu, 0); + return 1; + } + + /* Exit Guest-Mode */ + leave_guest_mode(vcpu); + svm->nested.vmcb12_gpa = 0; + WARN_ON_ONCE(svm->nested.nested_run_pending); + + kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); + + /* in case we halted in L2 */ + kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE); + if (!kvm_pause_in_guest(vcpu->kvm)) { vmcb01->control.pause_filter_count = vmcb02->control.pause_filter_count; vmcb_mark_dirty(vmcb01, VMCB_INTERCEPTS); @@ -1238,9 +1260,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm) if (!nested_exit_on_intr(svm)) kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); - if (nested_vmcb12_has_lbrv(vcpu)) { - svm_copy_lbrs(&vmcb12->save, &vmcb02->save); - } else { + if (!nested_vmcb12_has_lbrv(vcpu)) { svm_copy_lbrs(&vmcb01->save, &vmcb02->save); vmcb_mark_dirty(vmcb01, VMCB_LBR); } @@ -1296,15 +1316,6 @@ int nested_svm_vmexit(struct vcpu_svm *svm) svm->vcpu.arch.dr7 = DR7_FIXED_1; kvm_update_dr7(&svm->vcpu); - trace_kvm_nested_vmexit_inject(vmcb12->control.exit_code, - vmcb12->control.exit_info_1, - vmcb12->control.exit_info_2, - vmcb12->control.exit_int_info, - vmcb12->control.exit_int_info_err, - KVM_ISA_SVM); - - kvm_vcpu_unmap(vcpu, &map); - nested_svm_transition_tlb_flush(vcpu); nested_svm_uninit_mmu_context(vcpu); -- cgit v1.2.3 From 1b30e7551767cb95b3e49bb169c72bbd76b56e05 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:02 +0000 Subject: KVM: nSVM: Triple fault if mapping VMCB12 fails on nested #VMEXIT KVM currently injects a #GP and hopes for the best if mapping VMCB12 fails on nested #VMEXIT, and only if the failure mode is -EINVAL. Mapping the VMCB12 could also fail if creating host mappings fails. After the #GP is injected, nested_svm_vmexit() bails early, without cleaning up (e.g. KVM_REQ_GET_NESTED_STATE_PAGES is set, is_guest_mode() is true, etc). Instead of optionally injecting a #GP, triple fault the guest if mapping VMCB12 fails since KVM cannot make a sane recovery. The APM states that a #VMEXIT will triple fault if host state is illegal or an exception occurs while loading host state, so the behavior is not entirely made up. Do not return early from nested_svm_vmexit(), continue cleaning up the vCPU state (e.g. switch back to vmcb01), to handle the failure as gracefully as possible. Fixes: cf74a78b229d ("KVM: SVM: Add VMEXIT handler and intercepts") CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-9-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 8c01916cb154..30c99bbe9927 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1199,12 +1199,8 @@ int nested_svm_vmexit(struct vcpu_svm *svm) struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; int rc; - rc = nested_svm_vmexit_update_vmcb12(vcpu); - if (rc) { - if (rc == -EINVAL) - kvm_inject_gp(vcpu, 0); - return 1; - } + if (nested_svm_vmexit_update_vmcb12(vcpu)) + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); /* Exit Guest-Mode */ leave_guest_mode(vcpu); -- cgit v1.2.3 From 5d291ef0585ed880ed4dd71ea1a5965e0a65fb53 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:03 +0000 Subject: KVM: nSVM: Triple fault if restore host CR3 fails on nested #VMEXIT If loading L1's CR3 fails on a nested #VMEXIT, nested_svm_vmexit() returns an error code that is ignored by most callers, and continues to run L1 with corrupted state. A sane recovery is not possible in this case, and HW behavior is to cause a shutdown. Inject a triple fault instead, and do not return early from nested_svm_vmexit(). Continue cleaning up the vCPU state (e.g. clear pending exceptions), to handle the failure as gracefully as possible. From the APM: Upon #VMEXIT, the processor performs the following actions in order to return to the host execution context: ... if (illegal host state loaded, or exception while loading host state) shutdown else execute first host instruction following the VMRUN Remove the return value of nested_svm_vmexit(), which is mostly unchecked anyway. Fixes: d82aaef9c88a ("KVM: nSVM: use nested_svm_load_cr3() on guest->host switch") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-10-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 10 +++------- arch/x86/kvm/svm/svm.c | 11 ++--------- arch/x86/kvm/svm/svm.h | 6 +++--- 3 files changed, 8 insertions(+), 19 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 30c99bbe9927..5e0feeb50ba3 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1192,12 +1192,11 @@ static int nested_svm_vmexit_update_vmcb12(struct kvm_vcpu *vcpu) return 0; } -int nested_svm_vmexit(struct vcpu_svm *svm) +void nested_svm_vmexit(struct vcpu_svm *svm) { struct kvm_vcpu *vcpu = &svm->vcpu; struct vmcb *vmcb01 = svm->vmcb01.ptr; struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; - int rc; if (nested_svm_vmexit_update_vmcb12(vcpu)) kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); @@ -1316,9 +1315,8 @@ int nested_svm_vmexit(struct vcpu_svm *svm) nested_svm_uninit_mmu_context(vcpu); - rc = nested_svm_load_cr3(vcpu, vmcb01->save.cr3, false, true); - if (rc) - return 1; + if (nested_svm_load_cr3(vcpu, vmcb01->save.cr3, false, true)) + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); /* * Drop what we picked up for L2 via svm_complete_interrupts() so it @@ -1343,8 +1341,6 @@ int nested_svm_vmexit(struct vcpu_svm *svm) */ if (kvm_apicv_activated(vcpu->kvm)) __kvm_vcpu_update_apicv(vcpu); - - return 0; } static void nested_svm_triple_fault(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index e97c56df41f6..7efa71709292 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2234,13 +2234,9 @@ static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode) [SVM_INSTR_VMSAVE] = vmsave_interception, }; struct vcpu_svm *svm = to_svm(vcpu); - int ret; if (is_guest_mode(vcpu)) { - /* Returns '1' or -errno on failure, '0' on success. */ - ret = nested_svm_simple_vmexit(svm, guest_mode_exit_codes[opcode]); - if (ret) - return ret; + nested_svm_simple_vmexit(svm, guest_mode_exit_codes[opcode]); return 1; } return svm_instr_handlers[opcode](vcpu); @@ -4871,7 +4867,6 @@ static int svm_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram) { struct vcpu_svm *svm = to_svm(vcpu); struct kvm_host_map map_save; - int ret; if (!is_guest_mode(vcpu)) return 0; @@ -4891,9 +4886,7 @@ static int svm_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram) svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP]; svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP]; - ret = nested_svm_simple_vmexit(svm, SVM_EXIT_SW); - if (ret) - return ret; + nested_svm_simple_vmexit(svm, SVM_EXIT_SW); /* * KVM uses VMCB01 to store L1 host state while L2 runs but diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 44d767cd1d25..7629cb37c930 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -793,14 +793,14 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu); void svm_copy_vmrun_state(struct vmcb_save_area *to_save, struct vmcb_save_area *from_save); void svm_copy_vmloadsave_state(struct vmcb *to_vmcb, struct vmcb *from_vmcb); -int nested_svm_vmexit(struct vcpu_svm *svm); +void nested_svm_vmexit(struct vcpu_svm *svm); -static inline int nested_svm_simple_vmexit(struct vcpu_svm *svm, u32 exit_code) +static inline void nested_svm_simple_vmexit(struct vcpu_svm *svm, u32 exit_code) { svm->vmcb->control.exit_code = exit_code; svm->vmcb->control.exit_info_1 = 0; svm->vmcb->control.exit_info_2 = 0; - return nested_svm_vmexit(svm); + nested_svm_vmexit(svm); } int nested_svm_exit_handled(struct vcpu_svm *svm); -- cgit v1.2.3 From f85a6ce06e4a0d49652f57967a649ab09e06287c Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:04 +0000 Subject: KVM: nSVM: Clear GIF on nested #VMEXIT(INVALID) According to the APM, GIF is set to 0 on any #VMEXIT, including an #VMEXIT(INVALID) due to failed consistency checks. Clear GIF on consistency check failures. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-11-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 5e0feeb50ba3..ac7d7f82c82b 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1035,6 +1035,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) vmcb12->control.exit_code = SVM_EXIT_ERR; vmcb12->control.exit_info_1 = 0; vmcb12->control.exit_info_2 = 0; + svm_set_gif(svm, false); goto out; } -- cgit v1.2.3 From 69b721a86d0dcb026f6db7d111dcde7550442d2e Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:05 +0000 Subject: KVM: nSVM: Clear EVENTINJ fields in vmcb12 on nested #VMEXIT According to the APM, from the reference of the VMRUN instruction: Upon #VMEXIT, the processor performs the following actions in order to return to the host execution context: ... clear EVENTINJ field in VMCB KVM already syncs EVENTINJ fields from vmcb02 to cached vmcb12 on every L2->L0 #VMEXIT. Since these fields are zeroed by the CPU on #VMEXIT, they will mostly be zeroed in vmcb12 on nested #VMEXIT by nested_svm_vmexit(). However, this is not the case when: 1. Consistency checks fail, as nested_svm_vmexit() is not called. 2. Entering guest mode fails before L2 runs (e.g. due to failed load of CR3). (2) was broken by commit 2d8a42be0e2b ("KVM: nSVM: synchronize VMCB controls updated by the processor on every vmexit"), as prior to that nested_svm_vmexit() always zeroed EVENTINJ fields. Explicitly clear the fields in all nested #VMEXIT code paths. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Fixes: 2d8a42be0e2b ("KVM: nSVM: synchronize VMCB controls updated by the processor on every vmexit") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-12-yosry@kernel.org [sean: massage changelog formatting] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index ac7d7f82c82b..90c8bc641bf3 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1035,6 +1035,8 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) vmcb12->control.exit_code = SVM_EXIT_ERR; vmcb12->control.exit_info_1 = 0; vmcb12->control.exit_info_2 = 0; + vmcb12->control.event_inj = 0; + vmcb12->control.event_inj_err = 0; svm_set_gif(svm, false); goto out; } @@ -1178,9 +1180,9 @@ static int nested_svm_vmexit_update_vmcb12(struct kvm_vcpu *vcpu) if (nested_vmcb12_has_lbrv(vcpu)) svm_copy_lbrs(&vmcb12->save, &vmcb02->save); + vmcb12->control.event_inj = 0; + vmcb12->control.event_inj_err = 0; vmcb12->control.int_ctl = svm->nested.ctl.int_ctl; - vmcb12->control.event_inj = svm->nested.ctl.event_inj; - vmcb12->control.event_inj_err = svm->nested.ctl.event_inj_err; trace_kvm_nested_vmexit_inject(vmcb12->control.exit_code, vmcb12->control.exit_info_1, -- cgit v1.2.3 From 8998e1d012f3f45d0456f16706682cef04c3c436 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:06 +0000 Subject: KVM: nSVM: Clear tracking of L1->L2 NMI and soft IRQ on nested #VMEXIT KVM clears tracking of L1->L2 injected NMIs (i.e. nmi_l1_to_l2) and soft IRQs (i.e. soft_int_injected) on a synthesized #VMEXIT(INVALID) due to failed VMRUN. However, they are not explicitly cleared in other synthesized #VMEXITs. soft_int_injected is always cleared after the first VMRUN of L2 when completing interrupts, as any re-injection is then tracked by KVM (instead of purely in vmcb02). nmi_l1_to_l2 is not cleared after the first VMRUN if NMI injection failed, as KVM still needs to keep track that the NMI originated from L1 to avoid blocking NMIs for L1. It is only cleared when the NMI injection succeeds. KVM could synthesize a #VMEXIT to L1 before successfully injecting the NMI into L2 (e.g. due to a #NPF on L2's NMI handler in L1's NPTs). In this case, nmi_l1_to_l2 will remain true, and KVM may not correctly mask NMIs and intercept IRET when injecting an NMI into L1. Clear both nmi_l1_to_l2 and soft_int_injected in nested_svm_vmexit(), i.e. for all #VMEXITs except those that occur due to failed consistency checks, as those happen before nmi_l1_to_l2 or soft_int_injected are set. Fixes: 159fc6fa3b7d ("KVM: nSVM: Transparently handle L1 -> L2 NMI re-injection") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-13-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 90c8bc641bf3..d0037f01fb98 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1064,8 +1064,6 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) out_exit_err: svm->nested.nested_run_pending = 0; - svm->nmi_l1_to_l2 = false; - svm->soft_int_injected = false; svm->vmcb->control.exit_code = SVM_EXIT_ERR; svm->vmcb->control.exit_info_1 = 0; @@ -1321,6 +1319,10 @@ void nested_svm_vmexit(struct vcpu_svm *svm) if (nested_svm_load_cr3(vcpu, vmcb01->save.cr3, false, true)) kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + /* Drop tracking for L1->L2 injected NMIs and soft IRQs */ + svm->nmi_l1_to_l2 = false; + svm->soft_int_injected = false; + /* * Drop what we picked up for L2 via svm_complete_interrupts() so it * doesn't end up in L1. -- cgit v1.2.3 From b786e34cde42922dace620e6f56f0858edae2311 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:07 +0000 Subject: KVM: nSVM: Drop nested_vmcb_check_{save/control}() wrappers The wrappers provide little value and make it harder to see what KVM is checking in the normal flow. Drop them. Opportunistically fixup comments referring to the functions, adding '()' to make it clear it's a reference to a function. No functional change intended. Co-developed-by: Sean Christopherson Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-14-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 36 ++++++++++-------------------------- 1 file changed, 10 insertions(+), 26 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index d0037f01fb98..0d447d044101 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -339,8 +339,8 @@ static bool nested_svm_check_bitmap_pa(struct kvm_vcpu *vcpu, u64 pa, u32 size) kvm_vcpu_is_legal_gpa(vcpu, addr + size - 1); } -static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu, - struct vmcb_ctrl_area_cached *control) +static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu, + struct vmcb_ctrl_area_cached *control) { if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN))) return false; @@ -367,8 +367,8 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu, } /* Common checks that apply to both L1 and L2 state. */ -static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu, - struct vmcb_save_area_cached *save) +static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu, + struct vmcb_save_area_cached *save) { if (CC(!(save->efer & EFER_SVME))) return false; @@ -402,22 +402,6 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu, return true; } -static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb_save_area_cached *save = &svm->nested.save; - - return __nested_vmcb_check_save(vcpu, save); -} - -static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl; - - return __nested_vmcb_check_controls(vcpu, ctl); -} - /* * If a feature is not advertised to L1, clear the corresponding vmcb12 * intercept. @@ -469,7 +453,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->pause_filter_count = from->pause_filter_count; to->pause_filter_thresh = from->pause_filter_thresh; - /* Copy asid here because nested_vmcb_check_controls will check it. */ + /* Copy asid here because nested_vmcb_check_controls() will check it */ to->asid = from->asid; to->msrpm_base_pa &= ~0x0fffULL; to->iopm_base_pa &= ~0x0fffULL; @@ -1030,8 +1014,8 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) nested_copy_vmcb_control_to_cache(svm, &vmcb12->control); nested_copy_vmcb_save_to_cache(svm, &vmcb12->save); - if (!nested_vmcb_check_save(vcpu) || - !nested_vmcb_check_controls(vcpu)) { + if (!nested_vmcb_check_save(vcpu, &svm->nested.save) || + !nested_vmcb_check_controls(vcpu, &svm->nested.ctl)) { vmcb12->control.exit_code = SVM_EXIT_ERR; vmcb12->control.exit_info_1 = 0; vmcb12->control.exit_info_2 = 0; @@ -1877,12 +1861,12 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, ret = -EINVAL; __nested_copy_vmcb_control_to_cache(vcpu, &ctl_cached, ctl); - if (!__nested_vmcb_check_controls(vcpu, &ctl_cached)) + if (!nested_vmcb_check_controls(vcpu, &ctl_cached)) goto out_free; /* * Processor state contains L2 state. Check that it is - * valid for guest mode (see nested_vmcb_check_save). + * valid for guest mode (see nested_vmcb_check_save()). */ cr0 = kvm_read_cr0(vcpu); if (((cr0 & X86_CR0_CD) == 0) && (cr0 & X86_CR0_NW)) @@ -1896,7 +1880,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, if (!(save->cr0 & X86_CR0_PG) || !(save->cr0 & X86_CR0_PE) || (save->rflags & X86_EFLAGS_VM) || - !__nested_vmcb_check_save(vcpu, &save_cached)) + !nested_vmcb_check_save(vcpu, &save_cached)) goto out_free; -- cgit v1.2.3 From e0b6f031d64c086edd563e7af9c0c0a2261dd2a4 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:08 +0000 Subject: KVM: nSVM: Drop the non-architectural consistency check for NP_ENABLE KVM currenty fails a nested VMRUN and injects VMEXIT_INVALID (aka SVM_EXIT_ERR) if L1 sets NP_ENABLE and the host does not support NPTs. On first glance, it seems like the check should actually be for guest_cpu_cap_has(X86_FEATURE_NPT) instead, as it is possible for the host to support NPTs but the guest CPUID to not advertise it. However, the consistency check is not architectural to begin with. The APM does not mention VMEXIT_INVALID if NP_ENABLE is set on a processor that does not have X86_FEATURE_NPT. Hence, NP_ENABLE should be ignored if X86_FEATURE_NPT is not available for L1, so sanitize it when copying from the VMCB12 to KVM's cache. Apart from the consistency check, NP_ENABLE in VMCB12 is currently ignored because the bit is actually copied from VMCB01 to VMCB02, not from VMCB12. Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-15-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 0d447d044101..2ed6530e7bd1 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -348,9 +348,6 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu, if (CC(control->asid == 0)) return false; - if (CC((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && !npt_enabled)) - return false; - if (CC(!nested_svm_check_bitmap_pa(vcpu, control->msrpm_base_pa, MSRPM_SIZE))) return false; @@ -431,6 +428,11 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, nested_svm_sanitize_intercept(vcpu, to, SKINIT); nested_svm_sanitize_intercept(vcpu, to, RDPRU); + /* Always clear SVM_NESTED_CTL_NP_ENABLE if the guest cannot use NPTs */ + to->nested_ctl = from->nested_ctl; + if (!guest_cpu_cap_has(vcpu, X86_FEATURE_NPT)) + to->nested_ctl &= ~SVM_NESTED_CTL_NP_ENABLE; + to->iopm_base_pa = from->iopm_base_pa; to->msrpm_base_pa = from->msrpm_base_pa; to->tsc_offset = from->tsc_offset; @@ -444,7 +446,6 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->exit_info_2 = from->exit_info_2; to->exit_int_info = from->exit_int_info; to->exit_int_info_err = from->exit_int_info_err; - to->nested_ctl = from->nested_ctl; to->event_inj = from->event_inj; to->event_inj_err = from->event_inj_err; to->next_rip = from->next_rip; -- cgit v1.2.3 From b71138fcc362c67ebe66747bb22cb4e6b4d6a651 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:09 +0000 Subject: KVM: nSVM: Add missing consistency check for nCR3 validity MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From the APM Volume #2, 15.25.4 (24593—Rev. 3.42—March 2024): When VMRUN is executed with nested paging enabled (NP_ENABLE = 1), the following conditions are considered illegal state combinations, in addition to those mentioned in “Canonicalization and Consistency Checks”: • Any MBZ bit of nCR3 is set. • Any G_PAT.PA field has an unsupported type encoding or any reserved field in G_PAT has a nonzero value. Add the consistency check for nCR3 being a legal GPA with no MBZ bits set. Note, the G_PAT.PA check is being handled separately[*]. Link: https://lore.kernel.org/kvm/20260205214326.1029278-3-jmattson@google.com [*] Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-16-yosry@kernel.org [sean: capture everything in CC(), massage changelog formatting] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 2ed6530e7bd1..a59b976c16db 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -348,6 +348,10 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu, if (CC(control->asid == 0)) return false; + if (CC((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && + !kvm_vcpu_is_legal_gpa(vcpu, control->nested_cr3))) + return false; + if (CC(!nested_svm_check_bitmap_pa(vcpu, control->msrpm_base_pa, MSRPM_SIZE))) return false; -- cgit v1.2.3 From 96bd3e76a171a8e21a6387e54e4c420a81968492 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:10 +0000 Subject: KVM: nSVM: Add missing consistency check for EFER, CR0, CR4, and CS MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit According to the APM Volume #2, 15.5, Canonicalization and Consistency Checks (24593—Rev. 3.42—March 2024), the following condition (among others) results in a #VMEXIT with VMEXIT_INVALID (aka SVM_EXIT_ERR): EFER.LME, CR0.PG, CR4.PAE, CS.L, and CS.D are all non-zero. In the list of consistency checks done when EFER.LME and CR0.PG are set, add a check that CS.L and CS.D are not both set, after the existing check that CR4.PAE is set. This is functionally a nop because the nested VMRUN results in SVM_EXIT_ERR in HW, which is forwarded to L1, but KVM makes all consistency checks before a VMRUN is actually attempted. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-17-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 ++++++ arch/x86/kvm/svm/svm.h | 1 + 2 files changed, 7 insertions(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index a59b976c16db..50180565bcfc 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -391,6 +391,10 @@ static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu, CC(!(save->cr0 & X86_CR0_PE)) || CC(!kvm_vcpu_is_legal_cr3(vcpu, save->cr3))) return false; + + if (CC((save->cs.attrib & SVM_SELECTOR_L_MASK) && + (save->cs.attrib & SVM_SELECTOR_DB_MASK))) + return false; } /* Note, SVM doesn't have any additional restrictions on CR4. */ @@ -486,6 +490,8 @@ static void __nested_copy_vmcb_save_to_cache(struct vmcb_save_area_cached *to, * Copy only fields that are validated, as we need them * to avoid TOC/TOU races. */ + to->cs = from->cs; + to->efer = from->efer; to->cr0 = from->cr0; to->cr3 = from->cr3; diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 7629cb37c930..0a5d5a4453b7 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -140,6 +140,7 @@ struct kvm_vmcb_info { }; struct vmcb_save_area_cached { + struct vmcb_seg cs; u64 efer; u64 cr4; u64 cr3; -- cgit v1.2.3 From 7e79f71bca5cf536f92effc7227bd044c2722c11 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:11 +0000 Subject: KVM: nSVM: Add missing consistency check for EVENTINJ MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit According to the APM Volume #2, 15.20 (24593—Rev. 3.42—March 2024): VMRUN exits with VMEXIT_INVALID error code if either: • Reserved values of TYPE have been specified, or • TYPE = 3 (exception) has been specified with a vector that does not correspond to an exception (this includes vector 2, which is an NMI, not an exception). Add the missing consistency checks to KVM. For the second point, inject VMEXIT_INVALID if the vector is anything but the vectors defined by the APM for exceptions. Reserved vectors are also considered invalid, which matches the HW behavior. Vector 9 (i.e. #CSO) is considered invalid because it is reserved on modern CPUs, and according to LLMs no CPUs exist supporting SVM and producing #CSOs. Defined exceptions could be different between virtual CPUs as new CPUs define new vectors. In a best effort to dynamically define the valid vectors, make all currently defined vectors as valid except those obviously tied to a CPU feature: SHSTK -> #CP and SEV-ES -> #VC. As new vectors are defined, they can similarly be tied to corresponding CPU features. Invalid vectors on specific (e.g. old) CPUs that are missed by KVM should be rejected by HW anyway. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-18-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 50180565bcfc..1c5f0f08bb8c 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -339,6 +339,54 @@ static bool nested_svm_check_bitmap_pa(struct kvm_vcpu *vcpu, u64 pa, u32 size) kvm_vcpu_is_legal_gpa(vcpu, addr + size - 1); } +static bool nested_svm_event_inj_valid_exept(struct kvm_vcpu *vcpu, u8 vector) +{ + /* + * Vectors that do not correspond to a defined exception are invalid + * (including #NMI and reserved vectors). In a best effort to define + * valid exceptions based on the virtual CPU, make all exceptions always + * valid except those obviously tied to a CPU feature. + */ + switch (vector) { + case DE_VECTOR: case DB_VECTOR: case BP_VECTOR: case OF_VECTOR: + case BR_VECTOR: case UD_VECTOR: case NM_VECTOR: case DF_VECTOR: + case TS_VECTOR: case NP_VECTOR: case SS_VECTOR: case GP_VECTOR: + case PF_VECTOR: case MF_VECTOR: case AC_VECTOR: case MC_VECTOR: + case XM_VECTOR: case HV_VECTOR: case SX_VECTOR: + return true; + case CP_VECTOR: + return guest_cpu_cap_has(vcpu, X86_FEATURE_SHSTK); + case VC_VECTOR: + return guest_cpu_cap_has(vcpu, X86_FEATURE_SEV_ES); + } + return false; +} + +/* + * According to the APM, VMRUN exits with SVM_EXIT_ERR if SVM_EVTINJ_VALID is + * set and: + * - The type of event_inj is not one of the defined values. + * - The type is SVM_EVTINJ_TYPE_EXEPT, but the vector is not a valid exception. + */ +static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj) +{ + u32 type = event_inj & SVM_EVTINJ_TYPE_MASK; + u8 vector = event_inj & SVM_EVTINJ_VEC_MASK; + + if (!(event_inj & SVM_EVTINJ_VALID)) + return true; + + if (type != SVM_EVTINJ_TYPE_INTR && type != SVM_EVTINJ_TYPE_NMI && + type != SVM_EVTINJ_TYPE_EXEPT && type != SVM_EVTINJ_TYPE_SOFT) + return false; + + if (type == SVM_EVTINJ_TYPE_EXEPT && + !nested_svm_event_inj_valid_exept(vcpu, vector)) + return false; + + return true; +} + static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu, struct vmcb_ctrl_area_cached *control) { @@ -364,6 +412,9 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu, return false; } + if (CC(!nested_svm_check_event_inj(vcpu, control->event_inj))) + return false; + return true; } -- cgit v1.2.3 From d5bde6113aed8315a2bfe708730b721be9c2f48b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 15:09:51 -0800 Subject: KVM: SVM: Explicitly mark vmcb01 dirty after modifying VMCB intercepts When reacting to an intercept update, explicitly mark vmcb01's intercepts dirty, as KVM always initially operates on vmcb01, and nested_svm_vmexit() isn't guaranteed to mark VMCB_INTERCEPTS as dirty. I.e. if L2 is active, KVM will modify the intercepts for L1, but might not mark them as dirty before the next VMRUN of L1. Fixes: 116a0a23676e ("KVM: SVM: Add clean-bit for intercetps, tsc-offset and pause filter count") Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260218230958.2877682-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 1c5f0f08bb8c..5b639d98bf09 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -128,11 +128,13 @@ void recalc_intercepts(struct vcpu_svm *svm) struct vmcb_ctrl_area_cached *g; unsigned int i; - vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); + vmcb_mark_dirty(svm->vmcb01.ptr, VMCB_INTERCEPTS); if (!is_guest_mode(&svm->vcpu)) return; + vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); + c = &svm->vmcb->control; h = &svm->vmcb01.ptr->control; g = &svm->nested.ctl; -- cgit v1.2.3 From c36991c6f8d2ab56ee67aff04e3c357f45cfc76c Mon Sep 17 00:00:00 2001 From: Kevin Cheng Date: Tue, 3 Mar 2026 16:22:22 -0800 Subject: KVM: nSVM: Raise #UD if unhandled VMMCALL isn't intercepted by L1 Explicitly synthesize a #UD for VMMCALL if L2 is active, L1 does NOT want to intercept VMMCALL, nested_svm_l2_tlb_flush_enabled() is true, and the hypercall is something other than one of the supported Hyper-V hypercalls. When all of the above conditions are met, KVM will intercept VMMCALL but never forward it to L1, i.e. will let L2 make hypercalls as if it were L1. The TLFS says a whole lot of nothing about this scenario, so go with the architectural behavior, which says that VMMCALL #UDs if it's not intercepted. Opportunistically do a 2-for-1 stub trade by stub-ifying the new API instead of the helpers it uses. The last remaining "single" stub will soon be dropped as well. Suggested-by: Sean Christopherson Fixes: 3f4a812edf5c ("KVM: nSVM: hyper-v: Enable L2 TLB flush") Cc: Vitaly Kuznetsov Cc: stable@vger.kernel.org Signed-off-by: Kevin Cheng Link: https://patch.msgid.link/20260228033328.2285047-5-chengkev@google.com [sean: rewrite changelog and comment, tag for stable, remove defunct stubs] Reviewed-by: Yosry Ahmed Reviewed-by: Vitaly Kuznetsov Link: https://patch.msgid.link/20260304002223.1105129-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/hyperv.h | 8 -------- arch/x86/kvm/svm/hyperv.h | 11 +++++++++++ arch/x86/kvm/svm/nested.c | 4 +--- arch/x86/kvm/svm/svm.c | 19 ++++++++++++++++++- 4 files changed, 30 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index 6ce160ffa678..6301f79fcbae 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -305,14 +305,6 @@ static inline bool kvm_hv_has_stimer_pending(struct kvm_vcpu *vcpu) { return false; } -static inline bool kvm_hv_is_tlb_flush_hcall(struct kvm_vcpu *vcpu) -{ - return false; -} -static inline bool guest_hv_cpuid_has_l2_tlb_flush(struct kvm_vcpu *vcpu) -{ - return false; -} static inline int kvm_hv_verify_vp_assist(struct kvm_vcpu *vcpu) { return 0; diff --git a/arch/x86/kvm/svm/hyperv.h b/arch/x86/kvm/svm/hyperv.h index d3f8bfc05832..9af03970d40c 100644 --- a/arch/x86/kvm/svm/hyperv.h +++ b/arch/x86/kvm/svm/hyperv.h @@ -41,6 +41,13 @@ static inline bool nested_svm_l2_tlb_flush_enabled(struct kvm_vcpu *vcpu) return hv_vcpu->vp_assist_page.nested_control.features.directhypercall; } +static inline bool nested_svm_is_l2_tlb_flush_hcall(struct kvm_vcpu *vcpu) +{ + return guest_hv_cpuid_has_l2_tlb_flush(vcpu) && + nested_svm_l2_tlb_flush_enabled(vcpu) && + kvm_hv_is_tlb_flush_hcall(vcpu); +} + void svm_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu); #else /* CONFIG_KVM_HYPERV */ static inline void nested_svm_hv_update_vm_vp_ids(struct kvm_vcpu *vcpu) {} @@ -48,6 +55,10 @@ static inline bool nested_svm_l2_tlb_flush_enabled(struct kvm_vcpu *vcpu) { return false; } +static inline bool nested_svm_is_l2_tlb_flush_hcall(struct kvm_vcpu *vcpu) +{ + return false; +} static inline void svm_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu) {} #endif /* CONFIG_KVM_HYPERV */ diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 5b639d98bf09..0f7893a7cb04 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1738,9 +1738,7 @@ int nested_svm_exit_special(struct vcpu_svm *svm) } case SVM_EXIT_VMMCALL: /* Hyper-V L2 TLB flush hypercall is handled by L0 */ - if (guest_hv_cpuid_has_l2_tlb_flush(vcpu) && - nested_svm_l2_tlb_flush_enabled(vcpu) && - kvm_hv_is_tlb_flush_hcall(vcpu)) + if (nested_svm_is_l2_tlb_flush_hcall(vcpu)) return NESTED_EXIT_HOST; break; default: diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7efa71709292..9e6864cf58d3 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -52,6 +52,7 @@ #include "svm.h" #include "svm_ops.h" +#include "hyperv.h" #include "kvm_onhyperv.h" #include "svm_onhyperv.h" @@ -3248,6 +3249,22 @@ static int bus_lock_exit(struct kvm_vcpu *vcpu) return 0; } +static int vmmcall_interception(struct kvm_vcpu *vcpu) +{ + /* + * Inject a #UD if L2 is active and the VMMCALL isn't a Hyper-V TLB + * hypercall, as VMMCALL #UDs if it's not intercepted, and this path is + * reachable if and only if L1 doesn't want to intercept VMMCALL or has + * enabled L0 (KVM) handling of Hyper-V L2 TLB flush hypercalls. + */ + if (is_guest_mode(vcpu) && !nested_svm_is_l2_tlb_flush_hcall(vcpu)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + return kvm_emulate_hypercall(vcpu); +} + static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3298,7 +3315,7 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_TASK_SWITCH] = task_switch_interception, [SVM_EXIT_SHUTDOWN] = shutdown_interception, [SVM_EXIT_VMRUN] = vmrun_interception, - [SVM_EXIT_VMMCALL] = kvm_emulate_hypercall, + [SVM_EXIT_VMMCALL] = vmmcall_interception, [SVM_EXIT_VMLOAD] = vmload_interception, [SVM_EXIT_VMSAVE] = vmsave_interception, [SVM_EXIT_STGI] = stgi_interception, -- cgit v1.2.3 From 33d3617a52f9930d22b2af59f813c2fbdefa6dd5 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 3 Mar 2026 16:22:23 -0800 Subject: KVM: nSVM: Always intercept VMMCALL when L2 is active Always intercept VMMCALL now that KVM properly synthesizes a #UD as appropriate, i.e. when L1 doesn't want to intercept VMMCALL, to avoid putting L2 into an infinite #UD loop if KVM_X86_QUIRK_FIX_HYPERCALL_INSN is enabled. By letting L2 execute VMMCALL natively and thus #UD, for all intents and purposes KVM morphs the VMMCALL intercept into a #UD intercept (KVM always intercepts #UD). When the hypercall quirk is enabled, KVM "emulates" VMMCALL in response to the #UD by trying to fixup the opcode to the "right" vendor, then restarts the guest, without skipping the VMMCALL. As a result, the guest sees an endless stream of #UDs since it's already executing the correct vendor hypercall instruction, i.e. the emulator doesn't anticipate that the #UD could be due to lack of interception, as opposed to a truly undefined opcode. Fixes: 0d945bd93511 ("KVM: SVM: Don't allow nested guest to VMMCALL into host") Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed Reviewed-by: Vitaly Kuznetsov Link: https://patch.msgid.link/20260304002223.1105129-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/hyperv.h | 4 ---- arch/x86/kvm/svm/nested.c | 7 ------- 2 files changed, 11 deletions(-) diff --git a/arch/x86/kvm/svm/hyperv.h b/arch/x86/kvm/svm/hyperv.h index 9af03970d40c..f70d076911a6 100644 --- a/arch/x86/kvm/svm/hyperv.h +++ b/arch/x86/kvm/svm/hyperv.h @@ -51,10 +51,6 @@ static inline bool nested_svm_is_l2_tlb_flush_hcall(struct kvm_vcpu *vcpu) void svm_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu); #else /* CONFIG_KVM_HYPERV */ static inline void nested_svm_hv_update_vm_vp_ids(struct kvm_vcpu *vcpu) {} -static inline bool nested_svm_l2_tlb_flush_enabled(struct kvm_vcpu *vcpu) -{ - return false; -} static inline bool nested_svm_is_l2_tlb_flush_hcall(struct kvm_vcpu *vcpu) { return false; diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 0f7893a7cb04..fb86f09985e7 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -158,13 +158,6 @@ void recalc_intercepts(struct vcpu_svm *svm) vmcb_clr_intercept(c, INTERCEPT_VINTR); } - /* - * We want to see VMMCALLs from a nested guest only when Hyper-V L2 TLB - * flush feature is enabled. - */ - if (!nested_svm_l2_tlb_flush_enabled(&svm->vcpu)) - vmcb_clr_intercept(c, INTERCEPT_VMMCALL); - for (i = 0; i < MAX_INTERCEPT; i++) c->intercepts[i] |= g->intercepts[i]; -- cgit v1.2.3 From 69f779f79e0d1ff321a89ab56cdcab34613104c0 Mon Sep 17 00:00:00 2001 From: Kevin Cheng Date: Tue, 3 Mar 2026 16:30:09 -0800 Subject: KVM: SVM: Move STGI and CLGI intercept handling Move STGI/CLGI intercept handling to svm_recalc_instruction_intercepts() in preparation for making the function EFER.SVME-aware. This will allow configuring STGI/CLGI intercepts along with other intercepts for other SVM instructions when EFER.SVME is toggled (KVM needs to intercept SVM instructions when EFER.SVME=0 to inject #UD). When clearing the STGI intercept in particular, request KVM_REQ_EVENT if there is at least one a pending GIF-controlled event. This avoids breaking NMI/SMI window tracking, as enable_{nmi,smi}_window() sets INTERCEPT_STGI to detect when NMIs become unblocked. KVM_REQ_EVENT forces kvm_check_and_inject_events() to re-evaluate pending events and re-enable the intercept if needed. Extract the pending GIF event check into a helper function svm_has_pending_gif_event() to deduplicate the logic between svm_recalc_instruction_intercepts() and svm_set_gif(). Signed-off-by: Kevin Cheng [sean: keep vgif handling out of the "Intel CPU model" path] Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260304003010.1108257-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 32 ++++++++++++++++++++++++-------- 1 file changed, 24 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9e6864cf58d3..30d3291e4738 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -999,6 +999,14 @@ void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu) preempt_enable(); } +static bool svm_has_pending_gif_event(struct vcpu_svm *svm) +{ + return svm->vcpu.arch.smi_pending || + svm->vcpu.arch.nmi_pending || + kvm_cpu_has_injectable_intr(&svm->vcpu) || + kvm_apic_has_pending_init_or_sipi(&svm->vcpu); +} + /* Evaluate instruction intercepts that depend on guest CPUID features. */ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu) { @@ -1042,6 +1050,20 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu) } } + if (vgif) { + svm_clr_intercept(svm, INTERCEPT_STGI); + svm_clr_intercept(svm, INTERCEPT_CLGI); + + /* + * Process pending events when clearing STGI/CLGI intercepts if + * there's at least one pending event that is masked by GIF, so + * that KVM re-evaluates if the intercept needs to be set again + * to track when GIF is re-enabled (e.g. for NMI injection). + */ + if (svm_has_pending_gif_event(svm)) + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + } + if (kvm_need_rdpmc_intercept(vcpu)) svm_set_intercept(svm, INTERCEPT_RDPMC); else @@ -1185,11 +1207,8 @@ static void init_vmcb(struct kvm_vcpu *vcpu, bool init_event) if (vnmi) svm->vmcb->control.int_ctl |= V_NMI_ENABLE_MASK; - if (vgif) { - svm_clr_intercept(svm, INTERCEPT_STGI); - svm_clr_intercept(svm, INTERCEPT_CLGI); + if (vgif) svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK; - } if (vls) svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; @@ -2306,10 +2325,7 @@ void svm_set_gif(struct vcpu_svm *svm, bool value) svm_clear_vintr(svm); enable_gif(svm); - if (svm->vcpu.arch.smi_pending || - svm->vcpu.arch.nmi_pending || - kvm_cpu_has_injectable_intr(&svm->vcpu) || - kvm_apic_has_pending_init_or_sipi(&svm->vcpu)) + if (svm_has_pending_gif_event(svm)) kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); } else { disable_gif(svm); -- cgit v1.2.3 From 460c7eb2e7594319abcb2066c737cb8b5eb78213 Mon Sep 17 00:00:00 2001 From: Kevin Cheng Date: Tue, 3 Mar 2026 16:30:10 -0800 Subject: KVM: SVM: Recalc instructions intercepts when EFER.SVME is toggled The AMD APM states that VMRUN, VMLOAD, VMSAVE, CLGI, VMMCALL, and INVLPGA instructions should generate a #UD when EFER.SVME is cleared. Currently, when VMLOAD, VMSAVE, or CLGI are executed in L1 with EFER.SVME cleared, no #UD is generated in certain cases. This is because the intercepts for these instructions are cleared based on whether or not vls or vgif is enabled. The #UD fails to be generated when the intercepts are absent. Fix the missing #UD generation by ensuring that all relevant instructions have intercepts set when SVME.EFER is disabled. VMMCALL is special because KVM's ABI is that VMCALL/VMMCALL are always supported for L1 and never fault. Signed-off-by: Kevin Cheng [sean: isolate Intel CPU "compatibility" in EFER.SVME=1 path] Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260304003010.1108257-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 35 +++++++++++++++++++++++------------ 1 file changed, 23 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 30d3291e4738..5fbd87450f4f 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -245,6 +245,8 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) if (svm_gp_erratum_intercept && !sev_guest(vcpu->kvm)) set_exception_intercept(svm, GP_VECTOR); } + + kvm_make_request(KVM_REQ_RECALC_INTERCEPTS, vcpu); } svm->vmcb->save.efer = efer | EFER_SVME; @@ -1032,27 +1034,31 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu) } /* - * No need to toggle VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK here, it is - * always set if vls is enabled. If the intercepts are set, the bit is - * meaningless anyway. + * Intercept instructions that #UD if EFER.SVME=0, as SVME must be set + * even when running the guest, i.e. hardware will only ever see + * EFER.SVME=1. + * + * No need to toggle any of the vgif/vls/etc. enable bits here, as they + * are set when the VMCB is initialized and never cleared (if the + * relevant intercepts are set, the enablements are meaningless anyway). */ - if (guest_cpuid_is_intel_compatible(vcpu)) { + if (!(vcpu->arch.efer & EFER_SVME)) { svm_set_intercept(svm, INTERCEPT_VMLOAD); svm_set_intercept(svm, INTERCEPT_VMSAVE); + svm_set_intercept(svm, INTERCEPT_CLGI); + svm_set_intercept(svm, INTERCEPT_STGI); } else { /* * If hardware supports Virtual VMLOAD VMSAVE then enable it * in VMCB and clear intercepts to avoid #VMEXIT. */ - if (vls) { + if (guest_cpuid_is_intel_compatible(vcpu)) { + svm_set_intercept(svm, INTERCEPT_VMLOAD); + svm_set_intercept(svm, INTERCEPT_VMSAVE); + } else if (vls) { svm_clr_intercept(svm, INTERCEPT_VMLOAD); svm_clr_intercept(svm, INTERCEPT_VMSAVE); } - } - - if (vgif) { - svm_clr_intercept(svm, INTERCEPT_STGI); - svm_clr_intercept(svm, INTERCEPT_CLGI); /* * Process pending events when clearing STGI/CLGI intercepts if @@ -1060,8 +1066,13 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu) * that KVM re-evaluates if the intercept needs to be set again * to track when GIF is re-enabled (e.g. for NMI injection). */ - if (svm_has_pending_gif_event(svm)) - kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + if (vgif) { + svm_clr_intercept(svm, INTERCEPT_CLGI); + svm_clr_intercept(svm, INTERCEPT_STGI); + + if (svm_has_pending_gif_event(svm)) + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + } } if (kvm_need_rdpmc_intercept(vcpu)) -- cgit v1.2.3 From 0b97f929831a70e7ad6d9dbd30ae1f65dd43526d Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 15:09:52 -0800 Subject: KVM: SVM: Separate recalc_intercepts() into nested vs. non-nested parts Extract the non-nested aspects of recalc_intercepts() into a separate helper, svm_mark_intercepts_dirty(), to make it clear that the call isn't *just* recalculating (vmcb02's) intercepts, and to not bury non-nested code in nested.c. As suggested by Yosry, opportunistically prepend "nested_vmbc02_" to recalc_intercepts() so that it's obvious the function specifically deals with recomputing intercepts for L2. No functional change intended. Cc: Yosry Ahmed Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260218230958.2877682-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 9 ++------- arch/x86/kvm/svm/sev.c | 2 +- arch/x86/kvm/svm/svm.c | 4 ++-- arch/x86/kvm/svm/svm.h | 26 ++++++++++++++++++++------ 4 files changed, 25 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index fb86f09985e7..21ee75d6cdff 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -122,17 +122,12 @@ static bool nested_vmcb_needs_vls_intercept(struct vcpu_svm *svm) return false; } -void recalc_intercepts(struct vcpu_svm *svm) +void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm) { struct vmcb_control_area *c, *h; struct vmcb_ctrl_area_cached *g; unsigned int i; - vmcb_mark_dirty(svm->vmcb01.ptr, VMCB_INTERCEPTS); - - if (!is_guest_mode(&svm->vcpu)) - return; - vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); c = &svm->vmcb->control; @@ -962,7 +957,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) * Merge guest and host intercepts - must be called with vcpu in * guest-mode to take effect. */ - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to_vmcb) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 3f9c1aa39a0a..fea4a65758ad 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -4639,7 +4639,7 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm, bool init_event) if (!sev_vcpu_has_debug_swap(svm)) { vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_READ); vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_WRITE); - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } else { /* * Disable #DB intercept iff DebugSwap is enabled. KVM doesn't diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 5fbd87450f4f..1901e9feff51 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -638,7 +638,7 @@ static void set_dr_intercepts(struct vcpu_svm *svm) vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_READ); vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_WRITE); - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static void clr_dr_intercepts(struct vcpu_svm *svm) @@ -647,7 +647,7 @@ static void clr_dr_intercepts(struct vcpu_svm *svm) vmcb->control.intercepts[INTERCEPT_DR] = 0; - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static bool msr_write_intercepted(struct kvm_vcpu *vcpu, u32 msr) diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 0a5d5a4453b7..267ef8a3359b 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -358,8 +358,6 @@ struct svm_cpu_data { DECLARE_PER_CPU(struct svm_cpu_data, svm_data); -void recalc_intercepts(struct vcpu_svm *svm); - static __always_inline struct kvm_svm *to_kvm_svm(struct kvm *kvm) { return container_of(kvm, struct kvm_svm, kvm); @@ -487,6 +485,22 @@ static inline bool vmcb12_is_intercept(struct vmcb_ctrl_area_cached *control, u3 return __vmcb_is_intercept((unsigned long *)&control->intercepts, bit); } +void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm); + +static inline void svm_mark_intercepts_dirty(struct vcpu_svm *svm) +{ + vmcb_mark_dirty(svm->vmcb01.ptr, VMCB_INTERCEPTS); + + /* + * If L2 is active, recalculate the intercepts for vmcb02 to account + * for the changes made to vmcb01. All intercept configuration is done + * for vmcb01 and then propagated to vmcb02 to combine KVM's intercepts + * with L1's intercepts (from the vmcb12 snapshot). + */ + if (is_guest_mode(&svm->vcpu)) + nested_vmcb02_recalc_intercepts(svm); +} + static inline void set_exception_intercept(struct vcpu_svm *svm, u32 bit) { struct vmcb *vmcb = svm->vmcb01.ptr; @@ -494,7 +508,7 @@ static inline void set_exception_intercept(struct vcpu_svm *svm, u32 bit) WARN_ON_ONCE(bit >= 32); vmcb_set_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit); - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static inline void clr_exception_intercept(struct vcpu_svm *svm, u32 bit) @@ -504,7 +518,7 @@ static inline void clr_exception_intercept(struct vcpu_svm *svm, u32 bit) WARN_ON_ONCE(bit >= 32); vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit); - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static inline void svm_set_intercept(struct vcpu_svm *svm, int bit) @@ -513,7 +527,7 @@ static inline void svm_set_intercept(struct vcpu_svm *svm, int bit) vmcb_set_intercept(&vmcb->control, bit); - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static inline void svm_clr_intercept(struct vcpu_svm *svm, int bit) @@ -522,7 +536,7 @@ static inline void svm_clr_intercept(struct vcpu_svm *svm, int bit) vmcb_clr_intercept(&vmcb->control, bit); - recalc_intercepts(svm); + svm_mark_intercepts_dirty(svm); } static inline bool svm_is_intercept(struct vcpu_svm *svm, int bit) -- cgit v1.2.3 From a367b6e10372b46fa10debd889e89aa65ca65aee Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 18 Feb 2026 15:09:53 -0800 Subject: KVM: nSVM: WARN and abort vmcb02 intercepts recalc if vmcb02 isn't active WARN and bail early from nested_vmcb02_recalc_intercepts() if vmcb02 isn't the active/current VMCB, as recalculating intercepts for vmcb01 using logic intended for merging vmcb12 and vmcb01 intercepts can yield unexpected and unwanted results. In addition to hardening against general bugs, this will provide additional safeguards "if" nested_vmcb02_recalc_intercepts() is invoked directly from nested_vmcb02_prepare_control(). Signed-off-by: Yosry Ahmed [sean: split to separate patch, bail early on "failure"] Link: https://patch.msgid.link/20260218230958.2877682-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 21ee75d6cdff..75e7deef51a5 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -128,6 +128,9 @@ void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm) struct vmcb_ctrl_area_cached *g; unsigned int i; + if (WARN_ON_ONCE(svm->vmcb != svm->nested.vmcb02.ptr)) + return; + vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); c = &svm->vmcb->control; -- cgit v1.2.3 From 4a80c4bc1f10645fe3fc51d4c116f69096340683 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 15:09:54 -0800 Subject: KVM: nSVM: Directly (re)calc vmcb02 intercepts from nested_vmcb02_prepare_control() Now that nested_vmcb02_recalc_intercepts() provides guardrails against it being incorrectly called without vmcb02 active, invoke it directly from nested_vmcb02_recalc_intercepts() instead of bouncing through svm_mark_intercepts_dirty(), which unnecessarily marks vmcb01 as dirty. Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260218230958.2877682-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 75e7deef51a5..5ee77a5130d3 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -960,7 +960,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) * Merge guest and host intercepts - must be called with vcpu in * guest-mode to take effect. */ - svm_mark_intercepts_dirty(svm); + nested_vmcb02_recalc_intercepts(svm); } static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to_vmcb) -- cgit v1.2.3 From 586160b750914d5bd636f395a2ba9248c6f346e5 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 15:09:55 -0800 Subject: KVM: nSVM: Use intuitive local variables in nested_vmcb02_recalc_intercepts() Now that nested_vmcb02_recalc_intercepts() is explicitly scoped to deal with *only* recalculating vmcb02 intercepts, rename its local variables to use more intuivite names. The current "c", "h", and "g" local variables, for the current VMCB, vmcb01, and (cached) vmcb12 respectively, are short and sweet, but don't do much to help unfamiliar readers understand what the code is doing. Use vmcb12_ctrl/vmcb01/vmcb02/vmcb12_ctrl in lieu of c/h/g to make it clear the function is updating intercepts in vmcb02 based on the intercepts in vmcb01 and (cached) vmcb12. Opportunistically change the existing WARN_ON to a WARN_ON_ONCE so that a KVM bug doesn't unintentionally DoS the host. No functional change intended. Signed-off-by: Yosry Ahmed [sean: use WARN_ON_ONCE, keep local vmcb12 cache as vmcb12_ctrl] Link: https://patch.msgid.link/20260218230958.2877682-6-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 33 +++++++++++++++------------------ 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 5ee77a5130d3..46804b54200d 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -124,23 +124,20 @@ static bool nested_vmcb_needs_vls_intercept(struct vcpu_svm *svm) void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm) { - struct vmcb_control_area *c, *h; - struct vmcb_ctrl_area_cached *g; + struct vmcb_ctrl_area_cached *vmcb12_ctrl = &svm->nested.ctl; + struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; + struct vmcb *vmcb01 = svm->vmcb01.ptr; unsigned int i; - if (WARN_ON_ONCE(svm->vmcb != svm->nested.vmcb02.ptr)) + if (WARN_ON_ONCE(svm->vmcb != vmcb02)) return; - vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); - - c = &svm->vmcb->control; - h = &svm->vmcb01.ptr->control; - g = &svm->nested.ctl; + vmcb_mark_dirty(vmcb02, VMCB_INTERCEPTS); for (i = 0; i < MAX_INTERCEPT; i++) - c->intercepts[i] = h->intercepts[i]; + vmcb02->control.intercepts[i] = vmcb01->control.intercepts[i]; - if (g->int_ctl & V_INTR_MASKING_MASK) { + if (vmcb12_ctrl->int_ctl & V_INTR_MASKING_MASK) { /* * If L2 is active and V_INTR_MASKING is enabled in vmcb12, * disable intercept of CR8 writes as L2's CR8 does not affect @@ -151,17 +148,17 @@ void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm) * the effective RFLAGS.IF for L1 interrupts will never be set * while L2 is running (L2's RFLAGS.IF doesn't affect L1 IRQs). */ - vmcb_clr_intercept(c, INTERCEPT_CR8_WRITE); - if (!(svm->vmcb01.ptr->save.rflags & X86_EFLAGS_IF)) - vmcb_clr_intercept(c, INTERCEPT_VINTR); + vmcb_clr_intercept(&vmcb02->control, INTERCEPT_CR8_WRITE); + if (!(vmcb01->save.rflags & X86_EFLAGS_IF)) + vmcb_clr_intercept(&vmcb02->control, INTERCEPT_VINTR); } for (i = 0; i < MAX_INTERCEPT; i++) - c->intercepts[i] |= g->intercepts[i]; + vmcb02->control.intercepts[i] |= vmcb12_ctrl->intercepts[i]; /* If SMI is not intercepted, ignore guest SMI intercept as well */ if (!intercept_smi) - vmcb_clr_intercept(c, INTERCEPT_SMI); + vmcb_clr_intercept(&vmcb02->control, INTERCEPT_SMI); if (nested_vmcb_needs_vls_intercept(svm)) { /* @@ -169,10 +166,10 @@ void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm) * we must intercept these instructions to correctly * emulate them in case L1 doesn't intercept them. */ - vmcb_set_intercept(c, INTERCEPT_VMLOAD); - vmcb_set_intercept(c, INTERCEPT_VMSAVE); + vmcb_set_intercept(&vmcb02->control, INTERCEPT_VMLOAD); + vmcb_set_intercept(&vmcb02->control, INTERCEPT_VMSAVE); } else { - WARN_ON(!(c->virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)); + WARN_ON_ONCE(!(vmcb02->control.virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)); } } -- cgit v1.2.3 From ef09eebc5736add3415b6efb009fdb7c47a504c7 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 18 Feb 2026 15:09:56 -0800 Subject: KVM: nSVM: Use vmcb12_is_intercept() in nested_sync_control_from_vmcb02() Use vmcb12_is_intercept() instead of open-coding the intercept check. No functional change intended. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260218230958.2877682-7-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 46804b54200d..c965d10f3187 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -570,7 +570,7 @@ void nested_sync_control_from_vmcb02(struct vcpu_svm *svm) * int_ctl (because it was never recognized while L2 was running). */ if (svm_is_intercept(svm, INTERCEPT_VINTR) && - !test_bit(INTERCEPT_VINTR, (unsigned long *)svm->nested.ctl.intercepts)) + !vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_VINTR)) mask &= ~V_IRQ_MASK; if (nested_vgif_enabled(svm)) -- cgit v1.2.3 From af75470944f4c978956001cd6034f67469957c1b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 15:09:57 -0800 Subject: KVM: nSVM: Move vmcb_ctrl_area_cached.bus_lock_rip to svm_nested_state Move "bus_lock_rip" from "vmcb_ctrl_area_cached" to "svm_nested_state" as "last_bus_lock_rip" to more accurately reflect what it tracks, and because it is NOT a cached vmcb12 control field. The misplaced field isn't all that apparent in the current code base, as KVM uses "svm->nested.ctl" broadly, but the bad placement becomes glaringly obvious if "svm->nested.ctl" is captured as a local "vmcb12_ctrl" variable. No functional change intended. Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260218230958.2877682-8-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 8 ++++---- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/svm/svm.h | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index c965d10f3187..dc4cca7df47e 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -850,7 +850,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) * L1 re-enters L2, the same instruction will trigger a VM-Exit and the * entire cycle start over. */ - if (vmcb02->save.rip && (svm->nested.ctl.bus_lock_rip == vmcb02->save.rip)) + if (vmcb02->save.rip && (svm->nested.last_bus_lock_rip == vmcb02->save.rip)) vmcb02->control.bus_lock_counter = 1; else vmcb02->control.bus_lock_counter = 0; @@ -1255,11 +1255,11 @@ void nested_svm_vmexit(struct vcpu_svm *svm) } /* - * Invalidate bus_lock_rip unless KVM is still waiting for the guest - * to make forward progress before re-enabling bus lock detection. + * Invalidate last_bus_lock_rip unless KVM is still waiting for the + * guest to make forward progress before re-enabling bus lock detection. */ if (!vmcb02->control.bus_lock_counter) - svm->nested.ctl.bus_lock_rip = INVALID_GPA; + svm->nested.last_bus_lock_rip = INVALID_GPA; nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 1901e9feff51..62501c120112 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3271,7 +3271,7 @@ static int bus_lock_exit(struct kvm_vcpu *vcpu) vcpu->arch.complete_userspace_io = complete_userspace_buslock; if (is_guest_mode(vcpu)) - svm->nested.ctl.bus_lock_rip = vcpu->arch.cui_linear_rip; + svm->nested.last_bus_lock_rip = vcpu->arch.cui_linear_rip; return 0; } diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 267ef8a3359b..6c3b3fae91ec 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -174,7 +174,6 @@ struct vmcb_ctrl_area_cached { u64 nested_cr3; u64 virt_ext; u32 clean; - u64 bus_lock_rip; union { #if IS_ENABLED(CONFIG_HYPERV) || IS_ENABLED(CONFIG_KVM_HYPERV) struct hv_vmcb_enlightenments hv_enlightenments; @@ -189,6 +188,7 @@ struct svm_nested_state { u64 vm_cr_msr; u64 vmcb12_gpa; u64 last_vmcb12_gpa; + u64 last_bus_lock_rip; /* * The MSR permissions map used for vmcb02, which is the merge result -- cgit v1.2.3 From 56bfbe68f78ece2ea9b15f31ec8f7543d8942e3b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 18 Feb 2026 15:09:58 -0800 Subject: KVM: nSVM: Capture svm->nested.ctl as vmcb12_ctrl when preparing vmcb02 Grab svm->nested.ctl as vmcb12_ctrl when preparing the vmcb02 controls to make it more obvious that much of the data is coming from vmcb12 (or rather, a snapshot of vmcb12 at the time of L1's VMRUN). Opportunistically reorder the variable definitions to create a pretty reverse fir tree. No functional change intended. Cc: Yosry Ahmed Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260218230958.2877682-9-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 39 +++++++++++++++++++-------------------- 1 file changed, 19 insertions(+), 20 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index dc4cca7df47e..146faa7584a1 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -789,11 +789,11 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK; u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK; - struct kvm_vcpu *vcpu = &svm->vcpu; - struct vmcb *vmcb01 = svm->vmcb01.ptr; + struct vmcb_ctrl_area_cached *vmcb12_ctrl = &svm->nested.ctl; struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; - u32 pause_count12; - u32 pause_thresh12; + struct vmcb *vmcb01 = svm->vmcb01.ptr; + struct kvm_vcpu *vcpu = &svm->vcpu; + u32 pause_count12, pause_thresh12; nested_svm_transition_tlb_flush(vcpu); @@ -806,7 +806,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) */ if (guest_cpu_cap_has(vcpu, X86_FEATURE_VGIF) && - (svm->nested.ctl.int_ctl & V_GIF_ENABLE_MASK)) + (vmcb12_ctrl->int_ctl & V_GIF_ENABLE_MASK)) int_ctl_vmcb12_bits |= (V_GIF_MASK | V_GIF_ENABLE_MASK); else int_ctl_vmcb01_bits |= (V_GIF_MASK | V_GIF_ENABLE_MASK); @@ -864,10 +864,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) if (nested_npt_enabled(svm)) nested_svm_init_mmu_context(vcpu); - vcpu->arch.tsc_offset = kvm_calc_nested_tsc_offset( - vcpu->arch.l1_tsc_offset, - svm->nested.ctl.tsc_offset, - svm->tsc_ratio_msr); + vcpu->arch.tsc_offset = kvm_calc_nested_tsc_offset(vcpu->arch.l1_tsc_offset, + vmcb12_ctrl->tsc_offset, + svm->tsc_ratio_msr); vmcb02->control.tsc_offset = vcpu->arch.tsc_offset; @@ -876,13 +875,13 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) nested_svm_update_tsc_ratio_msr(vcpu); vmcb02->control.int_ctl = - (svm->nested.ctl.int_ctl & int_ctl_vmcb12_bits) | + (vmcb12_ctrl->int_ctl & int_ctl_vmcb12_bits) | (vmcb01->control.int_ctl & int_ctl_vmcb01_bits); - vmcb02->control.int_vector = svm->nested.ctl.int_vector; - vmcb02->control.int_state = svm->nested.ctl.int_state; - vmcb02->control.event_inj = svm->nested.ctl.event_inj; - vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err; + vmcb02->control.int_vector = vmcb12_ctrl->int_vector; + vmcb02->control.int_state = vmcb12_ctrl->int_state; + vmcb02->control.event_inj = vmcb12_ctrl->event_inj; + vmcb02->control.event_inj_err = vmcb12_ctrl->event_inj_err; /* * If nrips is exposed to L1, take NextRIP as-is. Otherwise, L1 @@ -893,7 +892,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) */ if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || !svm->nested.nested_run_pending) - vmcb02->control.next_rip = svm->nested.ctl.next_rip; + vmcb02->control.next_rip = vmcb12_ctrl->next_rip; svm->nmi_l1_to_l2 = is_evtinj_nmi(vmcb02->control.event_inj); @@ -905,7 +904,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) svm->soft_int_injected = true; if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || !svm->nested.nested_run_pending) - svm->soft_int_next_rip = svm->nested.ctl.next_rip; + svm->soft_int_next_rip = vmcb12_ctrl->next_rip; } /* LBR_CTL_ENABLE_MASK is controlled by svm_update_lbrv() */ @@ -914,11 +913,11 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) vmcb02->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; if (guest_cpu_cap_has(vcpu, X86_FEATURE_PAUSEFILTER)) - pause_count12 = svm->nested.ctl.pause_filter_count; + pause_count12 = vmcb12_ctrl->pause_filter_count; else pause_count12 = 0; if (guest_cpu_cap_has(vcpu, X86_FEATURE_PFTHRESHOLD)) - pause_thresh12 = svm->nested.ctl.pause_filter_thresh; + pause_thresh12 = vmcb12_ctrl->pause_filter_thresh; else pause_thresh12 = 0; if (kvm_pause_in_guest(svm->vcpu.kvm)) { @@ -932,7 +931,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) vmcb02->control.pause_filter_thresh = vmcb01->control.pause_filter_thresh; /* ... but ensure filtering is disabled if so requested. */ - if (vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_PAUSE)) { + if (vmcb12_is_intercept(vmcb12_ctrl, INTERCEPT_PAUSE)) { if (!pause_count12) vmcb02->control.pause_filter_count = 0; if (!pause_thresh12) @@ -949,7 +948,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) * L2 is the "guest"). */ if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS)) - vmcb02->control.erap_ctl = (svm->nested.ctl.erap_ctl & + vmcb02->control.erap_ctl = (vmcb12_ctrl->erap_ctl & ERAP_CONTROL_ALLOW_LARGER_RAP) | ERAP_CONTROL_CLEAR_RAP; -- cgit v1.2.3 From 1aea80dd42cf46d11af5ff7874a4f4dae77efd6a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 3 Mar 2026 08:58:06 -0800 Subject: KVM: SVM: Rename vmcb->nested_ctl to vmcb->misc_ctl The 'nested_ctl' field is misnamed. Although the first bit is for nested paging, the other defined bits are for SEV/SEV-ES. Other bits in the same field according to the APM (but not defined by KVM) include "Guest Mode Execution Trap", "Enable INVLPGB/TLBSYNC", and other control bits unrelated to 'nested'. There is nothing common among these bits, so just name the field misc_ctl. Also rename the flags accordingly. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-19-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/include/asm/svm.h | 8 ++++---- arch/x86/kvm/svm/nested.c | 14 +++++++------- arch/x86/kvm/svm/sev.c | 4 ++-- arch/x86/kvm/svm/svm.c | 4 ++-- arch/x86/kvm/svm/svm.h | 4 ++-- tools/testing/selftests/kvm/include/x86/svm.h | 6 +++--- tools/testing/selftests/kvm/lib/x86/svm.c | 2 +- 7 files changed, 21 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index edde36097ddc..983db6575141 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -142,7 +142,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u64 exit_info_2; u32 exit_int_info; u32 exit_int_info_err; - u64 nested_ctl; + u64 misc_ctl; u64 avic_vapic_bar; u64 ghcb_gpa; u32 event_inj; @@ -239,9 +239,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define SVM_IOIO_SIZE_MASK (7 << SVM_IOIO_SIZE_SHIFT) #define SVM_IOIO_ASIZE_MASK (7 << SVM_IOIO_ASIZE_SHIFT) -#define SVM_NESTED_CTL_NP_ENABLE BIT(0) -#define SVM_NESTED_CTL_SEV_ENABLE BIT(1) -#define SVM_NESTED_CTL_SEV_ES_ENABLE BIT(2) +#define SVM_MISC_ENABLE_NP BIT(0) +#define SVM_MISC_ENABLE_SEV BIT(1) +#define SVM_MISC_ENABLE_SEV_ES BIT(2) #define SVM_TSC_RATIO_RSVD 0xffffff0000000000ULL diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 146faa7584a1..789f38c55541 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -386,7 +386,7 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu, if (CC(control->asid == 0)) return false; - if (CC((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && + if (CC((control->misc_ctl & SVM_MISC_ENABLE_NP) && !kvm_vcpu_is_legal_gpa(vcpu, control->nested_cr3))) return false; @@ -477,10 +477,10 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, nested_svm_sanitize_intercept(vcpu, to, SKINIT); nested_svm_sanitize_intercept(vcpu, to, RDPRU); - /* Always clear SVM_NESTED_CTL_NP_ENABLE if the guest cannot use NPTs */ - to->nested_ctl = from->nested_ctl; + /* Always clear SVM_MISC_ENABLE_NP if the guest cannot use NPTs */ + to->misc_ctl = from->misc_ctl; if (!guest_cpu_cap_has(vcpu, X86_FEATURE_NPT)) - to->nested_ctl &= ~SVM_NESTED_CTL_NP_ENABLE; + to->misc_ctl &= ~SVM_MISC_ENABLE_NP; to->iopm_base_pa = from->iopm_base_pa; to->msrpm_base_pa = from->msrpm_base_pa; @@ -823,7 +823,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) } /* Copied from vmcb01. msrpm_base can be overwritten later. */ - vmcb02->control.nested_ctl = vmcb01->control.nested_ctl; + vmcb02->control.misc_ctl = vmcb01->control.misc_ctl; vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa; vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa; vmcb_mark_dirty(vmcb02, VMCB_PERM_MAP); @@ -982,7 +982,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, vmcb12->save.rip, vmcb12->control.int_ctl, vmcb12->control.event_inj, - vmcb12->control.nested_ctl, + vmcb12->control.misc_ctl, vmcb12->control.nested_cr3, vmcb12->save.cr3, KVM_ISA_SVM); @@ -1770,7 +1770,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst, dst->exit_info_2 = from->exit_info_2; dst->exit_int_info = from->exit_int_info; dst->exit_int_info_err = from->exit_int_info_err; - dst->nested_ctl = from->nested_ctl; + dst->misc_ctl = from->misc_ctl; dst->event_inj = from->event_inj; dst->event_inj_err = from->event_inj_err; dst->next_rip = from->next_rip; diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index fea4a65758ad..a0f1ce2b9a7b 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -4599,7 +4599,7 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm, bool init_event) struct kvm_sev_info *sev = to_kvm_sev_info(svm->vcpu.kvm); struct vmcb *vmcb = svm->vmcb01.ptr; - svm->vmcb->control.nested_ctl |= SVM_NESTED_CTL_SEV_ES_ENABLE; + svm->vmcb->control.misc_ctl |= SVM_MISC_ENABLE_SEV_ES; /* * An SEV-ES guest requires a VMSA area that is a separate from the @@ -4670,7 +4670,7 @@ void sev_init_vmcb(struct vcpu_svm *svm, bool init_event) { struct kvm_vcpu *vcpu = &svm->vcpu; - svm->vmcb->control.nested_ctl |= SVM_NESTED_CTL_SEV_ENABLE; + svm->vmcb->control.misc_ctl |= SVM_MISC_ENABLE_SEV; clr_exception_intercept(svm, UD_VECTOR); /* diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 62501c120112..c626cbacaf4a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1186,7 +1186,7 @@ static void init_vmcb(struct kvm_vcpu *vcpu, bool init_event) if (npt_enabled) { /* Setup VMCB for Nested Paging */ - control->nested_ctl |= SVM_NESTED_CTL_NP_ENABLE; + control->misc_ctl |= SVM_MISC_ENABLE_NP; svm_clr_intercept(svm, INTERCEPT_INVLPG); clr_exception_intercept(svm, PF_VECTOR); svm_clr_intercept(svm, INTERCEPT_CR3_READ); @@ -3417,7 +3417,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu) pr_err("%-20s%016llx\n", "exit_info2:", control->exit_info_2); pr_err("%-20s%08x\n", "exit_int_info:", control->exit_int_info); pr_err("%-20s%08x\n", "exit_int_info_err:", control->exit_int_info_err); - pr_err("%-20s%lld\n", "nested_ctl:", control->nested_ctl); + pr_err("%-20s%lld\n", "misc_ctl:", control->misc_ctl); pr_err("%-20s%016llx\n", "nested_cr3:", control->nested_cr3); pr_err("%-20s%016llx\n", "avic_vapic_bar:", control->avic_vapic_bar); pr_err("%-20s%016llx\n", "ghcb:", control->ghcb_gpa); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 6c3b3fae91ec..ab7eebd3fcff 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -167,7 +167,7 @@ struct vmcb_ctrl_area_cached { u64 exit_info_2; u32 exit_int_info; u32 exit_int_info_err; - u64 nested_ctl; + u64 misc_ctl; u32 event_inj; u32 event_inj_err; u64 next_rip; @@ -593,7 +593,7 @@ static inline bool gif_set(struct vcpu_svm *svm) static inline bool nested_npt_enabled(struct vcpu_svm *svm) { - return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE; + return svm->nested.ctl.misc_ctl & SVM_MISC_ENABLE_NP; } static inline bool nested_vnmi_enabled(struct vcpu_svm *svm) diff --git a/tools/testing/selftests/kvm/include/x86/svm.h b/tools/testing/selftests/kvm/include/x86/svm.h index 10b30b38bb3f..d81d8a9f5bfb 100644 --- a/tools/testing/selftests/kvm/include/x86/svm.h +++ b/tools/testing/selftests/kvm/include/x86/svm.h @@ -97,7 +97,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u64 exit_info_2; u32 exit_int_info; u32 exit_int_info_err; - u64 nested_ctl; + u64 misc_ctl; u64 avic_vapic_bar; u8 reserved_4[8]; u32 event_inj; @@ -175,8 +175,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define SVM_VM_CR_SVM_LOCK_MASK 0x0008ULL #define SVM_VM_CR_SVM_DIS_MASK 0x0010ULL -#define SVM_NESTED_CTL_NP_ENABLE BIT(0) -#define SVM_NESTED_CTL_SEV_ENABLE BIT(1) +#define SVM_MISC_ENABLE_NP BIT(0) +#define SVM_MISC_ENABLE_SEV BIT(1) struct __attribute__ ((__packed__)) vmcb_seg { u16 selector; diff --git a/tools/testing/selftests/kvm/lib/x86/svm.c b/tools/testing/selftests/kvm/lib/x86/svm.c index 2e5c480c9afd..eb20b00112c7 100644 --- a/tools/testing/selftests/kvm/lib/x86/svm.c +++ b/tools/testing/selftests/kvm/lib/x86/svm.c @@ -126,7 +126,7 @@ void generic_svm_setup(struct svm_test_data *svm, void *guest_rip, void *guest_r guest_regs.rdi = (u64)svm; if (svm->ncr3_gpa) { - ctrl->nested_ctl |= SVM_NESTED_CTL_NP_ENABLE; + ctrl->misc_ctl |= SVM_MISC_ENABLE_NP; ctrl->nested_cr3 = svm->ncr3_gpa; } } -- cgit v1.2.3 From 7e6eab9be2200f83ab03ab2b921ea7ca47a6c3b4 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:13 +0000 Subject: KVM: SVM: Rename vmcb->virt_ext to vmcb->misc_ctl2 'virt' is confusing in the VMCB because it is relative and ambiguous. The 'virt_ext' field includes bits for LBR virtualization and VMSAVE/VMLOAD virtualization, so it's just another miscellaneous control field. Name it as such. While at it, move the definitions of the bits below those for 'misc_ctl' and rename them for consistency. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-20-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/include/asm/svm.h | 7 +++---- arch/x86/kvm/svm/nested.c | 16 ++++++++-------- arch/x86/kvm/svm/svm.c | 18 +++++++++--------- arch/x86/kvm/svm/svm.h | 2 +- tools/testing/selftests/kvm/include/x86/svm.h | 8 ++++---- .../selftests/kvm/x86/nested_vmsave_vmload_test.c | 16 ++++++++-------- tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c | 4 ++-- 7 files changed, 35 insertions(+), 36 deletions(-) diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 983db6575141..c169256c415f 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -148,7 +148,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u32 event_inj; u32 event_inj_err; u64 nested_cr3; - u64 virt_ext; + u64 misc_ctl2; u32 clean; u32 reserved_5; u64 next_rip; @@ -222,9 +222,6 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define X2APIC_MODE_SHIFT 30 #define X2APIC_MODE_MASK (1 << X2APIC_MODE_SHIFT) -#define LBR_CTL_ENABLE_MASK BIT_ULL(0) -#define VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK BIT_ULL(1) - #define SVM_INTERRUPT_SHADOW_MASK BIT_ULL(0) #define SVM_GUEST_INTERRUPT_MASK BIT_ULL(1) @@ -243,6 +240,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define SVM_MISC_ENABLE_SEV BIT(1) #define SVM_MISC_ENABLE_SEV_ES BIT(2) +#define SVM_MISC2_ENABLE_V_LBR BIT_ULL(0) +#define SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE BIT_ULL(1) #define SVM_TSC_RATIO_RSVD 0xffffff0000000000ULL #define SVM_TSC_RATIO_MIN 0x0000000000000001ULL diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 789f38c55541..d3e3721fa223 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -116,7 +116,7 @@ static bool nested_vmcb_needs_vls_intercept(struct vcpu_svm *svm) if (!nested_npt_enabled(svm)) return true; - if (!(svm->nested.ctl.virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)) + if (!(svm->nested.ctl.misc_ctl2 & SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE)) return true; return false; @@ -169,7 +169,7 @@ void nested_vmcb02_recalc_intercepts(struct vcpu_svm *svm) vmcb_set_intercept(&vmcb02->control, INTERCEPT_VMLOAD); vmcb_set_intercept(&vmcb02->control, INTERCEPT_VMSAVE); } else { - WARN_ON_ONCE(!(vmcb02->control.virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)); + WARN_ON_ONCE(!(vmcb02->control.misc_ctl2 & SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE)); } } @@ -499,7 +499,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->event_inj_err = from->event_inj_err; to->next_rip = from->next_rip; to->nested_cr3 = from->nested_cr3; - to->virt_ext = from->virt_ext; + to->misc_ctl2 = from->misc_ctl2; to->pause_filter_count = from->pause_filter_count; to->pause_filter_thresh = from->pause_filter_thresh; @@ -679,7 +679,7 @@ void nested_vmcb02_compute_g_pat(struct vcpu_svm *svm) static bool nested_vmcb12_has_lbrv(struct kvm_vcpu *vcpu) { return guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && - (to_svm(vcpu)->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK); + (to_svm(vcpu)->nested.ctl.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR); } static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12) @@ -907,10 +907,10 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) svm->soft_int_next_rip = vmcb12_ctrl->next_rip; } - /* LBR_CTL_ENABLE_MASK is controlled by svm_update_lbrv() */ + /* SVM_MISC2_ENABLE_V_LBR is controlled by svm_update_lbrv() */ if (!nested_vmcb_needs_vls_intercept(svm)) - vmcb02->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + vmcb02->control.misc_ctl2 |= SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE; if (guest_cpu_cap_has(vcpu, X86_FEATURE_PAUSEFILTER)) pause_count12 = vmcb12_ctrl->pause_filter_count; @@ -1774,8 +1774,8 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst, dst->event_inj = from->event_inj; dst->event_inj_err = from->event_inj_err; dst->next_rip = from->next_rip; - dst->nested_cr3 = from->nested_cr3; - dst->virt_ext = from->virt_ext; + dst->nested_cr3 = from->nested_cr3; + dst->misc_ctl2 = from->misc_ctl2; dst->pause_filter_count = from->pause_filter_count; dst->pause_filter_thresh = from->pause_filter_thresh; /* 'clean' and 'hv_enlightenments' are not changed by KVM */ diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index c626cbacaf4a..7decb68f38f6 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -713,7 +713,7 @@ void *svm_alloc_permissions_map(unsigned long size, gfp_t gfp_mask) static void svm_recalc_lbr_msr_intercepts(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - bool intercept = !(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK); + bool intercept = !(svm->vmcb->control.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR); if (intercept == svm->lbr_msrs_intercepted) return; @@ -846,7 +846,7 @@ static void svm_recalc_msr_intercepts(struct kvm_vcpu *vcpu) static void __svm_enable_lbrv(struct kvm_vcpu *vcpu) { - to_svm(vcpu)->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK; + to_svm(vcpu)->vmcb->control.misc_ctl2 |= SVM_MISC2_ENABLE_V_LBR; } void svm_enable_lbrv(struct kvm_vcpu *vcpu) @@ -858,16 +858,16 @@ void svm_enable_lbrv(struct kvm_vcpu *vcpu) static void __svm_disable_lbrv(struct kvm_vcpu *vcpu) { KVM_BUG_ON(sev_es_guest(vcpu->kvm), vcpu->kvm); - to_svm(vcpu)->vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK; + to_svm(vcpu)->vmcb->control.misc_ctl2 &= ~SVM_MISC2_ENABLE_V_LBR; } void svm_update_lbrv(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - bool current_enable_lbrv = svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK; + bool current_enable_lbrv = svm->vmcb->control.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR; bool enable_lbrv = (svm->vmcb->save.dbgctl & DEBUGCTLMSR_LBR) || (is_guest_mode(vcpu) && guest_cpu_cap_has(vcpu, X86_FEATURE_LBRV) && - (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK)); + (svm->nested.ctl.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR)); if (enable_lbrv && !current_enable_lbrv) __svm_enable_lbrv(vcpu); @@ -1222,7 +1222,7 @@ static void init_vmcb(struct kvm_vcpu *vcpu, bool init_event) svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK; if (vls) - svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + svm->vmcb->control.misc_ctl2 |= SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE; if (vcpu->kvm->arch.bus_lock_detection_enabled) svm_set_intercept(svm, INTERCEPT_BUSLOCK); @@ -3423,7 +3423,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu) pr_err("%-20s%016llx\n", "ghcb:", control->ghcb_gpa); pr_err("%-20s%08x\n", "event_inj:", control->event_inj); pr_err("%-20s%08x\n", "event_inj_err:", control->event_inj_err); - pr_err("%-20s%lld\n", "virt_ext:", control->virt_ext); + pr_err("%-20s%lld\n", "misc_ctl2:", control->misc_ctl2); pr_err("%-20s%016llx\n", "next_rip:", control->next_rip); pr_err("%-20s%016llx\n", "avic_backing_page:", control->avic_backing_page); pr_err("%-20s%016llx\n", "avic_logical_id:", control->avic_logical_id); @@ -4472,7 +4472,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) * VM-Exit), as running with the host's DEBUGCTL can negatively affect * guest state and can even be fatal, e.g. due to Bus Lock Detect. */ - if (!(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK) && + if (!(svm->vmcb->control.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR) && vcpu->arch.host_debugctl != svm->vmcb->save.dbgctl) update_debugctlmsr(svm->vmcb->save.dbgctl); @@ -4503,7 +4503,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI)) kvm_before_interrupt(vcpu, KVM_HANDLING_NMI); - if (!(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK) && + if (!(svm->vmcb->control.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR) && vcpu->arch.host_debugctl != svm->vmcb->save.dbgctl) update_debugctlmsr(vcpu->arch.host_debugctl); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index ab7eebd3fcff..760a8a6d45cd 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -172,7 +172,7 @@ struct vmcb_ctrl_area_cached { u32 event_inj_err; u64 next_rip; u64 nested_cr3; - u64 virt_ext; + u64 misc_ctl2; u32 clean; union { #if IS_ENABLED(CONFIG_HYPERV) || IS_ENABLED(CONFIG_KVM_HYPERV) diff --git a/tools/testing/selftests/kvm/include/x86/svm.h b/tools/testing/selftests/kvm/include/x86/svm.h index d81d8a9f5bfb..c8539166270e 100644 --- a/tools/testing/selftests/kvm/include/x86/svm.h +++ b/tools/testing/selftests/kvm/include/x86/svm.h @@ -103,7 +103,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u32 event_inj; u32 event_inj_err; u64 nested_cr3; - u64 virt_ext; + u64 misc_ctl2; u32 clean; u32 reserved_5; u64 next_rip; @@ -155,9 +155,6 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define AVIC_ENABLE_SHIFT 31 #define AVIC_ENABLE_MASK (1 << AVIC_ENABLE_SHIFT) -#define LBR_CTL_ENABLE_MASK BIT_ULL(0) -#define VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK BIT_ULL(1) - #define SVM_INTERRUPT_SHADOW_MASK 1 #define SVM_IOIO_STR_SHIFT 2 @@ -178,6 +175,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define SVM_MISC_ENABLE_NP BIT(0) #define SVM_MISC_ENABLE_SEV BIT(1) +#define SVM_MISC2_ENABLE_V_LBR BIT_ULL(0) +#define SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE BIT_ULL(1) + struct __attribute__ ((__packed__)) vmcb_seg { u16 selector; u16 attrib; diff --git a/tools/testing/selftests/kvm/x86/nested_vmsave_vmload_test.c b/tools/testing/selftests/kvm/x86/nested_vmsave_vmload_test.c index 6764a48f9d4d..71717118d692 100644 --- a/tools/testing/selftests/kvm/x86/nested_vmsave_vmload_test.c +++ b/tools/testing/selftests/kvm/x86/nested_vmsave_vmload_test.c @@ -79,8 +79,8 @@ static void l1_guest_code(struct svm_test_data *svm) svm->vmcb->control.intercept |= (BIT_ULL(INTERCEPT_VMSAVE) | BIT_ULL(INTERCEPT_VMLOAD)); - /* ..VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK cleared.. */ - svm->vmcb->control.virt_ext &= ~VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + /* ..SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE cleared.. */ + svm->vmcb->control.misc_ctl2 &= ~SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE; svm->vmcb->save.rip = (u64)l2_guest_code_vmsave; run_guest(svm->vmcb, svm->vmcb_gpa); @@ -90,8 +90,8 @@ static void l1_guest_code(struct svm_test_data *svm) run_guest(svm->vmcb, svm->vmcb_gpa); GUEST_ASSERT_EQ(svm->vmcb->control.exit_code, SVM_EXIT_VMLOAD); - /* ..and VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK set */ - svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + /* ..and SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE set */ + svm->vmcb->control.misc_ctl2 |= SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE; svm->vmcb->save.rip = (u64)l2_guest_code_vmsave; run_guest(svm->vmcb, svm->vmcb_gpa); @@ -106,20 +106,20 @@ static void l1_guest_code(struct svm_test_data *svm) BIT_ULL(INTERCEPT_VMLOAD)); /* - * Without VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK, the GPA will be + * Without SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE, the GPA will be * interpreted as an L1 GPA, so VMCB0 should be used. */ svm->vmcb->save.rip = (u64)l2_guest_code_vmcb0; - svm->vmcb->control.virt_ext &= ~VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + svm->vmcb->control.misc_ctl2 &= ~SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE; run_guest(svm->vmcb, svm->vmcb_gpa); GUEST_ASSERT_EQ(svm->vmcb->control.exit_code, SVM_EXIT_VMMCALL); /* - * With VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK, the GPA will be interpeted as + * With SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE, the GPA will be interpeted as * an L2 GPA, and translated through the NPT to VMCB1. */ svm->vmcb->save.rip = (u64)l2_guest_code_vmcb1; - svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + svm->vmcb->control.misc_ctl2 |= SVM_MISC2_ENABLE_V_VMLOAD_VMSAVE; run_guest(svm->vmcb, svm->vmcb_gpa); GUEST_ASSERT_EQ(svm->vmcb->control.exit_code, SVM_EXIT_VMMCALL); diff --git a/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c b/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c index bf16abb1152e..ff99438824d3 100644 --- a/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c +++ b/tools/testing/selftests/kvm/x86/svm_lbr_nested_state.c @@ -69,9 +69,9 @@ static void l1_guest_code(struct svm_test_data *svm, bool nested_lbrv) &l2_guest_stack[L2_GUEST_STACK_SIZE]); if (nested_lbrv) - vmcb->control.virt_ext = LBR_CTL_ENABLE_MASK; + vmcb->control.misc_ctl2 = SVM_MISC2_ENABLE_V_LBR; else - vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK; + vmcb->control.misc_ctl2 &= ~SVM_MISC2_ENABLE_V_LBR; run_guest(vmcb, svm->vmcb_gpa); GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); -- cgit v1.2.3 From 84dc9fd0354d3d0e02faf2f7b3f4d1228c2571ea Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:14 +0000 Subject: KVM: nSVM: Cache all used fields from VMCB12 Currently, most fields used from VMCB12 are cached in svm->nested.{ctl/save}. This is mainly to avoid TOC-TOU bugs. However, for the save area, only the fields used in the consistency checks (i.e. nested_vmcb_check_save()) were being cached. Other fields are read directly from guest memory in nested_vmcb02_prepare_save(). While probably benign, this still makes it possible for TOC-TOU bugs to happen. For example, RAX, RSP, and RIP are read twice, once to store in VMCB02, and once to store in vcpu->arch.regs. It is possible for the guest to modify the value between both reads, potentially causing nasty bugs. Harden against such bugs by caching everything in svm->nested.save. Cache all the needed fields, and keep all accesses to the VMCB12 strictly in nested_svm_vmrun() for caching and early error injection. Following changes will further limit the access to the VMCB12 in the nested VMRUN path. Introduce vmcb12_is_dirty() to use with the cached control fields instead of vmcb_is_dirty(), similar to vmcb12_is_intercept(). Opportunistically order the copies in __nested_copy_vmcb_save_to_cache() by the order in which the fields are defined in struct vmcb_save_area. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-21-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 116 ++++++++++++++++++++++++++-------------------- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/svm/svm.h | 27 ++++++++++- 3 files changed, 93 insertions(+), 52 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index d3e3721fa223..0c3f2db6ac0b 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -507,11 +507,11 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->asid = from->asid; to->msrpm_base_pa &= ~0x0fffULL; to->iopm_base_pa &= ~0x0fffULL; + to->clean = from->clean; #ifdef CONFIG_KVM_HYPERV /* Hyper-V extensions (Enlightened VMCB) */ if (kvm_hv_hypercall_enabled(vcpu)) { - to->clean = from->clean; memcpy(&to->hv_enlightenments, &from->hv_enlightenments, sizeof(to->hv_enlightenments)); } @@ -527,19 +527,34 @@ void nested_copy_vmcb_control_to_cache(struct vcpu_svm *svm, static void __nested_copy_vmcb_save_to_cache(struct vmcb_save_area_cached *to, struct vmcb_save_area *from) { - /* - * Copy only fields that are validated, as we need them - * to avoid TOC/TOU races. - */ + to->es = from->es; to->cs = from->cs; + to->ss = from->ss; + to->ds = from->ds; + to->gdtr = from->gdtr; + to->idtr = from->idtr; + + to->cpl = from->cpl; to->efer = from->efer; - to->cr0 = from->cr0; - to->cr3 = from->cr3; to->cr4 = from->cr4; - - to->dr6 = from->dr6; + to->cr3 = from->cr3; + to->cr0 = from->cr0; to->dr7 = from->dr7; + to->dr6 = from->dr6; + + to->rflags = from->rflags; + to->rip = from->rip; + to->rsp = from->rsp; + + to->s_cet = from->s_cet; + to->ssp = from->ssp; + to->isst_addr = from->isst_addr; + + to->rax = from->rax; + to->cr2 = from->cr2; + + svm_copy_lbrs(to, from); } void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm, @@ -682,8 +697,10 @@ static bool nested_vmcb12_has_lbrv(struct kvm_vcpu *vcpu) (to_svm(vcpu)->nested.ctl.misc_ctl2 & SVM_MISC2_ENABLE_V_LBR); } -static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12) +static void nested_vmcb02_prepare_save(struct vcpu_svm *svm) { + struct vmcb_ctrl_area_cached *control = &svm->nested.ctl; + struct vmcb_save_area_cached *save = &svm->nested.save; bool new_vmcb12 = false; struct vmcb *vmcb01 = svm->vmcb01.ptr; struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; @@ -699,48 +716,48 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12 svm->nested.force_msr_bitmap_recalc = true; } - if (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_SEG))) { - vmcb02->save.es = vmcb12->save.es; - vmcb02->save.cs = vmcb12->save.cs; - vmcb02->save.ss = vmcb12->save.ss; - vmcb02->save.ds = vmcb12->save.ds; - vmcb02->save.cpl = vmcb12->save.cpl; + if (unlikely(new_vmcb12 || vmcb12_is_dirty(control, VMCB_SEG))) { + vmcb02->save.es = save->es; + vmcb02->save.cs = save->cs; + vmcb02->save.ss = save->ss; + vmcb02->save.ds = save->ds; + vmcb02->save.cpl = save->cpl; vmcb_mark_dirty(vmcb02, VMCB_SEG); } - if (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_DT))) { - vmcb02->save.gdtr = vmcb12->save.gdtr; - vmcb02->save.idtr = vmcb12->save.idtr; + if (unlikely(new_vmcb12 || vmcb12_is_dirty(control, VMCB_DT))) { + vmcb02->save.gdtr = save->gdtr; + vmcb02->save.idtr = save->idtr; vmcb_mark_dirty(vmcb02, VMCB_DT); } if (guest_cpu_cap_has(vcpu, X86_FEATURE_SHSTK) && - (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_CET)))) { - vmcb02->save.s_cet = vmcb12->save.s_cet; - vmcb02->save.isst_addr = vmcb12->save.isst_addr; - vmcb02->save.ssp = vmcb12->save.ssp; + (unlikely(new_vmcb12 || vmcb12_is_dirty(control, VMCB_CET)))) { + vmcb02->save.s_cet = save->s_cet; + vmcb02->save.isst_addr = save->isst_addr; + vmcb02->save.ssp = save->ssp; vmcb_mark_dirty(vmcb02, VMCB_CET); } - kvm_set_rflags(vcpu, vmcb12->save.rflags | X86_EFLAGS_FIXED); + kvm_set_rflags(vcpu, save->rflags | X86_EFLAGS_FIXED); svm_set_efer(vcpu, svm->nested.save.efer); svm_set_cr0(vcpu, svm->nested.save.cr0); svm_set_cr4(vcpu, svm->nested.save.cr4); - svm->vcpu.arch.cr2 = vmcb12->save.cr2; + svm->vcpu.arch.cr2 = save->cr2; - kvm_rax_write(vcpu, vmcb12->save.rax); - kvm_rsp_write(vcpu, vmcb12->save.rsp); - kvm_rip_write(vcpu, vmcb12->save.rip); + kvm_rax_write(vcpu, save->rax); + kvm_rsp_write(vcpu, save->rsp); + kvm_rip_write(vcpu, save->rip); /* In case we don't even reach vcpu_run, the fields are not updated */ - vmcb02->save.rax = vmcb12->save.rax; - vmcb02->save.rsp = vmcb12->save.rsp; - vmcb02->save.rip = vmcb12->save.rip; + vmcb02->save.rax = save->rax; + vmcb02->save.rsp = save->rsp; + vmcb02->save.rip = save->rip; - if (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_DR))) { + if (unlikely(new_vmcb12 || vmcb12_is_dirty(control, VMCB_DR))) { vmcb02->save.dr7 = svm->nested.save.dr7 | DR7_FIXED_1; svm->vcpu.arch.dr6 = svm->nested.save.dr6 | DR6_ACTIVE_LOW; vmcb_mark_dirty(vmcb02, VMCB_DR); @@ -751,7 +768,7 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12 * Reserved bits of DEBUGCTL are ignored. Be consistent with * svm_set_msr's definition of reserved bits. */ - svm_copy_lbrs(&vmcb02->save, &vmcb12->save); + svm_copy_lbrs(&vmcb02->save, save); vmcb02->save.dbgctl &= ~DEBUGCTL_RESERVED_BITS; } else { svm_copy_lbrs(&vmcb02->save, &vmcb01->save); @@ -971,28 +988,29 @@ static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to to_vmcb->save.spec_ctrl = from_vmcb->save.spec_ctrl; } -int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, - struct vmcb *vmcb12, bool from_vmrun) +int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun) { struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb_ctrl_area_cached *control = &svm->nested.ctl; + struct vmcb_save_area_cached *save = &svm->nested.save; int ret; trace_kvm_nested_vmenter(svm->vmcb->save.rip, vmcb12_gpa, - vmcb12->save.rip, - vmcb12->control.int_ctl, - vmcb12->control.event_inj, - vmcb12->control.misc_ctl, - vmcb12->control.nested_cr3, - vmcb12->save.cr3, + save->rip, + control->int_ctl, + control->event_inj, + control->misc_ctl, + control->nested_cr3, + save->cr3, KVM_ISA_SVM); - trace_kvm_nested_intercepts(vmcb12->control.intercepts[INTERCEPT_CR] & 0xffff, - vmcb12->control.intercepts[INTERCEPT_CR] >> 16, - vmcb12->control.intercepts[INTERCEPT_EXCEPTION], - vmcb12->control.intercepts[INTERCEPT_WORD3], - vmcb12->control.intercepts[INTERCEPT_WORD4], - vmcb12->control.intercepts[INTERCEPT_WORD5]); + trace_kvm_nested_intercepts(control->intercepts[INTERCEPT_CR] & 0xffff, + control->intercepts[INTERCEPT_CR] >> 16, + control->intercepts[INTERCEPT_EXCEPTION], + control->intercepts[INTERCEPT_WORD3], + control->intercepts[INTERCEPT_WORD4], + control->intercepts[INTERCEPT_WORD5]); svm->nested.vmcb12_gpa = vmcb12_gpa; @@ -1003,7 +1021,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, svm_switch_vmcb(svm, &svm->nested.vmcb02); nested_vmcb02_prepare_control(svm); - nested_vmcb02_prepare_save(svm, vmcb12); + nested_vmcb02_prepare_save(svm); ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3, nested_npt_enabled(svm), from_vmrun); @@ -1091,7 +1109,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) svm->nested.nested_run_pending = 1; - if (enter_svm_guest_mode(vcpu, vmcb12_gpa, vmcb12, true)) + if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) goto out_exit_err; if (nested_svm_merge_msrpm(vcpu)) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7decb68f38f6..2c511f86b79d 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -5004,7 +5004,7 @@ static int svm_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) vmcb12 = map.hva; nested_copy_vmcb_control_to_cache(svm, &vmcb12->control); nested_copy_vmcb_save_to_cache(svm, &vmcb12->save); - ret = enter_svm_guest_mode(vcpu, smram64->svm_guest_vmcb_gpa, vmcb12, false); + ret = enter_svm_guest_mode(vcpu, smram64->svm_guest_vmcb_gpa, false); if (ret) goto unmap_save; diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 760a8a6d45cd..995c8de3f660 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -140,13 +140,32 @@ struct kvm_vmcb_info { }; struct vmcb_save_area_cached { + struct vmcb_seg es; struct vmcb_seg cs; + struct vmcb_seg ss; + struct vmcb_seg ds; + struct vmcb_seg gdtr; + struct vmcb_seg idtr; + u8 cpl; u64 efer; u64 cr4; u64 cr3; u64 cr0; u64 dr7; u64 dr6; + u64 rflags; + u64 rip; + u64 rsp; + u64 s_cet; + u64 ssp; + u64 isst_addr; + u64 rax; + u64 cr2; + u64 dbgctl; + u64 br_from; + u64 br_to; + u64 last_excp_from; + u64 last_excp_to; }; struct vmcb_ctrl_area_cached { @@ -419,6 +438,11 @@ static inline bool vmcb_is_dirty(struct vmcb *vmcb, int bit) return !test_bit(bit, (unsigned long *)&vmcb->control.clean); } +static inline bool vmcb12_is_dirty(struct vmcb_ctrl_area_cached *control, int bit) +{ + return !test_bit(bit, (unsigned long *)&control->clean); +} + static __always_inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu) { return container_of(vcpu, struct vcpu_svm, vcpu); @@ -799,8 +823,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm) int __init nested_svm_init_msrpm_merge_offsets(void); -int enter_svm_guest_mode(struct kvm_vcpu *vcpu, - u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun); +int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb_gpa, bool from_vmrun); void svm_leave_nested(struct kvm_vcpu *vcpu); void svm_free_nested(struct vcpu_svm *svm); int svm_allocate_nested(struct vcpu_svm *svm); -- cgit v1.2.3 From b709087e9e544259d1d075ced91cc4ab769a8ae2 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:15 +0000 Subject: KVM: nSVM: Restrict mapping vmcb12 on nested VMRUN All accesses to the vmcb12 in the guest memory on nested VMRUN are limited to nested_svm_vmrun() copying vmcb12 fields and writing them on failed consistency checks. However, vmcb12 remains mapped throughout nested_svm_vmrun(). Mapping and unmapping around usages is possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is used after being unmapped. Move reading the vmcb12, copying to cache, and consistency checks from nested_svm_vmrun() into a new helper, nested_svm_copy_vmcb12_to_cache() to limit the scope of the mapping. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-22-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 89 +++++++++++++++++++++++++++-------------------- 1 file changed, 51 insertions(+), 38 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 0c3f2db6ac0b..c61b4923963e 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1041,12 +1041,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun) return 0; } -int nested_svm_vmrun(struct kvm_vcpu *vcpu) +static int nested_svm_copy_vmcb12_to_cache(struct kvm_vcpu *vcpu, u64 vmcb12_gpa) { struct vcpu_svm *svm = to_svm(vcpu); - int ret; - struct vmcb *vmcb12; struct kvm_host_map map; + struct vmcb *vmcb12; + int r = 0; + + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) + return -EFAULT; + + vmcb12 = map.hva; + nested_copy_vmcb_control_to_cache(svm, &vmcb12->control); + nested_copy_vmcb_save_to_cache(svm, &vmcb12->save); + + if (!nested_vmcb_check_save(vcpu, &svm->nested.save) || + !nested_vmcb_check_controls(vcpu, &svm->nested.ctl)) { + vmcb12->control.exit_code = SVM_EXIT_ERR; + vmcb12->control.exit_info_1 = 0; + vmcb12->control.exit_info_2 = 0; + vmcb12->control.event_inj = 0; + vmcb12->control.event_inj_err = 0; + svm_set_gif(svm, false); + r = -EINVAL; + } + + kvm_vcpu_unmap(vcpu, &map); + return r; +} + +int nested_svm_vmrun(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + int ret, err; u64 vmcb12_gpa; struct vmcb *vmcb01 = svm->vmcb01.ptr; @@ -1067,32 +1094,23 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) return ret; } + if (WARN_ON_ONCE(!svm->nested.initialized)) + return -EINVAL; + vmcb12_gpa = svm->vmcb->save.rax; - if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) { + err = nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa); + if (err == -EFAULT) { kvm_inject_gp(vcpu, 0); return 1; } + /* + * Advance RIP if #GP or #UD are not injected, but otherwise stop if + * copying and checking vmcb12 failed. + */ ret = kvm_skip_emulated_instruction(vcpu); - - vmcb12 = map.hva; - - if (WARN_ON_ONCE(!svm->nested.initialized)) - return -EINVAL; - - nested_copy_vmcb_control_to_cache(svm, &vmcb12->control); - nested_copy_vmcb_save_to_cache(svm, &vmcb12->save); - - if (!nested_vmcb_check_save(vcpu, &svm->nested.save) || - !nested_vmcb_check_controls(vcpu, &svm->nested.ctl)) { - vmcb12->control.exit_code = SVM_EXIT_ERR; - vmcb12->control.exit_info_1 = 0; - vmcb12->control.exit_info_2 = 0; - vmcb12->control.event_inj = 0; - vmcb12->control.event_inj_err = 0; - svm_set_gif(svm, false); - goto out; - } + if (err) + return ret; /* * Since vmcb01 is not in use, we can use it to store some of the L1 @@ -1109,23 +1127,18 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) svm->nested.nested_run_pending = 1; - if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true)) - goto out_exit_err; - - if (nested_svm_merge_msrpm(vcpu)) - goto out; - -out_exit_err: - svm->nested.nested_run_pending = 0; - - svm->vmcb->control.exit_code = SVM_EXIT_ERR; - svm->vmcb->control.exit_info_1 = 0; - svm->vmcb->control.exit_info_2 = 0; + if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) || + !nested_svm_merge_msrpm(vcpu)) { + svm->nested.nested_run_pending = 0; + svm->nmi_l1_to_l2 = false; + svm->soft_int_injected = false; - nested_svm_vmexit(svm); + svm->vmcb->control.exit_code = SVM_EXIT_ERR; + svm->vmcb->control.exit_info_1 = 0; + svm->vmcb->control.exit_info_2 = 0; -out: - kvm_vcpu_unmap(vcpu, &map); + nested_svm_vmexit(svm); + } return ret; } -- cgit v1.2.3 From a2b858051cf03d4f0abca014cddd424675be5316 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:16 +0000 Subject: KVM: nSVM: Use PAGE_MASK to drop lower bits of bitmap GPAs from vmcb12 Use PAGE_MASK to drop the lower bits from IOPM_BASE_PA and MSRPM_BASE_PA while copying them instead of dropping the bits afterward with a hardcoded mask. No functional change intended. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-23-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index c61b4923963e..fd7045904948 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -482,8 +482,8 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, if (!guest_cpu_cap_has(vcpu, X86_FEATURE_NPT)) to->misc_ctl &= ~SVM_MISC_ENABLE_NP; - to->iopm_base_pa = from->iopm_base_pa; - to->msrpm_base_pa = from->msrpm_base_pa; + to->iopm_base_pa = from->iopm_base_pa & PAGE_MASK; + to->msrpm_base_pa = from->msrpm_base_pa & PAGE_MASK; to->tsc_offset = from->tsc_offset; to->tlb_ctl = from->tlb_ctl; to->erap_ctl = from->erap_ctl; @@ -505,8 +505,6 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, /* Copy asid here because nested_vmcb_check_controls() will check it */ to->asid = from->asid; - to->msrpm_base_pa &= ~0x0fffULL; - to->iopm_base_pa &= ~0x0fffULL; to->clean = from->clean; #ifdef CONFIG_KVM_HYPERV -- cgit v1.2.3 From 30a1d2fa819039e06bc6242669f6fd45df039a41 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:17 +0000 Subject: KVM: nSVM: Sanitize TLB_CONTROL field when copying from vmcb12 The APM defines possible values for TLB_CONTROL as 0, 1, 3, and 7 -- all of which are always allowed for KVM guests as KVM always supports X86_FEATURE_FLUSHBYASID. Only copy bits 0 to 2 from vmcb12's TLB_CONTROL, such that no unhandled or reserved bits end up in vmcb02. Note that TLB_CONTROL in vmcb12 is currently ignored by KVM, as it nukes the TLB on nested transitions anyway (see nested_svm_transition_tlb_flush()). However, such sanitization will be needed once the TODOs there are addressed, and it's minimal churn to add it now. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-24-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/include/asm/svm.h | 2 ++ arch/x86/kvm/svm/nested.c | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index c169256c415f..16cf4f435aeb 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -182,6 +182,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define TLB_CONTROL_FLUSH_ASID 3 #define TLB_CONTROL_FLUSH_ASID_LOCAL 7 +#define TLB_CONTROL_MASK GENMASK(2, 0) + #define ERAP_CONTROL_ALLOW_LARGER_RAP BIT(0) #define ERAP_CONTROL_CLEAR_RAP BIT(1) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index fd7045904948..c4680270e54f 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -485,7 +485,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->iopm_base_pa = from->iopm_base_pa & PAGE_MASK; to->msrpm_base_pa = from->msrpm_base_pa & PAGE_MASK; to->tsc_offset = from->tsc_offset; - to->tlb_ctl = from->tlb_ctl; + to->tlb_ctl = from->tlb_ctl & TLB_CONTROL_MASK; to->erap_ctl = from->erap_ctl; to->int_ctl = from->int_ctl; to->int_vector = from->int_vector; -- cgit v1.2.3 From c8123e82725648b1b13103ce3d8066ce13ab81b7 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:18 +0000 Subject: KVM: nSVM: Sanitize INT/EVENTINJ fields when copying from vmcb12 Make sure all fields used from vmcb12 in creating the vmcb02 are sanitized, such that no unhandled or reserved bits end up in the vmcb02. The following control fields are read from vmcb12 and have bits that are either reserved or not handled/advertised by KVM: tlb_ctl, int_ctl, int_state, int_vector, event_inj, misc_ctl, and misc_ctl2. The following fields do not require any extra sanitizing: - tlb_ctl: already being sanitized. - int_ctl: bits from vmcb12 are copied bit-by-bit as needed. - misc_ctl: only used in consistency checks (particularly NP_ENABLE). - misc_ctl2: bits from vmcb12 are copied bit-by-bit as needed. For the remaining fields (int_vector, int_state, and event_inj), make sure only defined bits are copied from L1's vmcb12 into KVM'cache by defining appropriate masks where needed. Suggested-by: Jim Mattson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-25-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/include/asm/svm.h | 5 +++++ arch/x86/kvm/svm/nested.c | 8 ++++---- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 16cf4f435aeb..bcfeb5e7c0ed 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -224,6 +224,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define X2APIC_MODE_SHIFT 30 #define X2APIC_MODE_MASK (1 << X2APIC_MODE_SHIFT) +#define SVM_INT_VECTOR_MASK GENMASK(7, 0) + #define SVM_INTERRUPT_SHADOW_MASK BIT_ULL(0) #define SVM_GUEST_INTERRUPT_MASK BIT_ULL(1) @@ -637,6 +639,9 @@ static inline void __unused_size_checks(void) #define SVM_EVTINJ_VALID (1 << 31) #define SVM_EVTINJ_VALID_ERR (1 << 11) +#define SVM_EVTINJ_RESERVED_BITS ~(SVM_EVTINJ_VEC_MASK | SVM_EVTINJ_TYPE_MASK | \ + SVM_EVTINJ_VALID_ERR | SVM_EVTINJ_VALID) + #define SVM_EXITINTINFO_VEC_MASK SVM_EVTINJ_VEC_MASK #define SVM_EXITINTINFO_TYPE_MASK SVM_EVTINJ_TYPE_MASK diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index c4680270e54f..bf1e1ca22d9c 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -488,18 +488,18 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu, to->tlb_ctl = from->tlb_ctl & TLB_CONTROL_MASK; to->erap_ctl = from->erap_ctl; to->int_ctl = from->int_ctl; - to->int_vector = from->int_vector; - to->int_state = from->int_state; + to->int_vector = from->int_vector & SVM_INT_VECTOR_MASK; + to->int_state = from->int_state & SVM_INTERRUPT_SHADOW_MASK; to->exit_code = from->exit_code; to->exit_info_1 = from->exit_info_1; to->exit_info_2 = from->exit_info_2; to->exit_int_info = from->exit_int_info; to->exit_int_info_err = from->exit_int_info_err; - to->event_inj = from->event_inj; + to->event_inj = from->event_inj & ~SVM_EVTINJ_RESERVED_BITS; to->event_inj_err = from->event_inj_err; to->next_rip = from->next_rip; to->nested_cr3 = from->nested_cr3; - to->misc_ctl2 = from->misc_ctl2; + to->misc_ctl2 = from->misc_ctl2; to->pause_filter_count = from->pause_filter_count; to->pause_filter_thresh = from->pause_filter_thresh; -- cgit v1.2.3 From b6dc21d896a02b5fd305f505a4ec4dad50ecd8fb Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:19 +0000 Subject: KVM: nSVM: Only copy SVM_MISC_ENABLE_NP from VMCB01's misc_ctl The 'misc_ctl' field in VMCB02 is taken as-is from VMCB01. However, the only bit that needs to copied is SVM_MISC_ENABLE_NP, as all other known bits in misc_ctl are related to SEV guests, and KVM doesn't support nested virtualization for SEV guests. Only copy SVM_MISC_ENABLE_NP to harden against future bugs if/when other bits are set for L1 but should not be set for L2. Opportunistically add a comment explaining why SVM_MISC_ENABLE_NP is taken from VMCB01 and not VMCB02. Suggested-by: Jim Mattson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-26-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index bf1e1ca22d9c..b191c6cab57d 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -837,8 +837,16 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) V_NMI_BLOCKING_MASK); } - /* Copied from vmcb01. msrpm_base can be overwritten later. */ - vmcb02->control.misc_ctl = vmcb01->control.misc_ctl; + /* + * Copied from vmcb01. msrpm_base can be overwritten later. + * + * SVM_MISC_ENABLE_NP in vmcb12 is only used for consistency checks. If + * L1 enables NPTs, KVM shadows L1's NPTs and uses those to run L2. If + * L1 disables NPT, KVM runs L2 with the same NPTs used to run L1. For + * the latter, L1 runs L2 with shadow page tables that translate L2 GVAs + * to L1 GPAs, so the same NPTs can be used for L1 and L2. + */ + vmcb02->control.misc_ctl = vmcb01->control.misc_ctl & SVM_MISC_ENABLE_NP; vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa; vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa; vmcb_mark_dirty(vmcb02, VMCB_PERM_MAP); -- cgit v1.2.3 From 5e4c6da0bb925bc91a6020511e85bd9574f8474a Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Tue, 3 Mar 2026 00:34:20 +0000 Subject: KVM: selftest: Add a selftest for VMRUN/#VMEXIT with unmappable vmcb12 Add a test that verifies that KVM correctly injects a #GP for nested VMRUN and a shutdown for nested #VMEXIT, if the GPA of vmcb12 cannot be mapped. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260303003421.2185681-27-yosry@kernel.org Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../kvm/x86/svm_nested_invalid_vmcb12_gpa.c | 98 ++++++++++++++++++++++ 2 files changed, 99 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 36b48e766e49..f12e7c17d379 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -110,6 +110,7 @@ TEST_GEN_PROGS_x86 += x86/state_test TEST_GEN_PROGS_x86 += x86/vmx_preemption_timer_test TEST_GEN_PROGS_x86 += x86/svm_vmcall_test TEST_GEN_PROGS_x86 += x86/svm_int_ctl_test +TEST_GEN_PROGS_x86 += x86/svm_nested_invalid_vmcb12_gpa TEST_GEN_PROGS_x86 += x86/svm_nested_shutdown_test TEST_GEN_PROGS_x86 += x86/svm_nested_soft_inject_test TEST_GEN_PROGS_x86 += x86/svm_lbr_nested_state diff --git a/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c b/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c new file mode 100644 index 000000000000..c6d5f712120d --- /dev/null +++ b/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2026, Google LLC. + */ +#include "kvm_util.h" +#include "vmx.h" +#include "svm_util.h" +#include "kselftest.h" + + +#define L2_GUEST_STACK_SIZE 64 + +#define SYNC_GP 101 +#define SYNC_L2_STARTED 102 + +u64 valid_vmcb12_gpa; +int gp_triggered; + +static void guest_gp_handler(struct ex_regs *regs) +{ + GUEST_ASSERT(!gp_triggered); + GUEST_SYNC(SYNC_GP); + gp_triggered = 1; + regs->rax = valid_vmcb12_gpa; +} + +static void l2_guest_code(void) +{ + GUEST_SYNC(SYNC_L2_STARTED); + vmcall(); +} + +static void l1_guest_code(struct svm_test_data *svm, u64 invalid_vmcb12_gpa) +{ + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + + generic_svm_setup(svm, l2_guest_code, + &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + valid_vmcb12_gpa = svm->vmcb_gpa; + + run_guest(svm->vmcb, invalid_vmcb12_gpa); /* #GP */ + + /* GP handler should jump here */ + GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); + GUEST_DONE(); +} + +int main(int argc, char *argv[]) +{ + struct kvm_x86_state *state; + vm_vaddr_t nested_gva = 0; + struct kvm_vcpu *vcpu; + uint32_t maxphyaddr; + u64 max_legal_gpa; + struct kvm_vm *vm; + struct ucall uc; + + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); + + vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); + vm_install_exception_handler(vcpu->vm, GP_VECTOR, guest_gp_handler); + + /* + * Find the max legal GPA that is not backed by a memslot (i.e. cannot + * be mapped by KVM). + */ + maxphyaddr = kvm_cpuid_property(vcpu->cpuid, X86_PROPERTY_MAX_PHY_ADDR); + max_legal_gpa = BIT_ULL(maxphyaddr) - PAGE_SIZE; + vcpu_alloc_svm(vm, &nested_gva); + vcpu_args_set(vcpu, 2, nested_gva, max_legal_gpa); + + /* VMRUN with max_legal_gpa, KVM injects a #GP */ + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); + TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC); + TEST_ASSERT_EQ(uc.args[1], SYNC_GP); + + /* + * Enter L2 (with a legit vmcb12 GPA), then overwrite vmcb12 GPA with + * max_legal_gpa. KVM will fail to map vmcb12 on nested VM-Exit and + * cause a shutdown. + */ + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); + TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC); + TEST_ASSERT_EQ(uc.args[1], SYNC_L2_STARTED); + + state = vcpu_save_state(vcpu); + state->nested.hdr.svm.vmcb_pa = max_legal_gpa; + vcpu_load_state(vcpu, state); + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_SHUTDOWN); + + kvm_x86_state_cleanup(state); + kvm_vm_free(vm); + return 0; +} -- cgit v1.2.3 From 66b207f175f1cd52b083c4d90d03cc1c15b8ae6a Mon Sep 17 00:00:00 2001 From: Jim Mattson Date: Mon, 23 Feb 2026 16:54:39 -0800 Subject: KVM: x86: SVM: Remove vmcb_is_dirty() After commit dd26d1b5d6ed ("KVM: nSVM: Cache all used fields from VMCB12"), vmcb_is_dirty() has no callers. Remove the function. Signed-off-by: Jim Mattson Link: https://patch.msgid.link/20260224005500.1471972-2-jmattson@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.h | 5 ----- 1 file changed, 5 deletions(-) diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 995c8de3f660..c53068848628 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -433,11 +433,6 @@ static inline void vmcb_mark_dirty(struct vmcb *vmcb, int bit) vmcb->control.clean &= ~(1 << bit); } -static inline bool vmcb_is_dirty(struct vmcb *vmcb, int bit) -{ - return !test_bit(bit, (unsigned long *)&vmcb->control.clean); -} - static inline bool vmcb12_is_dirty(struct vmcb_ctrl_area_cached *control, int bit) { return !test_bit(bit, (unsigned long *)&control->clean); -- cgit v1.2.3 From cdc69269b18a19cb76eaf7bf4fa47fe270dcaf11 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 9 Feb 2026 19:51:41 +0000 Subject: KVM: SVM: Triple fault L1 on unintercepted EFER.SVME clear by L2 KVM tracks when EFER.SVME is set and cleared to initialize and tear down nested state. However, it doesn't differentiate if EFER.SVME is getting toggled in L1 or L2+. If L2 clears EFER.SVME, and L1 does not intercept the EFER write, KVM exits guest mode and tears down nested state while L2 is running, executing L1 without injecting a proper #VMEXIT. According to the APM: The effect of turning off EFER.SVME while a guest is running is undefined; therefore, the VMM should always prevent guests from writing EFER. Since the behavior is architecturally undefined, KVM gets to choose what to do. Inject a triple fault into L1 as a more graceful option that running L1 with corrupted state. Co-developed-by: Sean Christopherson Signed-off-by: Yosry Ahmed base-commit: 95deaec3557dced322e2540bfa426e60e5373d46 Link: https://patch.msgid.link/20260209195142.2554532-2-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 2c511f86b79d..4bf0f5d7167f 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -217,6 +217,19 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) if ((old_efer & EFER_SVME) != (efer & EFER_SVME)) { if (!(efer & EFER_SVME)) { + /* + * Architecturally, clearing EFER.SVME while a guest is + * running yields undefined behavior, i.e. KVM can do + * literally anything. Force the vCPU back into L1 as + * that is the safest option for KVM, but synthesize a + * triple fault (for L1!) so that KVM at least doesn't + * run random L2 code in the context of L1. Do so if + * and only if the vCPU is actively running, e.g. to + * avoid positives if userspace is stuffing state. + */ + if (is_guest_mode(vcpu) && vcpu->wants_to_run) + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + svm_leave_nested(vcpu); /* #GP intercept is still needed for vmware backdoor */ if (!enable_vmware_backdoor) -- cgit v1.2.3 From 3900e56eb184abcc8a16ab52af24ea255589acc2 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 9 Feb 2026 19:51:42 +0000 Subject: KVM: selftests: Add a test for L2 clearing EFER.SVME without intercept Add a test that verifies KVM's newly introduced behavior of synthesizing a triple fault in L1 if L2 clears EFER.SVME without an L1 interception (which is architecturally undefined). Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260209195142.2554532-3-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../selftests/kvm/x86/svm_nested_clear_efer_svme.c | 55 ++++++++++++++++++++++ 2 files changed, 56 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86/svm_nested_clear_efer_svme.c diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index f12e7c17d379..ba87cd31872b 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -110,6 +110,7 @@ TEST_GEN_PROGS_x86 += x86/state_test TEST_GEN_PROGS_x86 += x86/vmx_preemption_timer_test TEST_GEN_PROGS_x86 += x86/svm_vmcall_test TEST_GEN_PROGS_x86 += x86/svm_int_ctl_test +TEST_GEN_PROGS_x86 += x86/svm_nested_clear_efer_svme TEST_GEN_PROGS_x86 += x86/svm_nested_invalid_vmcb12_gpa TEST_GEN_PROGS_x86 += x86/svm_nested_shutdown_test TEST_GEN_PROGS_x86 += x86/svm_nested_soft_inject_test diff --git a/tools/testing/selftests/kvm/x86/svm_nested_clear_efer_svme.c b/tools/testing/selftests/kvm/x86/svm_nested_clear_efer_svme.c new file mode 100644 index 000000000000..a521a9eed061 --- /dev/null +++ b/tools/testing/selftests/kvm/x86/svm_nested_clear_efer_svme.c @@ -0,0 +1,55 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2026, Google LLC. + */ +#include "kvm_util.h" +#include "vmx.h" +#include "svm_util.h" +#include "kselftest.h" + + +#define L2_GUEST_STACK_SIZE 64 + +static void l2_guest_code(void) +{ + unsigned long efer = rdmsr(MSR_EFER); + + /* generic_svm_setup() initializes EFER_SVME set for L2 */ + GUEST_ASSERT(efer & EFER_SVME); + wrmsr(MSR_EFER, efer & ~EFER_SVME); + + /* Unreachable, L1 should be shutdown */ + GUEST_ASSERT(0); +} + +static void l1_guest_code(struct svm_test_data *svm) +{ + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + + generic_svm_setup(svm, l2_guest_code, + &l2_guest_stack[L2_GUEST_STACK_SIZE]); + run_guest(svm->vmcb, svm->vmcb_gpa); + + /* Unreachable, L1 should be shutdown */ + GUEST_ASSERT(0); +} + +int main(int argc, char *argv[]) +{ + struct kvm_vcpu *vcpu; + struct kvm_vm *vm; + vm_vaddr_t nested_gva = 0; + + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); + + vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); + + vcpu_alloc_svm(vm, &nested_gva); + vcpu_args_set(vcpu, 1, nested_gva); + + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_SHUTDOWN); + + kvm_vm_free(vm); + return 0; +} -- cgit v1.2.3 From 0b4a043a54144aef3e5a2597c29c6adb5e6c47dc Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 10 Mar 2026 15:04:14 -0700 Subject: KVM: SVM: Add a helper to get LBR field pointer to dedup MSR accesses Add a helper to get a pointer to the corresponding VMCB field given an LBR MSR index, and use it to dedup the handling in svm_{g,s}et_msr(). No functional change intended. Suggested-by: Yosry Ahmed Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260310220414.2569208-1-seanjc@google.com [sean: use KVM_BUG_ON() instead of BUILD_BUG(), clang ain't smart enough] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 49 ++++++++++++++++++++----------------------------- 1 file changed, 20 insertions(+), 29 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 4bf0f5d7167f..5e6bd7fca298 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2751,6 +2751,24 @@ static int svm_get_feature_msr(u32 msr, u64 *data) return 0; } +static u64 *svm_vmcb_lbr(struct vcpu_svm *svm, u32 msr) +{ + switch (msr) { + case MSR_IA32_LASTBRANCHFROMIP: + return &svm->vmcb->save.br_from; + case MSR_IA32_LASTBRANCHTOIP: + return &svm->vmcb->save.br_to; + case MSR_IA32_LASTINTFROMIP: + return &svm->vmcb->save.last_excp_from; + case MSR_IA32_LASTINTTOIP: + return &svm->vmcb->save.last_excp_to; + default: + break; + } + KVM_BUG_ON(1, svm->vcpu.kvm); + return &svm->vmcb->save.br_from; +} + static bool sev_es_prevent_msr_access(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { @@ -2827,16 +2845,10 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) msr_info->data = lbrv ? svm->vmcb->save.dbgctl : 0; break; case MSR_IA32_LASTBRANCHFROMIP: - msr_info->data = lbrv ? svm->vmcb->save.br_from : 0; - break; case MSR_IA32_LASTBRANCHTOIP: - msr_info->data = lbrv ? svm->vmcb->save.br_to : 0; - break; case MSR_IA32_LASTINTFROMIP: - msr_info->data = lbrv ? svm->vmcb->save.last_excp_from : 0; - break; case MSR_IA32_LASTINTTOIP: - msr_info->data = lbrv ? svm->vmcb->save.last_excp_to : 0; + msr_info->data = lbrv ? *svm_vmcb_lbr(svm, msr_info->index) : 0; break; case MSR_VM_HSAVE_PA: msr_info->data = svm->nested.hsave_msr; @@ -3112,35 +3124,14 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) svm_update_lbrv(vcpu); break; case MSR_IA32_LASTBRANCHFROMIP: - if (!lbrv) - return KVM_MSR_RET_UNSUPPORTED; - if (!msr->host_initiated) - return 1; - svm->vmcb->save.br_from = data; - vmcb_mark_dirty(svm->vmcb, VMCB_LBR); - break; case MSR_IA32_LASTBRANCHTOIP: - if (!lbrv) - return KVM_MSR_RET_UNSUPPORTED; - if (!msr->host_initiated) - return 1; - svm->vmcb->save.br_to = data; - vmcb_mark_dirty(svm->vmcb, VMCB_LBR); - break; case MSR_IA32_LASTINTFROMIP: - if (!lbrv) - return KVM_MSR_RET_UNSUPPORTED; - if (!msr->host_initiated) - return 1; - svm->vmcb->save.last_excp_from = data; - vmcb_mark_dirty(svm->vmcb, VMCB_LBR); - break; case MSR_IA32_LASTINTTOIP: if (!lbrv) return KVM_MSR_RET_UNSUPPORTED; if (!msr->host_initiated) return 1; - svm->vmcb->save.last_excp_to = data; + *svm_vmcb_lbr(svm, ecx) = data; vmcb_mark_dirty(svm->vmcb, VMCB_LBR); break; case MSR_VM_HSAVE_PA: -- cgit v1.2.3 From 520a1347faf46c2c00c3499de05fdecc6d254c2e Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Fri, 6 Mar 2026 21:08:56 +0000 Subject: KVM: nSVM: Simplify error handling of nested_svm_copy_vmcb12_to_cache() nested_svm_vmrun() currently stores the return value of nested_svm_copy_vmcb12_to_cache() in a local variable 'err', separate from the generally used 'ret' variable. This is done to have a single call to kvm_skip_emulated_instruction(), such that we can store the return value of kvm_skip_emulated_instruction() in 'ret', and then re-check the return value of nested_svm_copy_vmcb12_to_cache() in 'err'. The code is unnecessarily confusing. Instead, call kvm_skip_emulated_instruction() in the failure path of nested_svm_copy_vmcb12_to_cache() if the return value is not -EFAULT, and drop 'err'. Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260306210900.1933788-3-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index b191c6cab57d..3ffde1ff719b 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1079,7 +1079,7 @@ static int nested_svm_copy_vmcb12_to_cache(struct kvm_vcpu *vcpu, u64 vmcb12_gpa int nested_svm_vmrun(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - int ret, err; + int ret; u64 vmcb12_gpa; struct vmcb *vmcb01 = svm->vmcb01.ptr; @@ -1104,19 +1104,20 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) return -EINVAL; vmcb12_gpa = svm->vmcb->save.rax; - err = nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa); - if (err == -EFAULT) { - kvm_inject_gp(vcpu, 0); - return 1; + + ret = nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa); + if (ret) { + if (ret == -EFAULT) { + kvm_inject_gp(vcpu, 0); + return 1; + } + + /* Advance RIP past VMRUN as part of the nested #VMEXIT. */ + return kvm_skip_emulated_instruction(vcpu); } - /* - * Advance RIP if #GP or #UD are not injected, but otherwise stop if - * copying and checking vmcb12 failed. - */ + /* At this point, VMRUN is guaranteed to not fault; advance RIP. */ ret = kvm_skip_emulated_instruction(vcpu); - if (err) - return ret; /* * Since vmcb01 is not in use, we can use it to store some of the L1 -- cgit v1.2.3 From 3d4470d71fbf70576636947aba1ae51adbad5225 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Thu, 12 Mar 2026 16:48:22 -0700 Subject: KVM: x86: Move nested_run_pending to kvm_vcpu_arch Move nested_run_pending field present in both svm_nested_state and nested_vmx to the common kvm_vcpu_arch. This allows for common code to use without plumbing it through per-vendor helpers. nested_run_pending remains zero-initialized, as the entire kvm_vcpu struct is, and all further accesses are done through vcpu->arch instead of svm->nested or vmx->nested. No functional change intended. Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed [sean: expand the commend in the field declaration] Link: https://patch.msgid.link/20260312234823.3120658-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 9 ++++++++ arch/x86/kvm/svm/nested.c | 18 ++++++++-------- arch/x86/kvm/svm/svm.c | 16 +++++++------- arch/x86/kvm/svm/svm.h | 4 ---- arch/x86/kvm/vmx/nested.c | 46 ++++++++++++++++++++--------------------- arch/x86/kvm/vmx/vmx.c | 16 +++++++------- arch/x86/kvm/vmx/vmx.h | 3 --- 7 files changed, 57 insertions(+), 55 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ff07c45e3c73..19b3790e5e99 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1098,6 +1098,15 @@ struct kvm_vcpu_arch { */ bool pdptrs_from_userspace; + /* + * Set if an emulated nested VM-Enter to L2 is pending completion. KVM + * must not synthesize a VM-Exit to L1 before entering L2, as VM-Exits + * can only occur at instruction boundaries. The only exception is + * VMX's "notify" exits, which exist in large part to break the CPU out + * of infinite ucode loops, but can corrupt vCPU state in the process! + */ + bool nested_run_pending; + #if IS_ENABLED(CONFIG_HYPERV) hpa_t hv_root_tdp; #endif diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 3ffde1ff719b..e24f5450f121 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -914,7 +914,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) * the CPU and/or KVM and should be used regardless of L1's support. */ if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || - !svm->nested.nested_run_pending) + !vcpu->arch.nested_run_pending) vmcb02->control.next_rip = vmcb12_ctrl->next_rip; svm->nmi_l1_to_l2 = is_evtinj_nmi(vmcb02->control.event_inj); @@ -926,7 +926,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) if (is_evtinj_soft(vmcb02->control.event_inj)) { svm->soft_int_injected = true; if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS) || - !svm->nested.nested_run_pending) + !vcpu->arch.nested_run_pending) svm->soft_int_next_rip = vmcb12_ctrl->next_rip; } @@ -1132,11 +1132,11 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) if (!npt_enabled) vmcb01->save.cr3 = kvm_read_cr3(vcpu); - svm->nested.nested_run_pending = 1; + vcpu->arch.nested_run_pending = 1; if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) || !nested_svm_merge_msrpm(vcpu)) { - svm->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; svm->nmi_l1_to_l2 = false; svm->soft_int_injected = false; @@ -1278,7 +1278,7 @@ void nested_svm_vmexit(struct vcpu_svm *svm) /* Exit Guest-Mode */ leave_guest_mode(vcpu); svm->nested.vmcb12_gpa = 0; - WARN_ON_ONCE(svm->nested.nested_run_pending); + WARN_ON_ONCE(vcpu->arch.nested_run_pending); kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); @@ -1488,7 +1488,7 @@ void svm_leave_nested(struct kvm_vcpu *vcpu) struct vcpu_svm *svm = to_svm(vcpu); if (is_guest_mode(vcpu)) { - svm->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; svm->nested.vmcb12_gpa = INVALID_GPA; leave_guest_mode(vcpu); @@ -1673,7 +1673,7 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu) * previously injected event, the pending exception occurred while said * event was being delivered and thus needs to be handled. */ - bool block_nested_exceptions = svm->nested.nested_run_pending; + bool block_nested_exceptions = vcpu->arch.nested_run_pending; /* * New events (not exceptions) are only recognized at instruction * boundaries. If an event needs reinjection, then KVM is handling a @@ -1848,7 +1848,7 @@ static int svm_get_nested_state(struct kvm_vcpu *vcpu, kvm_state.size += KVM_STATE_NESTED_SVM_VMCB_SIZE; kvm_state.flags |= KVM_STATE_NESTED_GUEST_MODE; - if (svm->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) kvm_state.flags |= KVM_STATE_NESTED_RUN_PENDING; } @@ -1985,7 +1985,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, svm_set_gif(svm, !!(kvm_state->flags & KVM_STATE_NESTED_GIF_SET)); - svm->nested.nested_run_pending = + vcpu->arch.nested_run_pending = !!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING); svm->nested.vmcb12_gpa = kvm_state->hdr.svm.vmcb_pa; diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 5e6bd7fca298..dbd35340e7b0 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3820,7 +3820,7 @@ static void svm_fixup_nested_rips(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - if (!is_guest_mode(vcpu) || !svm->nested.nested_run_pending) + if (!is_guest_mode(vcpu) || !vcpu->arch.nested_run_pending) return; /* @@ -3968,7 +3968,7 @@ bool svm_nmi_blocked(struct kvm_vcpu *vcpu) static int svm_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { struct vcpu_svm *svm = to_svm(vcpu); - if (svm->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return -EBUSY; if (svm_nmi_blocked(vcpu)) @@ -4010,7 +4010,7 @@ static int svm_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection) { struct vcpu_svm *svm = to_svm(vcpu); - if (svm->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return -EBUSY; if (svm_interrupt_blocked(vcpu)) @@ -4222,7 +4222,7 @@ static void svm_complete_soft_interrupt(struct kvm_vcpu *vcpu, u8 vector, * the soft int and will reinject it via the standard injection flow, * and so KVM needs to grab the state from the pending nested VMRUN. */ - if (is_guest_mode(vcpu) && svm->nested.nested_run_pending) + if (is_guest_mode(vcpu) && vcpu->arch.nested_run_pending) svm_set_nested_run_soft_int_state(vcpu); /* @@ -4525,11 +4525,11 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) nested_sync_control_from_vmcb02(svm); /* Track VMRUNs that have made past consistency checking */ - if (svm->nested.nested_run_pending && + if (vcpu->arch.nested_run_pending && !svm_is_vmrun_failure(svm->vmcb->control.exit_code)) ++vcpu->stat.nested_run; - svm->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; } svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING; @@ -4898,7 +4898,7 @@ bool svm_smi_blocked(struct kvm_vcpu *vcpu) static int svm_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { struct vcpu_svm *svm = to_svm(vcpu); - if (svm->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return -EBUSY; if (svm_smi_blocked(vcpu)) @@ -5013,7 +5013,7 @@ static int svm_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) if (ret) goto unmap_save; - svm->nested.nested_run_pending = 1; + vcpu->arch.nested_run_pending = 1; unmap_save: kvm_vcpu_unmap(vcpu, &map_save); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index c53068848628..5b287ad83b69 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -215,10 +215,6 @@ struct svm_nested_state { */ void *msrpm; - /* A VMRUN has started but has not yet been performed, so - * we cannot inject a nested vmexit yet. */ - bool nested_run_pending; - /* cache for control fields of the guest */ struct vmcb_ctrl_area_cached ctl; diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 248635da6766..031075467a6d 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2273,7 +2273,7 @@ static void vmx_start_preemption_timer(struct kvm_vcpu *vcpu, static u64 nested_vmx_calc_efer(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) { - if (vmx->nested.nested_run_pending && + if (vmx->vcpu.arch.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)) return vmcs12->guest_ia32_efer; else if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) @@ -2513,7 +2513,7 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0 /* * Interrupt/Exception Fields */ - if (vmx->nested.nested_run_pending) { + if (vmx->vcpu.arch.nested_run_pending) { vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, vmcs12->vm_entry_intr_info_field); vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, @@ -2621,7 +2621,7 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3); } - if (kvm_mpx_supported() && vmx->nested.nested_run_pending && + if (kvm_mpx_supported() && vmx->vcpu.arch.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)) vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs); } @@ -2718,7 +2718,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, !(evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1); } - if (vmx->nested.nested_run_pending && + if (vcpu->arch.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) { kvm_set_dr(vcpu, 7, vmcs12->guest_dr7); vmx_guest_debugctl_write(vcpu, vmcs12->guest_ia32_debugctl & @@ -2728,13 +2728,13 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, vmx_guest_debugctl_write(vcpu, vmx->nested.pre_vmenter_debugctl); } - if (!vmx->nested.nested_run_pending || + if (!vcpu->arch.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) vmcs_write_cet_state(vcpu, vmx->nested.pre_vmenter_s_cet, vmx->nested.pre_vmenter_ssp, vmx->nested.pre_vmenter_ssp_tbl); - if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending || + if (kvm_mpx_supported() && (!vcpu->arch.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))) vmcs_write64(GUEST_BNDCFGS, vmx->nested.pre_vmenter_bndcfgs); vmx_set_rflags(vcpu, vmcs12->guest_rflags); @@ -2747,7 +2747,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits); - if (vmx->nested.nested_run_pending && + if (vcpu->arch.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)) { vmcs_write64(GUEST_IA32_PAT, vmcs12->guest_ia32_pat); vcpu->arch.pat = vmcs12->guest_ia32_pat; @@ -3335,7 +3335,7 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu, * to bit 8 (LME) if bit 31 in the CR0 field (corresponding to * CR0.PG) is 1. */ - if (to_vmx(vcpu)->nested.nested_run_pending && + if (vcpu->arch.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)) { if (CC(!kvm_valid_efer(vcpu, vmcs12->guest_ia32_efer)) || CC(ia32e != !!(vmcs12->guest_ia32_efer & EFER_LMA)) || @@ -3613,15 +3613,15 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu, kvm_service_local_tlb_flush_requests(vcpu); - if (!vmx->nested.nested_run_pending || + if (!vcpu->arch.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) vmx->nested.pre_vmenter_debugctl = vmx_guest_debugctl_read(); if (kvm_mpx_supported() && - (!vmx->nested.nested_run_pending || + (!vcpu->arch.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))) vmx->nested.pre_vmenter_bndcfgs = vmcs_read64(GUEST_BNDCFGS); - if (!vmx->nested.nested_run_pending || + if (!vcpu->arch.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) vmcs_read_cet_state(vcpu, &vmx->nested.pre_vmenter_s_cet, &vmx->nested.pre_vmenter_ssp, @@ -3830,7 +3830,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) * We're finally done with prerequisite checking, and can start with * the nested entry. */ - vmx->nested.nested_run_pending = 1; + vcpu->arch.nested_run_pending = 1; vmx->nested.has_preemption_timer_deadline = false; status = nested_vmx_enter_non_root_mode(vcpu, true); if (unlikely(status != NVMX_VMENTRY_SUCCESS)) @@ -3862,12 +3862,12 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) !nested_cpu_has(vmcs12, CPU_BASED_NMI_WINDOW_EXITING) && !(nested_cpu_has(vmcs12, CPU_BASED_INTR_WINDOW_EXITING) && (vmcs12->guest_rflags & X86_EFLAGS_IF))) { - vmx->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; return kvm_emulate_halt_noskip(vcpu); } break; case GUEST_ACTIVITY_WAIT_SIPI: - vmx->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; kvm_set_mp_state(vcpu, KVM_MP_STATE_INIT_RECEIVED); break; default: @@ -3877,7 +3877,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) return 1; vmentry_failed: - vmx->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; if (status == NVMX_VMENTRY_KVM_INTERNAL_ERROR) return 0; if (status == NVMX_VMENTRY_VMEXIT) @@ -4274,7 +4274,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu) * previously injected event, the pending exception occurred while said * event was being delivered and thus needs to be handled. */ - bool block_nested_exceptions = vmx->nested.nested_run_pending; + bool block_nested_exceptions = vcpu->arch.nested_run_pending; /* * Events that don't require injection, i.e. that are virtualized by * hardware, aren't blocked by a pending VM-Enter as KVM doesn't need @@ -4643,7 +4643,7 @@ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) if (nested_cpu_has_preemption_timer(vmcs12) && vmcs12->vm_exit_controls & VM_EXIT_SAVE_VMX_PREEMPTION_TIMER && - !vmx->nested.nested_run_pending) + !vcpu->arch.nested_run_pending) vmcs12->vmx_preemption_timer_value = vmx_get_preemption_timer_value(vcpu); @@ -5042,7 +5042,7 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, vmx->nested.mtf_pending = false; /* trying to cancel vmlaunch/vmresume is a bug */ - WARN_ON_ONCE(vmx->nested.nested_run_pending); + WARN_ON_ONCE(vcpu->arch.nested_run_pending); #ifdef CONFIG_KVM_HYPERV if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { @@ -6665,7 +6665,7 @@ bool nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu) unsigned long exit_qual; u32 exit_intr_info; - WARN_ON_ONCE(vmx->nested.nested_run_pending); + WARN_ON_ONCE(vcpu->arch.nested_run_pending); /* * Late nested VM-Fail shares the same flow as nested VM-Exit since KVM @@ -6761,7 +6761,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu, if (is_guest_mode(vcpu)) { kvm_state.flags |= KVM_STATE_NESTED_GUEST_MODE; - if (vmx->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) kvm_state.flags |= KVM_STATE_NESTED_RUN_PENDING; if (vmx->nested.mtf_pending) @@ -6836,7 +6836,7 @@ out: void vmx_leave_nested(struct kvm_vcpu *vcpu) { if (is_guest_mode(vcpu)) { - to_vmx(vcpu)->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; nested_vmx_vmexit(vcpu, -1, 0, 0); } free_nested(vcpu); @@ -6973,7 +6973,7 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu, if (!(kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE)) return 0; - vmx->nested.nested_run_pending = + vcpu->arch.nested_run_pending = !!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING); vmx->nested.mtf_pending = @@ -7025,7 +7025,7 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu, return 0; error_guest_mode: - vmx->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; return ret; } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 967b58a8ab9d..9ef3fb04403d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5279,7 +5279,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu) int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { - if (to_vmx(vcpu)->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return -EBUSY; /* An NMI must not be injected into L2 if it's supposed to VM-Exit. */ @@ -5306,7 +5306,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu) int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection) { - if (to_vmx(vcpu)->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return -EBUSY; /* @@ -6118,7 +6118,7 @@ static bool vmx_unhandleable_emulation_required(struct kvm_vcpu *vcpu) * only reachable if userspace modifies L2 guest state after KVM has * performed the nested VM-Enter consistency checks. */ - if (vmx->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return true; /* @@ -6802,7 +6802,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) * invalid guest state should never happen as that means KVM knowingly * allowed a nested VM-Enter with an invalid vmcs12. More below. */ - if (KVM_BUG_ON(vmx->nested.nested_run_pending, vcpu->kvm)) + if (KVM_BUG_ON(vcpu->arch.nested_run_pending, vcpu->kvm)) return -EIO; if (is_guest_mode(vcpu)) { @@ -7730,11 +7730,11 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) * Track VMLAUNCH/VMRESUME that have made past guest state * checking. */ - if (vmx->nested.nested_run_pending && + if (vcpu->arch.nested_run_pending && !vmx_get_exit_reason(vcpu).failed_vmentry) ++vcpu->stat.nested_run; - vmx->nested.nested_run_pending = 0; + vcpu->arch.nested_run_pending = 0; } if (unlikely(vmx->fail)) @@ -8491,7 +8491,7 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu) int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { /* we need a nested vmexit to enter SMM, postpone if run is pending */ - if (to_vmx(vcpu)->nested.nested_run_pending) + if (vcpu->arch.nested_run_pending) return -EBUSY; return !is_smm(vcpu); } @@ -8532,7 +8532,7 @@ int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) if (ret) return ret; - vmx->nested.nested_run_pending = 1; + vcpu->arch.nested_run_pending = 1; vmx->nested.smm.guest_mode = false; } return 0; diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 70bfe81dea54..db84e8001da5 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -138,9 +138,6 @@ struct nested_vmx { */ bool enlightened_vmcs_enabled; - /* L2 must run next, and mustn't decide to exit to L1. */ - bool nested_run_pending; - /* Pending MTF VM-exit into L1. */ bool mtf_pending; -- cgit v1.2.3 From 7212094baef5acabef1969d77781a6527c09d743 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 12 Mar 2026 16:48:23 -0700 Subject: KVM: x86: Suppress WARNs on nested_run_pending after userspace exit To end an ongoing game of whack-a-mole between KVM and syzkaller, WARN on illegally cancelling a pending nested VM-Enter if and only if userspace has NOT gained control of the vCPU since the nested run was initiated. As proven time and time again by syzkaller, userspace can clobber vCPU state so as to force a VM-Exit that violates KVM's architectural modelling of VMRUN/VMLAUNCH/VMRESUME. To detect that userspace has gained control, while minimizing the risk of operating on stale data, convert nested_run_pending from a pure boolean to a tri-state of sorts, where '0' is still "not pending", '1' is "pending", and '2' is "pending but untrusted". Then on KVM_RUN, if the flag is in the "trusted pending" state, move it to "untrusted pending". Note, moving the state to "untrusted" even if KVM_RUN is ultimately rejected is a-ok, because for the "untrusted" state to matter, KVM must get past kvm_x86_vcpu_pre_run() at some point for the vCPU. Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260312234823.3120658-3-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 8 +++++++- arch/x86/kvm/svm/nested.c | 11 +++++++---- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/vmx/nested.c | 12 +++++++----- arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/x86.c | 7 +++++++ arch/x86/kvm/x86.h | 10 ++++++++++ 7 files changed, 40 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 19b3790e5e99..c54c969c88ee 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1104,8 +1104,14 @@ struct kvm_vcpu_arch { * can only occur at instruction boundaries. The only exception is * VMX's "notify" exits, which exist in large part to break the CPU out * of infinite ucode loops, but can corrupt vCPU state in the process! + * + * For all intents and purposes, this is a boolean, but it's tracked as + * a u8 so that KVM can detect when userspace may have stuffed vCPU + * state and generated an architecturally-impossible VM-Exit. */ - bool nested_run_pending; +#define KVM_NESTED_RUN_PENDING 1 +#define KVM_NESTED_RUN_PENDING_UNTRUSTED 2 + u8 nested_run_pending; #if IS_ENABLED(CONFIG_HYPERV) hpa_t hv_root_tdp; diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index e24f5450f121..88e878160229 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1132,7 +1132,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) if (!npt_enabled) vmcb01->save.cr3 = kvm_read_cr3(vcpu); - vcpu->arch.nested_run_pending = 1; + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING; if (enter_svm_guest_mode(vcpu, vmcb12_gpa, true) || !nested_svm_merge_msrpm(vcpu)) { @@ -1278,7 +1278,8 @@ void nested_svm_vmexit(struct vcpu_svm *svm) /* Exit Guest-Mode */ leave_guest_mode(vcpu); svm->nested.vmcb12_gpa = 0; - WARN_ON_ONCE(vcpu->arch.nested_run_pending); + + kvm_warn_on_nested_run_pending(vcpu); kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); @@ -1985,8 +1986,10 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu, svm_set_gif(svm, !!(kvm_state->flags & KVM_STATE_NESTED_GIF_SET)); - vcpu->arch.nested_run_pending = - !!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING); + if (kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING) + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING_UNTRUSTED; + else + vcpu->arch.nested_run_pending = 0; svm->nested.vmcb12_gpa = kvm_state->hdr.svm.vmcb_pa; diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index dbd35340e7b0..f4b0aeba948f 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -5013,7 +5013,7 @@ static int svm_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) if (ret) goto unmap_save; - vcpu->arch.nested_run_pending = 1; + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING; unmap_save: kvm_vcpu_unmap(vcpu, &map_save); diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 031075467a6d..48d2991886cb 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -3830,7 +3830,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) * We're finally done with prerequisite checking, and can start with * the nested entry. */ - vcpu->arch.nested_run_pending = 1; + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING; vmx->nested.has_preemption_timer_deadline = false; status = nested_vmx_enter_non_root_mode(vcpu, true); if (unlikely(status != NVMX_VMENTRY_SUCCESS)) @@ -5042,7 +5042,7 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, vmx->nested.mtf_pending = false; /* trying to cancel vmlaunch/vmresume is a bug */ - WARN_ON_ONCE(vcpu->arch.nested_run_pending); + kvm_warn_on_nested_run_pending(vcpu); #ifdef CONFIG_KVM_HYPERV if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { @@ -6665,7 +6665,7 @@ bool nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu) unsigned long exit_qual; u32 exit_intr_info; - WARN_ON_ONCE(vcpu->arch.nested_run_pending); + kvm_warn_on_nested_run_pending(vcpu); /* * Late nested VM-Fail shares the same flow as nested VM-Exit since KVM @@ -6973,8 +6973,10 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu, if (!(kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE)) return 0; - vcpu->arch.nested_run_pending = - !!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING); + if (kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING) + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING_UNTRUSTED; + else + vcpu->arch.nested_run_pending = 0; vmx->nested.mtf_pending = !!(kvm_state->flags & KVM_STATE_NESTED_MTF_PENDING); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9ef3fb04403d..d75f6b22d74c 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -8532,7 +8532,7 @@ int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) if (ret) return ret; - vcpu->arch.nested_run_pending = 1; + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING; vmx->nested.smm.guest_mode = false; } return 0; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 64da02d1ee00..aa29f90c6e96 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11913,6 +11913,13 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) static int kvm_x86_vcpu_pre_run(struct kvm_vcpu *vcpu) { + /* + * Userspace may have modified vCPU state, mark nested_run_pending as + * "untrusted" to avoid triggering false-positive WARNs. + */ + if (vcpu->arch.nested_run_pending == KVM_NESTED_RUN_PENDING) + vcpu->arch.nested_run_pending = KVM_NESTED_RUN_PENDING_UNTRUSTED; + /* * SIPI_RECEIVED is obsolete; KVM leaves the vCPU in Wait-For-SIPI and * tracks the pending SIPI separately. SIPI_RECEIVED is still accepted diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 94d4f07aaaa0..9fe3a53fd8be 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -188,6 +188,16 @@ static inline bool kvm_can_set_cpuid_and_feature_msrs(struct kvm_vcpu *vcpu) return vcpu->arch.last_vmentry_cpu == -1 && !is_guest_mode(vcpu); } +/* + * WARN if a nested VM-Enter is pending completion, and userspace hasn't gained + * control since the nested VM-Enter was initiated (in which case, userspace + * may have modified vCPU state to induce an architecturally invalid VM-Exit). + */ +static inline void kvm_warn_on_nested_run_pending(struct kvm_vcpu *vcpu) +{ + WARN_ON_ONCE(vcpu->arch.nested_run_pending == KVM_NESTED_RUN_PENDING); +} + static inline void kvm_set_mp_state(struct kvm_vcpu *vcpu, int mp_state) { vcpu->arch.mp_state = mp_state; -- cgit v1.2.3 From c85aaff26d55920d783adac431a59ec738a35aef Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:24 +0000 Subject: KVM: SVM: Properly check RAX in the emulator for SVM instructions Architecturally, VMRUN/VMLOAD/VMSAVE should generate a #GP if the physical address in RAX is not supported. check_svme_pa() hardcodes this to checking that bits 63-48 are not set. This is incorrect on HW supporting 52 bits of physical address space. Additionally, the emulator does not check if the address is not aligned, which should also result in #GP. Use page_address_valid() which properly checks alignment and the address legality based on the guest's MAXPHYADDR. Plumb it through x86_emulate_ops, similar to is_canonical_addr(), to avoid directly accessing the vCPU object in emulator code. Fixes: 01de8b09e606 ("KVM: SVM: Add intercept checks for SVM instructions") Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-2-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/emulate.c | 3 +-- arch/x86/kvm/kvm_emulate.h | 2 ++ arch/x86/kvm/x86.c | 6 ++++++ 3 files changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c8e292e9a24d..202c376ff501 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3874,8 +3874,7 @@ static int check_svme_pa(struct x86_emulate_ctxt *ctxt) { u64 rax = reg_read(ctxt, VCPU_REGS_RAX); - /* Valid physical address? */ - if (rax & 0xffff000000000000ULL) + if (!ctxt->ops->page_address_valid(ctxt, rax)) return emulate_gp(ctxt, 0); return check_svme(ctxt); diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index fb3dab4b5a53..0abff36d0994 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -245,6 +245,8 @@ struct x86_emulate_ops { bool (*is_canonical_addr)(struct x86_emulate_ctxt *ctxt, gva_t addr, unsigned int flags); + + bool (*page_address_valid)(struct x86_emulate_ctxt *ctxt, gpa_t gpa); }; /* Type, address-of, and value of an instruction's operand. */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index aa29f90c6e96..2410401c57d8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8907,6 +8907,11 @@ static bool emulator_is_canonical_addr(struct x86_emulate_ctxt *ctxt, return !is_noncanonical_address(addr, emul_to_vcpu(ctxt), flags); } +static bool emulator_page_address_valid(struct x86_emulate_ctxt *ctxt, gpa_t gpa) +{ + return page_address_valid(emul_to_vcpu(ctxt), gpa); +} + static const struct x86_emulate_ops emulate_ops = { .vm_bugged = emulator_vm_bugged, .read_gpr = emulator_read_gpr, @@ -8954,6 +8959,7 @@ static const struct x86_emulate_ops emulate_ops = { .set_xcr = emulator_set_xcr, .get_untagged_addr = emulator_get_untagged_addr, .is_canonical_addr = emulator_is_canonical_addr, + .page_address_valid = emulator_page_address_valid, }; static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) -- cgit v1.2.3 From 27f70eaa8661c031f6c5efa4d72c7c4544cc41fc Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:25 +0000 Subject: KVM: SVM: Refactor SVM instruction handling on #GP intercept Instead of returning an opcode from svm_instr_opcode() and then passing it to emulate_svm_instr(), which uses it to find the corresponding exit code and intercept handler, return the exit code directly from svm_instr_opcode(), and rename it to svm_get_decoded_instr_exit_code(). emulate_svm_instr() boils down to synthesizing a #VMEXIT or calling the intercept handler, so open-code it in gp_interception(), and use svm_invoke_exit_handler() to call the intercept handler based on the exit code. This allows for dropping the SVM_INSTR_* enum, and the const array mapping its values to exit codes and intercept handlers. In gp_intercept(), handle SVM instructions and first with an early return, and invert is_guest_mode() checks, un-indenting the rest of the code. No functional change intended. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-3-yosry@kernel.org [sean: add BUILD_BUG_ON(), tweak formatting/naming] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 77 +++++++++++++++++--------------------------------- 1 file changed, 26 insertions(+), 51 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index f4b0aeba948f..927764894b89 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2236,54 +2236,28 @@ static int vmrun_interception(struct kvm_vcpu *vcpu) return nested_svm_vmrun(vcpu); } -enum { - NONE_SVM_INSTR, - SVM_INSTR_VMRUN, - SVM_INSTR_VMLOAD, - SVM_INSTR_VMSAVE, -}; - -/* Return NONE_SVM_INSTR if not SVM instrs, otherwise return decode result */ -static int svm_instr_opcode(struct kvm_vcpu *vcpu) +/* Return 0 if not SVM instr, otherwise return associated exit_code */ +static u64 svm_get_decoded_instr_exit_code(struct kvm_vcpu *vcpu) { struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; if (ctxt->b != 0x1 || ctxt->opcode_len != 2) - return NONE_SVM_INSTR; + return 0; + + BUILD_BUG_ON(!SVM_EXIT_VMRUN || !SVM_EXIT_VMLOAD || !SVM_EXIT_VMSAVE); switch (ctxt->modrm) { case 0xd8: /* VMRUN */ - return SVM_INSTR_VMRUN; + return SVM_EXIT_VMRUN; case 0xda: /* VMLOAD */ - return SVM_INSTR_VMLOAD; + return SVM_EXIT_VMLOAD; case 0xdb: /* VMSAVE */ - return SVM_INSTR_VMSAVE; + return SVM_EXIT_VMSAVE; default: break; } - return NONE_SVM_INSTR; -} - -static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode) -{ - const int guest_mode_exit_codes[] = { - [SVM_INSTR_VMRUN] = SVM_EXIT_VMRUN, - [SVM_INSTR_VMLOAD] = SVM_EXIT_VMLOAD, - [SVM_INSTR_VMSAVE] = SVM_EXIT_VMSAVE, - }; - int (*const svm_instr_handlers[])(struct kvm_vcpu *vcpu) = { - [SVM_INSTR_VMRUN] = vmrun_interception, - [SVM_INSTR_VMLOAD] = vmload_interception, - [SVM_INSTR_VMSAVE] = vmsave_interception, - }; - struct vcpu_svm *svm = to_svm(vcpu); - - if (is_guest_mode(vcpu)) { - nested_svm_simple_vmexit(svm, guest_mode_exit_codes[opcode]); - return 1; - } - return svm_instr_handlers[opcode](vcpu); + return 0; } /* @@ -2298,7 +2272,7 @@ static int gp_interception(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); u32 error_code = svm->vmcb->control.exit_info_1; - int opcode; + u64 svm_exit_code; /* Both #GP cases have zero error_code */ if (error_code) @@ -2308,27 +2282,28 @@ static int gp_interception(struct kvm_vcpu *vcpu) if (x86_decode_emulated_instruction(vcpu, 0, NULL, 0) != EMULATION_OK) goto reinject; - opcode = svm_instr_opcode(vcpu); - - if (opcode == NONE_SVM_INSTR) { - if (!enable_vmware_backdoor) - goto reinject; - - /* - * VMware backdoor emulation on #GP interception only handles - * IN{S}, OUT{S}, and RDPMC. - */ - if (!is_guest_mode(vcpu)) - return kvm_emulate_instruction(vcpu, - EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE); - } else { + svm_exit_code = svm_get_decoded_instr_exit_code(vcpu); + if (svm_exit_code) { /* All SVM instructions expect page aligned RAX */ if (svm->vmcb->save.rax & ~PAGE_MASK) goto reinject; - return emulate_svm_instr(vcpu, opcode); + if (!is_guest_mode(vcpu)) + return svm_invoke_exit_handler(vcpu, svm_exit_code); + + nested_svm_simple_vmexit(svm, svm_exit_code); + return 1; } + /* + * VMware backdoor emulation on #GP interception only handles + * IN{S}, OUT{S}, and RDPMC, and only for L1. + */ + if (!enable_vmware_backdoor || is_guest_mode(vcpu)) + goto reinject; + + return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE); + reinject: kvm_queue_exception_e(vcpu, GP_VECTOR, error_code); return 1; -- cgit v1.2.3 From 435741a4e766e3704af03c9ac634a73b9e75fc4c Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:26 +0000 Subject: KVM: SVM: Properly check RAX on #GP intercept of SVM instructions When KVM intercepts #GP on an SVM instruction, it re-injects the #GP if the instruction was executed with a mis-algined RAX. However, a #GP should also be reinjected if RAX contains an illegal GPA, according to the APM, one of #GP conditions is: rAX referenced a physical address above the maximum supported physical address. Replace the PAGE_MASK check with page_address_valid(), which checks both page-alignment as well as the legality of the GPA based on the vCPU's MAXPHYADDR. Use kvm_register_read() to read RAX to so that bits 63:32 are dropped when the vCPU is in 32-bit mode, i.e. to avoid a false positive when checking the validity of the address. Note that this is currently only a problem if KVM is running an L2 guest and ends up synthesizing a #VMEXIT to L1, as the RAX check takes precedence over the intercept. Otherwise, if KVM emulates the instruction, kvm_vcpu_map() should fail on illegal GPAs and inject a #GP anyway. However, following patches will change the failure behavior of kvm_vcpu_map(), so make sure the #GP interception handler does this appropriately. Opportunistically drop a teaser FIXME about the SVM instructions handling on #GP belonging in the emulator. Fixes: 82a11e9c6fa2 ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Fixes: d1cba6c92237 ("KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround") Suggested-by: Sean Christopherson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-4-yosry@kernel.org [sean: massage wording with respect to kvm_register_read()] Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 927764894b89..f68958447e58 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2282,10 +2282,10 @@ static int gp_interception(struct kvm_vcpu *vcpu) if (x86_decode_emulated_instruction(vcpu, 0, NULL, 0) != EMULATION_OK) goto reinject; + /* FIXME: Handle SVM instructions through the emulator */ svm_exit_code = svm_get_decoded_instr_exit_code(vcpu); if (svm_exit_code) { - /* All SVM instructions expect page aligned RAX */ - if (svm->vmcb->save.rax & ~PAGE_MASK) + if (!page_address_valid(vcpu, kvm_register_read(vcpu, VCPU_REGS_RAX))) goto reinject; if (!is_guest_mode(vcpu)) -- cgit v1.2.3 From d2fbeb61e1451eba09eb3249aaf1f01d4c5c1f8b Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:27 +0000 Subject: KVM: SVM: Move RAX legality check to SVM insn interception handlers When #GP is intercepted by KVM, the #GP interception handler checks whether the GPA in RAX is legal and reinjects the #GP accordingly. Otherwise, it calls into the appropriate interception handler for VMRUN/VMLOAD/VMSAVE. The intercept handlers do not check RAX. However, the intercept handlers need to do the RAX check, because if the guest has a smaller MAXPHYADDR, RAX could be legal from the hardware perspective (i.e. CPU does not inject #GP), but not from the vCPU's perspective. Note that with allow_smaller_maxphyaddr, both NPT and VLS cannot be used, so VMLOAD/VMSAVE have to be intercepted, and RAX can always be checked against the vCPU's MAXPHYADDR. Move the check into the interception handlers for VMRUN/VMLOAD/VMSAVE as the CPU does not check RAX before the interception. Read RAX using kvm_register_read() to avoid a false negative on page_address_valid() on 32-bit due to garbage in the higher bits. Keep the check in the #GP intercept handler in the nested case where a #VMEXIT is synthesized into L1, as the RAX check is still needed there and takes precedence over the intercept. Opportunistically add a FIXME about the #VMEXIT being synthesized into L1, as it needs to be conditional. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-5-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 +++++- arch/x86/kvm/svm/svm.c | 20 ++++++++++++++++---- 2 files changed, 21 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 88e878160229..16f4bc4f48f5 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1103,7 +1103,11 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) if (WARN_ON_ONCE(!svm->nested.initialized)) return -EINVAL; - vmcb12_gpa = svm->vmcb->save.rax; + vmcb12_gpa = kvm_register_read(vcpu, VCPU_REGS_RAX); + if (!page_address_valid(vcpu, vmcb12_gpa)) { + kvm_inject_gp(vcpu, 0); + return 1; + } ret = nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa); if (ret) { diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index f68958447e58..3472916657e1 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2185,6 +2185,7 @@ static int intr_interception(struct kvm_vcpu *vcpu) static int vmload_vmsave_interception(struct kvm_vcpu *vcpu, bool vmload) { + u64 vmcb12_gpa = kvm_register_read(vcpu, VCPU_REGS_RAX); struct vcpu_svm *svm = to_svm(vcpu); struct vmcb *vmcb12; struct kvm_host_map map; @@ -2193,7 +2194,12 @@ static int vmload_vmsave_interception(struct kvm_vcpu *vcpu, bool vmload) if (nested_svm_check_permissions(vcpu)) return 1; - ret = kvm_vcpu_map(vcpu, gpa_to_gfn(svm->vmcb->save.rax), &map); + if (!page_address_valid(vcpu, vmcb12_gpa)) { + kvm_inject_gp(vcpu, 0); + return 1; + } + + ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map); if (ret) { if (ret == -EINVAL) kvm_inject_gp(vcpu, 0); @@ -2285,12 +2291,18 @@ static int gp_interception(struct kvm_vcpu *vcpu) /* FIXME: Handle SVM instructions through the emulator */ svm_exit_code = svm_get_decoded_instr_exit_code(vcpu); if (svm_exit_code) { - if (!page_address_valid(vcpu, kvm_register_read(vcpu, VCPU_REGS_RAX))) - goto reinject; - if (!is_guest_mode(vcpu)) return svm_invoke_exit_handler(vcpu, svm_exit_code); + if (!page_address_valid(vcpu, kvm_register_read(vcpu, VCPU_REGS_RAX))) + goto reinject; + + /* + * FIXME: Only synthesize a #VMEXIT if L1 sets the intercept, + * but only after the VMLOAD/VMSAVE exit handlers can properly + * handle VMLOAD/VMSAVE from L2 with VLS enabled in L1 (i.e. + * RAX is an L2 GPA that needs translation through L1's NPT). + */ nested_svm_simple_vmexit(svm, svm_exit_code); return 1; } -- cgit v1.2.3 From 783cf7d01fb8788f37735c0a6c3955024189287c Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:28 +0000 Subject: KVM: SVM: Check EFER.SVME and CPL on #GP intercept of SVM instructions When KVM intercepts #GP on an SVM instruction from L2, it checks the legality of RAX, and injects a #GP if RAX is illegal, or otherwise synthesizes a #VMEXIT to L1. However, checking EFER.SVME and CPL takes precedence over both the RAX check and the intercept. Call nested_svm_check_permissions() first to cover both. Note that if #GP is intercepted on SVM instruction in L1, the intercept handlers of VMRUN/VMLOAD/VMSAVE already perform these checks. Note #2, if KVM does not intercept #GP, the check for EFER.SVME is not done in the correct order, because KVM handles it by intercepting the instructions when EFER.SVME=0 and injecting #UD. However, a #GP injected by hardware would happen before the instruction intercept, leading to #GP taking precedence over #UD from the guest's perspective. Opportunistically add a FIXME for this. Fixes: 82a11e9c6fa2 ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-6-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 3472916657e1..7d0d95f40cd2 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1054,6 +1054,11 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu) * No need to toggle any of the vgif/vls/etc. enable bits here, as they * are set when the VMCB is initialized and never cleared (if the * relevant intercepts are set, the enablements are meaningless anyway). + * + * FIXME: When #GP is not intercepted, a #GP on these instructions (e.g. + * due to CPL > 0) could be injected by hardware before the instruction + * is intercepted, leading to #GP taking precedence over #UD from the + * guest's perspective. */ if (!(vcpu->arch.efer & EFER_SVME)) { svm_set_intercept(svm, INTERCEPT_VMLOAD); @@ -2294,6 +2299,9 @@ static int gp_interception(struct kvm_vcpu *vcpu) if (!is_guest_mode(vcpu)) return svm_invoke_exit_handler(vcpu, svm_exit_code); + if (nested_svm_check_permissions(vcpu)) + return 1; + if (!page_address_valid(vcpu, kvm_register_read(vcpu, VCPU_REGS_RAX))) goto reinject; -- cgit v1.2.3 From 878b8efa2adbbfffc97f68cbba243cdf18d943c0 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:29 +0000 Subject: KVM: SVM: Treat mapping failures equally in VMLOAD/VMSAVE emulation Currently, a #GP is only injected if kvm_vcpu_map() fails with -EINVAL. But it could also fail with -EFAULT if creating a host mapping failed. Inject a #GP in all cases, no reason to treat failure modes differently. Similar to commit 01ddcdc55e09 ("KVM: nSVM: Always inject a #GP if mapping VMCB12 fails on nested VMRUN"), treat all failures equally. Fixes: 8c5fbf1a7231 ("KVM/nSVM: Use the new mapping API for mapping guest memory") Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-7-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/svm.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7d0d95f40cd2..b83d524a6e78 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2204,10 +2204,8 @@ static int vmload_vmsave_interception(struct kvm_vcpu *vcpu, bool vmload) return 1; } - ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map); - if (ret) { - if (ret == -EINVAL) - kvm_inject_gp(vcpu, 0); + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) { + kvm_inject_gp(vcpu, 0); return 1; } -- cgit v1.2.3 From 2daf71bfd77d0b7ba7b81d1a6ac872ebb338ff31 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:30 +0000 Subject: KVM: nSVM: Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails KVM currently injects a #GP if mapping vmcb12 fails when emulating VMRUN/VMLOAD/VMSAVE. This is not architectural behavior, as #GP should only be injected if the physical address is not supported or not aligned. Instead, handle it as an emulation failure, similar to how nVMX handles failures to read/write guest memory in several emulation paths. When virtual VMLOAD/VMSAVE is enabled, if vmcb12's GPA is not mapped in the NPTs a VMEXIT(#NPF) will be generated, and KVM will install an MMIO SPTE and emulate the instruction if there is no corresponding memslot. x86_emulate_insn() will return EMULATION_FAILED as VMLOAD/VMSAVE are not handled as part of the twobyte_insn cases. Even though this will also result in an emulation failure, it will only result in a straight return to userspace if KVM_CAP_EXIT_ON_EMULATION_FAILURE is set. Otherwise, it would inject #UD and only exit to userspace if not in guest mode. So the behavior is slightly different if virtual VMLOAD/VMSAVE is enabled. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Reported-by: Jim Mattson Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-8-yosry@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/nested.c | 6 ++---- arch/x86/kvm/svm/svm.c | 6 ++---- 2 files changed, 4 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 16f4bc4f48f5..b42d95fc8499 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -1111,10 +1111,8 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu) ret = nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa); if (ret) { - if (ret == -EFAULT) { - kvm_inject_gp(vcpu, 0); - return 1; - } + if (ret == -EFAULT) + return kvm_handle_memory_failure(vcpu, X86EMUL_IO_NEEDED, NULL); /* Advance RIP past VMRUN as part of the nested #VMEXIT. */ return kvm_skip_emulated_instruction(vcpu); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index b83d524a6e78..1e51cbb80e86 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2204,10 +2204,8 @@ static int vmload_vmsave_interception(struct kvm_vcpu *vcpu, bool vmload) return 1; } - if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) { - kvm_inject_gp(vcpu, 0); - return 1; - } + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map)) + return kvm_handle_memory_failure(vcpu, X86EMUL_IO_NEEDED, NULL); vmcb12 = map.hva; -- cgit v1.2.3 From 428543fbf06c498d9835d549920c2206befc1589 Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:31 +0000 Subject: KVM: selftests: Rework svm_nested_invalid_vmcb12_gpa The test currently allegedly makes sure that VMRUN causes a #GP in vmcb12 GPA is valid but unmappable. However, it calls run_guest() with an the test vmcb12 GPA, and the #GP is produced from VMLOAD, not VMRUN. Additionally, the underlying logic just changed to match architectural behavior, and all of VMRUN/VMLOAD/VMSAVE fail emulation if vmcb12 cannot be mapped. The CPU still injects a #GP if the vmcb12 GPA exceeds maxphyaddr. Rework the test such to use the KVM_ONE_VCPU_TEST[_SUITE] harness, and test all of VMRUN/VMLOAD/VMSAVE with both an invalid GPA (-1ULL) causing a #GP, and a valid but unmappable GPA causing emulation failure. Execute the instructions directly from L1 instead of run_guest() to make sure the #GP or emulation failure is produced by the right instruction. Leave the #VMEXIT with unmappable GPA test case as-is, but wrap it with a test harness as well. Opportunisitically drop gp_triggered, as the test already checks that a #GP was injected through a SYNC. Also, use the first unmapped GPA instead of the maximum legal GPA, as some CPUs inject a #GP for the maximum legal GPA (likely in a reserved area). Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-9-yosry@kernel.org Signed-off-by: Sean Christopherson --- .../kvm/x86/svm_nested_invalid_vmcb12_gpa.c | 152 ++++++++++++++++----- 1 file changed, 115 insertions(+), 37 deletions(-) diff --git a/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c b/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c index c6d5f712120d..569869bed20b 100644 --- a/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c +++ b/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c @@ -6,6 +6,8 @@ #include "vmx.h" #include "svm_util.h" #include "kselftest.h" +#include "kvm_test_harness.h" +#include "test_util.h" #define L2_GUEST_STACK_SIZE 64 @@ -13,86 +15,162 @@ #define SYNC_GP 101 #define SYNC_L2_STARTED 102 -u64 valid_vmcb12_gpa; -int gp_triggered; +static unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; static void guest_gp_handler(struct ex_regs *regs) { - GUEST_ASSERT(!gp_triggered); GUEST_SYNC(SYNC_GP); - gp_triggered = 1; - regs->rax = valid_vmcb12_gpa; } -static void l2_guest_code(void) +static void l2_code(void) { GUEST_SYNC(SYNC_L2_STARTED); vmcall(); } -static void l1_guest_code(struct svm_test_data *svm, u64 invalid_vmcb12_gpa) +static void l1_vmrun(struct svm_test_data *svm, u64 gpa) { - unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); - generic_svm_setup(svm, l2_guest_code, - &l2_guest_stack[L2_GUEST_STACK_SIZE]); + asm volatile ("vmrun %[gpa]" : : [gpa] "a" (gpa) : "memory"); +} - valid_vmcb12_gpa = svm->vmcb_gpa; +static void l1_vmload(struct svm_test_data *svm, u64 gpa) +{ + generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); - run_guest(svm->vmcb, invalid_vmcb12_gpa); /* #GP */ + asm volatile ("vmload %[gpa]" : : [gpa] "a" (gpa) : "memory"); +} + +static void l1_vmsave(struct svm_test_data *svm, u64 gpa) +{ + generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + asm volatile ("vmsave %[gpa]" : : [gpa] "a" (gpa) : "memory"); +} + +static void l1_vmexit(struct svm_test_data *svm, u64 gpa) +{ + generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); - /* GP handler should jump here */ + run_guest(svm->vmcb, svm->vmcb_gpa); GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); GUEST_DONE(); } -int main(int argc, char *argv[]) +static u64 unmappable_gpa(struct kvm_vcpu *vcpu) +{ + struct userspace_mem_region *region; + u64 region_gpa_end, vm_gpa_end = 0; + int i; + + hash_for_each(vcpu->vm->regions.slot_hash, i, region, slot_node) { + region_gpa_end = region->region.guest_phys_addr + region->region.memory_size; + vm_gpa_end = max(vm_gpa_end, region_gpa_end); + } + + return vm_gpa_end; +} + +static void test_invalid_vmcb12(struct kvm_vcpu *vcpu) { - struct kvm_x86_state *state; vm_vaddr_t nested_gva = 0; - struct kvm_vcpu *vcpu; - uint32_t maxphyaddr; - u64 max_legal_gpa; - struct kvm_vm *vm; struct ucall uc; - TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); - vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); vm_install_exception_handler(vcpu->vm, GP_VECTOR, guest_gp_handler); - - /* - * Find the max legal GPA that is not backed by a memslot (i.e. cannot - * be mapped by KVM). - */ - maxphyaddr = kvm_cpuid_property(vcpu->cpuid, X86_PROPERTY_MAX_PHY_ADDR); - max_legal_gpa = BIT_ULL(maxphyaddr) - PAGE_SIZE; - vcpu_alloc_svm(vm, &nested_gva); - vcpu_args_set(vcpu, 2, nested_gva, max_legal_gpa); - - /* VMRUN with max_legal_gpa, KVM injects a #GP */ + vcpu_alloc_svm(vcpu->vm, &nested_gva); + vcpu_args_set(vcpu, 2, nested_gva, -1ULL); vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC); TEST_ASSERT_EQ(uc.args[1], SYNC_GP); +} + +static void test_unmappable_vmcb12(struct kvm_vcpu *vcpu) +{ + vm_vaddr_t nested_gva = 0; + + vcpu_alloc_svm(vcpu->vm, &nested_gva); + vcpu_args_set(vcpu, 2, nested_gva, unmappable_gpa(vcpu)); + vcpu_run(vcpu); + + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_INTERNAL_ERROR); + TEST_ASSERT_EQ(vcpu->run->emulation_failure.suberror, KVM_INTERNAL_ERROR_EMULATION); +} + +static void test_unmappable_vmcb12_vmexit(struct kvm_vcpu *vcpu) +{ + struct kvm_x86_state *state; + vm_vaddr_t nested_gva = 0; + struct ucall uc; /* - * Enter L2 (with a legit vmcb12 GPA), then overwrite vmcb12 GPA with - * max_legal_gpa. KVM will fail to map vmcb12 on nested VM-Exit and + * Enter L2 (with a legit vmcb12 GPA), then overwrite vmcb12 GPA with an + * unmappable GPA. KVM will fail to map vmcb12 on nested VM-Exit and * cause a shutdown. */ + vcpu_alloc_svm(vcpu->vm, &nested_gva); + vcpu_args_set(vcpu, 2, nested_gva, unmappable_gpa(vcpu)); vcpu_run(vcpu); TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC); TEST_ASSERT_EQ(uc.args[1], SYNC_L2_STARTED); state = vcpu_save_state(vcpu); - state->nested.hdr.svm.vmcb_pa = max_legal_gpa; + state->nested.hdr.svm.vmcb_pa = unmappable_gpa(vcpu); vcpu_load_state(vcpu, state); vcpu_run(vcpu); TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_SHUTDOWN); kvm_x86_state_cleanup(state); - kvm_vm_free(vm); - return 0; +} + +KVM_ONE_VCPU_TEST_SUITE(vmcb12_gpa); + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmrun_invalid, l1_vmrun) +{ + test_invalid_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmload_invalid, l1_vmload) +{ + test_invalid_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmsave_invalid, l1_vmsave) +{ + test_invalid_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmrun_unmappable, l1_vmrun) +{ + test_unmappable_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmload_unmappable, l1_vmload) +{ + test_unmappable_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmsave_unmappable, l1_vmsave) +{ + test_unmappable_vmcb12(vcpu); +} + +/* + * Invalid vmcb12_gpa cannot be test for #VMEXIT as KVM_SET_NESTED_STATE will + * reject it. + */ +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmexit_unmappable, l1_vmexit) +{ + test_unmappable_vmcb12_vmexit(vcpu); +} + +int main(int argc, char *argv[]) +{ + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); + + return test_harness_run(argc, argv); } -- cgit v1.2.3 From 052ca584bd7c51de0de96e684631570459d46cda Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Mon, 16 Mar 2026 20:27:32 +0000 Subject: KVM: selftests: Drop 'invalid' from svm_nested_invalid_vmcb12_gpa's name The test checks both invalid GPAs as well as unmappable GPAs, so drop 'invalid' from its name. Signed-off-by: Yosry Ahmed Link: https://patch.msgid.link/20260316202732.3164936-10-yosry@kernel.org Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile.kvm | 2 +- .../kvm/x86/svm_nested_invalid_vmcb12_gpa.c | 176 --------------------- .../selftests/kvm/x86/svm_nested_vmcb12_gpa.c | 176 +++++++++++++++++++++ 3 files changed, 177 insertions(+), 177 deletions(-) delete mode 100644 tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c create mode 100644 tools/testing/selftests/kvm/x86/svm_nested_vmcb12_gpa.c diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index ba87cd31872b..83792d136ac3 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -111,9 +111,9 @@ TEST_GEN_PROGS_x86 += x86/vmx_preemption_timer_test TEST_GEN_PROGS_x86 += x86/svm_vmcall_test TEST_GEN_PROGS_x86 += x86/svm_int_ctl_test TEST_GEN_PROGS_x86 += x86/svm_nested_clear_efer_svme -TEST_GEN_PROGS_x86 += x86/svm_nested_invalid_vmcb12_gpa TEST_GEN_PROGS_x86 += x86/svm_nested_shutdown_test TEST_GEN_PROGS_x86 += x86/svm_nested_soft_inject_test +TEST_GEN_PROGS_x86 += x86/svm_nested_vmcb12_gpa TEST_GEN_PROGS_x86 += x86/svm_lbr_nested_state TEST_GEN_PROGS_x86 += x86/tsc_scaling_sync TEST_GEN_PROGS_x86 += x86/sync_regs_test diff --git a/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c b/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c deleted file mode 100644 index 569869bed20b..000000000000 --- a/tools/testing/selftests/kvm/x86/svm_nested_invalid_vmcb12_gpa.c +++ /dev/null @@ -1,176 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only -/* - * Copyright (C) 2026, Google LLC. - */ -#include "kvm_util.h" -#include "vmx.h" -#include "svm_util.h" -#include "kselftest.h" -#include "kvm_test_harness.h" -#include "test_util.h" - - -#define L2_GUEST_STACK_SIZE 64 - -#define SYNC_GP 101 -#define SYNC_L2_STARTED 102 - -static unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; - -static void guest_gp_handler(struct ex_regs *regs) -{ - GUEST_SYNC(SYNC_GP); -} - -static void l2_code(void) -{ - GUEST_SYNC(SYNC_L2_STARTED); - vmcall(); -} - -static void l1_vmrun(struct svm_test_data *svm, u64 gpa) -{ - generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); - - asm volatile ("vmrun %[gpa]" : : [gpa] "a" (gpa) : "memory"); -} - -static void l1_vmload(struct svm_test_data *svm, u64 gpa) -{ - generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); - - asm volatile ("vmload %[gpa]" : : [gpa] "a" (gpa) : "memory"); -} - -static void l1_vmsave(struct svm_test_data *svm, u64 gpa) -{ - generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); - - asm volatile ("vmsave %[gpa]" : : [gpa] "a" (gpa) : "memory"); -} - -static void l1_vmexit(struct svm_test_data *svm, u64 gpa) -{ - generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); - - run_guest(svm->vmcb, svm->vmcb_gpa); - GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); - GUEST_DONE(); -} - -static u64 unmappable_gpa(struct kvm_vcpu *vcpu) -{ - struct userspace_mem_region *region; - u64 region_gpa_end, vm_gpa_end = 0; - int i; - - hash_for_each(vcpu->vm->regions.slot_hash, i, region, slot_node) { - region_gpa_end = region->region.guest_phys_addr + region->region.memory_size; - vm_gpa_end = max(vm_gpa_end, region_gpa_end); - } - - return vm_gpa_end; -} - -static void test_invalid_vmcb12(struct kvm_vcpu *vcpu) -{ - vm_vaddr_t nested_gva = 0; - struct ucall uc; - - - vm_install_exception_handler(vcpu->vm, GP_VECTOR, guest_gp_handler); - vcpu_alloc_svm(vcpu->vm, &nested_gva); - vcpu_args_set(vcpu, 2, nested_gva, -1ULL); - vcpu_run(vcpu); - - TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); - TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC); - TEST_ASSERT_EQ(uc.args[1], SYNC_GP); -} - -static void test_unmappable_vmcb12(struct kvm_vcpu *vcpu) -{ - vm_vaddr_t nested_gva = 0; - - vcpu_alloc_svm(vcpu->vm, &nested_gva); - vcpu_args_set(vcpu, 2, nested_gva, unmappable_gpa(vcpu)); - vcpu_run(vcpu); - - TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_INTERNAL_ERROR); - TEST_ASSERT_EQ(vcpu->run->emulation_failure.suberror, KVM_INTERNAL_ERROR_EMULATION); -} - -static void test_unmappable_vmcb12_vmexit(struct kvm_vcpu *vcpu) -{ - struct kvm_x86_state *state; - vm_vaddr_t nested_gva = 0; - struct ucall uc; - - /* - * Enter L2 (with a legit vmcb12 GPA), then overwrite vmcb12 GPA with an - * unmappable GPA. KVM will fail to map vmcb12 on nested VM-Exit and - * cause a shutdown. - */ - vcpu_alloc_svm(vcpu->vm, &nested_gva); - vcpu_args_set(vcpu, 2, nested_gva, unmappable_gpa(vcpu)); - vcpu_run(vcpu); - TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); - TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC); - TEST_ASSERT_EQ(uc.args[1], SYNC_L2_STARTED); - - state = vcpu_save_state(vcpu); - state->nested.hdr.svm.vmcb_pa = unmappable_gpa(vcpu); - vcpu_load_state(vcpu, state); - vcpu_run(vcpu); - TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_SHUTDOWN); - - kvm_x86_state_cleanup(state); -} - -KVM_ONE_VCPU_TEST_SUITE(vmcb12_gpa); - -KVM_ONE_VCPU_TEST(vmcb12_gpa, vmrun_invalid, l1_vmrun) -{ - test_invalid_vmcb12(vcpu); -} - -KVM_ONE_VCPU_TEST(vmcb12_gpa, vmload_invalid, l1_vmload) -{ - test_invalid_vmcb12(vcpu); -} - -KVM_ONE_VCPU_TEST(vmcb12_gpa, vmsave_invalid, l1_vmsave) -{ - test_invalid_vmcb12(vcpu); -} - -KVM_ONE_VCPU_TEST(vmcb12_gpa, vmrun_unmappable, l1_vmrun) -{ - test_unmappable_vmcb12(vcpu); -} - -KVM_ONE_VCPU_TEST(vmcb12_gpa, vmload_unmappable, l1_vmload) -{ - test_unmappable_vmcb12(vcpu); -} - -KVM_ONE_VCPU_TEST(vmcb12_gpa, vmsave_unmappable, l1_vmsave) -{ - test_unmappable_vmcb12(vcpu); -} - -/* - * Invalid vmcb12_gpa cannot be test for #VMEXIT as KVM_SET_NESTED_STATE will - * reject it. - */ -KVM_ONE_VCPU_TEST(vmcb12_gpa, vmexit_unmappable, l1_vmexit) -{ - test_unmappable_vmcb12_vmexit(vcpu); -} - -int main(int argc, char *argv[]) -{ - TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); - - return test_harness_run(argc, argv); -} diff --git a/tools/testing/selftests/kvm/x86/svm_nested_vmcb12_gpa.c b/tools/testing/selftests/kvm/x86/svm_nested_vmcb12_gpa.c new file mode 100644 index 000000000000..569869bed20b --- /dev/null +++ b/tools/testing/selftests/kvm/x86/svm_nested_vmcb12_gpa.c @@ -0,0 +1,176 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2026, Google LLC. + */ +#include "kvm_util.h" +#include "vmx.h" +#include "svm_util.h" +#include "kselftest.h" +#include "kvm_test_harness.h" +#include "test_util.h" + + +#define L2_GUEST_STACK_SIZE 64 + +#define SYNC_GP 101 +#define SYNC_L2_STARTED 102 + +static unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + +static void guest_gp_handler(struct ex_regs *regs) +{ + GUEST_SYNC(SYNC_GP); +} + +static void l2_code(void) +{ + GUEST_SYNC(SYNC_L2_STARTED); + vmcall(); +} + +static void l1_vmrun(struct svm_test_data *svm, u64 gpa) +{ + generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + asm volatile ("vmrun %[gpa]" : : [gpa] "a" (gpa) : "memory"); +} + +static void l1_vmload(struct svm_test_data *svm, u64 gpa) +{ + generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + asm volatile ("vmload %[gpa]" : : [gpa] "a" (gpa) : "memory"); +} + +static void l1_vmsave(struct svm_test_data *svm, u64 gpa) +{ + generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + asm volatile ("vmsave %[gpa]" : : [gpa] "a" (gpa) : "memory"); +} + +static void l1_vmexit(struct svm_test_data *svm, u64 gpa) +{ + generic_svm_setup(svm, l2_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + run_guest(svm->vmcb, svm->vmcb_gpa); + GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); + GUEST_DONE(); +} + +static u64 unmappable_gpa(struct kvm_vcpu *vcpu) +{ + struct userspace_mem_region *region; + u64 region_gpa_end, vm_gpa_end = 0; + int i; + + hash_for_each(vcpu->vm->regions.slot_hash, i, region, slot_node) { + region_gpa_end = region->region.guest_phys_addr + region->region.memory_size; + vm_gpa_end = max(vm_gpa_end, region_gpa_end); + } + + return vm_gpa_end; +} + +static void test_invalid_vmcb12(struct kvm_vcpu *vcpu) +{ + vm_vaddr_t nested_gva = 0; + struct ucall uc; + + + vm_install_exception_handler(vcpu->vm, GP_VECTOR, guest_gp_handler); + vcpu_alloc_svm(vcpu->vm, &nested_gva); + vcpu_args_set(vcpu, 2, nested_gva, -1ULL); + vcpu_run(vcpu); + + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); + TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC); + TEST_ASSERT_EQ(uc.args[1], SYNC_GP); +} + +static void test_unmappable_vmcb12(struct kvm_vcpu *vcpu) +{ + vm_vaddr_t nested_gva = 0; + + vcpu_alloc_svm(vcpu->vm, &nested_gva); + vcpu_args_set(vcpu, 2, nested_gva, unmappable_gpa(vcpu)); + vcpu_run(vcpu); + + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_INTERNAL_ERROR); + TEST_ASSERT_EQ(vcpu->run->emulation_failure.suberror, KVM_INTERNAL_ERROR_EMULATION); +} + +static void test_unmappable_vmcb12_vmexit(struct kvm_vcpu *vcpu) +{ + struct kvm_x86_state *state; + vm_vaddr_t nested_gva = 0; + struct ucall uc; + + /* + * Enter L2 (with a legit vmcb12 GPA), then overwrite vmcb12 GPA with an + * unmappable GPA. KVM will fail to map vmcb12 on nested VM-Exit and + * cause a shutdown. + */ + vcpu_alloc_svm(vcpu->vm, &nested_gva); + vcpu_args_set(vcpu, 2, nested_gva, unmappable_gpa(vcpu)); + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); + TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC); + TEST_ASSERT_EQ(uc.args[1], SYNC_L2_STARTED); + + state = vcpu_save_state(vcpu); + state->nested.hdr.svm.vmcb_pa = unmappable_gpa(vcpu); + vcpu_load_state(vcpu, state); + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_SHUTDOWN); + + kvm_x86_state_cleanup(state); +} + +KVM_ONE_VCPU_TEST_SUITE(vmcb12_gpa); + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmrun_invalid, l1_vmrun) +{ + test_invalid_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmload_invalid, l1_vmload) +{ + test_invalid_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmsave_invalid, l1_vmsave) +{ + test_invalid_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmrun_unmappable, l1_vmrun) +{ + test_unmappable_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmload_unmappable, l1_vmload) +{ + test_unmappable_vmcb12(vcpu); +} + +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmsave_unmappable, l1_vmsave) +{ + test_unmappable_vmcb12(vcpu); +} + +/* + * Invalid vmcb12_gpa cannot be test for #VMEXIT as KVM_SET_NESTED_STATE will + * reject it. + */ +KVM_ONE_VCPU_TEST(vmcb12_gpa, vmexit_unmappable, l1_vmexit) +{ + test_unmappable_vmcb12_vmexit(vcpu); +} + +int main(int argc, char *argv[]) +{ + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); + + return test_harness_run(argc, argv); +} -- cgit v1.2.3