summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2026-03-05KVM: arm64: nv: Inject a SEA if failed to read the descriptorZenghui Yu (Huawei)1-1/+3
Failure to read the descriptor (because it is outside of a memslot) should result in a SEA being injected in the guest. Suggested-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/86ms1m9lp3.wl-maz@kernel.org Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev> Link: https://patch.msgid.link/20260225173515.20490-4-zenghui.yu@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-05KVM: arm64: nv: Report addrsz fault at level 0 with a bad VTTBR.BADDRZenghui Yu (Huawei)1-1/+2
As per R_BFHQH, " When an Address size fault is generated, the reported fault code indicates one of the following: If the fault was generated due to the TTBR_ELx used in the translation having nonzero address bits above the OA size, then a fault at level 0. " Fix the reported Address size fault level as being 0 if the base address is wrongly programmed by L1. Fixes: 61e30b9eef7f ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic") Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev> Link: https://patch.msgid.link/20260225173515.20490-3-zenghui.yu@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-05KVM: arm64: nv: Check S2 limits based on implemented PA sizeZenghui Yu (Huawei)1-9/+11
check_base_s2_limits() checks the validity of SL0 and inputsize against ia_size (inputsize again!) but the pseudocode from DDI0487 G.a AArch64.TranslationTableWalk() says that we should check against the implemented PA size. We would otherwise fail to walk S2 with a valid configuration. E.g., granule size = 4KB, inputsize = 40 bits, initial lookup level = 0 (no concatenation) on a system with 48 bits PA range supported is allowed by architecture. Fix it by obtaining PA size by kvm_get_pa_bits(). Note that kvm_get_pa_bits() returns the fixed limit now and should eventually reflect the per VM PARange (one day!). Given that the configured PARange should not be greater that kvm_ipa_limit, it at least fixes the problem described above. While at it, inject a level 0 translation fault to guest if check_base_s2_limits() fails, as per the pseudocode. Fixes: 61e30b9eef7f ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic") Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev> Link: https://patch.msgid.link/20260225173515.20490-2-zenghui.yu@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-05x86/reboot: Execute the kernel restart handler upon machine restartMartin Schiller1-1/+4
SoC devices like the Intel / MaxLinear Lightning Mountain must be reset by the Reset Control Unit (RCU) instead of using "normal" x86 mechanisms like ACPI, BIOS, KBD, etc. Therefore, the RCU driver (reset-intel-gw) registers a restart handler which triggers the global reset signal. Unfortunately, this is of no use as long as the restart chain is not processed during reboot on x86 systems. That's why do_kernel_restart() must be called when a reboot is performed. This has long been common practice for other architectures. [ bp: Massage commit message. ] Signed-off-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://patch.msgid.link/20260225-x86_do_kernel_restart-v2-1-81396cf3d44c@dev.tdt.de
2026-03-05KVM: arm64: pkvm: Fallback to level-3 mapping on host stage-2 faultMarc Zyngier1-1/+1
If, for any odd reason, we cannot converge to mapping size that is completely contained in a memblock region, we fail to install a S2 mapping and go back to the faulting instruction. Rince, repeat. This happens when faulting in regions that are smaller than a page or that do not have PAGE_SIZE-aligned boundaries (as witnessed on an O6 board that refuses to boot in protected mode). In this situation, fallback to using a PAGE_SIZE mapping anyway -- it isn't like we can go any lower. Fixes: e728e705802fe ("KVM: arm64: Adjust range correctly during host stage-2 faults") Link: https://lore.kernel.org/r/86wlzr77cn.wl-maz@kernel.org Cc: stable@vger.kernel.org Cc: Quentin Perret <qperret@google.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://patch.msgid.link/20260305132751.2928138-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-05KVM: arm64: Eagerly init vgic dist/redist on vgic creationMarc Zyngier1-16/+16
If vgic_allocate_private_irqs_locked() fails for any odd reason, we exit kvm_vgic_create() early, leaving dist->rd_regions uninitialised. kvm_vgic_dist_destroy() then comes along and walks into the weeds trying to free the RDs. Got to love this stuff. Solve it by moving all the static initialisation early, and make sure that if we fail halfway, we're in a reasonable shape to perform the rest of the teardown. While at it, reset the vgic model on failure, just in case... Reported-by: syzbot+f6a46b038fc243ac0175@syzkaller.appspotmail.com Tested-by: syzbot+f6a46b038fc243ac0175@syzkaller.appspotmail.com Fixes: b3aa9283c0c50 ("KVM: arm64: vgic: Hoist SGI/PPI alloc from vgic_init() to kvm_create_vgic()") Link: https://lore.kernel.org/r/69a2d58c.050a0220.3a55be.003b.GAE@google.com Link: https://patch.msgid.link/20260228164559.936268-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org
2026-03-05arm64: dts: mediatek: mt8195-cherry-dojo: Describe M.2 M-key NVMe slotChen-Yu Tsai1-0/+38
The Dojo device has a M.2 M-key slot for an included NVMe on some models. Add a proper device tree description based on the new M.2 M-key binding. Power for the slot is controlled by the embedded controller. As far as the main SoC is concerned, it is always on. Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Chen-Yu Tsai <wenst@chromium.org> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
2026-03-05arm64: dts: mediatek: mt8195-cherry: add WiFi PCIe and BT USB power suppliesChen-Yu Tsai1-11/+36
The MT8195 Cherry design features an M.2 E-key slot wired up with PCIe and USB for a WiFi+BT adapter. Previously the power was just enabled all the time with a default pinctrl setting that set the GPIO pin high. With the PCIe slot description DT binding in place, the power supplies can at least be added and tied to the PCIe and USB hosts. Once the M.2 E-key binding is merged, this description can be further converted to an M.2 E-key. Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Chen-Yu Tsai <wenst@chromium.org> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
2026-03-05arm64: dts: exynos: add initial support for Samsung Galaxy J5Andras Sebok2-0/+529
Add initial devicetree support for Samsung Galaxy J5 (2017) using Exynos7870 SoC. Signed-off-by: Andras Sebok <sebokandris2009@gmail.com> Signed-off-by: Kaustabh Chakraborty <kauschluss@disroot.org> Link: https://patch.msgid.link/20260304-exynos7870-j5y17lte-v1-2-eb25902c84c8@disroot.org [krzk: Rephrase commit msg] Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
2026-03-05KVM: SVM: Triple fault L1 on unintercepted EFER.SVME clear by L2Yosry Ahmed1-0/+13
KVM tracks when EFER.SVME is set and cleared to initialize and tear down nested state. However, it doesn't differentiate if EFER.SVME is getting toggled in L1 or L2+. If L2 clears EFER.SVME, and L1 does not intercept the EFER write, KVM exits guest mode and tears down nested state while L2 is running, executing L1 without injecting a proper #VMEXIT. According to the APM: The effect of turning off EFER.SVME while a guest is running is undefined; therefore, the VMM should always prevent guests from writing EFER. Since the behavior is architecturally undefined, KVM gets to choose what to do. Inject a triple fault into L1 as a more graceful option that running L1 with corrupted state. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> base-commit: 95deaec3557dced322e2540bfa426e60e5373d46 Link: https://patch.msgid.link/20260209195142.2554532-2-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: x86: SVM: Remove vmcb_is_dirty()Jim Mattson1-5/+0
After commit dd26d1b5d6ed ("KVM: nSVM: Cache all used fields from VMCB12"), vmcb_is_dirty() has no callers. Remove the function. Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20260224005500.1471972-2-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Only copy SVM_MISC_ENABLE_NP from VMCB01's misc_ctlYosry Ahmed1-2/+10
The 'misc_ctl' field in VMCB02 is taken as-is from VMCB01. However, the only bit that needs to copied is SVM_MISC_ENABLE_NP, as all other known bits in misc_ctl are related to SEV guests, and KVM doesn't support nested virtualization for SEV guests. Only copy SVM_MISC_ENABLE_NP to harden against future bugs if/when other bits are set for L1 but should not be set for L2. Opportunistically add a comment explaining why SVM_MISC_ENABLE_NP is taken from VMCB01 and not VMCB02. Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-26-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Sanitize INT/EVENTINJ fields when copying from vmcb12Yosry Ahmed2-4/+9
Make sure all fields used from vmcb12 in creating the vmcb02 are sanitized, such that no unhandled or reserved bits end up in the vmcb02. The following control fields are read from vmcb12 and have bits that are either reserved or not handled/advertised by KVM: tlb_ctl, int_ctl, int_state, int_vector, event_inj, misc_ctl, and misc_ctl2. The following fields do not require any extra sanitizing: - tlb_ctl: already being sanitized. - int_ctl: bits from vmcb12 are copied bit-by-bit as needed. - misc_ctl: only used in consistency checks (particularly NP_ENABLE). - misc_ctl2: bits from vmcb12 are copied bit-by-bit as needed. For the remaining fields (int_vector, int_state, and event_inj), make sure only defined bits are copied from L1's vmcb12 into KVM'cache by defining appropriate masks where needed. Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-25-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Sanitize TLB_CONTROL field when copying from vmcb12Yosry Ahmed2-1/+3
The APM defines possible values for TLB_CONTROL as 0, 1, 3, and 7 -- all of which are always allowed for KVM guests as KVM always supports X86_FEATURE_FLUSHBYASID. Only copy bits 0 to 2 from vmcb12's TLB_CONTROL, such that no unhandled or reserved bits end up in vmcb02. Note that TLB_CONTROL in vmcb12 is currently ignored by KVM, as it nukes the TLB on nested transitions anyway (see nested_svm_transition_tlb_flush()). However, such sanitization will be needed once the TODOs there are addressed, and it's minimal churn to add it now. Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-24-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Use PAGE_MASK to drop lower bits of bitmap GPAs from vmcb12Yosry Ahmed1-4/+2
Use PAGE_MASK to drop the lower bits from IOPM_BASE_PA and MSRPM_BASE_PA while copying them instead of dropping the bits afterward with a hardcoded mask. No functional change intended. Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-23-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Restrict mapping vmcb12 on nested VMRUNYosry Ahmed1-38/+51
All accesses to the vmcb12 in the guest memory on nested VMRUN are limited to nested_svm_vmrun() copying vmcb12 fields and writing them on failed consistency checks. However, vmcb12 remains mapped throughout nested_svm_vmrun(). Mapping and unmapping around usages is possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is used after being unmapped. Move reading the vmcb12, copying to cache, and consistency checks from nested_svm_vmrun() into a new helper, nested_svm_copy_vmcb12_to_cache() to limit the scope of the mapping. Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-22-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Cache all used fields from VMCB12Yosry Ahmed3-52/+93
Currently, most fields used from VMCB12 are cached in svm->nested.{ctl/save}. This is mainly to avoid TOC-TOU bugs. However, for the save area, only the fields used in the consistency checks (i.e. nested_vmcb_check_save()) were being cached. Other fields are read directly from guest memory in nested_vmcb02_prepare_save(). While probably benign, this still makes it possible for TOC-TOU bugs to happen. For example, RAX, RSP, and RIP are read twice, once to store in VMCB02, and once to store in vcpu->arch.regs. It is possible for the guest to modify the value between both reads, potentially causing nasty bugs. Harden against such bugs by caching everything in svm->nested.save. Cache all the needed fields, and keep all accesses to the VMCB12 strictly in nested_svm_vmrun() for caching and early error injection. Following changes will further limit the access to the VMCB12 in the nested VMRUN path. Introduce vmcb12_is_dirty() to use with the cached control fields instead of vmcb_is_dirty(), similar to vmcb12_is_intercept(). Opportunistically order the copies in __nested_copy_vmcb_save_to_cache() by the order in which the fields are defined in struct vmcb_save_area. Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-21-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: SVM: Rename vmcb->virt_ext to vmcb->misc_ctl2Yosry Ahmed4-22/+21
'virt' is confusing in the VMCB because it is relative and ambiguous. The 'virt_ext' field includes bits for LBR virtualization and VMSAVE/VMLOAD virtualization, so it's just another miscellaneous control field. Name it as such. While at it, move the definitions of the bits below those for 'misc_ctl' and rename them for consistency. Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-20-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: SVM: Rename vmcb->nested_ctl to vmcb->misc_ctlSean Christopherson5-17/+17
The 'nested_ctl' field is misnamed. Although the first bit is for nested paging, the other defined bits are for SEV/SEV-ES. Other bits in the same field according to the APM (but not defined by KVM) include "Guest Mode Execution Trap", "Enable INVLPGB/TLBSYNC", and other control bits unrelated to 'nested'. There is nothing common among these bits, so just name the field misc_ctl. Also rename the flags accordingly. Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-19-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Capture svm->nested.ctl as vmcb12_ctrl when preparing vmcb02Sean Christopherson1-20/+19
Grab svm->nested.ctl as vmcb12_ctrl when preparing the vmcb02 controls to make it more obvious that much of the data is coming from vmcb12 (or rather, a snapshot of vmcb12 at the time of L1's VMRUN). Opportunistically reorder the variable definitions to create a pretty reverse fir tree. No functional change intended. Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260218230958.2877682-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Move vmcb_ctrl_area_cached.bus_lock_rip to svm_nested_stateSean Christopherson3-6/+6
Move "bus_lock_rip" from "vmcb_ctrl_area_cached" to "svm_nested_state" as "last_bus_lock_rip" to more accurately reflect what it tracks, and because it is NOT a cached vmcb12 control field. The misplaced field isn't all that apparent in the current code base, as KVM uses "svm->nested.ctl" broadly, but the bad placement becomes glaringly obvious if "svm->nested.ctl" is captured as a local "vmcb12_ctrl" variable. No functional change intended. Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260218230958.2877682-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Use vmcb12_is_intercept() in nested_sync_control_from_vmcb02()Yosry Ahmed1-1/+1
Use vmcb12_is_intercept() instead of open-coding the intercept check. No functional change intended. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260218230958.2877682-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Use intuitive local variables in nested_vmcb02_recalc_intercepts()Sean Christopherson1-18/+15
Now that nested_vmcb02_recalc_intercepts() is explicitly scoped to deal with *only* recalculating vmcb02 intercepts, rename its local variables to use more intuivite names. The current "c", "h", and "g" local variables, for the current VMCB, vmcb01, and (cached) vmcb12 respectively, are short and sweet, but don't do much to help unfamiliar readers understand what the code is doing. Use vmcb12_ctrl/vmcb01/vmcb02/vmcb12_ctrl in lieu of c/h/g to make it clear the function is updating intercepts in vmcb02 based on the intercepts in vmcb01 and (cached) vmcb12. Opportunistically change the existing WARN_ON to a WARN_ON_ONCE so that a KVM bug doesn't unintentionally DoS the host. No functional change intended. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> [sean: use WARN_ON_ONCE, keep local vmcb12 cache as vmcb12_ctrl] Link: https://patch.msgid.link/20260218230958.2877682-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Directly (re)calc vmcb02 intercepts from ↵Sean Christopherson1-1/+1
nested_vmcb02_prepare_control() Now that nested_vmcb02_recalc_intercepts() provides guardrails against it being incorrectly called without vmcb02 active, invoke it directly from nested_vmcb02_recalc_intercepts() instead of bouncing through svm_mark_intercepts_dirty(), which unnecessarily marks vmcb01 as dirty. Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260218230958.2877682-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: WARN and abort vmcb02 intercepts recalc if vmcb02 isn't activeYosry Ahmed1-0/+3
WARN and bail early from nested_vmcb02_recalc_intercepts() if vmcb02 isn't the active/current VMCB, as recalculating intercepts for vmcb01 using logic intended for merging vmcb12 and vmcb01 intercepts can yield unexpected and unwanted results. In addition to hardening against general bugs, this will provide additional safeguards "if" nested_vmcb02_recalc_intercepts() is invoked directly from nested_vmcb02_prepare_control(). Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> [sean: split to separate patch, bail early on "failure"] Link: https://patch.msgid.link/20260218230958.2877682-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: SVM: Separate recalc_intercepts() into nested vs. non-nested partsSean Christopherson4-16/+25
Extract the non-nested aspects of recalc_intercepts() into a separate helper, svm_mark_intercepts_dirty(), to make it clear that the call isn't *just* recalculating (vmcb02's) intercepts, and to not bury non-nested code in nested.c. As suggested by Yosry, opportunistically prepend "nested_vmbc02_" to recalc_intercepts() so that it's obvious the function specifically deals with recomputing intercepts for L2. No functional change intended. Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260218230958.2877682-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: SVM: Recalc instructions intercepts when EFER.SVME is toggledKevin Cheng1-12/+23
The AMD APM states that VMRUN, VMLOAD, VMSAVE, CLGI, VMMCALL, and INVLPGA instructions should generate a #UD when EFER.SVME is cleared. Currently, when VMLOAD, VMSAVE, or CLGI are executed in L1 with EFER.SVME cleared, no #UD is generated in certain cases. This is because the intercepts for these instructions are cleared based on whether or not vls or vgif is enabled. The #UD fails to be generated when the intercepts are absent. Fix the missing #UD generation by ensuring that all relevant instructions have intercepts set when SVME.EFER is disabled. VMMCALL is special because KVM's ABI is that VMCALL/VMMCALL are always supported for L1 and never fault. Signed-off-by: Kevin Cheng <chengkev@google.com> [sean: isolate Intel CPU "compatibility" in EFER.SVME=1 path] Reviewed-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260304003010.1108257-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: SVM: Move STGI and CLGI intercept handlingKevin Cheng1-8/+24
Move STGI/CLGI intercept handling to svm_recalc_instruction_intercepts() in preparation for making the function EFER.SVME-aware. This will allow configuring STGI/CLGI intercepts along with other intercepts for other SVM instructions when EFER.SVME is toggled (KVM needs to intercept SVM instructions when EFER.SVME=0 to inject #UD). When clearing the STGI intercept in particular, request KVM_REQ_EVENT if there is at least one a pending GIF-controlled event. This avoids breaking NMI/SMI window tracking, as enable_{nmi,smi}_window() sets INTERCEPT_STGI to detect when NMIs become unblocked. KVM_REQ_EVENT forces kvm_check_and_inject_events() to re-evaluate pending events and re-enable the intercept if needed. Extract the pending GIF event check into a helper function svm_has_pending_gif_event() to deduplicate the logic between svm_recalc_instruction_intercepts() and svm_set_gif(). Signed-off-by: Kevin Cheng <chengkev@google.com> [sean: keep vgif handling out of the "Intel CPU model" path] Reviewed-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260304003010.1108257-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Always intercept VMMCALL when L2 is activeSean Christopherson2-11/+0
Always intercept VMMCALL now that KVM properly synthesizes a #UD as appropriate, i.e. when L1 doesn't want to intercept VMMCALL, to avoid putting L2 into an infinite #UD loop if KVM_X86_QUIRK_FIX_HYPERCALL_INSN is enabled. By letting L2 execute VMMCALL natively and thus #UD, for all intents and purposes KVM morphs the VMMCALL intercept into a #UD intercept (KVM always intercepts #UD). When the hypercall quirk is enabled, KVM "emulates" VMMCALL in response to the #UD by trying to fixup the opcode to the "right" vendor, then restarts the guest, without skipping the VMMCALL. As a result, the guest sees an endless stream of #UDs since it's already executing the correct vendor hypercall instruction, i.e. the emulator doesn't anticipate that the #UD could be due to lack of interception, as opposed to a truly undefined opcode. Fixes: 0d945bd93511 ("KVM: SVM: Don't allow nested guest to VMMCALL into host") Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed <yosry@kernel.org> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://patch.msgid.link/20260304002223.1105129-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Raise #UD if unhandled VMMCALL isn't intercepted by L1Kevin Cheng4-12/+30
Explicitly synthesize a #UD for VMMCALL if L2 is active, L1 does NOT want to intercept VMMCALL, nested_svm_l2_tlb_flush_enabled() is true, and the hypercall is something other than one of the supported Hyper-V hypercalls. When all of the above conditions are met, KVM will intercept VMMCALL but never forward it to L1, i.e. will let L2 make hypercalls as if it were L1. The TLFS says a whole lot of nothing about this scenario, so go with the architectural behavior, which says that VMMCALL #UDs if it's not intercepted. Opportunistically do a 2-for-1 stub trade by stub-ifying the new API instead of the helpers it uses. The last remaining "single" stub will soon be dropped as well. Suggested-by: Sean Christopherson <seanjc@google.com> Fixes: 3f4a812edf5c ("KVM: nSVM: hyper-v: Enable L2 TLB flush") Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Kevin Cheng <chengkev@google.com> Link: https://patch.msgid.link/20260228033328.2285047-5-chengkev@google.com [sean: rewrite changelog and comment, tag for stable, remove defunct stubs] Reviewed-by: Yosry Ahmed <yosry@kernel.org> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://patch.msgid.link/20260304002223.1105129-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: SVM: Explicitly mark vmcb01 dirty after modifying VMCB interceptsSean Christopherson1-1/+3
When reacting to an intercept update, explicitly mark vmcb01's intercepts dirty, as KVM always initially operates on vmcb01, and nested_svm_vmexit() isn't guaranteed to mark VMCB_INTERCEPTS as dirty. I.e. if L2 is active, KVM will modify the intercepts for L1, but might not mark them as dirty before the next VMRUN of L1. Fixes: 116a0a23676e ("KVM: SVM: Add clean-bit for intercetps, tsc-offset and pause filter count") Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260218230958.2877682-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Add missing consistency check for EVENTINJYosry Ahmed1-0/+51
According to the APM Volume #2, 15.20 (24593—Rev. 3.42—March 2024): VMRUN exits with VMEXIT_INVALID error code if either: • Reserved values of TYPE have been specified, or • TYPE = 3 (exception) has been specified with a vector that does not correspond to an exception (this includes vector 2, which is an NMI, not an exception). Add the missing consistency checks to KVM. For the second point, inject VMEXIT_INVALID if the vector is anything but the vectors defined by the APM for exceptions. Reserved vectors are also considered invalid, which matches the HW behavior. Vector 9 (i.e. #CSO) is considered invalid because it is reserved on modern CPUs, and according to LLMs no CPUs exist supporting SVM and producing #CSOs. Defined exceptions could be different between virtual CPUs as new CPUs define new vectors. In a best effort to dynamically define the valid vectors, make all currently defined vectors as valid except those obviously tied to a CPU feature: SHSTK -> #CP and SEV-ES -> #VC. As new vectors are defined, they can similarly be tied to corresponding CPU features. Invalid vectors on specific (e.g. old) CPUs that are missed by KVM should be rejected by HW anyway. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-18-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Add missing consistency check for EFER, CR0, CR4, and CSYosry Ahmed2-0/+7
According to the APM Volume #2, 15.5, Canonicalization and Consistency Checks (24593—Rev. 3.42—March 2024), the following condition (among others) results in a #VMEXIT with VMEXIT_INVALID (aka SVM_EXIT_ERR): EFER.LME, CR0.PG, CR4.PAE, CS.L, and CS.D are all non-zero. In the list of consistency checks done when EFER.LME and CR0.PG are set, add a check that CS.L and CS.D are not both set, after the existing check that CR4.PAE is set. This is functionally a nop because the nested VMRUN results in SVM_EXIT_ERR in HW, which is forwarded to L1, but KVM makes all consistency checks before a VMRUN is actually attempted. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-17-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Add missing consistency check for nCR3 validityYosry Ahmed1-0/+4
From the APM Volume #2, 15.25.4 (24593—Rev. 3.42—March 2024): When VMRUN is executed with nested paging enabled (NP_ENABLE = 1), the following conditions are considered illegal state combinations, in addition to those mentioned in “Canonicalization and Consistency Checks”: • Any MBZ bit of nCR3 is set. • Any G_PAT.PA field has an unsupported type encoding or any reserved field in G_PAT has a nonzero value. Add the consistency check for nCR3 being a legal GPA with no MBZ bits set. Note, the G_PAT.PA check is being handled separately[*]. Link: https://lore.kernel.org/kvm/20260205214326.1029278-3-jmattson@google.com [*] Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-16-yosry@kernel.org [sean: capture everything in CC(), massage changelog formatting] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Drop the non-architectural consistency check for NP_ENABLEYosry Ahmed1-4/+5
KVM currenty fails a nested VMRUN and injects VMEXIT_INVALID (aka SVM_EXIT_ERR) if L1 sets NP_ENABLE and the host does not support NPTs. On first glance, it seems like the check should actually be for guest_cpu_cap_has(X86_FEATURE_NPT) instead, as it is possible for the host to support NPTs but the guest CPUID to not advertise it. However, the consistency check is not architectural to begin with. The APM does not mention VMEXIT_INVALID if NP_ENABLE is set on a processor that does not have X86_FEATURE_NPT. Hence, NP_ENABLE should be ignored if X86_FEATURE_NPT is not available for L1, so sanitize it when copying from the VMCB12 to KVM's cache. Apart from the consistency check, NP_ENABLE in VMCB12 is currently ignored because the bit is actually copied from VMCB01 to VMCB02, not from VMCB12. Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-15-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Drop nested_vmcb_check_{save/control}() wrappersYosry Ahmed1-26/+10
The wrappers provide little value and make it harder to see what KVM is checking in the normal flow. Drop them. Opportunistically fixup comments referring to the functions, adding '()' to make it clear it's a reference to a function. No functional change intended. Co-developed-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-14-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Clear tracking of L1->L2 NMI and soft IRQ on nested #VMEXITYosry Ahmed1-2/+4
KVM clears tracking of L1->L2 injected NMIs (i.e. nmi_l1_to_l2) and soft IRQs (i.e. soft_int_injected) on a synthesized #VMEXIT(INVALID) due to failed VMRUN. However, they are not explicitly cleared in other synthesized #VMEXITs. soft_int_injected is always cleared after the first VMRUN of L2 when completing interrupts, as any re-injection is then tracked by KVM (instead of purely in vmcb02). nmi_l1_to_l2 is not cleared after the first VMRUN if NMI injection failed, as KVM still needs to keep track that the NMI originated from L1 to avoid blocking NMIs for L1. It is only cleared when the NMI injection succeeds. KVM could synthesize a #VMEXIT to L1 before successfully injecting the NMI into L2 (e.g. due to a #NPF on L2's NMI handler in L1's NPTs). In this case, nmi_l1_to_l2 will remain true, and KVM may not correctly mask NMIs and intercept IRET when injecting an NMI into L1. Clear both nmi_l1_to_l2 and soft_int_injected in nested_svm_vmexit(), i.e. for all #VMEXITs except those that occur due to failed consistency checks, as those happen before nmi_l1_to_l2 or soft_int_injected are set. Fixes: 159fc6fa3b7d ("KVM: nSVM: Transparently handle L1 -> L2 NMI re-injection") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-13-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Clear EVENTINJ fields in vmcb12 on nested #VMEXITYosry Ahmed1-2/+4
According to the APM, from the reference of the VMRUN instruction: Upon #VMEXIT, the processor performs the following actions in order to return to the host execution context: ... clear EVENTINJ field in VMCB KVM already syncs EVENTINJ fields from vmcb02 to cached vmcb12 on every L2->L0 #VMEXIT. Since these fields are zeroed by the CPU on #VMEXIT, they will mostly be zeroed in vmcb12 on nested #VMEXIT by nested_svm_vmexit(). However, this is not the case when: 1. Consistency checks fail, as nested_svm_vmexit() is not called. 2. Entering guest mode fails before L2 runs (e.g. due to failed load of CR3). (2) was broken by commit 2d8a42be0e2b ("KVM: nSVM: synchronize VMCB controls updated by the processor on every vmexit"), as prior to that nested_svm_vmexit() always zeroed EVENTINJ fields. Explicitly clear the fields in all nested #VMEXIT code paths. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Fixes: 2d8a42be0e2b ("KVM: nSVM: synchronize VMCB controls updated by the processor on every vmexit") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-12-yosry@kernel.org [sean: massage changelog formatting] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Clear GIF on nested #VMEXIT(INVALID)Yosry Ahmed1-0/+1
According to the APM, GIF is set to 0 on any #VMEXIT, including an #VMEXIT(INVALID) due to failed consistency checks. Clear GIF on consistency check failures. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-11-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Triple fault if restore host CR3 fails on nested #VMEXITYosry Ahmed3-19/+8
If loading L1's CR3 fails on a nested #VMEXIT, nested_svm_vmexit() returns an error code that is ignored by most callers, and continues to run L1 with corrupted state. A sane recovery is not possible in this case, and HW behavior is to cause a shutdown. Inject a triple fault instead, and do not return early from nested_svm_vmexit(). Continue cleaning up the vCPU state (e.g. clear pending exceptions), to handle the failure as gracefully as possible. From the APM: Upon #VMEXIT, the processor performs the following actions in order to return to the host execution context: ... if (illegal host state loaded, or exception while loading host state) shutdown else execute first host instruction following the VMRUN Remove the return value of nested_svm_vmexit(), which is mostly unchecked anyway. Fixes: d82aaef9c88a ("KVM: nSVM: use nested_svm_load_cr3() on guest->host switch") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-10-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Triple fault if mapping VMCB12 fails on nested #VMEXITYosry Ahmed1-6/+2
KVM currently injects a #GP and hopes for the best if mapping VMCB12 fails on nested #VMEXIT, and only if the failure mode is -EINVAL. Mapping the VMCB12 could also fail if creating host mappings fails. After the #GP is injected, nested_svm_vmexit() bails early, without cleaning up (e.g. KVM_REQ_GET_NESTED_STATE_PAGES is set, is_guest_mode() is true, etc). Instead of optionally injecting a #GP, triple fault the guest if mapping VMCB12 fails since KVM cannot make a sane recovery. The APM states that a #VMEXIT will triple fault if host state is illegal or an exception occurs while loading host state, so the behavior is not entirely made up. Do not return early from nested_svm_vmexit(), continue cleaning up the vCPU state (e.g. switch back to vmcb01), to handle the failure as gracefully as possible. Fixes: cf74a78b229d ("KVM: SVM: Add VMEXIT handler and intercepts") CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-9-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Refactor writing vmcb12 on nested #VMEXIT as a helperYosry Ahmed1-33/+44
Move mapping vmcb12 and updating it out of nested_svm_vmexit() into a helper, no functional change intended. CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-8-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Refactor checking LBRV enablement in vmcb12 into a helperYosry Ahmed1-4/+8
Refactor the vCPU cap and vmcb12 flag checks into a helper. The unlikely() annotation is dropped, it's unlikely (huh) to make a difference and the CPU will probably predict it better on its own. CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-7-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Always inject a #GP if mapping VMCB12 fails on nested VMRUNYosry Ahmed1-4/+1
nested_svm_vmrun() currently only injects a #GP if kvm_vcpu_map() fails with -EINVAL. But it could also fail with -EFAULT if creating a host mapping failed. Inject a #GP in all cases, no reason to treat failure modes differently. Fixes: 8c5fbf1a7231 ("KVM/nSVM: Use the new mapping API for mapping guest memory") CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-6-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: SVM: Add missing save/restore handling of LBR MSRsYosry Ahmed3-5/+45
MSR_IA32_DEBUGCTLMSR and LBR MSRs are currently not enumerated by KVM_GET_MSR_INDEX_LIST, and LBR MSRs cannot be set with KVM_SET_MSRS. So save/restore is completely broken. Fix it by adding the MSRs to msrs_to_save_base, and allowing writes to LBR MSRs from userspace only (as they are read-only MSRs) if LBR virtualization is enabled. Additionally, to correctly restore L1's LBRs while L2 is running, make sure the LBRs are copied from the captured VMCB01 save area in svm_copy_vmrun_state(). Note, for VMX, this also fixes a flaw where MSR_IA32_DEBUGCTLMSR isn't reported as an MSR to save/restore. Note #2, over-reporting MSR_IA32_LASTxxx on Intel is ok, as KVM already handles unsupported reads and writes thanks to commit b5e2fec0ebc3 ("KVM: Ignore DEBUGCTL MSRs with no effect") (kvm_do_msr_access() will morph the unsupported userspace write into a nop). Fixes: 24e09cbf480a ("KVM: SVM: enable LBR virtualization") Cc: stable@vger.kernel.org Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-4-yosry@kernel.org [sean: guard with lbrv checks, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: SVM: Switch svm_copy_lbrs() to a macroYosry Ahmed3-14/+13
In preparation for using svm_copy_lbrs() with 'struct vmcb_save_area' without a containing 'struct vmcb', and later even 'struct vmcb_save_area_cached', make it a macro. Macros are generally not preferred compared to functions, mainly due to type-safety. However, in this case it seems like having a simple macro copying a few fields is better than copy-pasting the same 5 lines of code in different places. Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-3-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Avoid clearing VMCB_LBR in vmcb12Yosry Ahmed2-4/+5
svm_copy_lbrs() always marks VMCB_LBR dirty in the destination VMCB. However, nested_svm_vmexit() uses it to copy LBRs to vmcb12, and clearing clean bits in vmcb12 is not architecturally defined. Move vmcb_mark_dirty() to callers and drop it for vmcb12. This also facilitates incoming refactoring that does not pass the entire VMCB to svm_copy_lbrs(). Fixes: d20c796ca370 ("KVM: x86: nSVM: implement nested LBR virtualization") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-2-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: SVM: Inject #UD for INVLPGA if EFER.SVME=0Kevin Cheng1-0/+3
INVLPGA should cause a #UD when EFER.SVME is not set. Add a check to properly inject #UD when EFER.SVME=0. Fixes: ff092385e828 ("KVM: SVM: Implement INVLPGA") Cc: stable@vger.kernel.org Signed-off-by: Kevin Cheng <chengkev@google.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260228033328.2285047-3-chengkev@google.com [sean: tag for stable@] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Delay setting soft IRQ RIP tracking fields until vCPU runSean Christopherson2-9/+37
In the save+restore path, when restoring nested state, the values of RIP and CS base passed into nested_vmcb02_prepare_control() are mostly incorrect. They are both pulled from the vmcb02. For CS base, the value is only correct if system regs are restored before nested state. The value of RIP is whatever the vCPU had in vmcb02 before restoring nested state (zero on a freshly created vCPU). Instead, take a similar approach to NextRIP, and delay initializing the RIP tracking fields until shortly before the vCPU is run, to make sure the most up-to-date values of RIP and CS base are used regardless of KVM_SET_SREGS, KVM_SET_REGS, and KVM_SET_NESTED_STATE's relative ordering. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260225005950.3739782-8-yosry@kernel.org [sean: deal with the svm_cancel_injection() madness] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-05KVM: nSVM: Delay stuffing L2's current RIP into NextRIP until vCPU runYosry Ahmed2-17/+33
For guests with NRIPS disabled, L1 does not provide NextRIP when running an L2 with an injected soft interrupt, instead it advances L2's RIP before running it. KVM uses L2's current RIP as the NextRIP in vmcb02 to emulate a CPU without NRIPS. However, in svm_set_nested_state(), the value used for L2's current RIP comes from vmcb02, which is just whatever the vCPU had in vmcb02 before restoring nested state (zero on a freshly created vCPU). Passing the cached RIP value instead (i.e. kvm_rip_read()) would only fix the issue if registers are restored before nested state. Instead, split the logic of setting NextRIP in vmcb02. Handle the 'normal' case of initializing vmcb02's NextRIP using NextRIP from vmcb12 (or KVM_GET_NESTED_STATE's payload) in nested_vmcb02_prepare_control(). Delay the special case of stuffing L2's current RIP into vmcb02's NextRIP until shortly before the vCPU is run, to make sure the most up-to-date value of RIP is used regardless of KVM_SET_REGS and KVM_SET_NESTED_STATE's relative ordering. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260225005950.3739782-7-yosry@kernel.org [sean: use new helper, svm_fixup_nested_rips()] Signed-off-by: Sean Christopherson <seanjc@google.com>