| Age | Commit message (Collapse) | Author | Files | Lines |
|
Failure to read the descriptor (because it is outside of a memslot) should
result in a SEA being injected in the guest.
Suggested-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/86ms1m9lp3.wl-maz@kernel.org
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260225173515.20490-4-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
As per R_BFHQH,
" When an Address size fault is generated, the reported fault code
indicates one of the following:
If the fault was generated due to the TTBR_ELx used in the translation
having nonzero address bits above the OA size, then a fault at level 0. "
Fix the reported Address size fault level as being 0 if the base address is
wrongly programmed by L1.
Fixes: 61e30b9eef7f ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic")
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260225173515.20490-3-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
check_base_s2_limits() checks the validity of SL0 and inputsize against
ia_size (inputsize again!) but the pseudocode from DDI0487 G.a
AArch64.TranslationTableWalk() says that we should check against the
implemented PA size.
We would otherwise fail to walk S2 with a valid configuration. E.g.,
granule size = 4KB, inputsize = 40 bits, initial lookup level = 0 (no
concatenation) on a system with 48 bits PA range supported is allowed by
architecture.
Fix it by obtaining PA size by kvm_get_pa_bits(). Note that
kvm_get_pa_bits() returns the fixed limit now and should eventually reflect
the per VM PARange (one day!). Given that the configured PARange should not
be greater that kvm_ipa_limit, it at least fixes the problem described
above.
While at it, inject a level 0 translation fault to guest if
check_base_s2_limits() fails, as per the pseudocode.
Fixes: 61e30b9eef7f ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic")
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260225173515.20490-2-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
SoC devices like the Intel / MaxLinear Lightning Mountain must be reset by the
Reset Control Unit (RCU) instead of using "normal" x86 mechanisms like ACPI,
BIOS, KBD, etc.
Therefore, the RCU driver (reset-intel-gw) registers a restart handler which
triggers the global reset signal.
Unfortunately, this is of no use as long as the restart chain is not processed
during reboot on x86 systems.
That's why do_kernel_restart() must be called when a reboot is performed. This
has long been common practice for other architectures.
[ bp: Massage commit message. ]
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20260225-x86_do_kernel_restart-v2-1-81396cf3d44c@dev.tdt.de
|
|
If, for any odd reason, we cannot converge to mapping size that is
completely contained in a memblock region, we fail to install a S2
mapping and go back to the faulting instruction. Rince, repeat.
This happens when faulting in regions that are smaller than a page
or that do not have PAGE_SIZE-aligned boundaries (as witnessed on
an O6 board that refuses to boot in protected mode).
In this situation, fallback to using a PAGE_SIZE mapping anyway --
it isn't like we can go any lower.
Fixes: e728e705802fe ("KVM: arm64: Adjust range correctly during host stage-2 faults")
Link: https://lore.kernel.org/r/86wlzr77cn.wl-maz@kernel.org
Cc: stable@vger.kernel.org
Cc: Quentin Perret <qperret@google.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://patch.msgid.link/20260305132751.2928138-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
If vgic_allocate_private_irqs_locked() fails for any odd reason,
we exit kvm_vgic_create() early, leaving dist->rd_regions uninitialised.
kvm_vgic_dist_destroy() then comes along and walks into the weeds
trying to free the RDs. Got to love this stuff.
Solve it by moving all the static initialisation early, and make
sure that if we fail halfway, we're in a reasonable shape to
perform the rest of the teardown. While at it, reset the vgic model
on failure, just in case...
Reported-by: syzbot+f6a46b038fc243ac0175@syzkaller.appspotmail.com
Tested-by: syzbot+f6a46b038fc243ac0175@syzkaller.appspotmail.com
Fixes: b3aa9283c0c50 ("KVM: arm64: vgic: Hoist SGI/PPI alloc from vgic_init() to kvm_create_vgic()")
Link: https://lore.kernel.org/r/69a2d58c.050a0220.3a55be.003b.GAE@google.com
Link: https://patch.msgid.link/20260228164559.936268-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
|
|
The Dojo device has a M.2 M-key slot for an included NVMe on some
models.
Add a proper device tree description based on the new M.2 M-key binding.
Power for the slot is controlled by the embedded controller. As far as
the main SoC is concerned, it is always on.
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
|
|
The MT8195 Cherry design features an M.2 E-key slot wired up with PCIe
and USB for a WiFi+BT adapter. Previously the power was just enabled
all the time with a default pinctrl setting that set the GPIO pin high.
With the PCIe slot description DT binding in place, the power supplies
can at least be added and tied to the PCIe and USB hosts. Once the
M.2 E-key binding is merged, this description can be further converted
to an M.2 E-key.
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
|
|
Add initial devicetree support for Samsung Galaxy J5 (2017) using
Exynos7870 SoC.
Signed-off-by: Andras Sebok <sebokandris2009@gmail.com>
Signed-off-by: Kaustabh Chakraborty <kauschluss@disroot.org>
Link: https://patch.msgid.link/20260304-exynos7870-j5y17lte-v1-2-eb25902c84c8@disroot.org
[krzk: Rephrase commit msg]
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
|
|
KVM tracks when EFER.SVME is set and cleared to initialize and tear down
nested state. However, it doesn't differentiate if EFER.SVME is getting
toggled in L1 or L2+. If L2 clears EFER.SVME, and L1 does not intercept
the EFER write, KVM exits guest mode and tears down nested state while
L2 is running, executing L1 without injecting a proper #VMEXIT.
According to the APM:
The effect of turning off EFER.SVME while a guest is running is
undefined; therefore, the VMM should always prevent guests from
writing EFER.
Since the behavior is architecturally undefined, KVM gets to choose what
to do. Inject a triple fault into L1 as a more graceful option that
running L1 with corrupted state.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
base-commit: 95deaec3557dced322e2540bfa426e60e5373d46
Link: https://patch.msgid.link/20260209195142.2554532-2-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
After commit dd26d1b5d6ed ("KVM: nSVM: Cache all used fields from VMCB12"),
vmcb_is_dirty() has no callers. Remove the function.
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260224005500.1471972-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The 'misc_ctl' field in VMCB02 is taken as-is from VMCB01. However, the
only bit that needs to copied is SVM_MISC_ENABLE_NP, as all other known
bits in misc_ctl are related to SEV guests, and KVM doesn't support
nested virtualization for SEV guests.
Only copy SVM_MISC_ENABLE_NP to harden against future bugs if/when other
bits are set for L1 but should not be set for L2.
Opportunistically add a comment explaining why SVM_MISC_ENABLE_NP is
taken from VMCB01 and not VMCB02.
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-26-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Make sure all fields used from vmcb12 in creating the vmcb02 are
sanitized, such that no unhandled or reserved bits end up in the vmcb02.
The following control fields are read from vmcb12 and have bits that are
either reserved or not handled/advertised by KVM: tlb_ctl, int_ctl,
int_state, int_vector, event_inj, misc_ctl, and misc_ctl2.
The following fields do not require any extra sanitizing:
- tlb_ctl: already being sanitized.
- int_ctl: bits from vmcb12 are copied bit-by-bit as needed.
- misc_ctl: only used in consistency checks (particularly NP_ENABLE).
- misc_ctl2: bits from vmcb12 are copied bit-by-bit as needed.
For the remaining fields (int_vector, int_state, and event_inj), make
sure only defined bits are copied from L1's vmcb12 into KVM'cache by
defining appropriate masks where needed.
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-25-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The APM defines possible values for TLB_CONTROL as 0, 1, 3, and 7 -- all
of which are always allowed for KVM guests as KVM always supports
X86_FEATURE_FLUSHBYASID. Only copy bits 0 to 2 from vmcb12's
TLB_CONTROL, such that no unhandled or reserved bits end up in vmcb02.
Note that TLB_CONTROL in vmcb12 is currently ignored by KVM, as it nukes
the TLB on nested transitions anyway (see
nested_svm_transition_tlb_flush()). However, such sanitization will be
needed once the TODOs there are addressed, and it's minimal churn to add
it now.
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-24-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use PAGE_MASK to drop the lower bits from IOPM_BASE_PA and MSRPM_BASE_PA
while copying them instead of dropping the bits afterward with a
hardcoded mask.
No functional change intended.
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-23-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
All accesses to the vmcb12 in the guest memory on nested VMRUN are
limited to nested_svm_vmrun() copying vmcb12 fields and writing them on
failed consistency checks. However, vmcb12 remains mapped throughout
nested_svm_vmrun(). Mapping and unmapping around usages is possible,
but it becomes easy-ish to introduce bugs where 'vmcb12' is used after
being unmapped.
Move reading the vmcb12, copying to cache, and consistency checks from
nested_svm_vmrun() into a new helper, nested_svm_copy_vmcb12_to_cache()
to limit the scope of the mapping.
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-22-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Currently, most fields used from VMCB12 are cached in
svm->nested.{ctl/save}. This is mainly to avoid TOC-TOU bugs. However,
for the save area, only the fields used in the consistency checks (i.e.
nested_vmcb_check_save()) were being cached. Other fields are read
directly from guest memory in nested_vmcb02_prepare_save().
While probably benign, this still makes it possible for TOC-TOU bugs to
happen. For example, RAX, RSP, and RIP are read twice, once to store in
VMCB02, and once to store in vcpu->arch.regs. It is possible for the
guest to modify the value between both reads, potentially causing nasty
bugs.
Harden against such bugs by caching everything in svm->nested.save.
Cache all the needed fields, and keep all accesses to the VMCB12
strictly in nested_svm_vmrun() for caching and early error injection.
Following changes will further limit the access to the VMCB12 in the
nested VMRUN path.
Introduce vmcb12_is_dirty() to use with the cached control fields
instead of vmcb_is_dirty(), similar to vmcb12_is_intercept().
Opportunistically order the copies in __nested_copy_vmcb_save_to_cache()
by the order in which the fields are defined in struct vmcb_save_area.
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-21-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
'virt' is confusing in the VMCB because it is relative and ambiguous.
The 'virt_ext' field includes bits for LBR virtualization and
VMSAVE/VMLOAD virtualization, so it's just another miscellaneous control
field. Name it as such.
While at it, move the definitions of the bits below those for
'misc_ctl' and rename them for consistency.
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-20-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The 'nested_ctl' field is misnamed. Although the first bit is for nested
paging, the other defined bits are for SEV/SEV-ES. Other bits in the
same field according to the APM (but not defined by KVM) include "Guest
Mode Execution Trap", "Enable INVLPGB/TLBSYNC", and other control bits
unrelated to 'nested'.
There is nothing common among these bits, so just name the field
misc_ctl. Also rename the flags accordingly.
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-19-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Grab svm->nested.ctl as vmcb12_ctrl when preparing the vmcb02 controls to
make it more obvious that much of the data is coming from vmcb12 (or
rather, a snapshot of vmcb12 at the time of L1's VMRUN).
Opportunistically reorder the variable definitions to create a pretty
reverse fir tree.
No functional change intended.
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260218230958.2877682-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move "bus_lock_rip" from "vmcb_ctrl_area_cached" to "svm_nested_state" as
"last_bus_lock_rip" to more accurately reflect what it tracks, and because
it is NOT a cached vmcb12 control field. The misplaced field isn't all
that apparent in the current code base, as KVM uses "svm->nested.ctl"
broadly, but the bad placement becomes glaringly obvious if
"svm->nested.ctl" is captured as a local "vmcb12_ctrl" variable.
No functional change intended.
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260218230958.2877682-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use vmcb12_is_intercept() instead of open-coding the intercept check.
No functional change intended.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260218230958.2877682-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that nested_vmcb02_recalc_intercepts() is explicitly scoped to deal
with *only* recalculating vmcb02 intercepts, rename its local variables to
use more intuivite names. The current "c", "h", and "g" local variables,
for the current VMCB, vmcb01, and (cached) vmcb12 respectively, are short
and sweet, but don't do much to help unfamiliar readers understand what
the code is doing.
Use vmcb12_ctrl/vmcb01/vmcb02/vmcb12_ctrl in lieu of c/h/g to make it clear
the function is updating intercepts in vmcb02 based on the intercepts in
vmcb01 and (cached) vmcb12.
Opportunistically change the existing WARN_ON to a WARN_ON_ONCE so that a
KVM bug doesn't unintentionally DoS the host.
No functional change intended.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
[sean: use WARN_ON_ONCE, keep local vmcb12 cache as vmcb12_ctrl]
Link: https://patch.msgid.link/20260218230958.2877682-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
nested_vmcb02_prepare_control()
Now that nested_vmcb02_recalc_intercepts() provides guardrails against it
being incorrectly called without vmcb02 active, invoke it directly from
nested_vmcb02_recalc_intercepts() instead of bouncing through
svm_mark_intercepts_dirty(), which unnecessarily marks vmcb01 as dirty.
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260218230958.2877682-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN and bail early from nested_vmcb02_recalc_intercepts() if vmcb02 isn't
the active/current VMCB, as recalculating intercepts for vmcb01 using logic
intended for merging vmcb12 and vmcb01 intercepts can yield unexpected and
unwanted results.
In addition to hardening against general bugs, this will provide additional
safeguards "if" nested_vmcb02_recalc_intercepts() is invoked directly from
nested_vmcb02_prepare_control().
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
[sean: split to separate patch, bail early on "failure"]
Link: https://patch.msgid.link/20260218230958.2877682-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Extract the non-nested aspects of recalc_intercepts() into a separate
helper, svm_mark_intercepts_dirty(), to make it clear that the call isn't
*just* recalculating (vmcb02's) intercepts, and to not bury non-nested
code in nested.c.
As suggested by Yosry, opportunistically prepend "nested_vmbc02_" to
recalc_intercepts() so that it's obvious the function specifically deals
with recomputing intercepts for L2.
No functional change intended.
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260218230958.2877682-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The AMD APM states that VMRUN, VMLOAD, VMSAVE, CLGI, VMMCALL, and
INVLPGA instructions should generate a #UD when EFER.SVME is cleared.
Currently, when VMLOAD, VMSAVE, or CLGI are executed in L1 with
EFER.SVME cleared, no #UD is generated in certain cases. This is because
the intercepts for these instructions are cleared based on whether or
not vls or vgif is enabled. The #UD fails to be generated when the
intercepts are absent.
Fix the missing #UD generation by ensuring that all relevant
instructions have intercepts set when SVME.EFER is disabled.
VMMCALL is special because KVM's ABI is that VMCALL/VMMCALL are always
supported for L1 and never fault.
Signed-off-by: Kevin Cheng <chengkev@google.com>
[sean: isolate Intel CPU "compatibility" in EFER.SVME=1 path]
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260304003010.1108257-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move STGI/CLGI intercept handling to svm_recalc_instruction_intercepts()
in preparation for making the function EFER.SVME-aware. This will allow
configuring STGI/CLGI intercepts along with other intercepts for other SVM
instructions when EFER.SVME is toggled (KVM needs to intercept SVM
instructions when EFER.SVME=0 to inject #UD).
When clearing the STGI intercept in particular, request KVM_REQ_EVENT if
there is at least one a pending GIF-controlled event. This avoids breaking
NMI/SMI window tracking, as enable_{nmi,smi}_window() sets INTERCEPT_STGI
to detect when NMIs become unblocked. KVM_REQ_EVENT forces
kvm_check_and_inject_events() to re-evaluate pending events and re-enable
the intercept if needed.
Extract the pending GIF event check into a helper function
svm_has_pending_gif_event() to deduplicate the logic between
svm_recalc_instruction_intercepts() and svm_set_gif().
Signed-off-by: Kevin Cheng <chengkev@google.com>
[sean: keep vgif handling out of the "Intel CPU model" path]
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260304003010.1108257-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Always intercept VMMCALL now that KVM properly synthesizes a #UD as
appropriate, i.e. when L1 doesn't want to intercept VMMCALL, to avoid
putting L2 into an infinite #UD loop if KVM_X86_QUIRK_FIX_HYPERCALL_INSN
is enabled.
By letting L2 execute VMMCALL natively and thus #UD, for all intents and
purposes KVM morphs the VMMCALL intercept into a #UD intercept (KVM always
intercepts #UD). When the hypercall quirk is enabled, KVM "emulates"
VMMCALL in response to the #UD by trying to fixup the opcode to the "right"
vendor, then restarts the guest, without skipping the VMMCALL. As a
result, the guest sees an endless stream of #UDs since it's already
executing the correct vendor hypercall instruction, i.e. the emulator
doesn't anticipate that the #UD could be due to lack of interception, as
opposed to a truly undefined opcode.
Fixes: 0d945bd93511 ("KVM: SVM: Don't allow nested guest to VMMCALL into host")
Cc: stable@vger.kernel.org
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/20260304002223.1105129-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly synthesize a #UD for VMMCALL if L2 is active, L1 does NOT want
to intercept VMMCALL, nested_svm_l2_tlb_flush_enabled() is true, and the
hypercall is something other than one of the supported Hyper-V hypercalls.
When all of the above conditions are met, KVM will intercept VMMCALL but
never forward it to L1, i.e. will let L2 make hypercalls as if it were L1.
The TLFS says a whole lot of nothing about this scenario, so go with the
architectural behavior, which says that VMMCALL #UDs if it's not
intercepted.
Opportunistically do a 2-for-1 stub trade by stub-ifying the new API
instead of the helpers it uses. The last remaining "single" stub will
soon be dropped as well.
Suggested-by: Sean Christopherson <seanjc@google.com>
Fixes: 3f4a812edf5c ("KVM: nSVM: hyper-v: Enable L2 TLB flush")
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Kevin Cheng <chengkev@google.com>
Link: https://patch.msgid.link/20260228033328.2285047-5-chengkev@google.com
[sean: rewrite changelog and comment, tag for stable, remove defunct stubs]
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/20260304002223.1105129-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When reacting to an intercept update, explicitly mark vmcb01's intercepts
dirty, as KVM always initially operates on vmcb01, and nested_svm_vmexit()
isn't guaranteed to mark VMCB_INTERCEPTS as dirty. I.e. if L2 is active,
KVM will modify the intercepts for L1, but might not mark them as dirty
before the next VMRUN of L1.
Fixes: 116a0a23676e ("KVM: SVM: Add clean-bit for intercetps, tsc-offset and pause filter count")
Cc: stable@vger.kernel.org
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260218230958.2877682-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
According to the APM Volume #2, 15.20 (24593—Rev. 3.42—March 2024):
VMRUN exits with VMEXIT_INVALID error code if either:
• Reserved values of TYPE have been specified, or
• TYPE = 3 (exception) has been specified with a vector that does not
correspond to an exception (this includes vector 2, which is an NMI,
not an exception).
Add the missing consistency checks to KVM. For the second point, inject
VMEXIT_INVALID if the vector is anything but the vectors defined by the
APM for exceptions. Reserved vectors are also considered invalid, which
matches the HW behavior. Vector 9 (i.e. #CSO) is considered invalid
because it is reserved on modern CPUs, and according to LLMs no CPUs
exist supporting SVM and producing #CSOs.
Defined exceptions could be different between virtual CPUs as new CPUs
define new vectors. In a best effort to dynamically define the valid
vectors, make all currently defined vectors as valid except those
obviously tied to a CPU feature: SHSTK -> #CP and SEV-ES -> #VC. As new
vectors are defined, they can similarly be tied to corresponding CPU
features.
Invalid vectors on specific (e.g. old) CPUs that are missed by KVM
should be rejected by HW anyway.
Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler")
CC: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-18-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
According to the APM Volume #2, 15.5, Canonicalization and Consistency
Checks (24593—Rev. 3.42—March 2024), the following condition (among
others) results in a #VMEXIT with VMEXIT_INVALID (aka SVM_EXIT_ERR):
EFER.LME, CR0.PG, CR4.PAE, CS.L, and CS.D are all non-zero.
In the list of consistency checks done when EFER.LME and CR0.PG are set,
add a check that CS.L and CS.D are not both set, after the existing
check that CR4.PAE is set.
This is functionally a nop because the nested VMRUN results in
SVM_EXIT_ERR in HW, which is forwarded to L1, but KVM makes all
consistency checks before a VMRUN is actually attempted.
Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-17-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
From the APM Volume #2, 15.25.4 (24593—Rev. 3.42—March 2024):
When VMRUN is executed with nested paging enabled (NP_ENABLE = 1), the
following conditions are considered illegal state combinations, in
addition to those mentioned in “Canonicalization and Consistency Checks”:
• Any MBZ bit of nCR3 is set.
• Any G_PAT.PA field has an unsupported type encoding or any
reserved field in G_PAT has a nonzero value.
Add the consistency check for nCR3 being a legal GPA with no MBZ bits
set. Note, the G_PAT.PA check is being handled separately[*].
Link: https://lore.kernel.org/kvm/20260205214326.1029278-3-jmattson@google.com [*]
Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-16-yosry@kernel.org
[sean: capture everything in CC(), massage changelog formatting]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM currenty fails a nested VMRUN and injects VMEXIT_INVALID (aka
SVM_EXIT_ERR) if L1 sets NP_ENABLE and the host does not support NPTs.
On first glance, it seems like the check should actually be for
guest_cpu_cap_has(X86_FEATURE_NPT) instead, as it is possible for the
host to support NPTs but the guest CPUID to not advertise it.
However, the consistency check is not architectural to begin with. The
APM does not mention VMEXIT_INVALID if NP_ENABLE is set on a processor
that does not have X86_FEATURE_NPT. Hence, NP_ENABLE should be ignored
if X86_FEATURE_NPT is not available for L1, so sanitize it when copying
from the VMCB12 to KVM's cache.
Apart from the consistency check, NP_ENABLE in VMCB12 is currently
ignored because the bit is actually copied from VMCB01 to VMCB02, not
from VMCB12.
Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-15-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The wrappers provide little value and make it harder to see what KVM is
checking in the normal flow. Drop them.
Opportunistically fixup comments referring to the functions, adding '()'
to make it clear it's a reference to a function.
No functional change intended.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-14-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM clears tracking of L1->L2 injected NMIs (i.e. nmi_l1_to_l2) and soft
IRQs (i.e. soft_int_injected) on a synthesized #VMEXIT(INVALID) due to
failed VMRUN. However, they are not explicitly cleared in other
synthesized #VMEXITs.
soft_int_injected is always cleared after the first VMRUN of L2 when
completing interrupts, as any re-injection is then tracked by KVM
(instead of purely in vmcb02).
nmi_l1_to_l2 is not cleared after the first VMRUN if NMI injection
failed, as KVM still needs to keep track that the NMI originated from L1
to avoid blocking NMIs for L1. It is only cleared when the NMI injection
succeeds.
KVM could synthesize a #VMEXIT to L1 before successfully injecting the
NMI into L2 (e.g. due to a #NPF on L2's NMI handler in L1's NPTs). In
this case, nmi_l1_to_l2 will remain true, and KVM may not correctly mask
NMIs and intercept IRET when injecting an NMI into L1.
Clear both nmi_l1_to_l2 and soft_int_injected in nested_svm_vmexit(), i.e.
for all #VMEXITs except those that occur due to failed consistency checks,
as those happen before nmi_l1_to_l2 or soft_int_injected are set.
Fixes: 159fc6fa3b7d ("KVM: nSVM: Transparently handle L1 -> L2 NMI re-injection")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-13-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
According to the APM, from the reference of the VMRUN instruction:
Upon #VMEXIT, the processor performs the following actions in order to
return to the host execution context:
...
clear EVENTINJ field in VMCB
KVM already syncs EVENTINJ fields from vmcb02 to cached vmcb12 on every
L2->L0 #VMEXIT. Since these fields are zeroed by the CPU on #VMEXIT, they
will mostly be zeroed in vmcb12 on nested #VMEXIT by nested_svm_vmexit().
However, this is not the case when:
1. Consistency checks fail, as nested_svm_vmexit() is not called.
2. Entering guest mode fails before L2 runs (e.g. due to failed load of
CR3).
(2) was broken by commit 2d8a42be0e2b ("KVM: nSVM: synchronize VMCB
controls updated by the processor on every vmexit"), as prior to that
nested_svm_vmexit() always zeroed EVENTINJ fields.
Explicitly clear the fields in all nested #VMEXIT code paths.
Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler")
Fixes: 2d8a42be0e2b ("KVM: nSVM: synchronize VMCB controls updated by the processor on every vmexit")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-12-yosry@kernel.org
[sean: massage changelog formatting]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
According to the APM, GIF is set to 0 on any #VMEXIT, including
an #VMEXIT(INVALID) due to failed consistency checks. Clear GIF on
consistency check failures.
Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-11-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
If loading L1's CR3 fails on a nested #VMEXIT, nested_svm_vmexit()
returns an error code that is ignored by most callers, and continues to
run L1 with corrupted state. A sane recovery is not possible in this
case, and HW behavior is to cause a shutdown. Inject a triple fault
instead, and do not return early from nested_svm_vmexit(). Continue
cleaning up the vCPU state (e.g. clear pending exceptions), to handle
the failure as gracefully as possible.
From the APM:
Upon #VMEXIT, the processor performs the following actions in order to
return to the host execution context:
...
if (illegal host state loaded, or exception while loading host state)
shutdown
else
execute first host instruction following the VMRUN
Remove the return value of nested_svm_vmexit(), which is mostly
unchecked anyway.
Fixes: d82aaef9c88a ("KVM: nSVM: use nested_svm_load_cr3() on guest->host switch")
CC: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-10-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM currently injects a #GP and hopes for the best if mapping VMCB12
fails on nested #VMEXIT, and only if the failure mode is -EINVAL.
Mapping the VMCB12 could also fail if creating host mappings fails.
After the #GP is injected, nested_svm_vmexit() bails early, without
cleaning up (e.g. KVM_REQ_GET_NESTED_STATE_PAGES is set, is_guest_mode()
is true, etc).
Instead of optionally injecting a #GP, triple fault the guest if mapping
VMCB12 fails since KVM cannot make a sane recovery. The APM states that
a #VMEXIT will triple fault if host state is illegal or an exception
occurs while loading host state, so the behavior is not entirely made
up.
Do not return early from nested_svm_vmexit(), continue cleaning up the
vCPU state (e.g. switch back to vmcb01), to handle the failure as
gracefully as possible.
Fixes: cf74a78b229d ("KVM: SVM: Add VMEXIT handler and intercepts")
CC: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-9-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move mapping vmcb12 and updating it out of nested_svm_vmexit() into a
helper, no functional change intended.
CC: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-8-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Refactor the vCPU cap and vmcb12 flag checks into a helper. The
unlikely() annotation is dropped, it's unlikely (huh) to make a
difference and the CPU will probably predict it better on its own.
CC: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-7-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
nested_svm_vmrun() currently only injects a #GP if kvm_vcpu_map() fails
with -EINVAL. But it could also fail with -EFAULT if creating a host
mapping failed. Inject a #GP in all cases, no reason to treat failure
modes differently.
Fixes: 8c5fbf1a7231 ("KVM/nSVM: Use the new mapping API for mapping guest memory")
CC: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-6-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
MSR_IA32_DEBUGCTLMSR and LBR MSRs are currently not enumerated by
KVM_GET_MSR_INDEX_LIST, and LBR MSRs cannot be set with KVM_SET_MSRS. So
save/restore is completely broken.
Fix it by adding the MSRs to msrs_to_save_base, and allowing writes to
LBR MSRs from userspace only (as they are read-only MSRs) if LBR
virtualization is enabled. Additionally, to correctly restore L1's LBRs
while L2 is running, make sure the LBRs are copied from the captured
VMCB01 save area in svm_copy_vmrun_state().
Note, for VMX, this also fixes a flaw where MSR_IA32_DEBUGCTLMSR isn't
reported as an MSR to save/restore.
Note #2, over-reporting MSR_IA32_LASTxxx on Intel is ok, as KVM already
handles unsupported reads and writes thanks to commit b5e2fec0ebc3 ("KVM:
Ignore DEBUGCTL MSRs with no effect") (kvm_do_msr_access() will morph the
unsupported userspace write into a nop).
Fixes: 24e09cbf480a ("KVM: SVM: enable LBR virtualization")
Cc: stable@vger.kernel.org
Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-4-yosry@kernel.org
[sean: guard with lbrv checks, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In preparation for using svm_copy_lbrs() with 'struct vmcb_save_area'
without a containing 'struct vmcb', and later even 'struct
vmcb_save_area_cached', make it a macro.
Macros are generally not preferred compared to functions, mainly due to
type-safety. However, in this case it seems like having a simple macro
copying a few fields is better than copy-pasting the same 5 lines of
code in different places.
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-3-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
svm_copy_lbrs() always marks VMCB_LBR dirty in the destination VMCB.
However, nested_svm_vmexit() uses it to copy LBRs to vmcb12, and
clearing clean bits in vmcb12 is not architecturally defined.
Move vmcb_mark_dirty() to callers and drop it for vmcb12.
This also facilitates incoming refactoring that does not pass the entire
VMCB to svm_copy_lbrs().
Fixes: d20c796ca370 ("KVM: x86: nSVM: implement nested LBR virtualization")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260303003421.2185681-2-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
INVLPGA should cause a #UD when EFER.SVME is not set. Add a check to
properly inject #UD when EFER.SVME=0.
Fixes: ff092385e828 ("KVM: SVM: Implement INVLPGA")
Cc: stable@vger.kernel.org
Signed-off-by: Kevin Cheng <chengkev@google.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260228033328.2285047-3-chengkev@google.com
[sean: tag for stable@]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In the save+restore path, when restoring nested state, the values of RIP
and CS base passed into nested_vmcb02_prepare_control() are mostly
incorrect. They are both pulled from the vmcb02. For CS base, the value
is only correct if system regs are restored before nested state. The
value of RIP is whatever the vCPU had in vmcb02 before restoring nested
state (zero on a freshly created vCPU).
Instead, take a similar approach to NextRIP, and delay initializing the
RIP tracking fields until shortly before the vCPU is run, to make sure
the most up-to-date values of RIP and CS base are used regardless of
KVM_SET_SREGS, KVM_SET_REGS, and KVM_SET_NESTED_STATE's relative
ordering.
Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
CC: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260225005950.3739782-8-yosry@kernel.org
[sean: deal with the svm_cancel_injection() madness]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
For guests with NRIPS disabled, L1 does not provide NextRIP when running
an L2 with an injected soft interrupt, instead it advances L2's RIP
before running it. KVM uses L2's current RIP as the NextRIP in vmcb02 to
emulate a CPU without NRIPS.
However, in svm_set_nested_state(), the value used for L2's current RIP
comes from vmcb02, which is just whatever the vCPU had in vmcb02 before
restoring nested state (zero on a freshly created vCPU). Passing the
cached RIP value instead (i.e. kvm_rip_read()) would only fix the issue
if registers are restored before nested state.
Instead, split the logic of setting NextRIP in vmcb02. Handle the
'normal' case of initializing vmcb02's NextRIP using NextRIP from vmcb12
(or KVM_GET_NESTED_STATE's payload) in nested_vmcb02_prepare_control().
Delay the special case of stuffing L2's current RIP into vmcb02's
NextRIP until shortly before the vCPU is run, to make sure the most
up-to-date value of RIP is used regardless of KVM_SET_REGS and
KVM_SET_NESTED_STATE's relative ordering.
Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
CC: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260225005950.3739782-7-yosry@kernel.org
[sean: use new helper, svm_fixup_nested_rips()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|