| Age | Commit message (Collapse) | Author | Files | Lines |
|
Add properties declared in the new DT binding nxp,lpc3220-dmamux.yaml
and corresponding phandles.
[vzapolskiy]:
1. rebased the change,
2. dmamux unit address shall be 0x78 instead of 0x7c,
3. removed unsupported 'dmas' properties from sd, ssp0, ssp1 and HS UARTs,
4. more non-functional updates by reordering properies,
5. minor updates to the commit message.
Link to the original change:
* https://lore.kernel.org/linux-arm-kernel/20240627150046.258795-6-piotr.wojtaszczyk@timesys.com/
Signed-off-by: Piotr Wojtaszczyk <piotr.wojtaszczyk@timesys.com>
Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
|
|
The clock controller is a part of NXP LPC32xx system control block (SCB),
and SCB provides a number of controllers apart of the clock controller.
[vzapolskiy]:
1. kept a simple comment,
2. renamed SoC specific compatible to 'nxp,lpc3220-scb' due to the SoC UM,
3. changed size in 'ranges', since it should cover more SCB functions,
4. updated the commit message.
Link to the original change:
* https://lore.kernel.org/linux-arm-kernel/20240627150046.258795-5-piotr.wojtaszczyk@timesys.com/
Signed-off-by: Piotr Wojtaszczyk <piotr.wojtaszczyk@timesys.com>
Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
|
|
SLC and MLC NAND flash controllers fire the muxed interrupt FLASH_INT to
the SoC, add the interrupt property to the SLC device tree node.
Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
|
|
The device tree node name of NAND controllers shall be 'nand-controller',
while 'flash' name is the name of NAND chip device tree nodes.
Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
|
|
PL022 binding require two clocks to be defined but NXP LPC32xx platform
doesn't comply with the bindings and define only one clock i.e apb_pclk.
Update SPI clocks and clocks-names property by adding appropriate clock
reference to make it compliant with the bindings.
Noteworthy, strictly speaking the change tackles DT ABI by changing
the order in the list of clock-names property values, however this level
of impact is considered as acceptable.
Cc: Vladimir Zapolskiy <vz@mleia.com>
Signed-off-by: Kuldeep Singh <singh.kuldeep87k@gmail.com>
[vzapolskiy: rebased and minor update to the commit message]
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
|
|
Add basic support for pcb8385 [1]. It is a modular board which allows
to add different daughter cards on which there are different PHYs.
This adds support for UART, LEDs and I2C.
[1] https://www.microchip.com/en-us/development-tool/ev83e85a
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20251208083545.3642168-3-horatiu.vultur@microchip.com
Signed-off-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
|
|
The arm64 kernel doesn't boot with annotated branches
(PROFILE_ANNOTATED_BRANCHES) enabled and CONFIG_DEBUG_VIRTUAL together.
Bisecting it, I found that disabling branch profiling in arch/arm64/mm
solved the problem. Narrowing down a bit further, I found that
physaddr.c is the file that needs to have branch profiling disabled to
get the machine to boot.
I suspect that it might invoke some ftrace helper very early in the boot
process and ftrace is still not enabled(!?).
Rather than playing whack-a-mole with individual files, disable branch
profiling for the entire arch/arm64 tree, similar to what x86 already
does in arch/x86/Kbuild.
Cc: stable@vger.kernel.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This initial version includes support for OV9282 camera sensor.
Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
Reviewed-by: Vladimir Zapolskiy <vladimir.zapolskiy@linaro.org>
Link: https://lore.kernel.org/r/20260108170550.359968-4-loic.poulain@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
|
|
The PM8008 device is a dedicated camera PMIC integrating all the necessary
camera power management features.
Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260108170550.359968-3-loic.poulain@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
|
|
Add pinctrl configuration for the four available camera master clocks.
Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Reviewed-by: Vladimir Zapolskiy <vladimir.zapolskiy@linaro.org>
Link: https://lore.kernel.org/r/20260108170550.359968-2-loic.poulain@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
|
|
synchronize_vcpu_pstate() doesn't make use of the reference to exit_code,
remove the parameter.
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://msgid.link/20251216103053.47224-5-alexandru.elisei@arm.com
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
__pvkm_host_share_hyp() and __pkvm_host_unshare_hyp() both have one
parameter, the pfn, not two. Even though correctness isn't impacted because
the SMCCC handlers pass the first argument and ignore the second one, let's
call the functions with the proper number of arguments.
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://msgid.link/20251216103053.47224-4-alexandru.elisei@arm.com
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
Configuring a register trap without specifying an accessor function is
abviously a bug. Instead of calling die() when that happens, let's be a
bit more helpful and print the register encoding. Also inject an
undefined instruction exception in the guest, similar to other unhandled
register accesses.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Link: https://msgid.link/20251216103053.47224-3-alexandru.elisei@arm.com
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
Commit fb10ddf35c1c ("KVM: arm64: Compute per-vCPU FGTs at vcpu_load()")
introduced per-VCPU FGT traps. For an unprotected pKVM VCPU, the untrusted
host FGT configuration is copied in pkvm_vcpu_init_traps(), which is called
from __pkvm_init_vcpu(). __pkvm_init_vcpu() is called once per VCPU (when
the VCPU is first run) which means that the uninitialized, zero, values for
the FGT registers end up being used for the entire lifetime of the VCPU.
This causes both unwanted traps (for the inverse polarity trap bits) and
the guest being allowed to access registers it shouldn't.
Fix it by copying the FGT traps for unprotected pKVM VCPUs when the
untrusted host loads the VCPU.
Fixes: fb10ddf35c1c ("KVM: arm64: Compute per-vCPU FGTs at vcpu_load()")
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20251216103053.47224-2-alexandru.elisei@arm.com
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
GIF==0 together with EFER.SVME==0 is a valid architectural
state. Don't return -EINVAL for KVM_SET_NESTED_STATE when this
combination is specified.
Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251121204803.991707-2-yosry.ahmed@linux.dev
[sean: disallow KVM_STATE_NESTED_RUN_PENDING with SVME=0]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Clearing EFER.SVME is not architected to set GIF. Don't set GIF when
emulating a change to EFER that clears EFER.SVME.
However, keep setting GIF if clearing EFER.SVME causes force-leaving the
nested guest through svm_leave_nested(), to maintain a sane behavior of
not leaving GIF cleared after exiting the guest. In every other path,
setting GIF is either correct/desirable, or irrelevant because the
caller immediately and unconditionally sets/clears GIF.
This is more-or-less KVM defining HW behavior, but leaving GIF cleared
would also be defining HW behavior anyway.
Note that if force-leaving the nested guest is considered a SHUTDOWN,
then this could violate the APM-specified behavior:
If the processor enters the shutdown state (due to a triple fault for
instance) while GIF is clear, it can only be restarted by means of a
RESET.
However, a SHUTDOWN leaves the VMCB undefined, so there's not a lot that
KVM can do in this case. Also, if vGIF is enabled on SHUTDOWN, KVM has
no way of finding out of GIF was cleared.
The only way for KVM to handle this without making up HW behavior is to
completely terminate the VM, so settle for doing the relatively "sane"
thing of setting GIF when force-leaving nested.
Fixes: c513f484c558 ("KVM: nSVM: leave guest mode when clearing EFER.SVME")
Signed-off-by: Jim Mattson <jmattson@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251121204803.991707-3-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The current XN implementation is tied to the EL2 translation regime,
and fall flat on its face with the EL2&0 one that is used for hVHE,
as the permission bit for privileged execution is a different one.
Fixes: 6537565fd9b7f ("KVM: arm64: Adjust EL2 stage-1 leaf AP bits when ARM64_KVM_HVHE is set")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Fuad Tabba <tabba@google.com>
Link: https://msgid.link/20251210173024.561160-2-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
Explicitly check for the vgic being v3 when disabling TWI. Failure to
check this can result in using the wrong view of the vgic CPU IF union
causing undesirable/unexpected behaviour.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20260106165154.3321753-1-sascha.bischoff@arm.com
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
Enable drivers for hardware present on Apple Silicon machines. None of
these drivers are critical so build them as modules. The reset and RTC
macsmc drivers would be useful as built-in drivers but they have large
dependencies so keep them as modules.
The size increases are minor and are offsetted by already merged
"default ARCH_APPLE" removals from the linked original submission.
vmlinux 157782640 -> 157902032 (+119392)
Image 41007616 -> 41073152 (+ 65536)
Link: https://lore.kernel.org/asahi/20250612-apple-kconfig-defconfig-v1-0-0e6f9cb512c1@kernel.org/
Signed-off-by: Janne Grunau <j@jannau.net>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Link: https://patch.msgid.link/20251231-arch-arm64-apple-defconfig-v1-2-4ff19805ba39@jannau.net
Signed-off-by: Sven Peter <sven@kernel.org>
|
|
Critical devices like the NVMe on Apple silicon systems depend on
APPLE_PMGR_PWRSTATE and others are powered down coming out of reset so
select APPLE_PMGR_PWRSTATE to ensure these devices work. Since the
systems are not totally unsusable (boot to initramfs is expected to
work) allow !PM builds without APPLE_PMGR_PWRSTATE.
Link: https://lore.kernel.org/asahi/2e022f4e-4c87-4da1-9d02-f7a3ae7c5798@arm.com/
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Janne Grunau <j@jannau.net>
Link: https://patch.msgid.link/20251231-arch-arm64-apple-defconfig-v1-1-4ff19805ba39@jannau.net
Signed-off-by: Sven Peter <sven@kernel.org>
|
|
AMD CPUs with the Enhanced Return Address Predictor Security (ERAPS)
feature (available on Zen5+) obviate the need for FILL_RETURN_BUFFER
sequences right after VMEXITs. ERAPS adds guest/host tags to entries in
the RSB (a.k.a. RAP). This helps with speculation protection across the
VM boundary, and it also preserves host and guest entries in the RSB that
can improve software performance (which would otherwise be flushed due to
the FILL_RETURN_BUFFER sequences).
Importantly, ERAPS also improves cross-domain security by clearing the RAP
in certain situations. Specifically, the RAP is cleared in response to
actions that are typically tied to software context switching between
tasks. Per the APM:
The ERAPS feature eliminates the need to execute CALL instructions to
clear the return address predictor in most cases. On processors that
support ERAPS, return addresses from CALL instructions executed in host
mode are not used in guest mode, and vice versa. Additionally, the
return address predictor is cleared in all cases when the TLB is
implicitly invalidated and in the following cases:
• MOV CR3 instruction
• INVPCID other than single address invalidation (operation type 0)
ERAPS also allows CPUs to extends the size of the RSB/RAP from the older
standard (of 32 entries) to a new size, enumerated in CPUID leaf
0x80000021:EBX bits 23:16 (64 entries in Zen5 CPUs).
In hardware, ERAPS is always-on, when running in host context, the CPU
uses the full RSB/RAP size without any software changes necessary.
However, when running in guest context, the CPU utilizes the full size of
the RSB/RAP if and only if the new ALLOW_LARGER_RAP flag is set in the
VMCB; if the flag is not set, the CPU limits itself to the historical size
of 32 entires.
Requiring software to opt-in for guest usage of RAPs larger than 32 entries
allows hypervisors, i.e. KVM, to emulate the aforementioned conditions in
which the RAP is cleared as well as the guest/host split. E.g. if the CPU
unconditionally used the full RAP for guests, failure to clear the RAP on
transitions between L1 or L2, or on emulated guest TLB flushes, would
expose the guest to RAP-based attacks as a guest without support for ERAPS
wouldn't know that its FILL_RETURN_BUFFER sequence is insufficient.
Address the ~two broad categories of ERAPS emulation, and advertise
ERAPS support to userspace, along with the RAP size enumerated in CPUID.
1. Architectural RAP clearing: as above, CPUs with ERAPS clear RAP entries
on several conditions, including CR3 updates. To handle scenarios
where a relevant operation is handled in common code (emulation of
INVPCID and to a lesser extent MOV CR3), piggyback VCPU_EXREG_CR3 and
create an alias, VCPU_EXREG_ERAPS. SVM doesn't utilize CR3 dirty
tracking, and so for all intents and purposes VCPU_EXREG_CR3 is unused.
Aliasing VCPU_EXREG_ERAPS ensures that any flow that writes CR3 will
also clear the guest's RAP, and allows common x86 to mark ERAPS vCPUs
as needing a RAP clear without having to add a new request (or other
mechanism).
2. Nested guests: the ERAPS feature adds host/guest tagging to entries
in the RSB, but does not distinguish between the guest ASIDs. To
prevent the case of an L2 guest poisoning the RSB to attack the L1
guest, the CPU exposes a new VMCB bit (CLEAR_RAP). The next
VMRUN with a VMCB that has this bit set causes the CPU to flush the
RSB before entering the guest context. Set the bit in VMCB01 after a
nested #VMEXIT to ensure the next time the L1 guest runs, its RSB
contents aren't polluted by the L2's contents. Similarly, before
entry into a nested guest, set the bit for VMCB02, so that the L1
guest's RSB contents are not leaked/used in the L2 context.
Enable ALLOW_LARGER_RAP (and emulate RAP clears) if and only if ERAPS is
exposed to the guest. Enabling ALLOW_LARGER_RAP unconditionally wouldn't
cause any functional issues, but ignoring userspace's (and L1's) desires
would put KVM into a grey area, which is especially undesirable due to the
potential security implications. E.g. if a use case wants to have L1 do
manual RAP clearing even when ERAPS is present in hardware, enabling
ALLOW_LARGER_RAP could result in L1 leaving stale entries in the RAP.
ERAPS is documented in AMD APM Vol 2 (Pub 24593), in revisions 3.43 and
later.
Signed-off-by: Amit Shah <amit.shah@amd.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Amit Shah <amit.shah@amd.com>
Link: https://patch.msgid.link/aR913X8EqO6meCqa@google.com
|
|
If a feature is not advertised in the guest's CPUID, prevent L1 from
intercepting the unsupported instructions by clearing the corresponding
intercept in KVM's cached vmcb12.
When an L2 guest executes an instruction that is not advertised to L1,
we expect a #UD exception to be injected by L0. However, the nested svm
exit handler first checks if the instruction intercept is set in vmcb12,
and if so, synthesizes an exit from L2 to L1 instead of a #UD exception.
If a feature is not advertised, the L1 intercept should be ignored.
While creating KVM's cached vmcb12, sanitize the intercepts for
instructions that are not advertised in the guest CPUID. This
effectively ignores the L1 intercept on nested vm exit handling. It also
ignores the L1 intercept when computing the intercepts in vmcb02, so if
L0 (for some reason) does not intercept the instruction, KVM won't
intercept it at all.
Signed-off-by: Kevin Cheng <chengkev@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251215192510.2300816-1-chengkev@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Wire up SVM's #NPF handler to fast MMIO. While SVM doesn't provide a
dedicated exit reason, it's trivial to key off PFERR_RSVD_MASK. Like VMX,
restrict the fast path to L1 to avoid having to deal with nGPA=>GPA
translations.
For simplicity, use the fast path if and only if the next RIP is known.
While KVM could utilize EMULTYPE_SKIP, doing so would require additional
logic to deal with SEV guests, e.g. to go down the slow path if the
instruction buffer is empty. All modern CPUs support next RIP, and in
practice the next RIP will be available for any guest fast path.
Copy+paste the kvm_io_bus_write() + trace_kvm_fast_mmio() logic even
though KVM would ideally provide a small helper, as such a helper would
need to either be a macro or non-inline to avoid including trace.h in a
header (trace.h must not be included by x86.c prior to CREATE_TRACE_POINTS
being defined).
Link: https://patch.msgid.link/20251113221642.1673023-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename "fault_address" to "gpa" in KVM's #NPF handler and track it as a
gpa_t to more precisely document what type of address is being captured,
and because "gpa" is much more succinct.
No functional change intended.
Link: https://patch.msgid.link/20251113221642.1673023-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the WARN in svm_set_nested_state() on nested_svm_load_cr3() failing
as it is trivially easy to trigger from userspace by modifying CPUID after
loading CR3. E.g. modifying the state restoration selftest like so:
--- tools/testing/selftests/kvm/x86/state_test.c
+++ tools/testing/selftests/kvm/x86/state_test.c
@@ -280,7 +280,16 @@ int main(int argc, char *argv[])
/* Restore state in a new VM. */
vcpu = vm_recreate_with_one_vcpu(vm);
- vcpu_load_state(vcpu, state);
+
+ if (stage == 4) {
+ state->sregs.cr3 = BIT(44);
+ vcpu_load_state(vcpu, state);
+
+ vcpu_set_cpuid_property(vcpu, X86_PROPERTY_MAX_PHY_ADDR, 36);
+ __vcpu_nested_state_set(vcpu, &state->nested);
+ } else {
+ vcpu_load_state(vcpu, state);
+ }
/*
* Restore XSAVE state in a dummy vCPU, first without doing
generates:
WARNING: CPU: 30 PID: 938 at arch/x86/kvm/svm/nested.c:1877 svm_set_nested_state+0x34a/0x360 [kvm_amd]
Modules linked in: kvm_amd kvm irqbypass [last unloaded: kvm]
CPU: 30 UID: 1000 PID: 938 Comm: state_test Tainted: G W 6.18.0-rc7-58e10b63777d-next-vm
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:svm_set_nested_state+0x34a/0x360 [kvm_amd]
Call Trace:
<TASK>
kvm_arch_vcpu_ioctl+0xf33/0x1700 [kvm]
kvm_vcpu_ioctl+0x4e6/0x8f0 [kvm]
__x64_sys_ioctl+0x8f/0xd0
do_syscall_64+0x61/0xad0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Simply delete the WARN instead of trying to prevent userspace from shoving
"illegal" state into CR3. For better or worse, KVM's ABI allows userspace
to set CPUID after SREGS, and vice versa, and KVM is very permissive when
it comes to guest CPUID. I.e. attempting to enforce the virtual CPU model
when setting CPUID could break userspace. Given that the WARN doesn't
provide any meaningful protection for KVM or benefit for userspace, simply
drop it even though the odds of breaking userspace are minuscule.
Opportunistically delete a spurious newline.
Fixes: b222b0b88162 ("KVM: nSVM: refactor the CR3 reload on migration")
Cc: stable@vger.kernel.org
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251216161755.1775409-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
iPad Pro (12.9-inch) uses DWI backlight, while the 9.7-inch model does not.
Add DWI backlight node for s8001 and enable it for J98a and J99a.
Signed-off-by: Nick Chan <towinchenmi@gmail.com>
Link: https://patch.msgid.link/20251231-b4-j98a-j99a-dwi-bl-v1-1-24793c2b99fc@gmail.com
Signed-off-by: Sven Peter <sven@kernel.org>
|
|
Don't read guest CR3 in kvm_arch_setup_async_pf() if the MMU is direct
and use INVALID_GPA instead.
When KVM tries to perform the host-only async page fault for the shared
memory of TDX guests, the following WARNING is triggered:
WARNING: CPU: 1 PID: 90922 at arch/x86/kvm/vmx/main.c:483 vt_cache_reg+0x16/0x20
Call Trace:
__kvm_mmu_faultin_pfn
kvm_mmu_faultin_pfn
kvm_tdp_page_fault
kvm_mmu_do_page_fault
kvm_mmu_page_fault
tdx_handle_ept_violation
This WARNING is triggered when calling kvm_mmu_get_guest_pgd() to cache
the guest CR3 in kvm_arch_setup_async_pf() for later use in
kvm_arch_async_page_ready() to determine if it's possible to fix the
page fault in the current vCPU context to save one VM exit. However, when
guest state is protected, KVM cannot read the guest CR3.
Since protected guests aren't compatible with shadow paging, i.e, they
must use direct MMU, avoid calling kvm_mmu_get_guest_pgd() to read guest
CR3 when the MMU is direct and use INVALID_GPA instead.
Note that for protected guests mmu->root_role.direct is always true, so
that kvm_mmu_get_guest_pgd() in kvm_arch_async_page_ready() won't be
reached.
Reported-by: Farrah Chen <farrah.chen@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20251212135051.2155280-1-xiaoyao.li@intel.com
[sean: explicitly cast to "unsigned long" to make 32-bit builds happy]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
PV MSR
Return KVM_MSR_RET_UNSUPPORTED instead of '1' (which for all intents and
purposes means "invalid") when rejecting accesses to KVM PV MSRs to adhere
to KVM's ABI of allowing host reads and writes of '0' to MSRs that are
advertised to userspace via KVM_GET_MSR_INDEX_LIST, even if the vCPU model
doesn't support the MSR.
E.g. running a QEMU VM with
-cpu host,-kvmclock,kvm-pv-enforce-cpuid
yields:
qemu: error: failed to set MSR 0x12 to 0x0
qemu: target/i386/kvm/kvm.c:3301: kvm_buf_set_msrs:
Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID")
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20251230205948.4094097-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
For consistency with commit 7afe79f5734a ("KVM: nVMX: Mark vmcs12's APIC
access page dirty when unmapping"), which marks the page dirty during
unmap operations, also mark it dirty during vmcs12 page synchronization.
Signed-off-by: Fred Griffoul <fgriffo@amazon.co.uk>
[sean: use kvm_vcpu_map_mark_dirty()]
Link: https://patch.msgid.link/20251121223444.355422-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move nested_mark_vmcs12_pages_dirty() to vmx.c now that it's only used in
the VM-Exit path, and add "all" to its name to document that its purpose
is to mark all (mapped-out-of-band) vmcs12 pages as dirty.
No functional change intended.
Link: https://patch.msgid.link/20251121223444.355422-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly mark the vmcs12 vAPIC and PI descriptor pages as dirty when
delivering a nested posted interrupt instead of marking all vmcs12 pages
as dirty. This will allow marking the APIC access page (and any future
vmcs12 pages) as dirty in nested_mark_vmcs12_pages_dirty() without over-
dirtying in the nested PI case. Manually marking the vAPIC and PID pages
as dirty also makes the flow a bit more self-documenting, e.g. it's not
obvious at first glance that vmx->nested.pi_desc is actually a host kernel
mapping of a vmcs12 page.
No functional change intended.
Link: https://patch.msgid.link/20251121223444.355422-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Mark vmcs12 pages as dirty (in KVM's dirty log bitmap) if and only if the
page is mapped, i.e. if the page is actually "active" in vmcs02. For some
pages, KVM simply disables the associated VMCS control if the vmcs12 page
is unreachable, i.e. it's possible for nested VM-Enter to succeed with a
"bad" vmcs12 page.
Link: https://patch.msgid.link/20251121223444.355422-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Extend mediated PMU support for Intel CPUs without support for saving
PERF_GLOBAL_CONTROL into the guest VMCS field on VM-Exit, e.g. for Skylake
and its derivatives, as well as Icelake. While supporting CPUs without
VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL isn't completely trivial, it's not that
complex either. And not supporting such CPUs would mean not supporting 7+
years of Intel CPUs released in the past 10 years.
On VM-Exit, immediately propagate the saved PERF_GLOBAL_CTRL to the VMCS
as well as KVM's software cache so that KVM doesn't need to add full EXREG
tracking of PERF_GLOBAL_CTRL. In practice, the vast majority of VM-Exits
won't trigger software writes to guest PERF_GLOBAL_CTRL, so deferring the
VMWRITE to the next VM-Enter would only delay the inevitable without
batching/avoiding VMWRITEs.
Note! Take care to refresh VM_EXIT_MSR_STORE_COUNT on nested VM-Exit, as
it's unfortunately possible that KVM could recalculate MSR intercepts
while L2 is active, e.g. if userspace loads nested state and _then_ sets
PERF_CAPABILITIES. Eating the VMWRITE on every nested VM-Exit is
unfortunate, but that's a pre-existing problem and can/should be solved
separately, e.g. modifying the number of auto-load entries while L2 is
active is also uncommon on modern CPUs.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Initialize vmcs01.VM_EXIT_MSR_STORE_ADDR to point at the vCPU's
msr_autostore list in anticipation of utilizing the auto-store
functionality, and to harden KVM against stray reads to pfn 0 (or, in
theory, a random pfn if the underlying CPU uses a complex scheme for
encoding VMCS data). The MSR auto lists are supposed to be ignored if the
associated COUNT VMCS field is '0', but leaving the ADDR field
zero-initialized in memory is an unnecessary risk (albeit a minuscule risk)
given that the cost is a single VMWRITE during vCPU creation.
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a helper to add an MSR to a VMCS's "auto" list to deduplicate the code
in add_atomic_switch_msr(), and so that the functionality can be used in
the future for managing the MSR auto-store list.
No functional change intended.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-43-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Undo the bundling of the "host" and "guest" MSR auto-load list logic so
that the code can be deduplicated by factoring out the logic to a separate
helper. Now that "list full" situations are treated as fatal to the VM,
there is no need to pre-check both lists.
For all intents and purposes, this reverts the add_atomic_switch_msr()
changes made by commit 3190709335dd ("x86/KVM/VMX: Separate the VMX
AUTOLOAD guest/host number accounting").
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When adding an MSR to the auto-load lists, update the MSR index in the
list entry if and only if a new entry is being inserted, as 'i' can only
be non-negative if vmx_find_loadstore_msr_slot() found an entry with the
MSR's index. Unnecessarily setting the index is benign, but it makes it
harder to see that updating the value is necessary even when an existing
entry for the MSR was found.
No functional change intended.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN and bug the VM if either MSR auto-load list is full when adding an
MSR to the lists, as the set of MSRs that KVM loads via the lists is
finite and entirely KVM controlled, i.e. overflowing the lists shouldn't
be possible in a fully released version of KVM. Terminate the VM as the
core KVM infrastructure has no insight as to _why_ an MSR is being added
to the list, and failure to load an MSR on VM-Enter and/or VM-Exit could
be fatal to the host. E.g. running the host with a guest-controlled PEBS
MSR could generate unexpected writes to the DS buffer and crash the host.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the "on VM-Enter only" parameter from add_atomic_switch_msr() as it
is no longer used, and for all intents and purposes was never used. The
functionality was added, under embargo, by commit 989e3992d2ec
("x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs"),
and then ripped out by commit 2f055947ae5e ("x86/kvm: Drop L1TF MSR list
approach") just a few commits later.
2f055947ae5e x86/kvm: Drop L1TF MSR list approach
72c6d2db64fa x86/litf: Introduce vmx status variable
215af5499d9e cpu/hotplug: Online siblings when SMT control is turned on
390d975e0c4e x86/KVM/VMX: Use MSR save list for IA32_FLUSH_CMD if required
989e3992d2ec x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs
Furthermore, it's extremely unlikely KVM will ever _need_ to load an MSR
value via the auto-load lists only on VM-Enter. MSRs writes via the lists
aren't optimized in any way, and so the only reason to use the lists
instead of a WRMSR are for cases where the MSR _must_ be load atomically
with respect to VM-Enter (and/or VM-Exit). While one could argue that
command MSRs, e.g. IA32_FLUSH_CMD, "need" to be done exact at VM-Enter, in
practice doing such flushes within a few instructons of VM-Enter is more
than sufficient.
Note, the shortlog and changelog for commit 390d975e0c4e ("x86/KVM/VMX: Use
MSR save list for IA32_FLUSH_CMD if required") are misleading and wrong.
That commit added MSR_IA32_FLUSH_CMD to the VM-Enter _load_ list, not the
VM-Enter save list (which doesn't exist, only VM-Exit has a store/save
list).
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-39-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a helper to remove an MSR from an auto-{load,store} list to dedup the
msr_autoload code, and in anticipation of adding similar functionality for
msr_autostore.
No functional change intended.
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-38-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rework nVMX's use of the MSR auto-store list to snapshot TSC to sneak
MSR_IA32_TSC into the list _without_ updating KVM's software tracking,
and drop the generic functionality so that future usage of the store list
for nested specific logic needs to consider the implications of modifying
the list. Updating the list only for vmcs02 and only on nested VM-Enter
is a disaster waiting to happen, as it means vmcs01 is stale relative to
the software tracking, and KVM could unintentionally leave an MSR in the
store list in perpetuity while running L1, e.g. if KVM addressed the first
issue and updated vmcs01 on nested VM-Exit without removing TSC from the
list.
Furthermore, mixing KVM's desire to save an MSR with L1's desire to save
an MSR result KVM clobbering/ignoring the needs of vmcs01 or vmcs02.
E.g. if KVM added MSR_IA32_TSC to the store list for its own purposes, and
then _removed_ MSR_IA32_TSC from the list after emulating nested VM-Enter,
then KVM would remove MSR_IA32_TSC from the list even though saving TSC on
VM-Exit from L2 is still desirable (to provide L1 with an accurate TSC).
Similarly, removing an MSR from the list based on vmcs12's settings could
drop an MSR that KVM wants to save for its own purposes.
In practice, the issues are currently benign, because KVM doesn't use the
store list for vmcs01. But that will change with upcoming mediated PMU
support.
Alternatively, a "full" solution would be to track MSR list entries for
vmcs12 separately from KVM's standard lists, but MSR_IA32_TSC is likely
the only MSR that KVM would ever want to save on _every_ VM-Exit purely
based on vmcs12. I.e. the added complexity isn't remotely justified at
this time.
Opportunistically escalate from a pr_warn_ratelimited() to a full WARN as
KVM reserves eight entries in each MSR list, and as above KVM uses at most
one entry.
Opportunistically make vmx_find_loadstore_msr_slot() local to vmx.c as
using it directly from nested code is unsafe due to the potential for
mixing vmcs01 and vmcs02 state (see above).
Cc: Jim Mattson <jmattson@google.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-37-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the intermediate "guest" field from vcpu_vmx.msr_autostore as the
value saved on VM-Exit isn't guaranteed to be the guest's value, it's
purely whatever is in hardware at the time of VM-Exit. E.g. KVM's only
use of the store list at the momemnt is to snapshot TSC at VM-Exit, and
the value saved is always the raw TSC even if TSC-offseting and/or
TSC-scaling is enabled for the guest.
And unlike msr_autoload, there is no need differentiate between "on-entry"
and "on-exit".
No functional change intended.
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-36-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When loading a mediated PMU state, elide the WRMSRs to load PMCs with the
guest's value if the value in hardware already matches the guest's value.
For the relatively common case where neither the guest nor the host is
actively using the PMU, i.e. when all/many counters are '0', eliding the
WRMSRs reduces the latency of handling VM-Exit by a measurable amount
(WRMSR is significantly more expensive than RDPMC).
As measured by KVM-Unit-Tests' CPUID VM-Exit testcase, this provides a
a ~25% reduction in latency (4k => 3k cycles) on Intel Emerald Rapids,
and a ~13% reduction (6.2k => 5.3k cycles) on AMD Turin.
Cc: Manali Shukla <manali.shukla@amd.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-35-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Expose enable_mediated_pmu parameter to user space, i.e. allow userspace
to enable/disable mediated vPMU support.
Document the mediated versus perf-based behavior as part of the
kernel-parameters.txt entry, and opportunistically add an entry for the
core enable_pmu param as well.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-34-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add MSRs that might be passed through to L1 when running with a mediated
PMU to the nested SVM's set of to-be-merged MSR indices, i.e. disable
interception of PMU MSRs when running L2 if both KVM (L0) and L1 disable
interception. There is no need for KVM to interpose on such MSR accesses,
e.g. if L1 exposes a mediated PMU (or equivalent) to L2.
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Merge KVM's PMU MSR interception bitmaps with those of L1, i.e. merge the
bitmaps of vmcs01 and vmcs12, e.g. so that KVM doesn't interpose on MSR
accesses unnecessarily if L1 exposes a mediated PMU (or equivalent) to L2.
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
[sean: rewrite changelog and comment, omit MSRs that are always intercepted]
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add macros nested_vmx_merge_msr_bitmaps_xxx() to simplify nested MSR
interception setting. No function change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Mediated vPMU needs to accumulate the emulated instructions into counter
and load the counter into HW at vm-entry.
Moreover, if the accumulation leads to counter overflow, KVM needs to
update GLOBAL_STATUS and inject PMI into guest as well.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't handle exits in the fastpath if emulation is required, i.e. if an
instruction needs to be skipped, the mediated PMU is enabled, and one or
more PMCs is counting instructions. With the mediated PMU, KVM's cache of
PMU state is inconsistent with respect to hardware until KVM exits the
inner run loop (when the mediated PMU is "put").
Reviewed-by: Sandipan Das <sandipan.das@amd.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Implement the PMU "world switch" between host perf and guest mediated PMU.
When loading guest state, call into perf to switch from host to guest, and
then load guest state into hardware, and then reverse those actions when
putting guest state.
On the KVM side, when loading guest state, zero PERF_GLOBAL_CTRL to ensure
all counters are disabled, then load selectors and counters, and finally
call into vendor code to load control/status information. While VMX and
SVM use different mechanisms to avoid counting host activity while guest
controls are loaded, both implementations require PERF_GLOBAL_CTRL to be
zeroed when the event selectors are in flux.
When putting guest state, reverse the order, and save and zero controls
and status prior to saving+zeroing selectors and counters. Defer clearing
PERF_GLOBAL_CTRL to vendor code, as only SVM needs to manually clear the
MSR; VMX configures PERF_GLOBAL_CTRL to be atomically cleared by the CPU
on VM-Exit.
Handle the difference in MSR layouts between Intel and AMD by communicating
the bases and stride via kvm_pmu_ops. Because KVM requires Intel v4 (and
full-width writes) and AMD v2, the MSRs to load/save are constant for a
given vendor, i.e. do not vary based on the guest PMU, and do not vary
based on host PMU (because KVM will simply disable mediated PMU support if
the necessary MSRs are unsupported).
Except for retrieving the guest's PERF_GLOBAL_CTRL, which needs to be read
before invoking any fastpath handler (spoiler alert), perform the context
switch around KVM's inner run loop. State only needs to be synchronized
from hardware before KVM can access the software "caches".
Note, VMX already grabs the guest's PERF_GLOBAL_CTRL immediately after
VM-Exit, as hardware saves value into the VMCS.
Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|