kernel/linux.git/arch/arm64/kvm, branch v6.18.21

KVM: arm64: Discard PC update state on vcpu reset

2026-04-02T11:23:19+00:00

commit 1744a6ef48b9a48f017e3e1a0d05de0a6978396e upstream. Our vcpu reset suffers from a particularly interesting flaw, as it does not correctly deal with state that will have an effect on the execution flow out of reset. Take the following completely random example, never seen in the wild and that never resulted in a couple of sleepless nights: /s - vcpu-A issues a PSCI_CPU_OFF using the SMC conduit - SMC being a trapped instruction (as opposed to HVC which is always normally executed), we annotate the vcpu as needing to skip the next instruction, which is the SMC itself - vcpu-A is now safely off - vcpu-B issues a PSCI_CPU_ON for vcpu-A, providing a starting PC - vcpu-A gets reset, get the new PC, and is sent on its merry way - right at the point of entering the guest, we notice that a PC increment is pending (remember the earlier SMC?) - vcpu-A skips its first instruction... What could possibly go wrong? Well, I'm glad you asked. For pKVM as a NV guest, that first instruction is extremely significant, as it indicates whether the CPU is booting or resuming. Having skipped that instruction, nothing makes any sense anymore, and CPU hotplugging fails. This is all caused by the decoupling of PC update from the handling of an exception that triggers such update, making it non-obvious what affects what when. Fix this train wreck by discarding all the PC-affecting state on vcpu reset. Fixes: f5e30680616ab ("KVM: arm64: Move __adjust_pc out of line") Cc: stable@vger.kernel.org Reviewed-by: Suzuki K Poulose Reviewed-by: Joey Gouly Link: https://patch.msgid.link/20260312140850.822968-1-maz@kernel.org Signed-off-by: Marc Zyngier Signed-off-by: Greg Kroah-Hartman

KVM: arm64: Eagerly init vgic dist/redist on vgic creation

2026-03-19T15:08:49+00:00

[ Upstream commit ac6769c8f948dff33265c50e524aebf9aa6f1be0 ] If vgic_allocate_private_irqs_locked() fails for any odd reason, we exit kvm_vgic_create() early, leaving dist->rd_regions uninitialised. kvm_vgic_dist_destroy() then comes along and walks into the weeds trying to free the RDs. Got to love this stuff. Solve it by moving all the static initialisation early, and make sure that if we fail halfway, we're in a reasonable shape to perform the rest of the teardown. While at it, reset the vgic model on failure, just in case... Reported-by: syzbot+f6a46b038fc243ac0175@syzkaller.appspotmail.com Tested-by: syzbot+f6a46b038fc243ac0175@syzkaller.appspotmail.com Fixes: b3aa9283c0c50 ("KVM: arm64: vgic: Hoist SGI/PPI alloc from vgic_init() to kvm_create_vgic()") Link: https://lore.kernel.org/r/69a2d58c.050a0220.3a55be.003b.GAE@google.com Link: https://patch.msgid.link/20260228164559.936268-1-maz@kernel.org Signed-off-by: Marc Zyngier Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

KVM: arm64: gic: Set vgic_model before initing private IRQs

2026-03-19T15:08:49+00:00

[ Upstream commit 9435c1e1431003e23aa34ef8e46c30d09c3dbcb5 ] Different GIC types require the private IRQs to be initialised differently. GICv5 is the culprit as it supports both a different number of private IRQs, and all of these are PPIs (there are no SGIs). Moreover, as GICv5 uses the top bits of the interrupt ID to encode the type, the intid also needs to computed differently. Up until now, the GIC model has been set after initialising the private IRQs for a VCPU. Move this earlier to ensure that the GIC model is available when configuring the private IRQs. While we're at it, also move the setting of the in_kernel flag and implementation revision to keep them grouped together as before. Signed-off-by: Sascha Bischoff Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/20260128175919.3828384-7-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier Stable-dep-of: ac6769c8f948 ("KVM: arm64: Eagerly init vgic dist/redist on vgic creation") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

KVM: arm64: pkvm: Fallback to level-3 mapping on host stage-2 fault

2026-03-19T15:08:25+00:00

commit 8531d5a83d8eb8affb5c0249b466c28d94192603 upstream. If, for any odd reason, we cannot converge to mapping size that is completely contained in a memblock region, we fail to install a S2 mapping and go back to the faulting instruction. Rince, repeat. This happens when faulting in regions that are smaller than a page or that do not have PAGE_SIZE-aligned boundaries (as witnessed on an O6 board that refuses to boot in protected mode). In this situation, fallback to using a PAGE_SIZE mapping anyway -- it isn't like we can go any lower. Fixes: e728e705802fe ("KVM: arm64: Adjust range correctly during host stage-2 faults") Link: https://lore.kernel.org/r/86wlzr77cn.wl-maz@kernel.org Cc: stable@vger.kernel.org Cc: Quentin Perret Reviewed-by: Quentin Perret Link: https://patch.msgid.link/20260305132751.2928138-1-maz@kernel.org Signed-off-by: Marc Zyngier Signed-off-by: Greg Kroah-Hartman

KVM: arm64: Fix protected mode handling of pages larger than 4kB

2026-03-19T15:08:24+00:00

commit 08f97454b7fa39bfcf82524955c771d2d693d6fe upstream. Since 3669ddd8fa8b5 ("KVM: arm64: Add a range to pkvm_mappings"), pKVM tracks the memory that has been mapped into a guest in a side data structure. Crucially, it uses it to find out whether a page has already been mapped, and therefore refuses to map it twice. So far, so good. However, this very patch completely breaks non-4kB page support, with guests being unable to boot. The most obvious symptom is that we take the same fault repeatedly, and not making forward progress. A quick investigation shows that this is because of the above rejection code. As it turns out, there are multiple issues at play: - while the HPFAR_EL2 register gives you the faulting IPA minus the bottom 12 bits, it will still give you the extra bits that are part of the page offset for anything larger than 4kB, even for a level-3 mapping - pkvm_pgtable_stage2_map() assumes that the address passed as a parameter is aligned to the size of the intended mapping - the faulting address is only aligned for a non-page mapping When the planets are suitably aligned (pun intended), the guest faults on a page by accessing it past the bottom 4kB, and extra bits get set in the HPFAR_EL2 register. If this results in a page mapping (which is likely with large granule sizes), nothing aligns it further down, and pkvm_mapping_iter_first() finds an intersection that doesn't really exist. We assume this is a spurious fault and return -EAGAIN. And again... This doesn't hit outside of the protected code, as the page table code always aligns the IPA down to a page boundary, hiding the issue for everyone else. Fix it by always forcing the alignment on vma_pagesize, irrespective of the value of vma_pagesize. Fixes: 3669ddd8fa8b5 ("KVM: arm64: Add a range to pkvm_mappings") Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Signed-off-by: Marc Zyngier Link: https://https://patch.msgid.link/20260222141000.3084258-1-maz@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

KVM: arm64: Fix ID register initialization for non-protected pKVM guests

2026-03-12T11:09:08+00:00

[ Upstream commit 7e7c2cf0024d89443a7af52e09e47b1fe634ab17 ] In protected mode, the hypervisor maintains a separate instance of the `kvm` structure for each VM. For non-protected VMs, this structure is initialized from the host's `kvm` state. Currently, `pkvm_init_features_from_host()` copies the `KVM_ARCH_FLAG_ID_REGS_INITIALIZED` flag from the host without the underlying `id_regs` data being initialized. This results in the hypervisor seeing the flag as set while the ID registers remain zeroed. Consequently, `kvm_has_feat()` checks at EL2 fail (return 0) for non-protected VMs. This breaks logic that relies on feature detection, such as `ctxt_has_tcrx()` for TCR2_EL1 support. As a result, certain system registers (e.g., TCR2_EL1, PIR_EL1, POR_EL1) are not saved/restored during the world switch, which could lead to state corruption. Fix this by explicitly copying the ID registers from the host `kvm` to the hypervisor `kvm` for non-protected VMs during initialization, since we trust the host with its non-protected guests' features. Also ensure `KVM_ARCH_FLAG_ID_REGS_INITIALIZED` is cleared initially in `pkvm_init_features_from_host` so that `vm_copy_id_regs` can properly initialize them and set the flag once done. Fixes: 41d6028e28bd ("KVM: arm64: Convert the SVE guest vcpu flag to a vm flag") Signed-off-by: Fuad Tabba Link: https://patch.msgid.link/20260213143815.1732675-4-tabba@google.com Signed-off-by: Marc Zyngier Signed-off-by: Sasha Levin

KVM: arm64: Hide S1POE from guests when not supported by the host

2026-03-12T11:09:08+00:00

[ Upstream commit f66857bafd4f151c5cc6856e47be2e12c1721e43 ] When CONFIG_ARM64_POE is disabled, KVM does not save/restore POR_EL1. However, ID_AA64MMFR3_EL1 sanitisation currently exposes the feature to guests whenever the hardware supports it, ignoring the host kernel configuration. If a guest detects this feature and attempts to use it, the host will fail to context-switch POR_EL1, potentially leading to state corruption. Fix this by masking ID_AA64MMFR3_EL1.S1POE in the sanitised system registers, preventing KVM from advertising the feature when the host does not support it (i.e. system_supports_poe() is false). Fixes: 70ed7238297f ("KVM: arm64: Sanitise ID_AA64MMFR3_EL1") Signed-off-by: Fuad Tabba Link: https://patch.msgid.link/20260213143815.1732675-2-tabba@google.com Signed-off-by: Marc Zyngier Signed-off-by: Sasha Levin

KVM: arm64: nv: Return correct RES0 bits for FGT registers

2026-03-04T12:21:12+00:00

[ Upstream commit 2eb80a2eee18762a33aa770d742d64fe47852c7e ] We had extended the sysreg masking infrastructure to more general registers, instead of restricting it to VNCR-backed registers, since commit a0162020095e ("KVM: arm64: Extend masking facility to arbitrary registers"). Fix kvm_get_sysreg_res0() to reflect this fact. Note that we're sure that we only deal with FGT registers in kvm_get_sysreg_res0(), the if (sr < __VNCR_START__) is actually a never false, which should probably be removed later. Fixes: 69c19e047dfe ("KVM: arm64: Add TCR2_EL2 to the sysreg arrays") Signed-off-by: Zenghui Yu (Huawei) Link: https://patch.msgid.link/20260121101631.41037-1-zenghui.yu@linux.dev Signed-off-by: Marc Zyngier Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin

KVM: arm64: VHE: Compute fgt traps before activating them

2025-11-12T10:52:58+00:00

On VHE, the Fine Grain Traps registers are written to hardware in kvm_arch_vcpu_load()->..->__activate_traps_hfgxtr(), but the fgt array is computed later, in kvm_vcpu_load_fgt(). This can lead to zero being written to the FGT registers the first time a VCPU is loaded. Also, any changes to the fgt array will be visible only after the VCPU is scheduled out, and then back in, which is not the intended behaviour. Fix it by computing the fgt array just before the fgt traps are written to hardware. Fixes: fb10ddf35c1c ("KVM: arm64: Compute per-vCPU FGTs at vcpu_load()") Signed-off-by: Alexandru Elisei Reviewed-by: Oliver Upton Link: https://patch.msgid.link/20251112102853.47759-1-alexandru.elisei@arm.com Signed-off-by: Marc Zyngier

KVM: arm64: Finalize ID registers only once per VM

2025-11-11T12:24:22+00:00

Owing to the ID registers being global to the VM, there is no point in computing them more than once. However, recent changes making use of kvm_set_vm_id_reg() outlined that we repeatedly hammer the ID registers when we shouldn't. Gate the ID reg update on the VM having never run. Fixes: 50e7cce81b9b2 ("KVM: arm64: Limit clearing of ID_{AA64PFR0,PFR1}_EL1.GIC to userspace irqchip") Fixes: 5cb57a1aff755 ("KVM: arm64: Zero ID_AA64PFR0_EL1.GIC when no GICv3 is presented to the guest") Closes: https://lore.kernel.org/r/aRHf6x5umkTYhYJ3@finisterre.sirena.org.uk Reported-by: Mark Brown Tested-by: Mark Brown Link: https://patch.msgid.link/20251110173010.1918424-1-maz@kernel.org Signed-off-by: Marc Zyngier