summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2025-04-29riscv: switch set_icache_stale_mask() to using non-atomic assign_cpu()Yury Norov1-1/+1
The atomic cpumask_assign_cpu() follows non-atomic cpumask_setall(), which makes the whole operation non-atomic. Fix this by relaxing to non-atomic __assign_cpu(). Fixes: 7c1e5b9690b0e14 ("riscv: Disable preemption while handling PR_RISCV_CTX_SW_FENCEI_OFF") Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
2025-04-29x86/bugs: Restructure L1TF mitigationDavid Kaplan3-6/+22
Restructure L1TF to use select/apply functions to create consistent vulnerability handling. Define new AUTO mitigation for L1TF. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-16-david.kaplan@amd.com
2025-04-29x86/bugs: Restructure SSB mitigationDavid Kaplan1-19/+17
Restructure SSB to use select/apply functions to create consistent vulnerability handling. Remove __ssb_select_mitigation() and split the functionality between the select/apply functions. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-15-david.kaplan@amd.com
2025-04-29x86/bugs: Restructure spectre_v2 mitigationDavid Kaplan1-45/+56
Restructure spectre_v2 to use select/update/apply functions to create consistent vulnerability handling. The spectre_v2 mitigation may be updated based on the selected retbleed mitigation. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-14-david.kaplan@amd.com
2025-04-29x86/bugs: Restructure BHI mitigationDavid Kaplan1-4/+27
Restructure BHI mitigation to use select/update/apply functions to create consistent vulnerability handling. BHI mitigation was previously selected from within spectre_v2_select_mitigation() and now is selected from cpu_select_mitigation() like with all others. Define new AUTO mitigation for BHI. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-13-david.kaplan@amd.com
2025-04-29x86/bugs: Restructure spectre_v2_user mitigationDavid Kaplan1-74/+94
Restructure spectre_v2_user to use select/update/apply functions to create consistent vulnerability handling. The IBPB/STIBP choices are first decided based on the spectre_v2_user command line but can be modified by the spectre_v2 command line option as well. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-12-david.kaplan@amd.com
2025-04-29arm64: dts: st: Use 128kB size for aliased GIC400 register access on ↵Christian Bruel1-3/+3
stm32mp23 SoCs Adjust the size of 8kB GIC regions to 128kB so that each 4kB is mapped 16 times over a 64kB region. The offset is then adjusted in the irq-gic driver. see commit 12e14066f4835 ("irqchip/GIC: Add workaround for aliased GIC400") Fixes: e9b03ef21386e ("arm64: dts: st: introduce stm32mp23 SoCs family") Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Christian Bruel <christian.bruel@foss.st.com> Link: https://lore.kernel.org/r/20250415111654.2103767-7-christian.bruel@foss.st.com Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-04-29arm64: dts: st: Adjust interrupt-controller for stm32mp23 SoCsChristian Bruel1-2/+1
Use gic-400 compatible and remove address-cells = <1> for aarch64 Fixes: e9b03ef21386e ("arm64: dts: st: introduce stm32mp23 SoCs family") Signed-off-by: Christian Bruel <christian.bruel@foss.st.com> Link: https://lore.kernel.org/r/20250415111654.2103767-6-christian.bruel@foss.st.com Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-04-29arm64: dts: st: Use 128kB size for aliased GIC400 register access on ↵Christian Bruel1-3/+3
stm32mp21 SoCs Adjust the size of 8kB GIC regions to 128kB so that each 4kB is mapped 16 times over a 64kB region. The offset is then adjusted in the irq-gic driver. see commit 12e14066f4835 ("irqchip/GIC: Add workaround for aliased GIC400") Fixes: 7a57b1bb1afbf ("arm64: dts: st: introduce stm32mp21 SoCs family") Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Christian Bruel <christian.bruel@foss.st.com> Link: https://lore.kernel.org/r/20250415111654.2103767-5-christian.bruel@foss.st.com Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-04-29arm64: dts: st: Adjust interrupt-controller for stm32mp21 SoCsChristian Bruel1-1/+1
Use gic-400 compatible for aarch64 Fixes: 7a57b1bb1afbf ("arm64: dts: st: introduce stm32mp21 SoCs family") Signed-off-by: Christian Bruel <christian.bruel@foss.st.com> Link: https://lore.kernel.org/r/20250415111654.2103767-4-christian.bruel@foss.st.com Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-04-29arm64: dts: st: Use 128kB size for aliased GIC400 register access on ↵Christian Bruel1-3/+3
stm32mp25 SoCs Adjust the size of 8kB GIC regions to 128kB so that each 4kB is mapped 16 times over a 64kB region. The offset is then adjusted in the irq-gic driver. see commit 12e14066f4835 ("irqchip/GIC: Add workaround for aliased GIC400") Fixes: 5d30d03aaf785 ("arm64: dts: st: introduce stm32mp25 SoCs family") Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Christian Bruel <christian.bruel@foss.st.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250415111654.2103767-3-christian.bruel@foss.st.com Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-04-29arm64: dts: st: Adjust interrupt-controller for stm32mp25 SoCsChristian Bruel1-2/+1
Use gic-400 compatible and remove address-cells = <1> on aarch64 Fixes: 5d30d03aaf785 ("arm64: dts: st: introduce stm32mp25 SoCs family") Signed-off-by: Christian Bruel <christian.bruel@foss.st.com> Link: https://lore.kernel.org/r/20250415111654.2103767-2-christian.bruel@foss.st.com Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-04-29Merge tag 'imx-fixes-6.15' of ↵Arnd Bergmann4-9/+53
https://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into arm/fixes i.MX fixes for 6.15: - An i.MX8MP change from Ahmad Fatoum to fix the broken nominal device tree caused by commit 9f7595b3e5ae ("arm64: dts: imx8mp: configure GPU and NPU clocks to overdrive rate") - A MAINTAINERS update from Michael Riesch to exclude Sony IMX image sensor drivers from i.MX entry - A i.MX95 device tree change from Richard Zhu to correct the range of PCIe app-reg region - An opos6ul device tree change from Sébastien Szymanski to fix an Ethernet regression caused by commit c7e73b5051d6 ("ARM: imx: mach-imx6ul: remove 14x14 EVK specific PHY fixup") - An imx8mm-verdin device tree change from Wojciech Dubowik to fix a SD card regression caused by commit f5aab0438ef1 ("regulator: pca9450: Fix enable register for LDO5") * tag 'imx-fixes-6.15' of https://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux: arm64: dts: imx8mm-verdin: Link reg_usdhc2_vqmmc to usdhc2 MAINTAINERS: add exclude for dt-bindings to imx entry ARM: dts: opos6ul: add ksz8081 phy properties arm64: dts: imx95: Correct the range of PCIe app-reg region arm64: dts: imx8mp: configure GPU and NPU clocks in nominal DTSI
2025-04-29Merge tag 'juno-fix-6.15' of ↵Arnd Bergmann1-11/+11
https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes Armv8 Morello fix for v6.15 Just a single fix addressing the cache node inconsistencies. It removed unnecessary CPU number from L2 cache node names since they are local to CPU nodes and should simply be named "l2-cache" and relocates the shared L3 cache node from under cpu@0/l2-cache to the /cpus node, which is the standard location for shared caches. * tag 'juno-fix-6.15' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux: arm64: dts: morello: Fix-up cache nodes
2025-04-29KVM: x86: Unify cross-vCPU IBPBSean Christopherson6-47/+25
Both SVM and VMX have similar implementation for executing an IBPB between running different vCPUs on the same CPU to create separate prediction domains for different vCPUs. For VMX, when the currently loaded VMCS is changed in vmx_vcpu_load_vmcs(), an IBPB is executed if there is no 'buddy', which is the case on vCPU load. The intention is to execute an IBPB when switching vCPUs, but not when switching the VMCS within the same vCPU. Executing an IBPB on nested transitions within the same vCPU is handled separately and conditionally in nested_vmx_vmexit(). For SVM, the current VMCB is tracked on vCPU load and an IBPB is executed when it is changed. The intention is also to execute an IBPB when switching vCPUs, although it is possible that in some cases an IBBP is executed when switching VMCBs for the same vCPU. Executing an IBPB on nested transitions should be handled separately, and is proposed at [1]. Unify the logic by tracking the last loaded vCPU and execuintg the IBPB on vCPU change in kvm_arch_vcpu_load() instead. When a vCPU is destroyed, make sure all references to it are removed from any CPU. This is similar to how SVM clears the current_vmcb tracking on vCPU destruction. Remove the current VMCB tracking in SVM as it is no longer required, as well as the 'buddy' parameter to vmx_vcpu_load_vmcs(). [1] https://lore.kernel.org/lkml/20250221163352.3818347-4-yosry.ahmed@linux.dev Link: https://lore.kernel.org/all/20250320013759.3965869-1-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> [sean: tweak comment to stay at/under 80 columns] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-29KVM: SVM: Clear current_vmcb during vCPU free for all *possible* CPUsYosry Ahmed1-1/+1
When freeing a vCPU and thus its VMCB, clear current_vmcb for all possible CPUs, not just online CPUs, as it's theoretically possible a CPU could go offline and come back online in conjunction with KVM reusing the page for a new VMCB. Link: https://lore.kernel.org/all/20250320013759.3965869-1-yosry.ahmed@linux.dev Fixes: fd65d3142f73 ("kvm: svm: Ensure an IBPB on all affected CPUs when freeing a vmcb") Cc: stable@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> [sean: split to separate patch, write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-29arm64: pageattr: Explicitly bail out when changing permissions for ↵Dev Jain1-3/+3
vmalloc_huge mappings arm64 uses apply_to_page_range to change permissions for kernel vmalloc mappings, which does not support changing permissions for block mappings. This function will change permissions until it encounters a block mapping, and will bail out with a warning. Since there are no reports of this triggering, it implies that there are currently no cases of code doing a vmalloc_huge() followed by partial permission change. But this is a footgun waiting to go off, so let's detect it early and avoid the possibility of permissions in an intermediate state. So, explicitly disallow changing permissions for VM_ALLOW_HUGE_VMAP mappings. Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20250403052844.61818-1-dev.jain@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64: Extend pr_crit message on invalid FDTBartosz Szczepanek1-5/+5
Log size in addition to physical and virtual addresses. It has potential to be helpful when DTB exceeds the 2 MB limit. Initialize size to 0 to print out sane value if fixmap_remap_fdt fails without setting the size. Signed-off-by: Bartosz Szczepanek <bsz@amazon.de> Link: https://lore.kernel.org/r/20250423084851.26449-1-bsz@amazon.de Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64: Kconfig: remove unnecessary selection of CRC32Eric Biggers1-1/+0
The selection of CRC32 by ARM64 was added by commit 7481cddf29ed ("arm64/lib: add accelerated crc32 routines") as a workaround for the fact that, at the time, the CRC32 library functions used weak symbols to allow architecture-specific overrides. That only worked when CRC32 was built-in, and thus ARM64 was made to just force CRC32 to built-in. Now that the CRC32 library no longer uses weak symbols, that no longer applies. And the selection does not fulfill a user dependency either; those all have their own selections from other options. Therefore, the selection of CRC32 by ARM64 is no longer necessary. Remove it. Note that this does not necessarily result in CRC32 no longer being set to y, as it still tends to get selected by something else anyway. Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250414174018.6359-1-ebiggers@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64: Add missing includes for mem_encryptJason Gunthorpe2-0/+4
Doing: #include <linux/mem_encrypt.h> Causes a bunch of compiler failures due to missing implicit includes that don't happen on x86: ../arch/arm64/include/asm/rsi_cmds.h:117:2: error: call to undeclared library function 'memcpy' with type 'void *(void *, const void *, unsigned long)'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 117 | memcpy(&regs.a1, challenge, size); ../arch/arm64/include/asm/mem_encrypt.h:19:49: warning: declaration of 'struct device' will not be visible outside of this function [-Wvisibility] 19 | static inline bool force_dma_unencrypted(struct device *dev) ../arch/arm64/include/asm/rsi_cmds.h:44:38: error: call to undeclared function 'virt_to_phys'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 44 | arm_smccc_smc(SMC_RSI_REALM_CONFIG, virt_to_phys(cfg), Add the missing includes to the arch/arm headers to avoid this. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/0-v1-47aadfbd64cd+25795-arm_memenc_h_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64: Support ARM64_VA_BITS=52 when setting ARCH_MMAP_RND_BITS_MAXKornel Dulęba1-3/+3
When the 52-bit virtual addressing was introduced the select like ARCH_MMAP_RND_BITS_MAX logic was never updated to account for it. Because of that the rnd max bits knob is set to the default value of 18 when ARM64_VA_BITS=52. Fix this by setting ARCH_MMAP_RND_BITS_MAX to the same value that would be used if 48-bit addressing was used. Higher values can't used here because 52-bit addressing is used only if the caller provides a hint to mmap, with a fallback to 48-bit. The knob in question is an upper bound for what the user can set in /proc/sys/vm/mmap_rnd_bits, which in turn is used to determine how many random bits can be inserted into the base address used for mmap allocations. Since 48-bit allocations are legal with ARM64_VA_BITS=52, we need to make sure that the base address is small enough to facilitate this. Fixes: b6d00d47e81a ("arm64: mm: Introduce 52-bit Kernel VAs") Signed-off-by: Kornel Dulęba <korneld@google.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20250417114754.3238273-1-korneld@google.com Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64: Expose AIDR_EL1 via sysfsOliver Upton2-0/+4
The KVM PV ABI recently added a feature that allows the VM to discover the set of physical CPU implementations, identified by a tuple of {MIDR_EL1, REVIDR_EL1, AIDR_EL1}. Unlike other KVM PV features, the expectation is that the VMM implements the hypercall instead of KVM as it has the authoritative view of where the VM gets scheduled. To do this the VMM needs to know the values of these registers on any CPU in the system. While MIDR_EL1 and REVIDR_EL1 are already exposed, AIDR_EL1 is not. Provide it in sysfs along with the other identification registers. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20250403231626.3181116-1-oliver.upton@linux.dev Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64: vdso: Use __arch_counter_get_cntvct()Breno Leitao1-20/+2
While reading how `cntvct_el0` was read in the kernel, I found that __arch_get_hw_counter() is doing something very similar to what __arch_counter_get_cntvct() is already doing. Use the existing __arch_counter_get_cntvct() function instead of duplicating similar inline assembly code in __arch_get_hw_counter(). Both functions were performing nearly identical operations to read the cntvct_el0 register. The only difference was that __arch_get_hw_counter() included a memory clobber in its inline assembly, which appears unnecessary in this context. This change simplifies the code by eliminating duplicate functionality and improves maintainability by centralizing the counter access logic in a single implementation. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20250407-arm-vdso-v1-1-7012de25b195@debian.org Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64: enable PREEMPT_LAZYMark Rutland3-8/+11
For an architecture to enable CONFIG_ARCH_HAS_RESCHED_LAZY, two things are required: 1) Adding a TIF_NEED_RESCHED_LAZY flag definition 2) Checking for TIF_NEED_RESCHED_LAZY in the appropriate locations 2) is handled in a generic manner by CONFIG_GENERIC_ENTRY, which isn't (yet) implemented for arm64. However, outside of core scheduler code, TIF_NEED_RESCHED_LAZY only needs to be checked on a kernel exit, meaning: o return/entry to userspace. o return/entry to guest. The return/entry to a guest is all handled by xfer_to_guest_mode_handle_work() which already does the right thing, so it can be left as-is. arm64 doesn't use common entry's exit_to_user_mode_prepare(), so update its return to user path to check for TIF_NEED_RESCHED_LAZY and call into schedule() accordingly. Link: https://lore.kernel.org/linux-rt-users/20241216190451.1c61977c@mordecai.tesarici.cz/ Link: https://lore.kernel.org/all/xhsmh4j0fl0p3.mognet@vschneid-thinkpadt14sgen2i.remote.csb/ Signed-off-by: Mark Rutland <mark.rutland@arm.com> [testdrive, _TIF_WORK_MASK fixlet and changelog.] Signed-off-by: Mike Galbraith <efault@gmx.de> [Another round of testing; changelog faff] Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20250305104925.189198-2-vschneid@redhat.com Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29Merge branch kvm-arm64/nv-pmu-fixes into kvmarm-master/nextMarc Zyngier4-23/+92
* kvm-arm64/nv-pmu-fixes: : . : Fixes for NV PMU emulation. From the cover letter: : : "Joey reports that some of his PMU tests do not behave quite as : expected: : : - MDCR_EL2.HPMN is set to 0 out of reset : : - PMCR_EL0.P should reset all the counters when written from EL2 : : Oliver points out that setting PMCR_EL0.N from userspace by writing to : the register is silly with NV, and that we need a new PMU attribute : instead. : : On top of that, I figured out that we had a number of little gotchas: : : - It is possible for a guest to write an HPMN value that is out of : bound, and it seems valuable to limit it : : - PMCR_EL0.N should be the maximum number of counters when read from : EL2, and MDCR_EL2.HPMN when read from EL0/EL1 : : - Prevent userspace from updating PMCR_EL0.N when EL2 is available" : . KVM: arm64: Let kvm_vcpu_read_pmcr() return an EL-dependent value for PMCR_EL0.N KVM: arm64: Handle out-of-bound write to MDCR_EL2.HPMN KVM: arm64: Don't let userspace write to PMCR_EL0.N when the vcpu has EL2 KVM: arm64: Allow userspace to limit the number of PMU counters for EL2 VMs KVM: arm64: Contextualise the handling of PMCR_EL0.P writes KVM: arm64: Fix MDCR_EL2.HPMN reset value KVM: arm64: Repaint pmcr_n into nr_pmu_counters Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-29arm64/cpufeature: Add missing id_aa64mmfr4 feature reg updateYicong Yang1-0/+2
Add missing id_aa64mmfr4 feature register check and update in update_cpu_features(). Update the taint status as well. Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20250329034409.21354-2-yangyicong@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64/mm: Remove randomization of the linear mapArd Biesheuvel4-27/+0
Since commit 97d6786e0669 ("arm64: mm: account for hotplug memory when randomizing the linear region") the decision whether or not to randomize the placement of the system's DRAM inside the linear map is based on the capabilities of the CPU rather than how much memory is present at boot time. This change was necessary because memory hotplug may result in DRAM appearing in places that are not covered by the linear region at all (and therefore unusable) if the decision is solely based on the memory map at boot. In the Android GKI kernel, which requires support for memory hotplug, and is built with a reduced virtual address space of only 39 bits wide, randomization of the linear map never happens in practice as a result. And even on arm64 kernels built with support for 48 bit virtual addressing, the wider PArange of recent CPUs means that linear map randomization is slowly becoming a feature that only works on systems that will soon be obsolete. So let's just remove this feature. We can always bring it back in an improved form if there is a real need for it. Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250318134949.3194334-2-ardb+git@google.com Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64/fpsimd: Avoid unnecessary per-CPU buffers for EFI runtime callsArd Biesheuvel2-31/+27
The EFI specification has some elaborate rules about which runtime services may be called while another runtime service call is already in progress. In Linux, however, for simplicity, all EFI runtime service invocations are serialized via the efi_runtime_lock semaphore. This implies that calls to the helper pair arch_efi_call_virt_setup() and arch_efi_call_virt_teardown() are serialized too, and are guaranteed not to nest. Furthermore, the arm64 arch code has its own spinlock to serialize use of the EFI runtime stack, of which only a single instance exists. This all means that the FP/SIMD and SVE state preserve/restore logic in __efi_fpsimd_begin() and __efi_fpsimd_end() are also serialized, and only a single instance of the associated per-CPU variables can ever be in use at the same time. There is therefore no need at all for per-CPU variables here, and they can all be replaced with singleton instances. This saves a non-trivial amount of memory on systems with many CPUs. To be more robust against potential future changes in the core EFI code that may invalidate the reasoning above, move the invocations of __efi_fpsimd_begin() and __efi_fpsimd_end() into the critical section covered by the efi_rt_lock spinlock. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250318132421.3155799-2-ardb+git@google.com Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29x86/bugs: Restructure retbleed mitigationDavid Kaplan1-97/+98
Restructure retbleed mitigation to use select/update/apply functions to create consistent vulnerability handling. The retbleed_update_mitigation() simplifies the dependency between spectre_v2 and retbleed. The command line options now directly select a preferred mitigation which simplifies the logic. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-11-david.kaplan@amd.com
2025-04-29LoongArch: entry: Migrate ret_from_fork() to CCharlie Jenkins3-18/+45
LoongArch is the only architecture that calls syscall_exit_to_user_mode() from assembly. Move the call into C so that this function can be inlined across all architectures. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250320-riscv_optimize_entry-v6-3-63e187e26041@rivosinc.com
2025-04-29riscv: entry: Split ret_from_fork() into user and kernelCharlie Jenkins3-10/+23
This function was unified into a single function in commit ab9164dae273 ("riscv: entry: Consolidate ret_from_kernel_thread into ret_from_fork"). However that imposed a performance degradation. Partially reverting this commit to have ret_from_fork() split again, results in a 1% increase on the number of times fork is able to be called per second. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/all/20250320-riscv_optimize_entry-v6-2-63e187e26041@rivosinc.com
2025-04-29riscv: entry: Convert ret_from_fork() to CCharlie Jenkins3-11/+19
Move the main section of ret_from_fork() to C to allow inlining of syscall_exit_to_user_mode(). Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/all/20250320-riscv_optimize_entry-v6-1-63e187e26041@rivosinc.com
2025-04-29powerpc: Don't use --- in kernel logsChristophe Leroy2-5/+5
When a kernel log containing --- at the start of a line is copied into a patch message, 'git am' drops everything located after that ---. Replace --- by ---- to avoid that. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/54a1f8d2c3fb5b95434039724c8c141052ae5cc0.1739346038.git.christophe.leroy@csgroup.eu
2025-04-29powerpc/crash: Fix non-smp kexec preparationEddie James1-1/+4
In non-smp configurations, crash_kexec_prepare is never called in the crash shutdown path. One result of this is that the crashing_cpu variable is never set, preventing crash_save_cpu from storing the NT_PRSTATUS elf note in the core dump. Fixes: c7255058b543 ("powerpc/crash: save cpu register data in crash_smp_send_stop()") Signed-off-by: Eddie James <eajames@linux.ibm.com> Reviewed-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250211162054.857762-1-eajames@linux.ibm.com
2025-04-29powerpc: do not build ppc_save_regs.o alwaysJiri Slaby (SUSE)1-1/+1
The Fixes commit below tried to add CONFIG_PPC_BOOK3S to one of the conditions to enable the build of ppc_save_regs.o. But it failed to do so, in fact. The commit omitted to add a dollar sign. Therefore, ppc_save_regs.o is built always these days (as "(CONFIG_PPC_BOOK3S)" is never an empty string). Fix this by adding the missing dollar sign. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Fixes: fc2a5a6161a2 ("powerpc/64s: ppc_save_regs is now needed for all 64s builds") Acked-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250417105305.397128-1-jirislaby@kernel.org
2025-04-29powerpc/pseries/msi: Avoid reading PCI device registers in reduced power statesGautam Menghani1-1/+6
When a system is being suspended to RAM, the PCI devices are also suspended and the PPC code ends up calling pseries_msi_compose_msg() and this triggers the BUG_ON() in __pci_read_msi_msg() because the device at this point is in reduced power state. In reduced power state, the memory mapped registers of the PCI device are not accessible. To replicate the bug: 1. Make sure deep sleep is selected # cat /sys/power/mem_sleep s2idle [deep] 2. Make sure console is not suspended (so that dmesg logs are visible) echo N > /sys/module/printk/parameters/console_suspend 3. Suspend the system echo mem > /sys/power/state To fix this behaviour, read the cached msi message of the device when the device is not in PCI_D0 power state instead of touching the hardware. Fixes: a5f3d2c17b07 ("powerpc/pseries/pci: Add MSI domains") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250305090237.294633-1-gautam@linux.ibm.com
2025-04-29powerpc/bpf: fix JIT code size calculation of bpf trampolineHari Bathini4-39/+35
arch_bpf_trampoline_size() provides JIT size of the BPF trampoline before the buffer for JIT'ing it is allocated. The total number of instructions emitted for BPF trampoline JIT code depends on where the final image is located. So, the size arrived at with the dummy pass in arch_bpf_trampoline_size() can vary from the actual size needed in arch_prepare_bpf_trampoline(). When the instructions accounted in arch_bpf_trampoline_size() is less than the number of instructions emitted during the actual JIT compile of the trampoline, the below warning is produced: WARNING: CPU: 8 PID: 204190 at arch/powerpc/net/bpf_jit_comp.c:981 __arch_prepare_bpf_trampoline.isra.0+0xd2c/0xdcc which is: /* Make sure the trampoline generation logic doesn't overflow */ if (image && WARN_ON_ONCE(&image[ctx->idx] > (u32 *)rw_image_end - BPF_INSN_SAFETY)) { So, during the dummy pass, instead of providing some arbitrary image location, account for maximum possible instructions if and when there is a dependency with image location for JIT'ing. Fixes: d243b62b7bd3 ("powerpc64/bpf: Add support for bpf trampolines") Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/all/6168bfc8-659f-4b5a-a6fb-90a916dde3b3@linux.ibm.com/ Cc: stable@vger.kernel.org # v6.13+ Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250422082609.949301-1-hbathini@linux.ibm.com
2025-04-29powerpc64/ftrace: fix clobbered r15 during livepatchingHari Bathini1-1/+1
While r15 is clobbered always with PPC_FTRACE_OUT_OF_LINE, it is not restored in livepatch sequence leading to not so obvious fails like below: BUG: Unable to handle kernel data access on write at 0xc0000000000f9078 Faulting instruction address: 0xc0000000018ff958 Oops: Kernel access of bad area, sig: 11 [#1] ... NIP: c0000000018ff958 LR: c0000000018ff930 CTR: c0000000009c0790 REGS: c00000005f2e7790 TRAP: 0300 Tainted: G K (6.14.0+) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 2822880b XER: 20040000 CFAR: c0000000008addc0 DAR: c0000000000f9078 DSISR: 0a000000 IRQMASK: 1 GPR00: c0000000018f2584 c00000005f2e7a30 c00000000280a900 c000000017ffa488 GPR04: 0000000000000008 0000000000000000 c0000000018f24fc 000000000000000d GPR08: fffffffffffe0000 000000000000000d 0000000000000000 0000000000008000 GPR12: c0000000009c0790 c000000017ffa480 c00000005f2e7c78 c0000000000f9070 GPR16: c00000005f2e7c90 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 c00000005f3efa80 c00000005f2e7c60 c00000005f2e7c88 GPR24: c00000005f2e7c60 0000000000000001 c0000000000f9078 0000000000000000 GPR28: 00007fff97960000 c000000017ffa480 0000000000000000 c0000000000f9078 ... Call Trace: check_heap_object+0x34/0x390 (unreliable) __mutex_unlock_slowpath.isra.0+0xe4/0x230 seq_read_iter+0x430/0xa90 proc_reg_read_iter+0xa4/0x200 vfs_read+0x41c/0x510 ksys_read+0xa4/0x190 system_call_exception+0x1d0/0x440 system_call_vectored_common+0x15c/0x2ec Fix it by restoring r15 always. Fixes: eec37961a56a ("powerpc64/ftrace: Move ftrace sequence out of line") Reported-by: Viktor Malik <vmalik@redhat.com> Closes: https://lore.kernel.org/lkml/1aec4a9a-a30b-43fd-b303-7a351caeccb7@redhat.com Cc: stable@vger.kernel.org # v6.13+ Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Tested-by: Viktor Malik <vmalik@redhat.com> Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250416191227.201146-1-hbathini@linux.ibm.com
2025-04-28m68k: mac: Fix macintosh_config for Mac IIFinn Thain1-1/+1
When booted on my Mac II, the kernel prints this: Detected Macintosh model: 6 Apple Macintosh Unknown The catch-all entry ("Unknown") is mac_data_table[0] which is only needed in the unlikely event that the bootinfo model ID can't be matched. When model ID is 6, the search should begin and end at mac_data_table[1]. Fix the off-by-one error that causes this problem. Cc: Joshua Thompson <funaho@jurai.org> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Finn Thain <fthain@linux-m68k.org> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/d0f30a551064ca4810b1c48d5a90954be80634a9.1745453246.git.fthain@linux-m68k.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2025-04-28m68k: Replace strcpy() with strscpy() in hardware_proc_show()Thorsten Blum1-1/+1
strcpy() is deprecated; use strscpy() instead. Link: https://github.com/KSPP/linux/issues/88 Cc: linux-hardening@vger.kernel.org Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Jean-Michel Hautbois <jeanmichel.hautbois@yoseli.org> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/20250421122839.363619-1-thorsten.blum@linux.dev Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2025-04-28x86/sgx: Use SHA-256 library API instead of crypto_shash APIEric Biggers3-31/+3
This user of SHA-256 does not support any other algorithm, so the crypto_shash abstraction provides no value. Just use the SHA-256 library API instead, which is much simpler and easier to use. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250428183838.799333-1-ebiggers%40kernel.org
2025-04-28arm64/fpsimd: signal: Clear TPIDR2 when delivering signalsMark Rutland1-0/+1
Linux is intended to be compatible with userspace written to Arm's AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension (SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's ZA tile is lazily callee-saved and caller-restored. In this scheme, TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer" pointer. This scheme has been implemented in GCC and LLVM, with necessary runtime support implemented in glibc. AAPCS64 does not specify how the ZA lazy saving scheme is expected to interact with signal handling, and the behaviour that AAPCS64 currently recommends for (sig)setjmp() and (sig)longjmp() does not always compose safely with signal handling, as explained below. When Linux delivers a signal, it creates signal frames which contain the original values of PSTATE.ZA, the ZA tile, and TPIDR_EL2. Between saving the original state and entering the signal handler, Linux clears PSTATE.ZA, but leaves TPIDR2_EL0 unchanged. Consequently a signal handler can be entered with PSTATE.ZA=0 (meaning accesses to ZA will trap), while TPIDR_EL0 is non-null (which may indicate that ZA needs to be lazily saved, depending on the contents of the TPIDR2 block). While in this state, libc and/or compiler runtime code, such as longjmp(), may attempt to save ZA. As PSTATE.ZA=0, these accesses will trap, causing the kernel to inject a SIGILL. Note that by virtue of lazy saving occurring in libc and/or C runtime code, this can be triggered by application/library code which is unaware of SME. To avoid the problem above, the kernel must ensure that signal handlers are entered with PSTATE.ZA and TPIDR2_EL0 configured in a manner which complies with the ZA lazy saving scheme. Practically speaking, the only choice is to enter signal handlers with PSTATE.ZA=0 and TPIDR2_EL0=NULL. This change should not impact SME code which does not follow the ZA lazy saving scheme (and hence does not use TPIDR2_EL0). An alternative approach that was considered is to have the signal handler inherit the original values of both PSTATE.ZA and TPIDR2_EL0, relying on lazy save/restore sequences being idempotent and capable of racing safely. This is not safe as signal handlers must be assumed to have a "private ZA" interface, and therefore cannot be entered with PSTATE.ZA=1 and TPIDR2_EL0=NULL, but it is legitimate for signals to be taken from this state. With the kernel fixed to clear TPIDR2_EL0, there are a couple of remaining issues (largely masked by the first issue) that must be fixed in userspace: (1) When a (sig)setjmp() + (sig)longjmp() pair cross a signal boundary, ZA state may be discarded when it needs to be preserved. Currently, the ZA lazy saving scheme recommends that setjmp() does not save ZA, and recommends that longjmp() is responsible for saving ZA. A call to longjmp() in a signal handler will not have visibility of ZA state that existed prior to entry to the signal, and when a longjmp() is used to bypass a usual signal return, unsaved ZA state will be discarded erroneously. To fix this, it is necessary for setjmp() to eagerly save ZA state, and for longjmp() to configure PSTATE.ZA=0 and TPIDR2_EL0=NULL. This works regardless of whether a signal boundary is crossed. (2) When a C++ exception is thrown and crosses a signal boundary before it is caught, ZA state may be discarded when it needs to be preserved. AAPCS64 requires that exception handlers are entered with PSTATE.{SM,ZA}={0,0} and TPIDR2_EL0=NULL, with exception unwind code expected to perform any necessary save of ZA state. Where it is necessary to perform an exception unwind across an exception boundary, the unwind code must recover any necessary ZA state (along with TPIDR2) from signal frames. Fix the kernel as described above, with setup_return() clearing TPIDR2_EL0 when delivering a signal. Folk on CC are working on fixes for the remaining userspace issues, including updates/fixes to the AAPCS64 specification and glibc. [1] https://github.com/ARM-software/abi-aa/releases/download/2025Q1/aapcs64.pdf [2] https://github.com/ARM-software/abi-aa/blob/c51addc3dc03e73a016a1e4edf25440bcac76431/aapcs64/aapcs64.rst Fixes: 39782210eb7e ("arm64/sme: Implement ZA signal handling") Fixes: 39e54499280f ("arm64/signal: Include TPIDR2 in the signal context") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Daniel Kiss <daniel.kiss@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Richard Sandiford <richard.sandiford@arm.com> Cc: Sander De Smalen <sander.desmalen@arm.com> Cc: Tamas Petz <tamas.petz@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yury Khrustalev <yury.khrustalev@arm.com> Link: https://lore.kernel.org/r/20250417190113.3778111-1-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-28x86/microcode/AMD: Use sha256() instead of init/update/finalEric Biggers1-4/+1
Just call sha256() instead of doing the init/update/final sequence. No functional changes. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250428183006.782501-1-ebiggers@kernel.org
2025-04-28x86/sev: Remove unnecessary GFP_KERNEL_ACCOUNT for temporary variablesPeng Hao1-2/+2
Some variables allocated in sev_send_update_data are released when the function exits, so there is no need to set GFP_KERNEL_ACCOUNT. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Link: https://lore.kernel.org/r/20250428063013.62311-1-flyingpeng@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()Yan Zhao2-0/+4
Check request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to free obsolete roots in kvm_mmu_reload() to prevent kvm_mmu_reload() from seeing a stale obsolete root. Since kvm_mmu_reload() can be called outside the vcpu_enter_guest() path (e.g., kvm_arch_vcpu_pre_fault_memory()), it may be invoked after a root has been marked obsolete and before vcpu_enter_guest() is invoked to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS and set root.hpa to invalid. This causes kvm_mmu_reload() to fail to load a new root, which can lead to kvm_arch_vcpu_pre_fault_memory() being stuck in the while loop in kvm_tdp_map_page() since RET_PF_RETRY is always returned due to is_page_fault_stale(). Keep the existing check of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS in vcpu_enter_guest() since the cost of kvm_check_request() is negligible, especially a check that's guarded by kvm_request_pending(). Export symbol of kvm_mmu_free_obsolete_roots() as kvm_mmu_reload() is inline and may be called outside of kvm.ko. Fixes: 6e01b7601dfe ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013333.5817-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/mmu: Warn if PFN changes on shadow-present SPTE in shadow MMUYan Zhao1-2/+3
Warn if PFN changes on shadow-present SPTE in mmu_set_spte(). KVM should _never_ change the PFN of a shadow-present SPTE. In mmu_set_spte(), there is a WARN_ON_ONCE() on pfn changes on shadow-present SPTE in mmu_spte_update() to detect this condition. However, that WARN_ON_ONCE() is not hittable since mmu_set_spte() invokes drop_spte() earlier before mmu_spte_update(), which clears SPTE to a !shadow-present state. So, before invoking drop_spte(), add a WARN_ON_ONCE() in mmu_set_spte() to warn PFN change of a shadow-present SPTE. For the spurious prefetch fault, only return RET_PF_SPURIOUS directly when PFN is not changed. When PFN changes, fall through to follow the sequence of drop_spte(), warn of PFN change, make_spte(), flush tlb, rmap_add(). Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013310.5781-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/tdp_mmu: WARN if PFN changes for spurious faultsYan Zhao1-1/+3
Add a WARN() to assert that KVM does _not_ change the PFN of a shadow-present SPTE during spurious fault handling. KVM should _never_ change the PFN of a shadow-present SPTE and TDP MMU already BUG()s on this. However, spurious faults just return early before the existing BUG() could be hit. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013238.5732-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/tdp_mmu: Merge prefetch and access checks for spurious faultsYan Zhao1-5/+1
Combine prefetch and is_access_allowed() checks into a unified path to detect spurious faults, since both cases now share identical logic. No functional changes. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013210.5701-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/mmu: Further check old SPTE is leaf for spurious prefetch faultYan Zhao2-2/+3
Instead of simply treating a prefetch fault as spurious when there's a shadow-present old SPTE, further check if the old SPTE is leaf to determine if a prefetch fault is spurious. It's not reasonable to treat a prefetch fault as spurious when there's a shadow-present non-leaf SPTE without a corresponding shadow-present leaf SPTE. e.g., in the following sequence, a prefetch fault should not be considered spurious: 1. add a memslot with size 4K 2. prefault GPA A in the memslot 3. delete the memslot (zap all disabled) 4. re-add the memslot with size 2M 5. prefault GPA A again. In step 5, the prefetch fault attempts to install a 2M huge entry. Since step 3 zaps the leaf SPTE for GPA A while keeping the non-leaf SPTE, the leaf entry will remain empty after step 5 if the fetch fault is regarded as spurious due to a shadow-present non-leaf SPTE. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013111.5648-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: VMX: Flush shadow VMCS on emergency rebootChao Gao1-1/+4
Ensure the shadow VMCS cache is evicted during an emergency reboot to prevent potential memory corruption if the cache is evicted after reboot. This issue was identified through code inspection, as __loaded_vmcs_clear() flushes both the normal VMCS and the shadow VMCS. Avoid checking the "launched" state during an emergency reboot, unlike the behavior in __loaded_vmcs_clear(). This is important because reboot NMIs can interfere with operations like copy_shadow_to_vmcs12(), where shadow VMCSes are loaded directly using VMPTRLD. In such cases, if NMIs occur right after the VMCS load, the shadow VMCSes will be active but the "launched" state may not be set. Fixes: 16f5b9034b69 ("KVM: nVMX: Copy processor-specific shadow-vmcs to VMCS12") Cc: stable@vger.kernel.org Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250324140849.2099723-1-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>