summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2025-11-20LoongArch: Fix NUMA node parsing with numa_memblksBibo Mao1-42/+18
On physical machine, NUMA node id comes from high bit 44:48 of physical address. However it is not true on virt machine. With general method, it comes from ACPI SRAT table. Here the common function numa_memblks_init() is used to parse NUMA node information with numa_memblks. Cc: <stable@vger.kernel.org> Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-11-20LoongArch: Consolidate CPU names in /proc/cpuinfoHuacai Chen3-23/+34
Some processors have no IOCSR.VENDOR and IOCSR.CPUNAME, some processors have these registers but there is no valid information. Consolidate CPU names in /proc/cpuinfo: 1. Add "PRID" to display the PRID & Core-Name; 2. Let "Model Name" display "Unknown" if no valid name. Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-11-20LoongArch: Use UAPI types in ptrace UAPI headerThomas Weißschuh1-22/+18
The kernel UAPI headers already contain fixed-width integer types, there is no need to rely on the libc types. There may not be a libc available or the libc may not provides the <stdint.h>, like for example on nolibc. This also aligns the header with the rest of the LoongArch UAPI headers. Fixes: 803b0fc5c3f2 ("LoongArch: Add process management") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-11-20KVM: x86: Add emulator support for decoding VEX prefixesPaolo Bonzini1-10/+112
After all the changes done in the previous patches, the only thing left to support AVX MOV instructions is to expand the VEX prefix into the appropriate REX, 66/F3/F2 and map prefixes. Three-operand instructions are not supported. The Avx bit in this case is not cleared, in fact it is used as the sign that the instruction does support VEX encoding. Until it is added to any instruction, however, the only functional change is to change some not-implemented instructions to #UD if they correspond to a VEX prefix with an invalid map. Co-developed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20251114003633.60689-10-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-20KVM: x86: Refactor REX prefix handling in instruction emulationChang S. Bae2-13/+31
Restructure how to represent and interpret REX fields, preparing for handling of both REX2 and VEX. REX uses the upper four bits of a single byte as a fixed identifier, and the lower four bits containing the data. VEX and REX2 extends this so that the first byte identifies the prefix and the rest encode additional bits; and while VEX only has the same four data bits as REX, eight zero bits are a valid value for the data bits of REX2. So, stop storing the REX byte as-is. Instead, store only the low bits of the REX prefix and track separately whether a REX-like prefix was used. No functional changes intended. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Message-ID: <20251110180131.28264-11-chang.seok.bae@intel.com> [Extracted from APX series; removed bitfields and REX2-specific default. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20251114003633.60689-9-pbonzini@redhat.com [sean: name REX_{BXRW} enum "rex_bits"] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-20KVM: x86: Add AVX support to the emulator's register fetch and writebackPaolo Bonzini3-17/+114
Prepare struct operand for hosting AVX registers. Remove the existing, incomplete code that placed the Avx flag in the operand alignment field, and repurpose the name for a separate bit that indicates: - after decode, whether an instruction supports the VEX prefix; - before writeback, that the instruction did have the VEX prefix and therefore 1) it can have op_bytes == 32; 2) t should clear high bytes of XMM registers. Right now the bit will never be set and the patch has no intended functional change. However, this is actually more vexing than the decoder changes itself, and therefore worth separating. Co-developed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20251114003633.60689-8-pbonzini@redhat.com [sean: guard ymm[8-15] accesses with #ifdef CONFIG_X86_64] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-20KVM: x86: Add x86_emulate_ops.get_xcr() callbackPaolo Bonzini2-0/+10
This will be necessary in order to check whether AVX is enabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Link: https://patch.msgid.link/20251114003633.60689-7-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-20KVM: x86: Share emulator's common register decoding codePaolo Bonzini1-32/+17
Remove all duplicate handling of register operands, including picking the right register class and fetching it, by extracting a new function that can be used for both REG and MODRM operands. Centralize setting op->orig_val = op->val in fetch_register_operand() as well. No functional change intended. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Link: https://patch.msgid.link/20251114003633.60689-6-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-20KVM: x86: Move op_prefix to struct x86_emulate_ctxt (from x86_decode_insn())Paolo Bonzini2-4/+5
VEX decode will need to set it based on the "pp" bits, so make it a field in the struct rather than a local variable. No functional change intended. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Link: https://patch.msgid.link/20251114003633.60689-5-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-20KVM: x86: Improve formatting of the emulator's flags tablePaolo Bonzini1-16/+11
Align a little better the comments on the right side and list explicitly the bits used by multi-bit fields. No functional change intended. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Link: https://patch.msgid.link/20251114003633.60689-4-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-20KVM: x86: Move Src2Shift up one bit (use bits 36:32 for Src2 in the emulator)Paolo Bonzini1-1/+2
An irresistible microoptimization (changing accesses to Src2 to just an AND :)) that also frees a bit for AVX in the low flags word. This makes it closer to SSE since both of them can access XMM registers, pointlessly shaving another clock cycle or two (maybe). No functional change intended. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com Link: https://patch.msgid.link/20251114003633.60689-3-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-20KVM: x86: Add support for emulating MOVNTDQAPaolo Bonzini1-4/+9
MOVNTDQA is a simple MOV instruction, in fact it has the same characteristics as 0F E7 (MOVNTDQ) other than the aligned-address requirement. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20251114003633.60689-2-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-20KVM: arm64: Reschedule as needed when destroying the stage-2 page-tablesRaghavendra Rao Ananta1-1/+25
When a large VM, specifically one that holds a significant number of PTEs, gets abruptly destroyed, the following warning is seen during the page-table walk: sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE Tainted: [O]=OOT_MODULE Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x3c/0xb8 dump_stack+0x18/0x30 resched_latency_warn+0x7c/0x88 sched_tick+0x1c4/0x268 update_process_times+0xa8/0xd8 tick_nohz_handler+0xc8/0x168 __hrtimer_run_queues+0x11c/0x338 hrtimer_interrupt+0x104/0x308 arch_timer_handler_phys+0x40/0x58 handle_percpu_devid_irq+0x8c/0x1b0 generic_handle_domain_irq+0x48/0x78 gic_handle_irq+0x1b8/0x408 call_on_irq_stack+0x24/0x30 do_interrupt_handler+0x54/0x78 el1_interrupt+0x44/0x88 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x84/0x88 stage2_free_walker+0x30/0xa0 (P) __kvm_pgtable_walk+0x11c/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 kvm_pgtable_walk+0xc4/0x140 kvm_pgtable_stage2_destroy+0x5c/0xf0 kvm_free_stage2_pgd+0x6c/0xe8 kvm_uninit_stage2_mmu+0x24/0x48 kvm_arch_flush_shadow_all+0x80/0xa0 kvm_mmu_notifier_release+0x38/0x78 __mmu_notifier_release+0x15c/0x250 exit_mmap+0x68/0x400 __mmput+0x38/0x1c8 mmput+0x30/0x68 exit_mm+0xd4/0x198 do_exit+0x1a4/0xb00 do_group_exit+0x8c/0x120 get_signal+0x6d4/0x778 do_signal+0x90/0x718 do_notify_resume+0x70/0x170 el0_svc+0x74/0xd8 el0t_64_sync_handler+0x60/0xc8 el0t_64_sync+0x1b0/0x1b8 The warning is seen majorly on the host kernels that are configured not to force-preempt, such as CONFIG_PREEMPT_NONE=y. To avoid this, instead of walking the entire page-table in one go, split it into smaller ranges, by checking for cond_resched() between each range. Since the path is executed during VM destruction, after the page-table structure is unlinked from the KVM MMU, relying on cond_resched_rwlock_write() isn't necessary. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Link: https://msgid.link/20251113052452.975081-4-rananta@google.com Signed-off-by: Oliver Upton <oupton@kernel.org>
2025-11-20KVM: arm64: Split kvm_pgtable_stage2_destroy()Raghavendra Rao Ananta5-9/+73
Split kvm_pgtable_stage2_destroy() into two: - kvm_pgtable_stage2_destroy_range(), that performs the page-table walk and free the entries over a range of addresses. - kvm_pgtable_stage2_destroy_pgd(), that frees the PGD. This refactoring enables subsequent patches to free large page-tables in chunks, calling cond_resched() between each chunk, to yield the CPU as necessary. Existing callers of kvm_pgtable_stage2_destroy(), that probably cannot take advantage of this (such as nVMHE), will continue to function as is. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Suggested-by: Oliver Upton <oupton@kernel.org> Link: https://msgid.link/20251113052452.975081-3-rananta@google.com Signed-off-by: Oliver Upton <oupton@kernel.org>
2025-11-20KVM: arm64: Only drop references on empty tables in stage2_free_walkerOliver Upton1-6/+32
A subsequent change to the way KVM frees stage-2s will invoke the free walker on sub-ranges of the VM's IPA space, meaning there's potential for only partially visiting a table's PTEs. Split the leaf and table visitors and only drop references on a table when the page count reaches 1, implying there are no valid PTEs that need to be visited. Invalidate the table PTE to avoid traversing the stale reference. Link: https://msgid.link/20251113052452.975081-2-rananta@google.com Signed-off-by: Oliver Upton <oupton@kernel.org>
2025-11-19KVM: arm64: Use kvzalloc() for kvm struct allocationOliver Upton1-1/+1
Physically-allocated KVM structs aren't necessary when in VHE mode as there's no need to share with the hyp's address space. Of course, there can still be a performance benefit from physical allocations. Use kvzalloc() for opportunistic physical allocations. Acked-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://msgid.link/20251119093822.2513142-3-oupton@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
2025-11-19KVM: arm64: Drop useless __GFP_HIGHMEM from kvm struct allocationOliver Upton1-1/+1
A recent change on the receiving end of vmalloc() started warning about unsupported GFP flags passed by the caller. Nathan reports that this warning fires in kvm_arch_alloc_vm(), owing to the fact that KVM is passing a meaningless __GFP_HIGHMEM. Do as the warning says and fix the code. Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/kvmarm/20251118224448.GA998046@ax162/ Acked-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://msgid.link/20251119093822.2513142-2-oupton@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
2025-11-19arm_mpam: Add probe/remove for mpam msc driver and kbuild boiler plateJames Morse1-0/+1
Probing MPAM is convoluted. MSCs that are integrated with a CPU may only be accessible from those CPUs, and they may not be online. Touching the hardware early is pointless as MPAM can't be used until the system-wide common values for num_partid and num_pmg have been discovered. Start with driver probe/remove and mapping the MSC. Cc: Carl Worth <carl@os.amperecomputing.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19ACPI / MPAM: Parse the MPAM tableJames Morse1-0/+1
Add code to parse the arm64 specific MPAM table, looking up the cache level from the PPTT and feeding the end result into the MPAM driver. This happens in two stages. Platform devices are created first for the MSC devices. Once the driver probes it calls acpi_mpam_parse_resources() to discover the RIS entries the MSC contains. For now the MPAM hook mpam_ris_create() is stubbed out, but will update the MPAM driver with optional discovered data about the RIS entries. CC: Carl Worth <carl@os.amperecomputing.com> Link: https://developer.arm.com/documentation/den0065/3-0bet/?lang=en Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19arm64: kconfig: Add Kconfig entry for MPAMJames Morse1-0/+23
The bulk of the MPAM driver lives outside the arch code because it largely manages MMIO devices that generate interrupts. The driver needs a Kconfig symbol to enable it. As MPAM is only found on arm64 platforms, the arm64 tree is the most natural home for the Kconfig option. This Kconfig option will later be used by the arch code to enable or disable the MPAM context-switch code, and to register properties of CPUs with the MPAM driver. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> CC: Dave Martin <dave.martin@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19Merge tag 'soc-fixes-6.18-3' of ↵Linus Torvalds21-46/+69
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull SoC fixes from Arnd Bergmann: "These are mainly devicetree fixes for the arm platforms from Rockchips NXP, ASpeed and Broadcom, addressing issues with accidental overclocking, pinctrl, network and dtc warnings. There are additional fixes for regressions with the i.MX reset and memory controller drivers as well as the Tegra memory controller driver. Minor updates to the MAINTAINERS file, tee documentation and defconfigs bring those up to date with recent changes elsewhere" * tag 'soc-fixes-6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (29 commits) MAINTAINERS: sync omap devicetree maintainers with omap platform MAINTAINERS: Update Krzysztof Kozlowski's email arm64: dts: rockchip: fix PCIe 3.3V regulator voltage on orangepi-5 arm64: dts: rockchip: disable HS400 on RK3588 Tiger arm64: dts: rockchip: drop reset from rk3576 i2c9 node tee: <uapi/linux/tee.h: fix all kernel-doc issues arm64: dts: rockchip: Fix USB power enable pin for BTT CB2 and Pi2 arm64: dts: broadcom: bcm2712: rpi-5: Add ethernet0 alias arm64: dts: broadcom: Assign clock rates in eth node for RPi5 reset: imx8mp-audiomix: Fix bad mask values ARM: dts: BCM53573: Fix address of Luxul XAP-1440's Ethernet PHY arm64: defconfig: Fix V3D deferred probe timeout arm64: dts: rockchip: Fix vccio4-supply on rk3566-pinetab2 arm64: dts: rockchip: include rk3399-base instead of rk3399 in rk3399-op1 arm64: dts: imx8mp-kontron: Fix USB OTG role switching arm64: dts: imx95: Fix MSI mapping for PCIe endpoint nodes arm64: dts: imx8-ss-img: Avoid gpio0_mipi_csi GPIOs being deferred arm: imx_v6_v7_defconfig: enable ext4 directly memory: tegra210: Fix incorrect client ids arm64: dts: rockchip: Fix indentation on rk3399 haikou demo dtso ...
2025-11-19riscv: hwprobe: Expose Zicbop extension and its block sizeYao Zihong3-1/+9
- Add `RISCV_HWPROBE_EXT_ZICBOP` to report the presence of the Zicbop extension. - Add `RISCV_HWPROBE_KEY_ZICBOP_BLOCK_SIZE` to expose the block size (in bytes) when Zicbop is supported. - Update hwprobe.rst to document the new extension bit and block size key, following the existing Zicbom/Zicboz style. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Yao Zihong <zihong.plct@isrc.iscas.ac.cn> Link: https://patch.msgid.link/20251118162436.15485-2-zihong.plct@isrc.iscas.ac.cn [pjw@kernel.org: updated to apply] Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-11-19riscv: Introduce Zalasr instructionsXu Lu1-0/+79
Introduce l{b|h|w|d}.{aq|aqrl} and s{b|h|w|d}.{rl|aqrl} instruction encodings. Signed-off-by: Xu Lu <luxu.kernel@bytedance.com> Reviewed-by: Guo Ren <guoren@kernel.org> Link: https://patch.msgid.link/20251020042056.30283-5-luxu.kernel@bytedance.com Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-11-19riscv: hwprobe: Export Zalasr extensionXu Lu2-0/+2
Export the Zalasr extension to userspace using hwprobe. Signed-off-by: Xu Lu <luxu.kernel@bytedance.com> Link: https://patch.msgid.link/20251020042056.30283-4-luxu.kernel@bytedance.com Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-11-19riscv: Add ISA extension parsing for ZalasrXu Lu2-0/+2
Add parsing for Zalasr ISA extension. Signed-off-by: Xu Lu <luxu.kernel@bytedance.com> Link: https://patch.msgid.link/20251020042056.30283-2-luxu.kernel@bytedance.com [pjw@kernel.org: updated to apply] Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-11-19riscv: ptrace: Optimize the allocation of vector regsetYong-Xuan Wang3-3/+24
The vector regset uses the maximum possible vlen value to estimate the .n field. But not all the hardwares support the maximum vlen. Linux might wastes time to prepare a large memory buffer(about 2^6 pages) for the vector regset. The regset can only copy vector registers when the process are using vector. Add .active callback and determine the n field of vector regset in riscv_v_setup_ctx_cache() doesn't affect the ptrace syscall and coredump. It can avoid oversized allocations and better matches real hardware limits. Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Reviewed-by: Greentime Hu <greentime.hu@sifive.com> Reviewed-by: Andy Chiu <andybnac@gmail.com> Tested-by: Andy Chiu <andybnac@gmail.com> Link: https://patch.msgid.link/20251013091318.467864-2-yongxuan.wang@sifive.com Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-11-19riscv: cmpxchg: Use riscv_has_extension_likelyVivian Wang1-8/+4
Use riscv_has_extension_likely() to check for RISCV_ISA_EXT_ZAWRS, replacing the use of asm goto with ALTERNATIVE. The "likely" variant is used to match the behavior of the original implementation using ALTERNATIVE("j %l[no_zawrs]", "nop", ...). Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn> Link: https://patch.msgid.link/20251020-riscv-altn-helper-wip-v4-5-ef941c87669a@iscas.ac.cn Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-11-19riscv: bitops: Use riscv_has_extension_likelyVivian Wang1-24/+8
Use riscv_has_extension_likely() to check for RISCV_ISA_EXT_ZBB, replacing the use of asm goto with ALTERNATIVE. The "likely" variant is used to match the behavior of the original implementation using ALTERNATIVE("j %l[legacy]", "nop", ...). Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn> Link: https://patch.msgid.link/20251020-riscv-altn-helper-wip-v4-4-ef941c87669a@iscas.ac.cn Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-11-19riscv: hweight: Use riscv_has_extension_likelyVivian Wang1-16/+8
Use riscv_has_extension_likely() to check for RISCV_ISA_EXT_ZBB, replacing the use of asm goto with ALTERNATIVE. The "likely" variant is used to match the behavior of the original implementation using ALTERNATIVE("j %l[legacy]", "nop", ...). Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn> Link: https://patch.msgid.link/20251020-riscv-altn-helper-wip-v4-3-ef941c87669a@iscas.ac.cn Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-11-19riscv: checksum: Use riscv_has_extension_likelyVivian Wang2-50/+16
Use riscv_has_extension_likely() to check for RISCV_ISA_EXT_ZBB, replacing the use of asm goto with ALTERNATIVE. The "likely" variant is used to match the behavior of the original implementation using ALTERNATIVE("j %l[no_zbb]", "nop", ...). While we're at it, also remove bogus comment about Zbb being likely available. We have to choose between "likely" and "unlikely" due to limitations of the asm goto feature, but that does not mean we should put a bad comment on why we pick "likely" over "unlikely". Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn> Link: https://patch.msgid.link/20251020-riscv-altn-helper-wip-v4-2-ef941c87669a@iscas.ac.cn Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-11-19riscv: pgtable: Use riscv_has_extension_unlikelyVivian Wang2-20/+17
Use riscv_has_extension_unlikely() to check for RISCV_ISA_EXT_SVVPTC, replacing the use of asm goto with ALTERNATIVE. The "unlikely" variant is used to match the behavior of the original implementation using ALTERNATIVE("nop", "j %l[svvptc]", ...). Note that this makes the check for RISCV_ISA_EXT_SVVPTC a runtime one if RISCV_ALTERNATIVE=n, but it should still be worthwhile to do so given that TLB flushes are relatively slow. Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn> Link: https://patch.msgid.link/20251020-riscv-altn-helper-wip-v4-1-ef941c87669a@iscas.ac.cn Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-11-19riscv: Remove __GFP_HIGHMEM maskingVishal Moola (Oracle)1-2/+2
Remove unnecessary __GFP_HIGHMEM masking, which was introduced with commit 380f2c1ae9d4 ("riscv: convert alloc_{pmd, pte}_late() to use ptdescs"). GFP_KERNEL doesn't contain __GFP_HIGHMEM. Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Link: https://patch.msgid.link/20251107182620.95844-1-vishal.moola@gmail.com Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-11-19RISC-V: Enable HOTPLUG_PARALLEL for secondary CPUsAnup Patel2-1/+16
The core kernel already supports parallel bringup of secondary CPUs (aka HOTPLUG_PARALLEL). The x86 and MIPS architectures already use HOTPLUG_PARALLEL and ARM is also moving toward it. On RISC-V, there is no arch specific global data accessed in the RISC-V secondary CPU bringup path so enabling HOTPLUG_PARALLEL for RISC-V would only require: 1) Providing RISC-V specific arch_cpuhp_kick_ap_alive() 2) Calling cpuhp_ap_sync_alive() from smp_callin() This patch is tested natively with OpenSBI on QEMU RV64 virt machine with 64 cores and also tested with KVM RISC-V guest with 32 VCPUs. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Link: https://patch.msgid.link/20250905122512.71684-1-apatel@ventanamicro.com Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-11-19arm64, tlbflush: don't TLBI broadcast if page reused in write faultHuang Ying4-9/+72
A multi-thread customer workload with large memory footprint uses fork()/exec() to run some external programs every tens seconds. When running the workload on an arm64 server machine, it's observed that quite some CPU cycles are spent in the TLB flushing functions. While running the workload on the x86_64 server machine, it's not. This causes the performance on arm64 to be much worse than that on x86_64. During the workload running, after fork()/exec() write-protects all pages in the parent process, memory writing in the parent process will cause a write protection fault. Then the page fault handler will make the PTE/PDE writable if the page can be reused, which is almost always true in the workload. On arm64, to avoid the write protection fault on other CPUs, the page fault handler flushes the TLB globally with TLBI broadcast after changing the PTE/PDE. However, this isn't always necessary. Firstly, it's safe to leave some stale read-only TLB entries as long as they will be flushed finally. Secondly, it's quite possible that the original read-only PTE/PDEs aren't cached in remote TLB at all if the memory footprint is large. In fact, on x86_64, the page fault handler doesn't flush the remote TLB in this situation, which benefits the performance a lot. To improve the performance on arm64, make the write protection fault handler flush the TLB locally instead of globally via TLBI broadcast after making the PTE/PDE writable. If there are stale read-only TLB entries in the remote CPUs, the page fault handler on these CPUs will regard the page fault as spurious and flush the stale TLB entries. To test the patchset, make the usemem.c from vm-scalability (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git). support calling fork()/exec() periodically. To mimic the behavior of the customer workload, run usemem with 4 threads, access 100GB memory, and call fork()/exec() every 40 seconds. Test results show that with the patchset the score of usemem improves ~40.6%. The cycles% of TLB flush functions reduces from ~50.5% to ~0.3% in perf profile. Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19KVM: x86: Add a helper to dedup loading guest/host XCR0 and XSSBinbin Wu1-23/+10
Add and use a helper, kvm_load_xfeatures(), to dedup the code that loads guest/host xfeatures. Opportunistically return early if X86_CR4_OSXSAVE is not set to reduce indentations. No functional change intended. Suggested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20251110050539.3398759-1-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19KVM: x86: Load guest/host PKRU outside of the fastpath run loopSean Christopherson4-12/+10
Move KVM's swapping of PKRU outside of the fastpath loop, as there is no KVM code anywhere in the fastpath that accesses guest/userspace memory, i.e. that can consume protection keys. As documented by commit 1be0e61c1f25 ("KVM, pkeys: save/restore PKRU when guest/host switches"), KVM just needs to ensure the host's PKRU is loaded when KVM (or the kernel at-large) may access userspace memory. And at the time of commit 1be0e61c1f25, KVM didn't have a fastpath, and PKU was strictly contained to VMX, i.e. there was no reason to swap PKRU outside of vmx_vcpu_run(). Over time, the "need" to swap PKRU close to VM-Enter was likely falsely solidified by the association with XFEATUREs in commit 37486135d3a7 ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c"), and XFEATURE swapping was in turn moved close to VM-Enter/VM-Exit as a KVM hack-a-fix ution for an #MC handler bug by commit 1811d979c716 ("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context"). Deferring the PKRU loads shaves ~40 cycles off the fastpath for Intel, and ~60 cycles for AMD. E.g. using INVD in KVM-Unit-Test's vmexit.c, with extra hacks to enable CR4.PKE and PKRU=(-1u & ~0x3), latency numbers for AMD Turin go from ~1560 => ~1500, and for Intel Emerald Rapids, go from ~810 => ~770. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://patch.msgid.link/20251118222328.2265758-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loopSean Christopherson1-13/+26
Move KVM's swapping of XFEATURE masks, i.e. XCR0 and XSS, out of the fastpath loop now that the guts of the #MC handler runs in task context, i.e. won't invoke schedule() with preemption disabled and clobber state (or crash the kernel) due to trying to context switch XSTATE with a mix of host and guest state. For all intents and purposes, this reverts commit 1811d979c716 ("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context"), which papered over an egregious bug/flaw in the #MC handler where it would do schedule() even though IRQs are disabled. E.g. the call stack from the commit: kvm_load_guest_xcr0 ... kvm_x86_ops->run(vcpu) vmx_vcpu_run vmx_complete_atomic_exit kvm_machine_check do_machine_check do_memory_failure memory_failure lock_page Commit 1811d979c716 "fixed" the immediate issue of XRSTORS exploding, but completely ignored that scheduling out a vCPU task while IRQs and preemption is wildly broken. Thankfully, commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work") (somewhat incidentally?) fixed that flaw by pushing the meat of the work to the user-return path, i.e. to task context. KVM has also hardened itself against #MC goofs by moving #MC forwarding to kvm_x86_ops.handle_exit_irqoff(), i.e. out of the fastpath. While that's by no means a robust fix, restoring as much state as possible before handling the #MC will hopefully provide some measure of protection in the event that #MC handling goes off the rails again. Note, KVM always intercepts XCR0 writes for vCPUs without protected state, e.g. there's no risk of consuming a stale XCR0 when determining if a PKRU update is needed; kvm_load_host_xfeatures() only reads, and never writes, vcpu->arch.xcr0. Deferring the XCR0 and XSS loads shaves ~300 cycles off the fastpath for Intel, and ~500 cycles for AMD. E.g. using INVD in KVM-Unit-Test's vmexit.c, which an extra hack to enable CR4.OXSAVE, latency numbers for AMD Turin go from ~2000 => 1500, and for Intel Emerald Rapids, go from ~1300 => ~1000. Cc: Jon Kohler <jon@nutanix.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://patch.msgid.link/20251118222328.2265758-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpathSean Christopherson2-8/+11
Handle Machine Checks (#MC) that happen on VM-Enter (VMX or TDX) outside of KVM's fastpath so that as much host state as possible is re-loaded before invoking the kernel's #MC handler. The only requirement is that KVM invokes the #MC handler before enabling IRQs (and even that could _probably_ be related to handling #MCs before enabling preemption). Waiting to handle #MCs until "more" host state is loaded hardens KVM against flaws in the #MC handler, which has historically been quite brittle. E.g. prior to commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work"), the #MC code could trigger a schedule() with IRQs and preemption disabled. That led to a KVM hack-a-fix in commit 1811d979c716 ("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context"). Note, vmx_handle_exit_irqoff() is common to VMX and TDX guests. Cc: Tony Lindgren <tony.lindgren@linux.intel.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Jon Kohler <jon@nutanix.com> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Link: https://patch.msgid.link/20251118222328.2265758-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19perf/x86/intel/uncore: Remove superfluous checkJiri Slaby (SUSE)1-2/+0
The 'pmu' pointer cannot be NULL, as it is taken as a pointer to an array. Remove the superfluous NULL check. Found by Coverity: CID#1497507. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Liang Kan <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://patch.msgid.link/20251119091538.825307-1-jirislaby@kernel.org
2025-11-19KVM: SVM: Handle #MCs in guest outside of fastpathSean Christopherson1-9/+9
Handle Machine Checks (#MC) that happen in the guest (by forwarding them to the host) outside of KVM's fastpath so that as much host state as possible is re-loaded before invoking the kernel's #MC handler. The only requirement is that KVM invokes the #MC handler before enabling IRQs (and even that could _probably_ be relaxed to handling #MCs before enabling preemption). Waiting to handle #MCs until "more" host state is loaded hardens KVM against flaws in the #MC handler, which has historically been quite brittle. E.g. prior to commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work"), the #MC code could trigger a schedule() with IRQs and preemption disabled. That led to a KVM hack-a-fix in commit 1811d979c716 ("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context"). Note, except for #MCs on VM-Enter, VMX already handles #MCs outside of the fastpath. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://patch.msgid.link/20251118222328.2265758-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19KVM: x86: Unify L1TF flushing under per-CPU variableBrendan Jackman6-23/+24
Currently the tracking of the need to flush L1D for L1TF is tracked by two bits: one per-CPU and one per-vCPU. The per-vCPU bit is always set when the vCPU shows up on a core, so there is no interesting state that's truly per-vCPU. Indeed, this is a requirement, since L1D is a part of the physical CPU. So simplify this by combining the two bits. The vCPU bit was being written from preemption-enabled regions. To play nice with those cases, wrap all calls from KVM and use a raw write so that request a flush with preemption enabled doesn't trigger what would effectively be DEBUG_PREEMPT false positives. Preemption doesn't need to be disabled, as kvm_arch_vcpu_load() will mark the new CPU as needing a flush if the vCPU task is migrated, or if userspace runs the vCPU on a different task. Signed-off-by: Brendan Jackman <jackmanb@google.com> [sean: put raw write in KVM instead of in a hardirq.h variant] Link: https://patch.msgid.link/20251113233746.1703361-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19KVM: VMX: Disable L1TF L1 data cache flush if CONFIG_CPU_MITIGATIONS=nSean Christopherson2-14/+46
Disable support for flushing the L1 data cache to mitigate L1TF if CPU mitigations are disabled for the entire kernel. KVM's mitigation of L1TF is in no way special enough to justify ignoring CONFIG_CPU_MITIGATIONS=n. Deliberately use CPU_MITIGATIONS instead of the more precise MITIGATION_L1TF, as MITIGATION_L1TF only controls the default behavior, i.e. CONFIG_MITIGATION_L1TF=n doesn't completely disable L1TF mitigations in the kernel. Keep the vmentry_l1d_flush module param to avoid breaking existing setups, and leverage the .set path to alert the user to the fact that vmentry_l1d_flush will be ignored. Don't bother validating the incoming value; if an admin misconfigures vmentry_l1d_flush, the fact that the bad configuration won't be detected when running with CONFIG_CPU_MITIGATIONS=n is likely the least of their worries. Reviewed-by: Brendan Jackman <jackmanb@google.com> Link: https://patch.msgid.link/20251113233746.1703361-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19KVM: VMX: Bundle all L1 data cache flush mitigation code togetherSean Christopherson1-87/+87
Move vmx_l1d_flush(), vmx_cleanup_l1d_flush(), and the vmentry_l1d_flush param code up in vmx.c so that all of the L1 data cache flushing code is bundled together. This will allow conditioning the mitigation code on CONFIG_CPU_MITIGATIONS=y with minimal #ifdefs. No functional change intended. Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://patch.msgid.link/20251113233746.1703361-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19x86/bugs: KVM: Move VM_CLEAR_CPU_BUFFERS into SVM as SVM_CLEAR_CPU_BUFFERSSean Christopherson2-5/+4
Now that VMX encodes its own sequence for clearing CPU buffers, move VM_CLEAR_CPU_BUFFERS into SVM to minimize the chances of KVM botching a mitigation in the future, e.g. using VM_CLEAR_CPU_BUFFERS instead of checking multiple mitigation flags. No functional change intended. Reviewed-by: Brendan Jackman <jackmanb@google.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://patch.msgid.link/20251113233746.1703361-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19KVM: VMX: Handle MMIO Stale Data in VM-Enter assembly via ALTERNATIVES_2Sean Christopherson2-15/+14
Rework the handling of the MMIO Stale Data mitigation to clear CPU buffers immediately prior to VM-Enter, i.e. in the same location that KVM emits a VERW for unconditional (at runtime) clearing. Co-locating the code and using a single ALTERNATIVES_2 makes it more obvious how VMX mitigates the various vulnerabilities. Deliberately order the alternatives as: 0. Do nothing 1. Clear if vCPU can access MMIO 2. Clear always since the last alternative wins in ALTERNATIVES_2(), i.e. so that KVM will honor the strictest mitigation (always clear CPU buffers) if multiple mitigations are selected. E.g. even if the kernel chooses to mitigate MMIO Stale Data via X86_FEATURE_CLEAR_CPU_BUF_VM_MMIO, another mitigation may enable X86_FEATURE_CLEAR_CPU_BUF_VM, and that other thing needs to win. Note, decoupling the MMIO mitigation from the L1TF mitigation also fixes a mostly-benign flaw where KVM wouldn't do any clearing/flushing if the L1TF mitigation is configured to conditionally flush the L1D, and the MMIO mitigation but not any other "clear CPU buffers" mitigation is enabled. For that specific scenario, KVM would skip clearing CPU buffers for the MMIO mitigation even though the kernel requested a clear on every VM-Enter. Note #2, the flaw goes back to the introduction of the MDS mitigation. The MDS mitigation was inadvertently fixed by commit 43fb862de8f6 ("KVM/VMX: Move VERW closer to VMentry for MDS mitigation"), but previous kernels that flush CPU buffers in vmx_vcpu_enter_exit() are affected (though it's unlikely the flaw is meaningfully exploitable even older kernels). Fixes: 650b68a0622f ("x86/kvm/vmx: Add MDS protection when L1D Flush is not active") Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Brendan Jackman <jackmanb@google.com> Link: https://patch.msgid.link/20251113233746.1703361-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19x86/bugs: Use an x86 feature to track the MMIO Stale Data mitigationSean Christopherson5-15/+9
Convert the MMIO Stale Data mitigation tracking from a static branch into an x86 feature flag so that it can be used via ALTERNATIVE_2 in KVM. No functional change intended. Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Brendan Jackman <jackmanb@google.com> Link: https://patch.msgid.link/20251113233746.1703361-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19x86/bugs: Decouple ALTERNATIVE usage from VERW macro definitionSean Christopherson1-11/+16
Decouple the use of ALTERNATIVE from the encoding of VERW to clear CPU buffers so that KVM can use ALTERNATIVE_2 to handle "always clear buffers" and "clear if guest can access host MMIO" in a single statement. No functional change intended. Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://patch.msgid.link/20251113233746.1703361-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19x86/bugs: Use VM_CLEAR_CPU_BUFFERS in VMX as wellPawan Gupta2-5/+10
TSA mitigation: d8010d4ba43e ("x86/bugs: Add a Transient Scheduler Attacks mitigation") introduced VM_CLEAR_CPU_BUFFERS for guests on AMD CPUs. Currently on Intel CLEAR_CPU_BUFFERS is being used for guests which has a much broader scope (kernel->user also). Make mitigations on Intel consistent with TSA. This would help handling the guest-only mitigations better in future. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> [sean: make CLEAR_CPU_BUF_VM mutually exclusive with the MMIO mitigation] Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Brendan Jackman <jackmanb@google.com> Link: https://patch.msgid.link/20251113233746.1703361-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19KVM: VMX: Use on-stack copy of @flags in __vmx_vcpu_run()Sean Christopherson2-16/+7
When testing for VMLAUNCH vs. VMRESUME, use the copy of @flags from the stack instead of first moving it to EBX, and then propagating VMX_RUN_VMRESUME to RFLAGS.CF (because RBX is clobbered with the guest value prior to the conditional branch to VMLAUNCH). Stashing information in RFLAGS is gross, especially with the writer and reader being bifurcated by yet more gnarly assembly code. Opportunistically drop the SHIFT macros as they existed purely to allow the VM-Enter flow to use Bit Test. Suggested-by: Borislav Petkov <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Brendan Jackman <jackmanb@google.com> Link: https://patch.msgid.link/20251113233746.1703361-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-19KVM: x86: Allocate/free user_return_msrs at kvm.ko (un)loading timeChao Gao1-27/+13
Move user_return_msrs allocation/free from vendor modules (kvm-intel.ko and kvm-amd.ko) (un)loading time to kvm.ko's to make it less risky to access user_return_msrs in kvm.ko. Tying the lifetime of user_return_msrs to vendor modules makes every access to user_return_msrs prone to use-after-free issues as vendor modules may be unloaded at any time. Opportunistically turn the per-CPU variable into full structs, as there's no practical difference between statically allocating the memory and allocating it unconditionally during module_init(). Zero out kvm_nr_uret_msrs on vendor module exit to further minimize the chances of consuming stale data, and WARN on vendor module load if KVM thinks there are existing user-return MSRs. Note! The user-return MSRs also need to be "destroyed" if ops->hardware_setup() fails, as both SVM and VMX expect common KVM to clean up (because common code, not vendor code, is responsible for kvm_nr_uret_msrs). Signed-off-by: Chao Gao <chao.gao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20251108013601.902918-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>