summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2025-01-25Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds37-942/+1769
Pull kvm updates from Paolo Bonzini: "Loongarch: - Clear LLBCTL if secondary mmu mapping changes - Add hypercall service support for usermode VMM x86: - Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a direct call to kvm_tdp_page_fault() when RETPOLINE is enabled - Ensure that all SEV code is compiled out when disabled in Kconfig, even if building with less brilliant compilers - Remove a redundant TLB flush on AMD processors when guest CR4.PGE changes - Use str_enabled_disabled() to replace open coded strings - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache prior to every VM-Enter - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities instead of just those where KVM needs to manage state and/or explicitly enable the feature in hardware. Along the way, refactor the code to make it easier to add features, and to make it more self-documenting how KVM is handling each feature - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios (e.g. if emulation is triggered by the exit), and brings parity between VMX and SVM - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU, due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU - Make the completion of hypercalls go through the complete_hypercall function pointer argument, no matter if the hypercall exits to userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically went to userspace, and all the others did not; the new code need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at all whether there was an exit to userspace or not - As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. When TDX will be enabled, operations on private pages will need to go through the privileged TDX Module via SEAMCALLs; as a result, they are limited and relatively slow compared to reading a PTE. The patches included in 6.14 allow KVM to keep a mirror of the private EPT in host memory, and define entries in kvm_x86_ops to operate on external page tables such as the TDX private EPT - The recently introduced conversion of the NX-page reclamation kthread to vhost_task moved the task under the main process. The task is created as soon as KVM_CREATE_VM was invoked and this, of course, broke userspace that didn't expect to see any child task of the VM process until it started creating its own userspace threads. In particular crosvm refuses to fork() if procfs shows any child task, so unbreak it by creating the task lazily. This is arguably a userspace bug, as there can be other kinds of legitimate worker tasks and they wouldn't impede fork(); but it's not like userspace has a way to distinguish kernel worker tasks right now. Should they show as "Kthread: 1" in proc/.../status? x86 - Intel: - Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit while L2 is active, while ultimately results in a hardware-accelerated L1 EOI effectively being lost - Honor event priority when emulating Posted Interrupt delivery during nested VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the interrupt - Rework KVM's processing of the Page-Modification Logging buffer to reap entries in the same order they were created, i.e. to mark gfns dirty in the same order that hardware marked the page/PTE dirty - Misc cleanups Generic: - Cleanup and harden kvm_set_memory_region(); add proper lockdep assertions when setting memory regions and add a dedicated API for setting KVM-internal memory regions. The API can then explicitly disallow all flags for KVM-internal memory regions - Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug where KVM would return a pointer to a vCPU prior to it being fully online, and give kvm_for_each_vcpu() similar treatment to fix a similar flaw - Wait for a vCPU to come online prior to executing a vCPU ioctl, to fix a bug where userspace could coerce KVM into handling the ioctl on a vCPU that isn't yet onlined - Gracefully handle xarray insertion failures; even though such failures are impossible in practice after xa_reserve(), reserving an entry is always followed by xa_store() which does not know (or differentiate) whether there was an xa_reserve() before or not RISC-V: - Zabha, Svvptc, and Ziccrse extension support for guests. None of them require anything in KVM except for detecting them and marking them as supported; Zabha adds byte and halfword atomic operations, while the others are markers for specific operation of the TLB and of LL/SC instructions respectively - Virtualize SBI system suspend extension for Guest/VM - Support firmware counters which can be used by the guests to collect statistics about traps that occur in the host Selftests: - Rework vcpu_get_reg() to return a value instead of using an out-param, and update all affected arch code accordingly - Convert the max_guest_memory_test into a more generic mmu_stress_test. The basic gist of the "conversion" is to have the test do mprotect() on guest memory while vCPUs are accessing said memory, e.g. to verify KVM and mmu_notifiers are working as intended - Play nice with treewrite builds of unsupported architectures, e.g. arm (32-bit), as KVM selftests' Makefile doesn't do anything to ensure the target architecture is actually one KVM selftests supports - Use the kernel's $(ARCH) definition instead of the target triple for arch specific directories, e.g. arm64 instead of aarch64, mainly so as not to be different from the rest of the kernel - Ensure that format strings for logging statements are checked by the compiler even when the logging statement itself is disabled - Attempt to whack the last LLC references/misses mole in the Intel PMU counters test by adding a data load and doing CLFLUSH{OPT} on the data instead of the code being executed. It seems that modern Intel CPUs have learned new code prefetching tricks that bypass the PMU counters - Fix a flaw in the Intel PMU counters test where it asserts that events are counting correctly without actually knowing what the events count given the underlying hardware; this can happen if Intel reuses a formerly microarchitecture-specific event encoding as an architectural event, as was the case for Top-Down Slots" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits) kvm: defer huge page recovery vhost task to later KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault() KVM: Disallow all flags for KVM-internal memslots KVM: x86: Drop double-underscores from __kvm_set_memory_region() KVM: Add a dedicated API for setting KVM-internal memslots KVM: Assert slots_lock is held when setting memory regions KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API) LoongArch: KVM: Add hypercall service support for usermode VMM LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup() KVM: VMX: read the PML log in the same order as it was written KVM: VMX: refactor PML terminology KVM: VMX: Fix comment of handle_vmx_instruction() KVM: VMX: Reinstate __exit attribute for vmx_exit() KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup() KVM: x86: Avoid double RDPKRU when loading host/guest PKRU KVM: x86: Use LVT_TIMER instead of an open coded literal RISC-V: KVM: Add new exit statstics for redirected traps RISC-V: KVM: Update firmware counters for various events RISC-V: KVM: Redirect instruction access fault trap to guest ...
2025-01-25Merge tag 'hyperv-next-signed-20250123' of ↵Linus Torvalds15-840/+20
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Introduce a new set of Hyper-V headers in include/hyperv and replace the old hyperv-tlfs.h with the new headers (Nuno Das Neves) - Fixes for the Hyper-V VTL mode (Roman Kisel) - Fixes for cpu mask usage in Hyper-V code (Michael Kelley) - Document the guest VM hibernation behaviour (Michael Kelley) - Miscellaneous fixes and cleanups (Jacob Pan, John Starks, Naman Jain) * tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Documentation: hyperv: Add overview of guest VM hibernation hyperv: Do not overlap the hvcall IO areas in hv_vtl_apicid_to_vp_id() hyperv: Do not overlap the hvcall IO areas in get_vtl() hyperv: Enable the hypercall output page for the VTL mode hv_balloon: Fallback to generic_online_page() for non-HV hot added mem Drivers: hv: vmbus: Log on missing offers if any Drivers: hv: vmbus: Wait for boot-time offers during boot and resume uio_hv_generic: Add a check for HV_NIC for send, receive buffers setup iommu/hyper-v: Don't assume cpu_possible_mask is dense Drivers: hv: Don't assume cpu_possible_mask is dense x86/hyperv: Don't assume cpu_possible_mask is dense hyperv: Remove the now unused hyperv-tlfs.h files hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h hyperv: Add new Hyper-V headers in include/hyperv hyperv: Clean up unnecessary #includes hyperv: Move hv_connection_id to hyperv-tlfs.h
2025-01-25Merge tag 'efi-next-for-v6.14' of ↵Linus Torvalds3-10/+8
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI updates from Ard Biesheuvel: - Increase the headroom in the EFI memory map allocation created by the EFI stub. This is needed because event callbacks called during ExitBootServices() may cause fragmentation, and reallocation is not allowed after that. - Drop obsolete UGA graphics code and switch to a more ergonomic API to traverse handle buffers. Simplify some error paths using a __free() helper while at it. - Fix some W=1 warnings when CONFIG_EFI=n - Rely on the dentry cache to keep track of the contents of the efivarfs filesystem, rather than using a separate linked list. - Improve and extend efivarfs test cases. - Synchronize efivarfs with underlying variable store on resume from hibernation - this is needed because the firmware itself or another OS running on the same machine may have modified it. - Fix x86 EFI stub build with GCC 15. - Fix kexec/x86 false positive warning in EFI memory attributes table sanity check. * tag 'efi-next-for-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (23 commits) x86/efi: skip memattr table on kexec boot efivarfs: add variable resync after hibernation efivarfs: abstract initial variable creation routine efi: libstub: Use '-std=gnu11' to fix build with GCC 15 selftests/efivarfs: add concurrent update tests selftests/efivarfs: fix tests for failed write removal efivarfs: fix error on write to new variable leaving remnants efivarfs: remove unused efivarfs_list efivarfs: move variable lifetime management into the inodes selftests/efivarfs: add check for disallowing file truncation efivarfs: prevent setting of zero size on the inodes in the cache efi: sysfb_efi: fix W=1 warnings when EFI is not set efi/libstub: Use __free() helper for pool deallocations efi/libstub: Use cleanup helpers for freeing copies of the memory map efi/libstub: Simplify PCI I/O handle buffer traversal efi/libstub: Refactor and clean up GOP resolution picker code efi/libstub: Simplify GOP handling code efi/libstub: Use C99-style for loop to traverse handle buffer x86/efistub: Drop long obsolete UGA support efivarfs: make variable_is_present use dcache lookup ...
2025-01-24kvm: defer huge page recovery vhost task to laterKeith Busch3-6/+21
Some libraries want to ensure they are single threaded before forking, so making the kernel's kvm huge page recovery process a vhost task of the user process breaks those. The minijail library used by crosvm is one such affected application. Defer the task to after the first VM_RUN call, which occurs after the parent process has forked all its jailed processes. This needs to happen only once for the kvm instance, so introduce some general-purpose infrastructure for that, too. It's similar in concept to pthread_once; except it is actually usable, because the callback takes a parameter. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Alyssa Ross <hi@alyssa.is> Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250123153543.2769928-1-kbusch@meta.com> [Move call_once API to include/linux. - Paolo] Cc: stable@vger.kernel.org Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-01-24Merge tag 'v6.14-p1' of ↵Linus Torvalds8-242/+221
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Remove physical address skcipher walking - Fix boot-up self-test race Algorithms: - Optimisations for x86/aes-gcm - Optimisations for x86/aes-xts - Remove VMAC - Remove keywrap Drivers: - Remove n2 Others: - Fixes for padata UAF - Fix potential rhashtable deadlock by moving schedule_work outside lock" * tag 'v6.14-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (75 commits) rhashtable: Fix rhashtable_try_insert test dt-bindings: crypto: qcom,inline-crypto-engine: Document the SM8750 ICE dt-bindings: crypto: qcom,prng: Document SM8750 RNG dt-bindings: crypto: qcom-qce: Document the SM8750 crypto engine crypto: asymmetric_keys - Remove unused key_being_used_for[] padata: avoid UAF for reorder_work padata: fix UAF in padata_reorder padata: add pd get/put refcnt helper crypto: skcipher - call cond_resched() directly crypto: skcipher - optimize initializing skcipher_walk fields crypto: skcipher - clean up initialization of skcipher_walk::flags crypto: skcipher - fold skcipher_walk_skcipher() into skcipher_walk_virt() crypto: skcipher - remove redundant check for SKCIPHER_WALK_SLOW crypto: skcipher - remove redundant clamping to page size crypto: skcipher - remove unnecessary page alignment of bounce buffer crypto: skcipher - document skcipher_walk_done() and rename some vars crypto: omap - switch from scatter_walk to plain offset crypto: powerpc/p10-aes-gcm - simplify handling of linear associated data crypto: bcm - Drop unused setting of local 'ptr' variable crypto: hisilicon/qm - support new function communication ...
2025-01-24Merge tag 'platform-drivers-x86-v6.14-1' of ↵Linus Torvalds2-9/+62
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver updates from Ilpo Järvinen: "acer-wmi: - Add support for PH14-51, PH16-72, and Nitro AN515-58 - Add proper hwmon support - Improve error handling when reading "gaming system info" - Replace direct EC reads for the current platform profile with WMI calls to handle EC address variations - Replace custom platform_profile cycling with the generic one ACPI: - platform_profile: Major refactoring and improvements - Support registering multiple platform_profile handlers concurrently to avoid the need to quirk which handler takes precedence - Support reporting "custom" profile for cases where the current profile is ambiguous or when settings tweaks are done outside the pre-defined profile - Abstract and layer platform_profile API better using the class_dev and drvdata - Various minor improvements - Add Documentation and kerneldoc amd/hsmp: - Add support for HSMP protocol v7 amd/pmc: - Support AMD 1Ah family 70h - Support STB with Ryzen desktop SoCs amd/pmf: - Support Custom BIOS inputs for PMF TA - Support passing SRA sensor data from AMD SFH (HID) to PMF TA dell-smo8800: - Move SMO88xx quirk away from the generic i2c-i801 driver - Add accelerometer support for Dell Latitude E6330/E6430 and XPS 9550 - Support probing accelerometer for models yet to be listed in the DMI mapping table because ACPI lacks i2c-address for the accelerometer (behind a module parameter because probing might be dangerous) HID: - amd_sfh: Add support for exporting SRA sensor data hp-wmi: - Add fan and thermal support for Victus 16-s1000 input: - Add key for phone linking - i8042: Add context for the i8042 filter to enable cleaning up the filter related global variables from pdx86 drivers lenovo-wmi-camera: - Use SW_CAMERA_LENS_COVER instead of KEY_CAMERA_ACCESS mellanox mlxbf-pmc: - Add support for monitoring cycle count - Add Documentation thinkpad_acpi: - Add support for phone link key tools/power/x86/intel-speed-select: - Fix Turbo Ratio Limit restore x86-android-tables: - Add support for Vexia EDU ATLA 10 Bluetooth and EC battery driver And miscellaneous cleanups / refactoring / improvements" * tag 'platform-drivers-x86-v6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (133 commits) platform/x86: acer-wmi: Fix initialization of last_non_turbo_profile platform/x86: acer-wmi: Ignore AC events platform/mellanox: mlxreg-io: use sysfs_emit() instead of sprintf() platform/mellanox: mlxreg-hotplug: use sysfs_emit() instead of sprintf() platform/mellanox: mlxbf-bootctl: use sysfs_emit() instead of sprintf() platform/x86: hp-wmi: Add fan and thermal profile support for Victus 16-s1000 ACPI: platform_profile: Add a prefix to log messages ACPI: platform_profile: Add documentation ACPI: platform_profile: Clean platform_profile_handler ACPI: platform_profile: Move platform_profile_handler ACPI: platform_profile: Remove platform_profile_handler from exported symbols platform/x86: thinkpad_acpi: Use devm_platform_profile_register() platform/x86: inspur_platform_profile: Use devm_platform_profile_register() platform/x86: hp-wmi: Use devm_platform_profile_register() platform/x86: ideapad-laptop: Use devm_platform_profile_register() platform/x86: dell-pc: Use devm_platform_profile_register() platform/x86: asus-wmi: Use devm_platform_profile_register() platform/x86: amd: pmf: sps: Use devm_platform_profile_register() platform/x86: acer-wmi: Use devm_platform_profile_register() platform/surface: surface_platform_profile: Use devm_platform_profile_register() ...
2025-01-24Merge tag 'x86_tdx_for_6.14-rc1' of ↵Linus Torvalds9-112/+252
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 TDX updates from Dave Hansen: "Intel Trust Domain updates. The existing TDX code needs a _bit_ of metadata from the TDX module. But KVM is going to need a bunch more very shortly. Rework the interface with the TDX module to be more consistent and handle the new higher volume. The TDX module has added a few new features. The first is a promise not to clobber RBP under any circumstances. Basically the kernel now will refuse to use any modules that don't have this promise. Second, enable the new "REDUCE_VE" feature. This ensures that the TDX module will not send some silly virtualization exceptions that the guest had no good way to handle anyway. - Centralize global metadata infrastructure - Use new TDX module features for exception suppression and RBP clobbering" * tag 'x86_tdx_for_6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/virt/tdx: Require the module to assert it has the NO_RBP_MOD mitigation x86/virt/tdx: Switch to use auto-generated global metadata reading code x86/virt/tdx: Use dedicated struct members for PAMT entry sizes x86/virt/tdx: Use auto-generated code to read global metadata x86/virt/tdx: Start to track all global metadata in one structure x86/virt/tdx: Rename 'struct tdx_tdmr_sysinfo' to reflect the spec better x86/tdx: Dump attributes and TD_CTLS on boot x86/tdx: Disable unnecessary virtualization exceptions
2025-01-24Merge tag 'x86-boot-2025-01-21' of ↵Linus Torvalds17-218/+300
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Ingo Molnar: - A large and involved preparatory series to pave the way to add exception handling for relocate_kernel - which will be a debugging facility that has aided in the field to debug an exceptionally hard to debug early boot bug. Plus assorted cleanups and fixes that were discovered along the way, by David Woodhouse: - Clean up and document register use in relocate_kernel_64.S - Use named labels in swap_pages in relocate_kernel_64.S - Only swap pages for ::preserve_context mode - Allocate PGD for x86_64 transition page tables separately - Copy control page into place in machine_kexec_prepare() - Invoke copy of relocate_kernel() instead of the original - Move relocate_kernel to kernel .data section - Add data section to relocate_kernel - Drop page_list argument from relocate_kernel() - Eliminate writes through kernel mapping of relocate_kernel page - Clean up register usage in relocate_kernel() - Mark relocate_kernel page as ROX instead of RWX - Disable global pages before writing to control page - Ensure preserve_context flag is set on return to kernel - Use correct swap page in swap_pages function - Fix stack and handling of re-entry point for ::preserve_context - Mark machine_kexec() with __nocfi - Cope with relocate_kernel() not being at the start of the page - Use typedef for relocate_kernel_fn function prototype - Fix location of relocate_kernel with -ffunction-sections (fix by Nathan Chancellor) - A series to remove the last remaining absolute symbol references from .head.text, and enforce this at build time, by Ard Biesheuvel: - Avoid WARN()s and panic()s in early boot code - Don't hang but terminate on failure to remap SVSM CA - Determine VA/PA offset before entering C code - Avoid intentional absolute symbol references in .head.text - Disable UBSAN in early boot code - Move ENTRY_TEXT to the start of the image - Move .head.text into its own output section - Reject absolute references in .head.text - The above build-time enforcement uncovered a handful of bugs of essentially non-working code, and a wrokaround for a toolchain bug, fixed by Ard Biesheuvel as well: - Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12 - Disable UBSAN on SEV code that may execute very early - Disable ftrace branch profiling in SEV startup code - And miscellaneous cleanups: - kexec_core: Add and update comments regarding the KEXEC_JUMP flow (Rafael J. Wysocki) - x86/sysfs: Constify 'struct bin_attribute' (Thomas Weißschuh)" * tag 'x86-boot-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) x86/sev: Disable ftrace branch profiling in SEV startup code x86/kexec: Use typedef for relocate_kernel_fn function prototype x86/kexec: Cope with relocate_kernel() not being at the start of the page kexec_core: Add and update comments regarding the KEXEC_JUMP flow x86/kexec: Mark machine_kexec() with __nocfi x86/kexec: Fix location of relocate_kernel with -ffunction-sections x86/kexec: Fix stack and handling of re-entry point for ::preserve_context x86/kexec: Use correct swap page in swap_pages function x86/kexec: Ensure preserve_context flag is set on return to kernel x86/kexec: Disable global pages before writing to control page x86/sev: Don't hang but terminate on failure to remap SVSM CA x86/sev: Disable UBSAN on SEV code that may execute very early x86/boot/64: Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12 x86/sysfs: Constify 'struct bin_attribute' x86/kexec: Mark relocate_kernel page as ROX instead of RWX x86/kexec: Clean up register usage in relocate_kernel() x86/kexec: Eliminate writes through kernel mapping of relocate_kernel page x86/kexec: Drop page_list argument from relocate_kernel() x86/kexec: Add data section to relocate_kernel x86/kexec: Move relocate_kernel to kernel .data section ...
2025-01-24x86/acpi: Fix LAPIC/x2APIC parsing orderZhang Rui1-5/+45
On some systems, the same CPU (with the same APIC ID) is assigned a different logical CPU id after commit ec9aedb2aa1a ("x86/acpi: Ignore invalid x2APIC entries"). This means that Linux enumerates the CPUs in a different order, which violates ACPI specification[1] that states: "OSPM should initialize processors in the order that they appear in the MADT" The problematic commit parses all LAPIC entries before any x2APIC entries, aiming to ignore x2APIC entries with APIC ID < 255 when valid LAPIC entries exist. However, it disrupts the CPU enumeration order on systems where x2APIC entries precede LAPIC entries in the MADT. Fix this problem by: 1) Parsing LAPIC entries first without registering them in the topology to evaluate whether valid LAPIC entries exist. 2) Restoring the MADT in order parser which invokes either the LAPIC or the X2APIC parser function depending on the entry type. The X2APIC parser still ignores entries < 0xff in case that #1 found valid LAPIC entries independent of their position in the MADT table. Link: https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#madt-processor-local-apic-sapic-structure-entry-order Cc: All applicable <stable@vger.kernel.org> Reported-by: Jim Mattson <jmattson@google.com> Closes: https://lore.kernel.org/all/20241010213136.668672-1-jmattson@google.com/ Fixes: ec9aedb2aa1a ("x86/acpi: Ignore invalid x2APIC entries") Signed-off-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Tested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20250117081420.4046737-1-rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-23x86/efi: skip memattr table on kexec bootDave Young1-0/+5
efi_memattr_init() added a sanity check to avoid firmware caused corruption. The check is based on efi memmap entry numbers, but kexec only takes the runtime related memmap entries thus this caused many false warnings, see below thread for details: https://lore.kernel.org/all/20250108215957.3437660-2-usamaarif642@gmail.com/ Ard suggests to skip the efi memattr table in kexec, this makes sense because those memattr fixups are not critical. Fixes: 8fbe4c49c0cc ("efi/memattr: Ignore table if the size is clearly bogus") Cc: <stable@vger.kernel.org> # v6.13+ Reported-by: Breno Leitao <leitao@debian.org> Reported-and-tested-by: Usama Arif <usamaarif642@gmail.com> Suggested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Dave Young <dyoung@redhat.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-01-23Merge tag 'crc-for-linus' of ↵Linus Torvalds12-678/+225
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be directly accessible via the library API, instead of requiring the crypto API. This is much simpler and more efficient. - Convert some users such as ext4 to use the CRC32 library API instead of the crypto API. More conversions like this will come later. - Add a KUnit test that tests and benchmarks multiple CRC variants. Remove older, less-comprehensive tests that are made redundant by this. - Add an entry to MAINTAINERS for the kernel's CRC library code. I'm volunteering to maintain it. I have additional cleanups and optimizations planned for future cycles. * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (31 commits) MAINTAINERS: add entry for CRC library powerpc/crc: delete obsolete crc-vpmsum_test.c lib/crc32test: delete obsolete crc32test.c lib/crc16_kunit: delete obsolete crc16_kunit.c lib/crc_kunit.c: add KUnit test suite for CRC library functions powerpc/crc-t10dif: expose CRC-T10DIF function through lib arm64/crc-t10dif: expose CRC-T10DIF function through lib arm/crc-t10dif: expose CRC-T10DIF function through lib x86/crc-t10dif: expose CRC-T10DIF function through lib crypto: crct10dif - expose arch-optimized lib function lib/crc-t10dif: add support for arch overrides lib/crc-t10dif: stop wrapping the crypto API scsi: target: iscsi: switch to using the crc32c library f2fs: switch to using the crc32 library jbd2: switch to using the crc32c library ext4: switch to using the crc32c library lib/crc32: make crc32c() go directly to lib bcachefs: Explicitly select CRYPTO from BCACHEFS_FS x86/crc32: expose CRC32 functions through lib x86/crc32: update prototype for crc32_pclmul_le_16() ...
2025-01-22Merge tag 'kthread-for-6.14-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "Kthreads affinity follow either of 4 existing different patterns: 1) Per-CPU kthreads must stay affine to a single CPU and never execute relevant code on any other CPU. This is currently handled by smpboot code which takes care of CPU-hotplug operations. Affinity here is a correctness constraint. 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't run anywhere else. The affinity is set through kthread_bind_mask() and the subsystem takes care by itself to handle CPU-hotplug operations. Affinity here is assumed to be a correctness constraint. 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This is not a correctness constraint but merely a preference in terms of memory locality. kswapd and kcompactd both fall into this category. The affinity is set manually like for any other task and CPU-hotplug is supposed to be handled by the relevant subsystem so that the task is properly reaffined whenever a given CPU from the node comes up. Also care should be taken so that the node affinity doesn't cross isolated (nohz_full) cpumask boundaries. 4) Similar to the previous point except kthreads have a _preferred_ affinity different than a node. Both RCU boost kthreads and RCU exp kworkers fall into this category as they refer to "RCU nodes" from a distinctly distributed tree. Currently the preferred affinity patterns (3 and 4) have at least 4 identified users, with more or less success when it comes to handle CPU-hotplug operations and CPU isolation. Each of which do it in its own ad-hoc way. This is an infrastructure proposal to handle this with the following API changes: - kthread_create_on_node() automatically affines the created kthread to its target node unless it has been set as per-cpu or bound with kthread_bind[_mask]() before the first wake-up. - kthread_affine_preferred() is a new function that can be called right after kthread_create_on_node() to specify a preferred affinity different than the specified node. When the preferred affinity can't be applied because the possible targets are offline or isolated (nohz_full), the kthread is affine to the housekeeping CPUs (which means to all online CPUs most of the time or only the non-nohz_full CPUs when nohz_full= is set). kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been converted, along with a few old drivers. Summary of the changes: - Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu() - Introduce task_cpu_fallback_mask() that defines the default last resort affinity of a task to become nohz_full aware - Add some correctness check to ensure kthread_bind() is always called before the first kthread wake up. - Default affine kthread to its preferred node. - Convert kswapd / kcompactd and remove their halfway working ad-hoc affinity implementation - Implement kthreads preferred affinity - Unify kthread worker and kthread API's style - Convert RCU kthreads to the new API and remove the ad-hoc affinity implementation" * tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: kthread: modify kernel-doc function name to match code rcu: Use kthread preferred affinity for RCU exp kworkers treewide: Introduce kthread_run_worker[_on_cpu]() kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format rcu: Use kthread preferred affinity for RCU boost kthread: Implement preferred affinity mm: Create/affine kswapd to its preferred node mm: Create/affine kcompactd to its preferred node kthread: Default affine kthread to its preferred NUMA node kthread: Make sure kthread hasn't started while binding it sched,arm64: Handle CPU isolation on last resort fallback rq selection arm64: Exclude nohz_full CPUs from 32bits el0 support lib: test_objpool: Use kthread_run_on_cpu() kallsyms: Use kthread_run_on_cpu() soc/qman: test: Use kthread_run_on_cpu() arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-22Merge tag 'ftrace-v6.14' of ↵Linus Torvalds6-55/+81
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace updates from Steven Rostedt: - Have fprobes built on top of function graph infrastructure The fprobe logic is an optimized kprobe that uses ftrace to attach to functions when a probe is needed at the start or end of the function. The fprobe and kretprobe logic implements a similar method as the function graph tracer to trace the end of the function. That is to hijack the return address and jump to a trampoline to do the trace when the function exits. To do this, a shadow stack needs to be created to store the original return address. Fprobes and function graph do this slightly differently. Fprobes (and kretprobes) has slots per callsite that are reserved to save the return address. This is fine when just a few points are traced. But users of fprobes, such as BPF programs, are starting to add many more locations, and this method does not scale. The function graph tracer was created to trace all functions in the kernel. In order to do this, when function graph tracing is started, every task gets its own shadow stack to hold the return address that is going to be traced. The function graph tracer has been updated to allow multiple users to use its infrastructure. Now have fprobes be one of those users. This will also allow for the fprobe and kretprobe methods to trace the return address to become obsolete. With new technologies like CFI that need to know about these methods of hijacking the return address, going toward a solution that has only one method of doing this will make the kernel less complex. - Cleanup with guard() and free() helpers There were several places in the code that had a lot of "goto out" in the error paths to either unlock a lock or free some memory that was allocated. But this is error prone. Convert the code over to use the guard() and free() helpers that let the compiler unlock locks or free memory when the function exits. - Remove disabling of interrupts in the function graph tracer When function graph tracer was first introduced, it could race with interrupts and NMIs. To prevent that race, it would disable interrupts and not trace NMIs. But the code has changed to allow NMIs and also interrupts. This change was done a long time ago, but the disabling of interrupts was never removed. Remove the disabling of interrupts in the function graph tracer is it is not needed. This greatly improves its performance. - Allow the :mod: command to enable tracing module functions on the kernel command line. The function tracer already has a way to enable functions to be traced in modules by writing ":mod:<module>" into set_ftrace_filter. That will enable either all the functions for the module if it is loaded, or if it is not, it will cache that command, and when the module is loaded that matches <module>, its functions will be enabled. This also allows init functions to be traced. But currently events do not have that feature. Because enabling function tracing can be done very early at boot up (before scheduling is enabled), the commands that can be done when function tracing is started is limited. Having the ":mod:" command to trace module functions as they are loaded is very useful. Update the kernel command line function filtering to allow it. * tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (26 commits) ftrace: Implement :mod: cache filtering on kernel command line tracing: Adopt __free() and guard() for trace_fprobe.c bpf: Use ftrace_get_symaddr() for kprobe_multi probes ftrace: Add ftrace_get_symaddr to convert fentry_ip to symaddr Documentation: probes: Update fprobe on function-graph tracer selftests/ftrace: Add a test case for repeating register/unregister fprobe selftests: ftrace: Remove obsolate maxactive syntax check tracing/fprobe: Remove nr_maxactive from fprobe fprobe: Add fprobe_header encoding feature fprobe: Rewrite fprobe on function-graph tracer s390/tracing: Enable HAVE_FTRACE_GRAPH_FUNC ftrace: Add CONFIG_HAVE_FTRACE_GRAPH_FUNC bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled tracing/fprobe: Enable fprobe events with CONFIG_DYNAMIC_FTRACE_WITH_ARGS tracing: Add ftrace_fill_perf_regs() for perf event tracing: Add ftrace_partial_regs() for converting ftrace_regs to pt_regs fprobe: Use ftrace_regs in fprobe exit handler fprobe: Use ftrace_regs in fprobe entry handler fgraph: Pass ftrace_regs to retfunc fgraph: Replace fgraph_ret_regs with ftrace_regs ...
2025-01-22Merge tag 'irq-core-2025-01-21' of ↵Linus Torvalds5-14/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt subsystem updates from Thomas Gleixner: - Consolidate the machine_kexec_mask_interrupts() by providing a generic implementation and replacing the copy & pasta orgy in the relevant architectures. - Prevent unconditional operations on interrupt chips during kexec shutdown, which can trigger warnings in certain cases when the underlying interrupt has been shut down before. - Make the enforcement of interrupt handling in interrupt context unconditionally available, so that it actually works for non x86 related interrupt chips. The earlier enablement for ARM GIC chips set the required chip flag, but did not notice that the check was hidden behind a config switch which is not selected by ARM[64]. - Decrapify the handling of deferred interrupt affinity setting. Some interrupt chips require that affinity changes are made from the context of handling an interrupt to avoid certain race conditions. For x86 this was the default, but with interrupt remapping this requirement was lifted and a flag was introduced which tells the core code that affinity changes can be done in any context. Unrestricted affinity changes are the default for the majority of interrupt chips. RISCV has the requirement to add the deferred mode to one of it's interrupt controllers, but with the original implementation this would require to add the any context flag to all other RISC-V interrupt chips. That's backwards, so reverse the logic and require that chips, which need the deferred mode have to be marked accordingly. That avoids chasing the 'sane' chips and marking them. - Add multi-node support to the Loongarch AVEC interrupt controller driver. - The usual tiny cleanups, fixes and improvements all over the place. * tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/generic_chip: Export irq_gc_mask_disable_and_ack_set() genirq/timings: Add kernel-doc for a function parameter genirq: Remove IRQ_MOVE_PCNTXT and related code x86/apic: Convert to IRQCHIP_MOVE_DEFERRED genirq: Provide IRQCHIP_MOVE_DEFERRED hexagon: Remove GENERIC_PENDING_IRQ leftover ARC: Remove GENERIC_PENDING_IRQ genirq: Remove handle_enforce_irqctx() wrapper genirq: Make handle_enforce_irqctx() unconditionally available irqchip/loongarch-avec: Add multi-nodes topology support irqchip/ts4800: Replace seq_printf() by seq_puts() irqchip/ti-sci-inta : Add module build support irqchip/ti-sci-intr: Add module build support irqchip/irq-brcmstb-l2: Replace brcmstb_l2_mask_and_ack() by generic function irqchip: keystone: Use syscon_regmap_lookup_by_phandle_args genirq/kexec: Prevent redundant IRQ masking by checking state before shutdown kexec: Consolidate machine_kexec_mask_interrupts() implementation genirq: Reuse irq_thread_fn() for forced thread case genirq: Move irq_thread_fn() further up in the code
2025-01-21Merge tag 'sched-core-2025-01-21' of ↵Linus Torvalds3-67/+37
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Fair scheduler (SCHED_FAIR) enhancements: - Behavioral improvements: - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra) - Delayed-dequeue enhancements & fixes: (Vincent Guittot) - Rename h_nr_running into h_nr_queued - Add new cfs_rq.h_nr_runnable - Use the new cfs_rq.h_nr_runnable - Removed unsued cfs_rq.h_nr_delayed - Rename cfs_rq.idle_h_nr_running into h_nr_idle - Remove unused cfs_rq.idle_nr_running - Rename cfs_rq.nr_running into nr_queued - Do not try to migrate delayed dequeue task - Fix variable declaration position - Encapsulate set custom slice in a __setparam_fair() function - Fixes: - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding) - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia) - Cleanups: - Clean up in migrate_degrades_locality() to improve readability (Peter Zijlstra) - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko) - Update comments after sched_tick() rename (Sebastian Andrzej Siewior) - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used() (Valentin Schneider) Deadline scheduler (SCHED_DL) enhancements: - Restore dl_server bandwidth on non-destructive root domain changes (Juri Lelli) - Correctly account for allocated bandwidth during hotplug (Juri Lelli) - Check bandwidth overflow earlier for hotplug (Juri Lelli) - Clean up goto label in pick_earliest_pushable_dl_task() (John Stultz) - Consolidate timer cancellation (Wander Lairson Costa) Load-balancer enhancements: - Improve performance by prioritizing migrating eligible tasks in sched_balance_rq() (Hao Jia) - Do not compute NUMA Balancing stats unnecessarily during load-balancing (K Prateek Nayak) - Do not compute overloaded status unnecessarily during load-balancing (K Prateek Nayak) Generic scheduling code enhancements: - Use READ_ONCE() in task_on_rq_queued(), to consistently use the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal) Isolated CPUs support enhancements: (Waiman Long) - Make "isolcpus=nohz" equivalent to "nohz_full" - Consolidate housekeeping cpumasks that are always identical - Remove HK_TYPE_SCHED - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE RSEQ enhancements: - Validate read-only fields under DEBUG_RSEQ config (Mathieu Desnoyers) PSI enhancements: - Fix race when task wakes up before psi_sched_switch() adjusts flags (Chengming Zhou) IRQ time accounting performance enhancements: (Yafang Shao) - Define sched_clock_irqtime as static key - Don't account irq time if sched_clock_irqtime is disabled Virtual machine scheduling enhancements: - Don't try to catch up excess steal time (Suleiman Souhlal) Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak) - Convert "sysctl_sched_itmt_enabled" to boolean - Use guard() for itmt_update_mutex - Move the "sched_itmt_enabled" sysctl to debugfs - Remove x86_smt_flags and use cpu_smt_flags directly - Use x86_sched_itmt_flags for PKG domain unconditionally Debugging code & instrumentation enhancements: - Change need_resched warnings to pr_err() (David Rientjes) - Print domain name in /proc/schedstat (K Prateek Nayak) - Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra) - Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal) - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal) - Update Schedstat version to 17 (Swapnil Sapkal)" * tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits) rseq: Fix rseq unregistration regression psi: Fix race when task wakes up before psi_sched_switch() adjusts flags sched, psi: Don't account irq time if sched_clock_irqtime is disabled sched: Don't account irq time if sched_clock_irqtime is disabled sched: Define sched_clock_irqtime as static key sched/fair: Do not compute overloaded status unnecessarily during lb sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs x86/itmt: Use guard() for itmt_update_mutex x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean sched/core: Prioritize migrating eligible tasks in sched_balance_rq() sched/debug: Change need_resched warnings to pr_err sched/fair: Encapsulate set custom slice in a __setparam_fair() function sched: Fix race between yield_to() and try_to_wake_up() docs: Update Schedstat version to 17 sched/stats: Print domain name in /proc/schedstat sched: Move sched domain name out of CONFIG_SCHED_DEBUG sched: Report the different kinds of imbalances in /proc/schedstat ...
2025-01-21Merge tag 'x86-cleanups-2025-01-21' of ↵Linus Torvalds9-104/+50
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Miscellaneous x86 cleanups and typo fixes, and also the removal of the 'disablelapic' boot parameter" * tag 'x86-cleanups-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ioapic: Remove a stray tab in the IO-APIC type string x86/cpufeatures: Remove "AMD" from the comments to the AMD-specific leaf Documentation/kernel-parameters: Fix a typo in kvm.enable_virt_at_load text x86/cpu: Fix typo in x86_match_cpu()'s doc x86/apic: Remove "disablelapic" cmdline option Documentation: Merge x86-specific boot options doc into kernel-parameters.txt x86/ioremap: Remove unused size parameter in remapping functions x86/ioremap: Simplify setup_data mapping variants x86/boot/compressed: Remove unused header includes from kaslr.c
2025-01-21Merge tag 'perf-core-2025-01-20' of ↵Linus Torvalds16-291/+508
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Seqlock optimizations that arose in a perf context and were merged into the perf tree: - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan) - mm: Convert mm_lock_seq to a proper seqcount (Suren Baghdasaryan) - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren Baghdasaryan) - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra) Core perf enhancements: - Reduce 'struct page' footprint of perf by mapping pages in advance (Lorenzo Stoakes) - Save raw sample data conditionally based on sample type (Yabin Cui) - Reduce sampling overhead by checking sample_type in perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin Cui) - Export perf_exclude_event() (Namhyung Kim) Uprobes scalability enhancements: (Andrii Nakryiko) - Simplify find_active_uprobe_rcu() VMA checks - Add speculative lockless VMA-to-inode-to-uprobe resolution - Simplify session consumer tracking - Decouple return_instance list traversal and freeing - Ensure return_instance is detached from the list before freeing - Reuse return_instances between multiple uretprobes within task - Guard against kmemdup() failing in dup_return_instance() AMD core PMU driver enhancements: - Relax privilege filter restriction on AMD IBS (Namhyung Kim) AMD RAPL energy counters support: (Dhananjay Ugwekar) - Introduce topology_logical_core_id() (K Prateek Nayak) - Remove the unused get_rapl_pmu_cpumask() function - Remove the cpu_to_rapl_pmu() function - Rename rapl_pmu variables - Make rapl_model struct global - Add arguments to the init and cleanup functions - Modify the generic variable names to *_pkg* - Remove the global variable rapl_msrs - Move the cntr_mask to rapl_pmus struct - Add core energy counter support for AMD CPUs Intel core PMU driver enhancements: - Support RDPMC 'metrics clear mode' feature (Kan Liang) - Clarify adaptive PEBS processing (Kan Liang) - Factor out functions for PEBS records processing (Kan Liang) - Simplify the PEBS records processing for adaptive PEBS (Kan Liang) Intel uncore driver enhancements: (Kan Liang) - Convert buggy pmu->func_id use to pmu->registered - Support more units on Granite Rapids" * tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) perf: map pages in advance perf/x86/intel/uncore: Support more units on Granite Rapids perf/x86/intel/uncore: Clean up func_id perf/x86/intel: Support RDPMC metrics clear mode uprobes: Guard against kmemdup() failing in dup_return_instance() perf/x86: Relax privilege filter restriction on AMD IBS perf/core: Export perf_exclude_event() uprobes: Reuse return_instances between multiple uretprobes within task uprobes: Ensure return_instance is detached from the list before freeing uprobes: Decouple return_instance list traversal and freeing uprobes: Simplify session consumer tracking uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution uprobes: simplify find_active_uprobe_rcu() VMA checks mm: introduce mmap_lock_speculate_{try_begin|retry} mm: convert mm_lock_seq to a proper seqcount mm/gup: Use raw_seqcount_try_begin() seqlock: add raw_seqcount_try_begin perf/x86/rapl: Add core energy counter support for AMD CPUs perf/x86/rapl: Move the cntr_mask to rapl_pmus struct perf/x86/rapl: Remove the global variable rapl_msrs ...
2025-01-21Merge tag 'objtool-core-2025-01-20' of ↵Linus Torvalds9-51/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - Introduce the generic section-based annotation infrastructure a.k.a. ASM_ANNOTATE/ANNOTATE (Peter Zijlstra) - Convert various facilities to ASM_ANNOTATE/ANNOTATE: (Peter Zijlstra) - ANNOTATE_NOENDBR - ANNOTATE_RETPOLINE_SAFE - instrumentation_{begin,end}() - VALIDATE_UNRET_BEGIN - ANNOTATE_IGNORE_ALTERNATIVE - ANNOTATE_INTRA_FUNCTION_CALL - {.UN}REACHABLE - Optimize the annotation-sections parsing code (Peter Zijlstra) - Centralize annotation definitions in <linux/objtool.h> - Unify & simplify the barrier_before_unreachable()/unreachable() definitions (Peter Zijlstra) - Convert unreachable() calls to BUG() in x86 code, as unreachable() has unreliable code generation (Peter Zijlstra) - Remove annotate_reachable() and annotate_unreachable(), as it's unreliable against compiler optimizations (Peter Zijlstra) - Fix non-standard ANNOTATE_REACHABLE annotation order (Peter Zijlstra) - Robustify the annotation code by warning about unknown annotation types (Peter Zijlstra) - Allow arch code to discover jump table size, in preparation of annotated jump table support (Ard Biesheuvel) * tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Convert unreachable() to BUG() objtool: Allow arch code to discover jump table size objtool: Warn about unknown annotation types objtool: Fix ANNOTATE_REACHABLE to be a normal annotation objtool: Convert {.UN}REACHABLE to ANNOTATE objtool: Remove annotate_{,un}reachable() loongarch: Use ASM_REACHABLE x86: Convert unreachable() to BUG() unreachable: Unify objtool: Collect more annotations in objtool.h objtool: Collapse annotate sequences objtool: Convert ANNOTATE_INTRA_FUNCTION_CALL to ANNOTATE objtool: Convert ANNOTATE_IGNORE_ALTERNATIVE to ANNOTATE objtool: Convert VALIDATE_UNRET_BEGIN to ANNOTATE objtool: Convert instrumentation_{begin,end}() to ANNOTATE objtool: Convert ANNOTATE_RETPOLINE_SAFE to ANNOTATE objtool: Convert ANNOTATE_NOENDBR to ANNOTATE objtool: Generic annotation infrastructure
2025-01-21Merge tag 'x86_misc_for_v6.14_rc1' of ↵Linus Torvalds13-314/+293
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Borislav Petkov: - The first part of a restructuring of AMD's representation of a northbridge which is legacy now, and the creation of the new AMD node concept which represents the Zen architecture of having a collection of I/O devices within an SoC. Those nodes comprise the so-called data fabric on Zen. This has at least one practical advantage of not having to add a PCI ID each time a new data fabric PCI device releases. Eventually, the lot more uniform provider of data fabric functionality amd_node.c will be used by all the drivers which need it - Smaller cleanups * tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/amd_node: Use defines for SMN register offsets x86/amd_node: Remove dependency on AMD_NB x86/amd_node: Update __amd_smn_rw() error paths x86/amd_nb: Move SMN access code to a new amd_node driver x86/amd_nb, hwmon: (k10temp): Simplify amd_pci_dev_to_node_id() x86/amd_nb: Simplify function 3 search x86/amd_nb: Use topology info to get AMD node count x86/amd_nb: Simplify root device search x86/amd_nb: Simplify function 4 search x86: Start moving AMD node functionality out of AMD_NB x86/amd_nb: Clean up early_is_amd_nb() x86/amd_nb: Restrict init function to AMD-based systems x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
2025-01-21Merge tag 'x86_cpu_for_v6.14_rc1' of ↵Linus Torvalds26-253/+151
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: - Remove the less generic CPU matching infra around struct x86_cpu_desc and use the generic struct x86_cpu_id thing - Remove magic naked numbers for CPUID functions and use proper defines of the prefix CPUID_LEAF_*. Consolidate some of the crazy use around the tree - Smaller cleanups and improvements * tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Make all all CPUID leaf names consistent x86/fpu: Remove unnecessary CPUID level check x86/fpu: Move CPUID leaf definitions to common code x86/tsc: Remove CPUID "frequency" leaf magic numbers. x86/tsc: Move away from TSC leaf magic numbers x86/cpu: Move TSC CPUID leaf definition x86/cpu: Refresh DCA leaf reading code x86/cpu: Remove unnecessary MwAIT leaf checks x86/cpu: Use MWAIT leaf definition x86/cpu: Move MWAIT leaf definition to common header x86/cpu: Remove 'x86_cpu_desc' infrastructure x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id' x86/cpu: Replace PEBS use of 'x86_cpu_desc' use with 'x86_cpu_id' x86/cpu: Expose only stepping min/max interface x86/cpu: Introduce new microcode matching helper x86/cpufeature: Document cpu_feature_enabled() as the default to use x86/paravirt: Remove the WBINVD callback x86/cpufeatures: Free up unused feature bits
2025-01-21Merge tag 'x86_sev_for_v6.14_rc1' of ↵Linus Torvalds15-123/+1292
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - A segmented Reverse Map table (RMP) is a across-nodes distributed table of sorts which contains per-node descriptors of each node-local 4K page, denoting its ownership (hypervisor, guest, etc) in the realm of confidential computing. Add support for such a table in order to improve referential locality when accessing or modifying RMP table entries - Add support for reading the TSC in SNP guests by removing any interference or influence the hypervisor might have, with the goal of making a confidential guest even more independent from the hypervisor * tag 'x86_sev_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev: Add the Secure TSC feature for SNP guests x86/tsc: Init the TSC for Secure TSC guests x86/sev: Mark the TSC in a secure TSC guest as reliable x86/sev: Prevent RDTSC/RDTSCP interception for Secure TSC enabled guests x86/sev: Prevent GUEST_TSC_FREQ MSR interception for Secure TSC enabled guests x86/sev: Change TSC MSR behavior for Secure TSC enabled guests x86/sev: Add Secure TSC support for SNP guests x86/sev: Relocate SNP guest messaging routines to common code x86/sev: Carve out and export SNP guest messaging init routines virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL virt: sev-guest: Remove is_vmpck_empty() helper x86/sev/docs: Document the SNP Reverse Map Table (RMP) x86/sev: Add full support for a segmented RMP table x86/sev: Treat the contiguous RMP table as a single RMP segment x86/sev: Map only the RMP table entries instead of the full RMP range x86/sev: Move the SNP probe routine out of the way x86/sev: Require the RMPREAD instruction after Zen4 x86/sev: Add support for the RMPREAD instruction x86/sev: Prepare for using the RMPREAD instruction to access the RMP
2025-01-21Merge tag 'x86_microcode_for_v6.14_rc1' of ↵Linus Torvalds1-26/+28
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 microcode loader updates from Borislav Petkov: - A bunch of minor cleanups * tag 'x86_microcode_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode/AMD: Remove ret local var in early_apply_microcode() x86/microcode/AMD: Have __apply_microcode_amd() return bool x86/microcode/AMD: Make __verify_patch_size() return bool x86/microcode/AMD: Remove bogus comment from parse_container() x86/microcode/AMD: Return bool from find_blobs_in_containers()
2025-01-21Merge tag 'x86_cache_for_v6.14_rc1' of ↵Linus Torvalds6-87/+168
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 resource control updates from Borislav Petkov: - Extend resctrl with the capability of total memory bandwidth monitoring, thus accomodating systems which support only total but not local memory bandwidth monitoring. Add the respective new mount options - The usual cleanups * tag 'x86_cache_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/resctrl: Document the new "mba_MBps_event" file x86/resctrl: Add write option to "mba_MBps_event" file x86/resctrl: Add "mba_MBps_event" file to CTRL_MON directories x86/resctrl: Make mba_sc use total bandwidth if local is not supported x86/resctrl: Compute memory bandwidth for all supported events x86/resctrl: Modify update_mba_bw() to use per CTRL_MON group event x86/resctrl: Prepare for per-CTRL_MON group mba_MBps control x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags x86/resctrl: Use kthread_run_on_cpu()
2025-01-21Merge tag 'x86_bugs_for_v6.14_rc1' of ↵Linus Torvalds4-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 CPU speculation update from Borislav Petkov: - Add support for AMD hardware which is not affected by SRSO on the user/kernel attack vector and advertise it to guest userspace * tag 'x86_bugs_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM: x86: Advertise SRSO_USER_KERNEL_NO to userspace x86/bugs: Add SRSO_USER_KERNEL_NO support
2025-01-21Merge tag 'ras_core_for_v6.14_rc1' of ↵Linus Torvalds9-261/+168
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Borislav Petkov: - Remove the shared threshold bank hack on AMD and streamline and simplify it - Cleanup and sanitize MCA code * tag 'ras_core_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce/amd: Remove shared threshold bank plumbing x86/mce: Remove the redundant mce_hygon_feature_init() x86/mce: Convert family/model mixed checks to VFM-based checks x86/mce: Break up __mcheck_cpu_apply_quirks() x86/mce: Make four functions return bool x86/mce/threshold: Remove the redundant this_cpu_dec_return() x86/mce: Make several functions return bool
2025-01-20x86: use cmov for user address maskingLinus Torvalds2-9/+8
This was a suggestion by David Laight, and while I was slightly worried that some micro-architecture would predict cmov like a conditional branch, there is little reason to actually believe any core would be that broken. Intel documents that their existing cores treat CMOVcc as a data dependency that will constrain speculation in their "Speculative Execution Side Channel Mitigations" whitepaper: "Other instructions such as CMOVcc, AND, ADC, SBB and SETcc can also be used to prevent bounds check bypass by constraining speculative execution on current family 6 processors (Intel® Core™, Intel® Atom™, Intel® Xeon® and Intel® Xeon Phi™ processors)" and while that leaves the future uarch issues open, that's certainly true of our traditional SBB usage too. Any core that predicts CMOV will be unusable for various crypto algorithms that need data-independent timing stability, so let's just treat CMOV as the safe choice that simplifies the address masking by avoiding an extra instruction and doesn't need a temporary register. Suggested-by: David Laight <David.Laight@aculab.com> Link: https://www.intel.com/content/dam/develop/external/us/en/documents/336996-speculative-execution-side-channel-mitigations.pdf Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-20x86: use proper 'clac' and 'stac' opcode namesLinus Torvalds1-11/+7
Back when we added SMAP support, all versions of binutils didn't necessarily understand the 'clac' and 'stac' instructions. So we implemented those instructions manually as ".byte" sequences. But we've since upgraded the minimum version of binutils to version 2.25, and that included proper support for the SMAP instructions, and there's no reason for us to use some line noise to express them any more. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-20Merge branch 'kvm-mirror-page-tables' into HEADPaolo Bonzini12-113/+505
As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. Confidential computing solutions almost invariably have concepts of private and shared memory, but they may different a lot in the details. In SEV, for example, the bit is handled more like a permission bit as far as the page tables are concerned: the private/shared bit is not included in the physical address. For TDX, instead, the bit is more like a physical address bit, with the host mapping private memory in one half of the address space and shared in another. Furthermore, the two halves are mapped by different EPT roots and only the shared half is managed by KVM; the private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module via SEAMCALLs. As a result, the operations that actually change the private half of the EPT are limited and relatively slow compared to reading a PTE. For this reason the design for KVM is to keep a mirror of the private EPT in host memory. This allows KVM to quickly walk the EPT and only perform the slower private EPT operations when it needs to actually modify mid-level private PTEs. There are thus three sets of EPT page tables: external, mirror and direct. In the case of TDX (the only user of this framework) the first two cover private memory, whereas the third manages shared memory: external EPT - Hidden within the TDX module, modified via TDX module calls. mirror EPT - Bookkeeping tree used as an optimization by KVM, not used by the processor. direct EPT - Normal EPT that maps unencrypted shared memory. Managed like the EPT of a normal VM. Modifying external EPT ---------------------- Modifications to the mirrored page tables need to also perform the same operations to the private page tables, which will be handled via kvm_x86_ops. Although this prep series does not interact with the TDX module at all to actually configure the private EPT, it does lay the ground work for doing this. In some ways updating the private EPT is as simple as plumbing PTE modifications through to also call into the TDX module; however, the locking is more complicated because inserting a single PTE cannot anymore be done atomically with a single CMPXCHG. For this reason, the existing FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides protecting operation of KVM, it limits the set of cases in which the TDX module will encounter contention on its own PTE locks. Zapping external EPT -------------------- While the framework tries to be relatively generic, and to be understandable without knowing TDX much in detail, some requirements of TDX sometimes leak; for example the private page tables also cannot be zapped while the range has anything mapped, so the mirrored/private page tables need to be protected from KVM operations that zap any non-leaf PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast(). For normal VMs, guest memory is zapped for several reasons: user memory getting paged out by the guest, memslots getting deleted, passthrough of devices with non-coherent DMA. Confidential computing adds to these the conversion of memory between shared and privates. These operations must not zap any private memory that is in use by the guest. This is possible because the only zapping that is out of the control of KVM/userspace is paging out userspace memory, which cannot apply to guestmemfd operations. Thus a TDX VM will only zap private memory from memslot deletion and from conversion between private and shared memory which is triggered by the guest. To avoid zapping too much memory, enums are introduced so that operations can choose to target only private or shared memory, and thus only direct or mirror EPT. For example: Memslot deletion - Private and shared MMU notifier based zapping - Shared only Conversion to shared - Private only Conversion to private - Shared only Other cases of zapping will not be supported for KVM, for example APICv update or non-coherent DMA status update; for the latter, TDX will simply require that the CPU supports self-snoop and honor guest PAT unconditionally for shared memory.
2025-01-20Merge branch 'kvm-userspace-hypercall' into HEADPaolo Bonzini4-41/+67
Make the completion of hypercalls go through the complete_hypercall function pointer argument, no matter if the hypercall exits to userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically went to userspace, and all the others did not; the new code need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at all whether there was an exit to userspace or not.
2025-01-20Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini28-677/+1038
KVM x86 misc changes for 6.14: - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities instead of just those where KVM needs to manage state and/or explicitly enable the feature in hardware. Along the way, refactor the code to make it easier to add features, and to make it more self-documenting how KVM is handling each feature. - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios (e.g. if emulation is triggered by the exit), and brings parity between VMX and SVM. - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively. - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU, due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU.
2025-01-20Merge tag 'kvm-x86-vmx-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini9-82/+120
KVM VMX changes for 6.14: - Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit while L2 is active, while ultimately results in a hardware-accelerated L1 EOI effectively being lost. - Honor event priority when emulating Posted Interrupt delivery during nested VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the interrupt. - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache prior to every VM-Enter. - Rework KVM's processing of the Page-Modification Logging buffer to reap entries in the same order they were created, i.e. to mark gfns dirty in the same order that hardware marked the page/PTE dirty. - Misc cleanups.
2025-01-20Merge tag 'kvm-x86-svm-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini3-20/+9
KVM SVM changes for 6.14: - Macrofy the SEV=n version of the sev_xxx_guest() helpers so that the code is optimized away when building with less than brilliant compilers. - Remove a now-redundant TLB flush when guest CR4.PGE changes. - Use str_enabled_disabled() to replace open coded strings.
2025-01-20Merge tag 'kvm-x86-mmu-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-0/+4
KVM x86 MMU changes for 6.14: - Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
2025-01-20Merge tag 'kvm-memslots-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-3/+4
KVM kvm_set_memory_region() cleanups and hardening for 6.14: - Add proper lockdep assertions when setting memory regions. - Add a dedicated API for setting KVM-internal memory regions. - Explicitly disallow all flags for KVM-internal memory regions.
2025-01-20Grab mm lock before grabbing pt lockMaksym Planeta1-0/+4
Function xen_pin_page calls xen_pte_lock, which in turn grab page table lock (ptlock). When locking, xen_pte_lock expect mm->page_table_lock to be held before grabbing ptlock, but this does not happen when pinning is caused by xen_mm_pin_all. This commit addresses lockdep warning below, which shows up when suspending a Xen VM. [ 3680.658422] Freezing user space processes [ 3680.660156] Freezing user space processes completed (elapsed 0.001 seconds) [ 3680.660182] OOM killer disabled. [ 3680.660192] Freezing remaining freezable tasks [ 3680.661485] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 3680.685254] [ 3680.685265] ================================== [ 3680.685269] WARNING: Nested lock was not taken [ 3680.685274] 6.12.0+ #16 Tainted: G W [ 3680.685279] ---------------------------------- [ 3680.685283] migration/0/19 is trying to lock: [ 3680.685288] ffff88800bac33c0 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: xen_pin_page+0x175/0x1d0 [ 3680.685303] [ 3680.685303] but this task is not holding: [ 3680.685308] init_mm.page_table_lock [ 3680.685311] [ 3680.685311] stack backtrace: [ 3680.685316] CPU: 0 UID: 0 PID: 19 Comm: migration/0 Tainted: G W 6.12.0+ #16 [ 3680.685324] Tainted: [W]=WARN [ 3680.685328] Stopper: multi_cpu_stop+0x0/0x120 <- __stop_cpus.constprop.0+0x8c/0xd0 [ 3680.685339] Call Trace: [ 3680.685344] <TASK> [ 3680.685347] dump_stack_lvl+0x77/0xb0 [ 3680.685356] __lock_acquire+0x917/0x2310 [ 3680.685364] lock_acquire+0xce/0x2c0 [ 3680.685369] ? xen_pin_page+0x175/0x1d0 [ 3680.685373] _raw_spin_lock_nest_lock+0x2f/0x70 [ 3680.685381] ? xen_pin_page+0x175/0x1d0 [ 3680.685386] xen_pin_page+0x175/0x1d0 [ 3680.685390] ? __pfx_xen_pin_page+0x10/0x10 [ 3680.685394] __xen_pgd_walk+0x233/0x2c0 [ 3680.685401] ? stop_one_cpu+0x91/0x100 [ 3680.685405] __xen_pgd_pin+0x5d/0x250 [ 3680.685410] xen_mm_pin_all+0x70/0xa0 [ 3680.685415] xen_pv_pre_suspend+0xf/0x280 [ 3680.685420] xen_suspend+0x57/0x1a0 [ 3680.685428] multi_cpu_stop+0x6b/0x120 [ 3680.685432] ? update_cpumasks_hier+0x7c/0xa60 [ 3680.685439] ? __pfx_multi_cpu_stop+0x10/0x10 [ 3680.685443] cpu_stopper_thread+0x8c/0x140 [ 3680.685448] ? smpboot_thread_fn+0x20/0x1f0 [ 3680.685454] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 3680.685458] smpboot_thread_fn+0xed/0x1f0 [ 3680.685462] kthread+0xde/0x110 [ 3680.685467] ? __pfx_kthread+0x10/0x10 [ 3680.685471] ret_from_fork+0x2f/0x50 [ 3680.685478] ? __pfx_kthread+0x10/0x10 [ 3680.685482] ret_from_fork_asm+0x1a/0x30 [ 3680.685489] </TASK> [ 3680.685491] [ 3680.685491] other info that might help us debug this: [ 3680.685497] 1 lock held by migration/0/19: [ 3680.685500] #0: ffffffff8284df38 (pgd_lock){+.+.}-{3:3}, at: xen_mm_pin_all+0x14/0xa0 [ 3680.685512] [ 3680.685512] stack backtrace: [ 3680.685518] CPU: 0 UID: 0 PID: 19 Comm: migration/0 Tainted: G W 6.12.0+ #16 [ 3680.685528] Tainted: [W]=WARN [ 3680.685531] Stopper: multi_cpu_stop+0x0/0x120 <- __stop_cpus.constprop.0+0x8c/0xd0 [ 3680.685538] Call Trace: [ 3680.685541] <TASK> [ 3680.685544] dump_stack_lvl+0x77/0xb0 [ 3680.685549] __lock_acquire+0x93c/0x2310 [ 3680.685554] lock_acquire+0xce/0x2c0 [ 3680.685558] ? xen_pin_page+0x175/0x1d0 [ 3680.685562] _raw_spin_lock_nest_lock+0x2f/0x70 [ 3680.685568] ? xen_pin_page+0x175/0x1d0 [ 3680.685572] xen_pin_page+0x175/0x1d0 [ 3680.685578] ? __pfx_xen_pin_page+0x10/0x10 [ 3680.685582] __xen_pgd_walk+0x233/0x2c0 [ 3680.685588] ? stop_one_cpu+0x91/0x100 [ 3680.685592] __xen_pgd_pin+0x5d/0x250 [ 3680.685596] xen_mm_pin_all+0x70/0xa0 [ 3680.685600] xen_pv_pre_suspend+0xf/0x280 [ 3680.685607] xen_suspend+0x57/0x1a0 [ 3680.685611] multi_cpu_stop+0x6b/0x120 [ 3680.685615] ? update_cpumasks_hier+0x7c/0xa60 [ 3680.685620] ? __pfx_multi_cpu_stop+0x10/0x10 [ 3680.685625] cpu_stopper_thread+0x8c/0x140 [ 3680.685629] ? smpboot_thread_fn+0x20/0x1f0 [ 3680.685634] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 3680.685638] smpboot_thread_fn+0xed/0x1f0 [ 3680.685642] kthread+0xde/0x110 [ 3680.685645] ? __pfx_kthread+0x10/0x10 [ 3680.685649] ret_from_fork+0x2f/0x50 [ 3680.685654] ? __pfx_kthread+0x10/0x10 [ 3680.685657] ret_from_fork_asm+0x1a/0x30 [ 3680.685662] </TASK> [ 3680.685267] xen:grant_table: Grant tables using version 1 layout [ 3680.685921] OOM killer enabled. [ 3680.685934] Restarting tasks ... done. Signed-off-by: Maksym Planeta <maksym@exostellar.io> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20241204103516.3309112-1-maksym@exostellar.io> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-01-19Merge tag 'x86_urgent_for_v6.13' of ↵Linus Torvalds3-3/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Mark serialize() noinstr so that it can be used from instrumentation- free code - Make sure FRED's RSP0 MSR is synchronized with its corresponding per-CPU value in order to avoid double faults in hotplug scenarios - Disable EXECMEM_ROX on x86 for now because it didn't receive proper x86 maintainers review, went in and broke a bunch of things * tag 'x86_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm: Make serialize() always_inline x86/fred: Fix the FRED RSP0 MSR out of sync with its per-CPU cache x86: Disable EXECMEM_ROX support
2025-01-16x86/asm: Make serialize() always_inlineJuergen Gross1-1/+1
In order to allow serialize() to be used from noinstr code, make it __always_inline. Fixes: 0ef8047b737d ("x86/static-call: provide a way to do very early static-call updates") Closes: https://lore.kernel.org/oe-kbuild-all/202412181756.aJvzih2K-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241218100918.22167-1-jgross@suse.com
2025-01-15genirq: Remove IRQ_MOVE_PCNTXT and related codeThomas Gleixner1-1/+0
Now that x86 is converted over to use the IRQCHIP_MOVE_DEFERRED flags, remove IRQ*_MOVE_PCNTXT and related code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241210103335.626707225@linutronix.de
2025-01-15x86/apic: Convert to IRQCHIP_MOVE_DEFERREDThomas Gleixner6-14/+5
Instead of marking individual interrupts as safe to be migrated in arbitrary contexts, mark the interrupt chips, which require the interrupt to be moved in actual interrupt context, with the new IRQCHIP_MOVE_DEFERRED flag. This makes more sense because this is a per interrupt chip property and not restricted to individual interrupts. That flips the logic from the historical opt-out to a opt-in model. This is simpler to handle for other architectures, which default to unrestricted affinity setting. It also allows to cleanup the redundant core logic significantly. All interrupt chips, which belong to a top-level domain sitting directly on top of the x86 vector domain are marked accordingly, unless the related setup code marks the interrupts with IRQ_MOVE_PCNTXT, i.e. XEN. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Acked-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/all/20241210103335.563277044@linutronix.de
2025-01-15KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()Yan Zhao2-4/+18
Return RET_PF* (excluding RET_PF_EMULATE/RET_PF_CONTINUE/RET_PF_INVALID) instead of 1 in kvm_mmu_page_fault(). The callers of kvm_mmu_page_fault() are KVM page fault handlers (i.e., npf_interception(), handle_ept_misconfig(), __vmx_handle_ept_violation(), kvm_handle_page_fault()). They either check if the return value is > 0 (as in npf_interception()) or pass it further to vcpu_run() to decide whether to break out of the kernel loop and return to the user when r <= 0. Therefore, returning any positive value is equivalent to returning 1. Warn if r == RET_PF_CONTINUE (which should not be a valid value) to ensure a positive return value. This is a preparation to allow TDX's EPT violation handler to check the RET_PF* value and retry internally for RET_PF_RETRY. No functional changes are intended. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250113021138.18875-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-01-15platform/x86: intel: punit_ipc: Remove unused functionDr. David Alan Gilbert1-7/+0
intel_punit_ipc_simple_command() was added in 2015 by commit fdca4f16f57d ("platform:x86: add Intel P-Unit mailbox IPC driver") but never used. Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://lore.kernel.org/r/20241221234807.160473-1-linux@treblig.org Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-01-15PCI: Avoid putting some root ports into D3 on TUXEDO Sirius Gen1Werner Sembach1-0/+30
commit 9d26d3a8f1b0 ("PCI: Put PCIe ports into D3 during suspend") sets the policy that all PCIe ports are allowed to use D3. When the system is suspended if the port is not power manageable by the platform and won't be used for wakeup via a PME this sets up the policy for these ports to go into D3hot. This policy generally makes sense from an OSPM perspective but it leads to problems with wakeup from suspend on the TUXEDO Sirius 16 Gen 1 with a specific old BIOS. This manifests as a system hang. On the affected Device + BIOS combination, add a quirk for the root port of the problematic controller to ensure that these root ports are not put into D3hot at suspend. This patch is based on https://lore.kernel.org/linux-pci/20230708214457.1229-2-mario.limonciello@amd.com but with the added condition both in the documentation and in the code to apply only to the TUXEDO Sirius 16 Gen 1 with a specific old BIOS and only the affected root ports. Fixes: 9d26d3a8f1b0 ("PCI: Put PCIe ports into D3 during suspend") Suggested-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20250114222436.1075456-1-wse@tuxedocomputers.com Co-developed-by: Georg Gottleuber <ggo@tuxedocomputers.com> Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com> Signed-off-by: Werner Sembach <wse@tuxedocomputers.com> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Cc: <stable@vger.kernel.org> # 6.1+
2025-01-15KVM: x86: Drop double-underscores from __kvm_set_memory_region()Sean Christopherson1-1/+1
Now that there's no outer wrapper for __kvm_set_memory_region() and it's static, drop its double-underscore prefix. No functional change intended. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-15KVM: Add a dedicated API for setting KVM-internal memslotsSean Christopherson1-1/+1
Add a dedicated API for setting internal memslots, and have it explicitly disallow setting userspace memslots. Setting a userspace memslots without a direct command from userspace would result in all manner of issues. No functional change intended. Cc: Tao Su <tao1.su@linux.intel.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-15KVM: Assert slots_lock is held when setting memory regionsSean Christopherson1-1/+2
Add proper lockdep assertions in __kvm_set_memory_region() and __x86_set_memory_region() instead of relying comments. Opportunistically delete __kvm_set_memory_region()'s entire function comment as the API doesn't allocate memory or select a gfn, and the "mostly for framebuffers" comment hasn't been true for a very long time. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-15x86/fred: Fix the FRED RSP0 MSR out of sync with its per-CPU cacheXin Li (Intel)1-1/+7
The FRED RSP0 MSR is only used for delivering events when running userspace. Linux leverages this property to reduce expensive MSR writes and optimize context switches. The kernel only writes the MSR when about to run userspace *and* when the MSR has actually changed since the last time userspace ran. This optimization is implemented by maintaining a per-CPU cache of FRED RSP0 and then checking that against the value for the top of current task stack before running userspace. However cpu_init_fred_exceptions() writes the MSR without updating the per-CPU cache. This means that the kernel might return to userspace with MSR_IA32_FRED_RSP0==0 when it needed to point to the top of current task stack. This would induce a double fault (#DF), which is bad. A context switch after cpu_init_fred_exceptions() can paper over the issue since it updates the cached value. That evidently happens most of the time explaining how this bug got through. Fix the bug through resynchronizing the FRED RSP0 MSR with its per-CPU cache in cpu_init_fred_exceptions(). Fixes: fe85ee391966 ("x86/entry: Set FRED RSP0 on return to userspace instead of context switch") Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250110174639.1250829-1-xin%40zytor.com
2025-01-14x86/sev: Disable ftrace branch profiling in SEV startup codeArd Biesheuvel1-0/+2
Ftrace branch profiling inserts absolute references to its metadata at call sites, and this implies that this kind of instrumentation cannot be used while executing from the 1:1 mapping of memory. Therefore, disable ftrace branch profiling in the SEV startup routines, by disabling it for the entire SEV core source file. Closes: https://lore.kernel.org/oe-kbuild-all/202501072244.zZrx9864-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250107151826.820147-2-ardb+git@google.com
2025-01-14x86/kexec: Use typedef for relocate_kernel_fn function prototypeDavid Woodhouse3-26/+16
Both i386 and x86_64 now copy the relocate_kernel() function into the control page and execute it from there, using an open-coded function pointer. Use a typedef for it instead. [ bp: Put relocate_kernel_ptr ptr arithmetic on a single line. ] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250109140757.2841269-10-dwmw2@infradead.org
2025-01-14x86/kexec: Cope with relocate_kernel() not being at the start of the pageDavid Woodhouse1-7/+11
A few places in the kexec control code page make the assumption that the first instruction of relocate_kernel is at the very start of the page. To allow for Clang CFI information to be added to relocate_kernel(), as well as the general principle of removing unwarranted assumptions, fix them to use the external __relocate_kernel_start symbol that the linker adds. This means using a separate addq and subq for calculating offsets, as the assembler can no longer calculate the delta directly for itself and relocations aren't that versatile. But those values can at least be used relative to a local label to avoid absolute relocations. Turn the jump from relocate_kernel() to identity_mapped() into a real indirect 'jmp *%rsi' too, while touching it. There was no real reason for it to be a push+ret in the first place, and adding Clang CFI info will also give objtool enough visibility to start complaining 'return with modified stack frame' about it. [ bp: Massage commit message. ] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250109140757.2841269-9-dwmw2@infradead.org
2025-01-14x86/kexec: Mark machine_kexec() with __nocfiDavid Woodhouse1-1/+1
A recent commit caused the relocate_kernel() function to be invoked through a function pointer, but it does not have CFI information. The resulting trap occurs after the IDT and GDT have been invalidated, leading to a triple-fault if CONFIG_CFI_CLANG is enabled. Using SYM_TYPED_FUNC_START() to provide the CFI information looks like it will require a prolonged battle with objtool. And is fairly pointless anyway, as the actual signature comes from a __kcfi_typeid_… symbol emitted from the C code based on the function prototype it thinks that relocate_kernel has, rendering the check somewhat tautological. The simple fix is just to mark machine_kexec() with __nocfi. Fixes: eeebbde57113 ("x86/kexec: Invoke copy of relocate_kernel() instead of the original") Reported-by: Nathan Chancellor <nathan@kernel.org> Suggested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250109140757.2841269-7-dwmw2@infradead.org