summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-01-01x86/smpboot/64: Handle X2APIC BIOS inconsistency gracefullyThomas Gleixner1-0/+16
commit 69a7386c1ec25476a0c78ffeb59de08a2a08f495 upstream. Chris reported that a Dell PowerEdge T340 system stopped to boot when upgrading to a kernel which contains the parallel hotplug changes. Disabling parallel hotplug on the kernel command line makes it boot again. It turns out that the Dell BIOS has x2APIC enabled and the boot CPU comes up in X2APIC mode, but the APs come up inconsistently in xAPIC mode. Parallel hotplug requires that the upcoming CPU reads out its APIC ID from the local APIC in order to map it to the Linux CPU number. In this particular case the readout on the APs uses the MMIO mapped registers because the BIOS failed to enable x2APIC mode. That readout results in a page fault because the kernel does not have the APIC MMIO space mapped when X2APIC mode was enabled by the BIOS on the boot CPU and the kernel switched to X2APIC mode early. That page fault can't be handled on the upcoming CPU that early and results in a silent boot failure. If parallel hotplug is disabled the system boots because in that case the APIC ID read is not required as the Linux CPU number is provided to the AP in the smpboot control word. When the kernel uses x2APIC mode then the APs are switched to x2APIC mode too slightly later in the bringup process, but there is no reason to do it that late. Cure the BIOS bogosity by checking in the parallel bootup path whether the kernel uses x2APIC mode and if so switching over the APs to x2APIC mode before the APIC ID readout. Fixes: 0c7ffa32dbd6 ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it") Reported-by: Chris Lindee <chris.lindee@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Tested-by: Chris Lindee <chris.lindee@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/CA%2B2tU59853R49EaU_tyvOZuOTDdcU0RshGyydccp9R1NX9bEeQ@mail.gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-01x86/alternatives: Disable interrupts and sync when optimizing NOPs in placeThomas Gleixner1-1/+11
commit 2dc4196138055eb0340231aecac4d78c2ec2bea5 upstream. apply_alternatives() treats alternatives with the ALT_FLAG_NOT flag set special as it optimizes the existing NOPs in place. Unfortunately, this happens with interrupts enabled and does not provide any form of core synchronization. So an interrupt hitting in the middle of the update and using the affected code path will observe a half updated NOP and crash and burn. The following 3 NOP sequence was observed to expose this crash halfway reliably under QEMU 32bit: 0x90 0x90 0x90 which is replaced by the optimized 3 byte NOP: 0x8d 0x76 0x00 So an interrupt can observe: 1) 0x90 0x90 0x90 nop nop nop 2) 0x8d 0x90 0x90 undefined 3) 0x8d 0x76 0x90 lea -0x70(%esi),%esi 4) 0x8d 0x76 0x00 lea 0x0(%esi),%esi Where only #1 and #4 are true NOPs. The same problem exists for 64bit obviously. Disable interrupts around this NOP optimization and invoke sync_core() before re-enabling them. Fixes: 270a69c4485d ("x86/alternative: Support relocations in alternatives") Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/ZT6narvE%2BLxX%2B7Be@windriver.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-01x86/alternatives: Sync core before enabling interruptsThomas Gleixner1-1/+1
commit 3ea1704a92967834bf0e64ca1205db4680d04048 upstream. text_poke_early() does: local_irq_save(flags); memcpy(addr, opcode, len); local_irq_restore(flags); sync_core(); That's not really correct because the synchronization should happen before interrupts are re-enabled to ensure that a pending interrupt observes the complete update of the opcodes. It's not entirely clear whether the interrupt entry provides enough serialization already, but moving the sync_core() invocation into interrupt disabled region does no harm and is obviously correct. Fixes: 6fffacb30349 ("x86/alternatives, jumplabel: Use text_poke_early() before mm_init()") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/ZT6narvE%2BLxX%2B7Be@windriver.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-13x86/CPU/AMD: Check vendor in the AMD microcode callbackBorislav Petkov (AMD)1-0/+3
commit 9b8493dc43044376716d789d07699f17d538a7c4 upstream. Commit in Fixes added an AMD-specific microcode callback. However, it didn't check the CPU vendor the kernel runs on explicitly. The only reason the Zenbleed check in it didn't run on other x86 vendors hardware was pure coincidental luck: if (!cpu_has_amd_erratum(c, amd_zenbleed)) return; gives true on other vendors because they don't have those families and models. However, with the removal of the cpu_has_amd_erratum() in 05f5f73936fa ("x86/CPU/AMD: Drop now unused CPU erratum checking function") that coincidental condition is gone, leading to the zenbleed check getting executed on other vendors too. Add the explicit vendor check for the whole callback as it should've been done in the first place. Fixes: 522b1d69219d ("x86/cpu/amd: Add a Zenbleed fix") Cc: <stable@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231201184226.16749-1-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-13x86/sev: Fix kernel crash due to late update to read-only ghcb_versionAshwin Dayanand Kamat1-4/+7
[ Upstream commit 27d25348d42161837be08fc63b04a2559d2e781c ] A write-access violation page fault kernel crash was observed while running cpuhotplug LTP testcases on SEV-ES enabled systems. The crash was observed during hotplug, after the CPU was offlined and the process was migrated to different CPU. setup_ghcb() is called again which tries to update ghcb_version in sev_es_negotiate_protocol(). Ideally this is a read_only variable which is initialised during booting. Trying to write it results in a pagefault: BUG: unable to handle page fault for address: ffffffffba556e70 #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation [ ...] Call Trace: <TASK> ? __die_body.cold+0x1a/0x1f ? __die+0x2a/0x35 ? page_fault_oops+0x10c/0x270 ? setup_ghcb+0x71/0x100 ? __x86_return_thunk+0x5/0x6 ? search_exception_tables+0x60/0x70 ? __x86_return_thunk+0x5/0x6 ? fixup_exception+0x27/0x320 ? kernelmode_fixup_or_oops+0xa2/0x120 ? __bad_area_nosemaphore+0x16a/0x1b0 ? kernel_exc_vmm_communication+0x60/0xb0 ? bad_area_nosemaphore+0x16/0x20 ? do_kern_addr_fault+0x7a/0x90 ? exc_page_fault+0xbd/0x160 ? asm_exc_page_fault+0x27/0x30 ? setup_ghcb+0x71/0x100 ? setup_ghcb+0xe/0x100 cpu_init_exception_handling+0x1b9/0x1f0 The fix is to call sev_es_negotiate_protocol() only in the BSP boot phase, and it only needs to be done once in any case. [ mingo: Refined the changelog. ] Fixes: 95d33bfaa3e1 ("x86/sev: Register GHCB memory when SEV-SNP is active") Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Co-developed-by: Bo Gan <bo.gan@broadcom.com> Signed-off-by: Bo Gan <bo.gan@broadcom.com> Signed-off-by: Ashwin Dayanand Kamat <ashwin.kamat@broadcom.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/1701254429-18250-1-git-send-email-kashwindayan@vmware.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-13x86/entry: Convert INT 0x80 emulation to IDTENTRYThomas Gleixner1-1/+1
[ upstream commit be5341eb0d43b1e754799498bd2e8756cc167a41 ] There is no real reason to have a separate ASM entry point implementation for the legacy INT 0x80 syscall emulation on 64-bit. IDTENTRY provides all the functionality needed with the only difference that it does not: - save the syscall number (AX) into pt_regs::orig_ax - set pt_regs::ax to -ENOSYS Both can be done safely in the C code of an IDTENTRY before invoking any of the syscall related functions which depend on this convention. Aside of ASM code reduction this prepares for detecting and handling a local APIC injected vector 0x80. [ kirill.shutemov: More verbose comments ] Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@vger.kernel.org> # v6.0+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28x86/srso: Move retbleed IBPB check into existing 'has_microcode' code blockJosh Poimboeuf1-3/+1
commit 351236947a45a512c517153bbe109fe868d05e6d upstream. Simplify the code flow a bit by moving the retbleed IBPB check into the existing 'has_microcode' block. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/0a22b86b1f6b07f9046a9ab763fc0e0d1b7a91d4.1693889988.git.jpoimboe@kernel.org Cc: Caleb Jorden <cjorden@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28x86/cpu/hygon: Fix the CPU topology evaluation for realPu Wen1-2/+6
commit ee545b94d39a00c93dc98b1dbcbcf731d2eadeb4 upstream. Hygon processors with a model ID > 3 have CPUID leaf 0xB correctly populated and don't need the fixed package ID shift workaround. The fixup is also incorrect when running in a guest. Fixes: e0ceeae708ce ("x86/CPU/hygon: Fix phys_proc_id calculation logic for multi-die processors") Signed-off-by: Pu Wen <puwen@hygon.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/tencent_594804A808BD93A4EBF50A994F228E3A7F07@qq.com Link: https://lore.kernel.org/r/20230814085112.089607918@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28x86/apic/msi: Fix misconfigured non-maskable MSI quirkKoichiro Den1-5/+3
commit b56ebe7c896dc78b5865ec2c4b1dae3c93537517 upstream. commit ef8dd01538ea ("genirq/msi: Make interrupt allocation less convoluted"), reworked the code so that the x86 specific quirk for affinity setting of non-maskable PCI/MSI interrupts is not longer activated if necessary. This could be solved by restoring the original logic in the core MSI code, but after a deeper analysis it turned out that the quirk flag is not required at all. The quirk is only required when the PCI/MSI device cannot mask the MSI interrupts, which in turn also prevents reservation mode from being enabled for the affected interrupt. This allows ot remove the NOMASK quirk bit completely as msi_set_affinity() can instead check whether reservation mode is enabled for the interrupt, which gives exactly the same answer. Even in the momentary non-existing case that the reservation mode would be not set for a maskable MSI interrupt this would not cause any harm as it just would cause msi_set_affinity() to go needlessly through the functionaly equivalent slow path, which works perfectly fine with maskable interrupts as well. Rework msi_set_affinity() to query the reservation mode and remove all NOMASK quirk logic from the core code. [ tglx: Massaged changelog ] Fixes: ef8dd01538ea ("genirq/msi: Make interrupt allocation less convoluted") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Koichiro Den <den@valinux.co.jp> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20231026032036.2462428-1-den@valinux.co.jp Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28x86/shstk: Delay signal entry SSP write until after user accessesRick Edgecombe1-3/+3
commit 31255e072b2e91f97645d792d25b2db744186dd1 upstream. When a signal is being delivered, the kernel needs to make accesses to userspace. These accesses could encounter an access error, in which case the signal delivery itself will trigger a segfault. Usually this would result in the kernel killing the process. But in the case of a SEGV signal handler being configured, the failure of the first signal delivery will result in *another* signal getting delivered. The second signal may succeed if another thread has resolved the issue that triggered the segfault (i.e. a well timed mprotect()/mmap()), or the second signal is being delivered to another stack (i.e. an alt stack). On x86, in the non-shadow stack case, all the accesses to userspace are done before changes to the registers (in pt_regs). The operation is aborted when an access error occurs, so although there may be writes done for the first signal, control flow changes for the signal (regs->ip, regs->sp, etc) are not committed until all the accesses have already completed successfully. This means that the second signal will be delivered as if it happened at the time of the first signal. It will effectively replace the first aborted signal, overwriting the half-written frame of the aborted signal. So on sigreturn from the second signal, control flow will resume happily from the point of control flow where the original signal was delivered. The problem is, when shadow stack is active, the shadow stack SSP register/MSR is updated *before* some of the userspace accesses. This means if the earlier accesses succeed and the later ones fail, the second signal will not be delivered at the same spot on the shadow stack as the first one. So on sigreturn from the second signal, the SSP will be pointing to the wrong location on the shadow stack (off by a frame). Pengfei privately reported that while using a shadow stack enabled glibc, the “signal06” test in the LTP test-suite hung. It turns out it is testing the above described double signal scenario. When this test was compiled with shadow stack, the first signal pushed a shadow stack sigframe, then the second pushed another. When the second signal was handled, the SSP was at the first shadow stack signal frame instead of the original location. The test then got stuck as the #CP from the twice incremented SSP was incorrect and generated segfaults in a loop. Fix this by adjusting the SSP register only after any userspace accesses, such that there can be no failures after the SSP is adjusted. Do this by moving the shadow stack sigframe push logic to happen after all other userspace accesses. Note, sigreturn (as opposed to the signal delivery dealt with in this patch) has ordering behavior that could lead to similar failures. The ordering issues there extend beyond shadow stack to include the alt stack restoration. Fixing that would require cross-arch changes, and the ordering today does not cause any known test or apps breakages. So leave it as is, for now. [ dhansen: minor changelog/subject tweak ] Fixes: 05e36022c054 ("x86/shstk: Handle signals for shadow stack") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20231107182251.91276-1-rick.p.edgecombe%40intel.com Link: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/signal/signal06.c Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-20x86/amd_nb: Use Family 19h Models 60h-7Fh Function 4 IDsYazen Ghannam1-0/+3
commit 2a565258b3f4bbdc7a3c09cd02082cb286a7bffc upstream. Three PCI IDs for DF Function 4 were defined but not used. Add them to the "link" list. Fixes: f8faf3496633 ("x86/amd_nb: Add AMD PCI IDs for SMN communication") Fixes: 23a5b8bb022c ("x86/amd_nb: Add PCI ID for family 19h model 78h") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230803150430.3542854-1-yazen.ghannam@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-20x86/nmi: Fix out-of-order NMI nesting checks & false positive warningPaul E. McKenney1-6/+7
[ Upstream commit f44075ecafb726830e63d33fbca29413149eeeb8 ] The ->idt_seq and ->recv_jiffies variables added by: 1a3ea611fc10 ("x86/nmi: Accumulate NMI-progress evidence in exc_nmi()") ... place the exit-time check of the bottom bit of ->idt_seq after the this_cpu_dec_return() that re-enables NMI nesting. This can result in the following sequence of events on a given CPU in kernels built with CONFIG_NMI_CHECK_CPU=y: o An NMI arrives, and ->idt_seq is incremented to an odd number. In addition, nmi_state is set to NMI_EXECUTING==1. o The NMI is processed. o The this_cpu_dec_return(nmi_state) zeroes nmi_state and returns NMI_EXECUTING==1, thus opting out of the "goto nmi_restart". o Another NMI arrives and ->idt_seq is incremented to an even number, triggering the warning. But all is just fine, at least assuming we don't get so many closely spaced NMIs that the stack overflows or some such. Experience on the fleet indicates that the MTBF of this false positive is about 70 years. Or, for those who are not quite that patient, the MTBF appears to be about one per week per 4,000 systems. Fix this false-positive warning by moving the "nmi_restart" label before the initial ->idt_seq increment/check and moving the this_cpu_dec_return() to follow the final ->idt_seq increment/check. This way, all nested NMIs that get past the NMI_NOT_RUNNING check get a clean ->idt_seq slate. And if they don't get past that check, they will set nmi_state to NMI_LATCHED, which will cause the this_cpu_dec_return(nmi_state) to restart. Fixes: 1a3ea611fc10 ("x86/nmi: Accumulate NMI-progress evidence in exc_nmi()") Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/0cbff831-6e3d-431c-9830-ee65ee7787ff@paulmck-laptop Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20x86/apic: Fake primary thread mask for XEN/PVThomas Gleixner1-0/+11
[ Upstream commit 965e05ff8af98c44f9937366715c512000373164 ] The SMT control mechanism got added as speculation attack vector mitigation. The implemented logic relies on the primary thread mask to be set up properly. This turns out to be an issue with XEN/PV guests because their CPU hotplug mechanics do not enumerate APICs and therefore the mask is never correctly populated. This went unnoticed so far because by chance XEN/PV ends up with smp_num_siblings == 2. So cpu_smt_control stays at its default value CPU_SMT_ENABLED and the primary thread mask is never evaluated in the context of CPU hotplug. This stopped "working" with the upcoming overhaul of the topology evaluation which legitimately provides a fake topology for XEN/PV. That sets smp_num_siblings to 1, which causes the core CPU hot-plug core to refuse to bring up the APs. This happens because cpu_smt_control is set to CPU_SMT_NOT_SUPPORTED which causes cpu_bootable() to evaluate the unpopulated primary thread mask with the conclusion that all non-boot CPUs are not valid to be plugged. The core code has already been made more robust against this kind of fail, but the primary thread mask really wants to be populated to avoid other issues all over the place. Just fake the mask by pretending that all XEN/PV vCPUs are primary threads, which is consistent because all of XEN/PVs topology is fake or non-existent. Fixes: 6a4d2657e048 ("x86/smp: Provide topology_is_primary_thread()") Fixes: f54d4434c281 ("x86/apic: Provide cpu_primary_thread mask") Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.210011520@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20x86/boot: Fix incorrect startup_gdt_descr.sizeYuntao Wang1-1/+1
[ Upstream commit 001470fed5959d01faecbd57fcf2f60294da0de1 ] Since the size value is added to the base address to yield the last valid byte address of the GDT, the current size value of startup_gdt_descr is incorrect (too large by one), fix it. [ mingo: This probably never mattered, because startup_gdt[] is only used in a very controlled fashion - but make it consistent nevertheless. ] Fixes: 866b556efa12 ("x86/head/64: Install startup GDT") Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20230807084547.217390-1-ytcoode@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20x86/srso: Fix vulnerability reporting for missing microcodeJosh Poimboeuf1-14/+22
[ Upstream commit dc6306ad5b0dda040baf1fde3cfd458e6abfc4da ] The SRSO default safe-ret mitigation is reported as "mitigated" even if microcode hasn't been updated. That's wrong because userspace may still be vulnerable to SRSO attacks due to IBPB not flushing branch type predictions. Report the safe-ret + !microcode case as vulnerable. Also report the microcode-only case as vulnerable as it leaves the kernel open to attacks. Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/a8a14f97d1b0e03ec255c81637afdf4cf0ae9c99.1693889988.git.jpoimboe@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20x86/srso: Print mitigation for retbleed IBPB caseJosh Poimboeuf1-3/+3
[ Upstream commit de9f5f7b06a5b7adbfdd8016f011120a4e928add ] When overriding the requested mitigation with IBPB due to retbleed=ibpb, print the mitigation in the usual format instead of a custom error message. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/ec3af919e267773d896c240faf30bfc6a1fd6304.1693889988.git.jpoimboe@kernel.org Stable-dep-of: dc6306ad5b0d ("x86/srso: Fix vulnerability reporting for missing microcode") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20x86/srso: Fix SBPB enablement for (possible) future fixed HWJosh Poimboeuf1-1/+1
[ Upstream commit 1d1142ac51307145dbb256ac3535a1d43a1c9800 ] Make the SBPB check more robust against the (possible) case where future HW has SRSO fixed but doesn't have the SRSO_NO bit set. Fixes: 1b5277c0ea0b ("x86/srso: Add SRSO_NO support") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/cee5050db750b391c9f35f5334f8ff40e66c01b9.1693889988.git.jpoimboe@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-28Merge tag 'x86-urgent-2023-10-28' of ↵Linus Torvalds3-9/+42
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Fix a possible CPU hotplug deadlock bug caused by the new TSC synchronization code - Fix a legacy PIC discovery bug that results in device troubles on affected systems, such as non-working keybards, etc - Add a new Intel CPU model number to <asm/intel-family.h> * tag 'x86-urgent-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tsc: Defer marking TSC unstable to a worker x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility x86/cpu: Add model number for Intel Arrow Lake mobile processor
2023-10-27x86/tsc: Defer marking TSC unstable to a workerThomas Gleixner1-1/+9
Tetsuo reported the following lockdep splat when the TSC synchronization fails during CPU hotplug: tsc: Marking TSC unstable due to check_tsc_sync_source failed WARNING: inconsistent lock state inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. ffffffff8cfa1c78 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0x23/0x5a0 {IN-HARDIRQ-W} state was registered at: _raw_spin_lock_irqsave+0x3f/0x60 clocksource_mark_unstable+0x1b/0x90 mark_tsc_unstable+0x41/0x50 check_tsc_sync_source+0x14f/0x180 sysvec_call_function_single+0x69/0x90 Possible unsafe locking scenario: lock(watchdog_lock); <Interrupt> lock(watchdog_lock); stack backtrace: _raw_spin_lock+0x30/0x40 clocksource_watchdog+0x23/0x5a0 run_timer_softirq+0x2a/0x50 sysvec_apic_timer_interrupt+0x6e/0x90 The reason is the recent conversion of the TSC synchronization function during CPU hotplug on the control CPU to a SMP function call. In case that the synchronization with the upcoming CPU fails, the TSC has to be marked unstable via clocksource_mark_unstable(). clocksource_mark_unstable() acquires 'watchdog_lock', but that lock is taken with interrupts enabled in the watchdog timer callback to minimize interrupt disabled time. That's obviously a possible deadlock scenario, Before that change the synchronization function was invoked in thread context so this could not happen. As it is not crucical whether the unstable marking happens slightly delayed, defer the call to a worker thread which avoids the lock context problem. Fixes: 9d349d47f0e3 ("x86/smpboot: Make TSC synchronization function call based") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87zg064ceg.ffs@tglx
2023-10-27x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibilityThomas Gleixner2-8/+33
David and a few others reported that on certain newer systems some legacy interrupts fail to work correctly. Debugging revealed that the BIOS of these systems leaves the legacy PIC in uninitialized state which makes the PIC detection fail and the kernel switches to a dummy implementation. Unfortunately this fallback causes quite some code to fail as it depends on checks for the number of legacy PIC interrupts or the availability of the real PIC. In theory there is no reason to use the PIC on any modern system when IO/APIC is available, but the dependencies on the related checks cannot be resolved trivially and on short notice. This needs lots of analysis and rework. The PIC detection has been added to avoid quirky checks and force selection of the dummy implementation all over the place, especially in VM guest scenarios. So it's not an option to revert the relevant commit as that would break a lot of other scenarios. One solution would be to try to initialize the PIC on detection fail and retry the detection, but that puts the burden on everything which does not have a PIC. Fortunately the ACPI/MADT table header has a flag field, which advertises in bit 0 that the system is PCAT compatible, which means it has a legacy 8259 PIC. Evaluate that bit and if set avoid the detection routine and keep the real PIC installed, which then gets initialized (for nothing) and makes the rest of the code with all the dependencies work again. Fixes: e179f6914152 ("x86, irq, pic: Probe for legacy PIC and set legacy_pic appropriately") Reported-by: David Lazar <dlazar@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: David Lazar <dlazar@gmail.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Cc: stable@vger.kernel.org Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218003 Link: https://lore.kernel.org/r/875y2u5s8g.ffs@tglx
2023-10-20Merge tag 'sev_fixes_for_v6.6' of ↵Linus Torvalds2-9/+74
//git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: "Take care of a race between when the #VC exception is raised and when the guest kernel gets to emulate certain instructions in SEV-{ES,SNP} guests by: - disabling emulation of MMIO instructions when coming from user mode - checking the IO permission bitmap before emulating IO instructions and verifying the memory operands of INS/OUTS insns" * tag 'sev_fixes_for_v6.6' of //git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev: Check for user-space IOIO pointing to kernel space x86/sev: Check IOBM for IOIO exceptions from user-space x86/sev: Disable MMIO emulation from user mode
2023-10-17x86/sev: Check for user-space IOIO pointing to kernel spaceJoerg Roedel1-2/+29
Check the memory operand of INS/OUTS before emulating the instruction. The #VC exception can get raised from user-space, but the memory operand can be manipulated to access kernel memory before the emulation actually begins and after the exception handler has run. [ bp: Massage commit message. ] Fixes: 597cfe48212a ("x86/boot/compressed/64: Setup a GHCB-based VC Exception handler") Reported-by: Tom Dohrmann <erbse.13@gmx.de> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org>
2023-10-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds3-9/+11
Pull kvm fixes from Paolo Bonzini: "ARM: - Fix the handling of the phycal timer offset when FEAT_ECV and CNTPOFF_EL2 are implemented - Restore the functionnality of Permission Indirection that was broken by the Fine Grained Trapping rework - Cleanup some PMU event sharing code MIPS: - Fix W=1 build s390: - One small fix for gisa to avoid stalls x86: - Truncate writes to PMU counters to the counter's width to avoid spurious overflows when emulating counter events in software - Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined architectural behavior) - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to kick the guest out of emulated halt - Fix for loading XSAVE state from an old kernel into a new one - Fixes for AMD AVIC selftests: - Play nice with %llx when formatting guest printf and assert statements - Clean up stale test metadata - Zero-initialize structures in memslot perf test to workaround a suspected 'may be used uninitialized' false positives from GCC" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits) KVM: arm64: timers: Correctly handle TGE flip with CNTPOFF_EL2 KVM: arm64: POR{E0}_EL1 do not need trap handlers KVM: arm64: Add nPIR{E0}_EL1 to HFG traps KVM: MIPS: fix -Wunused-but-set-variable warning KVM: arm64: pmu: Drop redundant check for non-NULL kvm_pmu_events KVM: SVM: Fix build error when using -Werror=unused-but-set-variable x86: KVM: SVM: refresh AVIC inhibition in svm_leave_nested() x86: KVM: SVM: add support for Invalid IPI Vector interception x86: KVM: SVM: always update the x2avic msr interception KVM: selftests: Force load all supported XSAVE state in state test KVM: selftests: Load XSAVE state into untouched vCPU during state test KVM: selftests: Touch relevant XSAVE state in guest for state test KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2} x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer KVM: selftests: Zero-initialize entire test_result in memslot perf test KVM: selftests: Remove obsolete and incorrect test case metadata KVM: selftests: Treat %llx like %lx when formatting guest printf KVM: x86/pmu: Synthesize at most one PMI per VM-exit KVM: x86: Mask LVTPC when handling a PMI KVM: x86/pmu: Truncate counter value to allowed width on write ...
2023-10-15Revert "x86/smp: Put CPUs into INIT on shutdown if possible"Linus Torvalds2-59/+7
This reverts commit 45e34c8af58f23db4474e2bfe79183efec09a18b, and the two subsequent fixes to it: 3f874c9b2aae ("x86/smp: Don't send INIT to non-present and non-booted CPUs") b1472a60a584 ("x86/smp: Don't send INIT to boot CPU") because it seems to result in hung machines at shutdown. Particularly some Dell machines, but Thomas says "The rest seems to be Lenovo and Sony with Alderlake/Raptorlake CPUs - at least that's what I could figure out from the various bug reports. I don't know which CPUs the DELL machines have, so I can't say it's a pattern. I agree with the revert for now" Ashok Raj chimes in: "There was a report (probably this same one), and it turns out it was a bug in the BIOS SMI handler. The client BIOS's were waiting for the lowest APICID to be the SMI rendevous master. If this is MeteorLake, the BSP wasn't the one with the lowest APIC and it triped here. The BIOS change is also being pushed to others for assimilation :) Server BIOS's had this correctly for a while now" and it does look likely to be some bad interaction between SMI and the non-BSP cores having put into INIT (and thus unresponsive until reset). Link: https://bbs.archlinux.org/viewtopic.php?pid=2124429 Link: https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/ Link: https://forum.artixlinux.org/index.php/topic,5997.0.html Link: https://bugzilla.redhat.com/show_bug.cgi?id=2241279 Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-10-15Merge tag 'smp-urgent-2023-10-15' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug fix from Ingo Molnar: "Fix a Longsoon build warning by harmonizing the arch_[un]register_cpu() prototypes between architectures" * tag 'smp-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu-hotplug: Provide prototypes for arch CPU registration
2023-10-12x86/alternatives: Disable KASAN in apply_alternatives()Kirill A. Shutemov1-0/+13
Fei has reported that KASAN triggers during apply_alternatives() on a 5-level paging machine: BUG: KASAN: out-of-bounds in rcu_is_watching() Read of size 4 at addr ff110003ee6419a0 by task swapper/0/0 ... __asan_load4() rcu_is_watching() trace_hardirqs_on() text_poke_early() apply_alternatives() ... On machines with 5-level paging, cpu_feature_enabled(X86_FEATURE_LA57) gets patched. It includes KASAN code, where KASAN_SHADOW_START depends on __VIRTUAL_MASK_SHIFT, which is defined with cpu_feature_enabled(). KASAN gets confused when apply_alternatives() patches the KASAN_SHADOW_START users. A test patch that makes KASAN_SHADOW_START static, by replacing __VIRTUAL_MASK_SHIFT with 56, works around the issue. Fix it for real by disabling KASAN while the kernel is patching alternatives. [ mingo: updated the changelog ] Fixes: 6657fca06e3f ("x86/mm: Allow to boot without LA57 if CONFIG_X86_5LEVEL=y") Reported-by: Fei Yang <fei.yang@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20231012100424.1456-1-kirill.shutemov@linux.intel.com
2023-10-12KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2}Sean Christopherson1-4/+1
Mask off xfeatures that aren't exposed to the guest only when saving guest state via KVM_GET_XSAVE{2} instead of modifying user_xfeatures directly. Preserving the maximal set of xfeatures in user_xfeatures restores KVM's ABI for KVM_SET_XSAVE, which prior to commit ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") allowed userspace to load xfeatures that are supported by the host, irrespective of what xfeatures are exposed to the guest. There is no known use case where userspace *intentionally* loads xfeatures that aren't exposed to the guest, but the bug fixed by commit ad856280ddea was specifically that KVM_GET_SAVE{2} would save xfeatures that weren't exposed to the guest, e.g. would lead to userspace unintentionally loading guest-unsupported xfeatures when live migrating a VM. Restricting KVM_SET_XSAVE to guest-supported xfeatures is especially problematic for QEMU-based setups, as QEMU has a bug where instead of terminating the VM if KVM_SET_XSAVE fails, QEMU instead simply stops loading guest state, i.e. resumes the guest after live migration with incomplete guest state, and ultimately results in guest data corruption. Note, letting userspace restore all host-supported xfeatures does not fix setups where a VM is migrated from a host *without* commit ad856280ddea, to a target with a subset of host-supported xfeatures. However there is no way to safely address that scenario, e.g. KVM could silently drop the unsupported features, but that would be a clear violation of KVM's ABI and so would require userspace to opt-in, at which point userspace could simply be updated to sanitize the to-be-loaded XSAVE state. Reported-by: Tyler Stachecki <stachecki.tyler@gmail.com> Closes: https://lore.kernel.org/all/20230914010003.358162-1-tstachecki@bloomberg.net Fixes: ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") Cc: stable@vger.kernel.org Cc: Leonardo Bras <leobras@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Message-Id: <20230928001956.924301-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12x86/fpu: Allow caller to constrain xfeatures when copying to uabi bufferSean Christopherson3-5/+10
Plumb an xfeatures mask into __copy_xstate_to_uabi_buf() so that KVM can constrain which xfeatures are saved into the userspace buffer without having to modify the user_xfeatures field in KVM's guest_fpu state. KVM's ABI for KVM_GET_XSAVE{2} is that features that are not exposed to guest must not show up in the effective xstate_bv field of the buffer. Saving only the guest-supported xfeatures allows userspace to load the saved state on a different host with a fewer xfeatures, so long as the target host supports the xfeatures that are exposed to the guest. KVM currently sets user_xfeatures directly to restrict KVM_GET_XSAVE{2} to the set of guest-supported xfeatures, but doing so broke KVM's historical ABI for KVM_SET_XSAVE, which allows userspace to load any xfeatures that are supported by the *host*. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230928001956.924301-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-11cpu-hotplug: Provide prototypes for arch CPU registrationRussell King (Oracle)1-1/+1
Provide common prototypes for arch_register_cpu() and arch_unregister_cpu(). These are called by acpi_processor.c, with weak versions, so the prototype for this is already set. It is generally not necessary for function prototypes to be conditional on preprocessor macros. Some architectures (e.g. Loongarch) are missing the prototype for this, and rather than add it to Loongarch's asm/cpu.h, do the job once for everyone. Since this covers everyone, remove the now unnecessary prototypes in asm/cpu.h, and therefore remove the 'static' from one of ia64's arch_register_cpu() definitions. [ tglx: Bring back the ia64 part and remove the ACPI prototypes ] Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1qkoRr-0088Q8-Da@rmk-PC.armlinux.org.uk
2023-10-11x86/cpu: Fix AMD erratum #1485 on Zen4-based CPUsBorislav Petkov (AMD)1-0/+8
Fix erratum #1485 on Zen4 parts where running with STIBP disabled can cause an #UD exception. The performance impact of the fix is negligible. Reported-by: René Rebe <rene@exactcode.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: René Rebe <rene@exactcode.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/D99589F4-BC5D-430B-87B2-72C20370CF57@exactcode.com
2023-10-09x86/sev: Check IOBM for IOIO exceptions from user-spaceJoerg Roedel2-7/+42
Check the IO permission bitmap (if present) before emulating IOIO #VC exceptions for user-space. These permissions are checked by hardware already before the #VC is raised, but due to the VC-handler decoding race it needs to be checked again in software. Fixes: 25189d08e516 ("x86/sev-es: Add support for handling IOIO exceptions") Reported-by: Tom Dohrmann <erbse.13@gmx.de> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Dohrmann <erbse.13@gmx.de> Cc: <stable@kernel.org>
2023-10-09x86/sev: Disable MMIO emulation from user modeBorislav Petkov (AMD)1-0/+3
A virt scenario can be constructed where MMIO memory can be user memory. When that happens, a race condition opens between when the hardware raises the #VC and when the #VC handler gets to emulate the instruction. If the MOVS is replaced with a MOVS accessing kernel memory in that small race window, then write to kernel memory happens as the access checks are not done at emulation time. Disable MMIO emulation in user mode temporarily until a sensible use case appears and justifies properly handling the race window. Fixes: 0118b604c2c9 ("x86/sev-es: Handle MMIO String Instructions") Reported-by: Tom Dohrmann <erbse.13@gmx.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Dohrmann <erbse.13@gmx.de> Cc: <stable@kernel.org>
2023-10-08x86/resctrl: Fix kernel-doc warningsRandy Dunlap1-5/+5
The kernel test robot reported kernel-doc warnings here: monitor.c:34: warning: Cannot understand * @rmid_free_lru A least recently used list of free RMIDs on line 34 - I thought it was a doc line monitor.c:41: warning: Cannot understand * @rmid_limbo_count count of currently unused but (potentially) on line 41 - I thought it was a doc line monitor.c:50: warning: Cannot understand * @rmid_entry - The entry in the limbo and free lists. on line 50 - I thought it was a doc line We don't have a syntax for documenting individual data items via kernel-doc, so remove the "/**" kernel-doc markers and add a hyphen for consistency. Fixes: 6a445edce657 ("x86/intel_rdt/cqm: Add RDT monitoring initialization") Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231006235132.16227-1-rdunlap@infradead.org
2023-10-02x86/sev: Change npages to unsigned long in snp_accept_memory()Tom Lendacky1-2/+1
In snp_accept_memory(), the npages variables value is calculated from phys_addr_t variables but is an unsigned int. A very large range passed into snp_accept_memory() could lead to truncating npages to zero. This doesn't happen at the moment but let's be prepared. Fixes: 6c3211796326 ("x86/sev: Add SNP-specific unaccepted memory support") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/6d511c25576494f682063c9fb6c705b526a3757e.1687441505.git.thomas.lendacky@amd.com
2023-10-02x86/sev: Use the GHCB protocol when available for SNP CPUID requestsTom Lendacky1-14/+55
SNP retrieves the majority of CPUID information from the SNP CPUID page. But there are times when that information needs to be supplemented by the hypervisor, for example, obtaining the initial APIC ID of the vCPU from leaf 1. The current implementation uses the MSR protocol to retrieve the data from the hypervisor, even when a GHCB exists. The problem arises when an NMI arrives on return from the VMGEXIT. The NMI will be immediately serviced and may generate a #VC requiring communication with the hypervisor. Since a GHCB exists in this case, it will be used. As part of using the GHCB, the #VC handler will write the GHCB physical address into the GHCB MSR and the #VC will be handled. When the NMI completes, processing resumes at the site of the VMGEXIT which is expecting to read the GHCB MSR and find a CPUID MSR protocol response. Since the NMI handling overwrote the GHCB MSR response, the guest will see an invalid reply from the hypervisor and self-terminate. Fix this problem by using the GHCB when it is available. Any NMI received is properly handled because the GHCB contents are copied into a backup page and restored on NMI exit, thus preserving the active GHCB request or result. [ bp: Touchups. ] Fixes: ee0bfa08a345 ("x86/compressed/64: Add support for SEV-SNP CPUID table in #VC handlers") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/a5856fa1ebe3879de91a8f6298b6bbd901c61881.1690578565.git.thomas.lendacky@amd.com
2023-09-29x86/sgx: Resolves SECS reclaim vs. page fault for EAUG raceHaitao Huang1-5/+25
The SGX EPC reclaimer (ksgxd) may reclaim the SECS EPC page for an enclave and set secs.epc_page to NULL. The SECS page is used for EAUG and ELDU in the SGX page fault handler. However, the NULL check for secs.epc_page is only done for ELDU, not EAUG before being used. Fix this by doing the same NULL check and reloading of the SECS page as needed for both EAUG and ELDU. The SECS page holds global enclave metadata. It can only be reclaimed when there are no other enclave pages remaining. At that point, virtually nothing can be done with the enclave until the SECS page is paged back in. An enclave can not run nor generate page faults without a resident SECS page. But it is still possible for a #PF for a non-SECS page to race with paging out the SECS page: when the last resident non-SECS page A triggers a #PF in a non-resident page B, and then page A and the SECS both are paged out before the #PF on B is handled. Hitting this bug requires that race triggered with a #PF for EAUG. Following is a trace when it happens. BUG: kernel NULL pointer dereference, address: 0000000000000000 RIP: 0010:sgx_encl_eaug_page+0xc7/0x210 Call Trace: ? __kmem_cache_alloc_node+0x16a/0x440 ? xa_load+0x6e/0xa0 sgx_vma_fault+0x119/0x230 __do_fault+0x36/0x140 do_fault+0x12f/0x400 __handle_mm_fault+0x728/0x1110 handle_mm_fault+0x105/0x310 do_user_addr_fault+0x1ee/0x750 ? __this_cpu_preempt_check+0x13/0x20 exc_page_fault+0x76/0x180 asm_exc_page_fault+0x27/0x30 Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave") Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20230728051024.33063-1-haitao.huang%40linux.intel.com
2023-09-28x86/srso: Add SRSO mitigation for Hygon processorsPu Wen1-1/+1
Add mitigation for the speculative return stack overflow vulnerability which exists on Hygon processors too. Signed-off-by: Pu Wen <puwen@hygon.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/tencent_4A14812842F104E93AA722EC939483CEFF05@qq.com
2023-09-24x86/kgdb: Fix a kerneldoc warning when build with W=1Christophe JAILLET1-1/+0
When compiled with W=1, the following warning is generated: arch/x86/kernel/kgdb.c:698: warning: Cannot understand * on line 698 - I thought it was a doc line Remove the corresponding empty comment line to fix the warning. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/aad659537c1d4ebd86912a6f0be458676c8e69af.1695401178.git.christophe.jaillet@wanadoo.fr
2023-09-22Merge tag 'x86_urgent_for_v6.6-rc3' of ↵Linus Torvalds2-8/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 rethunk fixes from Borislav Petkov: "Fix the patching ordering between static calls and return thunks" * tag 'x86_urgent_for_v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86,static_call: Fix static-call vs return-thunk x86/alternatives: Remove faulty optimization
2023-09-22Merge tag 'x86-urgent-2023-09-22' of ↵Linus Torvalds5-48/+45
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Fix a kexec bug - Fix an UML build bug - Fix a handful of SRSO related bugs - Fix a shadow stacks handling bug & robustify related code * tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/shstk: Add warning for shadow stack double unmap x86/shstk: Remove useless clone error handling x86/shstk: Handle vfork clone failure correctly x86/srso: Fix SBPB enablement for spec_rstack_overflow=off x86/srso: Don't probe microcode in a guest x86/srso: Set CPUID feature bits independently of bug or mitigation status x86/srso: Fix srso_show_state() side effect x86/asm: Fix build of UML with KASAN x86/mm, kexec, ima: Use memblock_free_late() from ima_free_kexec_buffer()
2023-09-22x86,static_call: Fix static-call vs return-thunkPeter Zijlstra2-1/+3
Commit 7825451fa4dc ("static_call: Add call depth tracking support") failed to realize the problem fixed there is not specific to call depth tracking but applies to all return-thunk uses. Move the fix to the appropriate place and condition. Fixes: ee88d363d156 ("x86,static_call: Use alternative RET encoding") Reported-by: David Kaplan <David.Kaplan@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org>
2023-09-22x86/alternatives: Remove faulty optimizationJosh Poimboeuf1-8/+0
The following commit 095b8303f383 ("x86/alternative: Make custom return thunk unconditional") made '__x86_return_thunk' a placeholder value. All code setting X86_FEATURE_RETHUNK also changes the value of 'x86_return_thunk'. So the optimization at the beginning of apply_returns() is dead code. Also, before the above-mentioned commit, the optimization actually had a bug It bypassed __static_call_fixup(), causing some raw returns to remain unpatched in static call trampolines. Thus the 'Fixes' tag. Fixes: d2408e043e72 ("x86/alternative: Optimize returns patching") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/16d19d2249d4485d8380fb215ffaae81e6b8119e.1693889988.git.jpoimboe@kernel.org
2023-09-19x86/shstk: Add warning for shadow stack double unmapRick Edgecombe1-0/+11
There are several ways a thread's shadow stacks can get unmapped. This can happen on exit or exec, as well as error handling in exec or clone. The task struct already keeps track of the thread's shadow stack. Use the size variable to keep track of if the shadow stack has already been freed. When an attempt to double unmap the thread shadow stack is caught, warn about it and abort the operation. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: H.J. Lu <hjl.tools@gmail.com> Link: https://lore.kernel.org/all/20230908203655.543765-4-rick.p.edgecombe%40intel.com
2023-09-19x86/shstk: Remove useless clone error handlingRick Edgecombe1-7/+0
When clone fails after the shadow stack is allocated, any allocated shadow stack is cleaned up in exit_thread() in copy_process(). So the logic in copy_thread() is unneeded, and also will not handle failures that happen outside of copy_thread(). In addition, since there is a second attempt to unmap the same shadow stack, there is a race where an newly mapped region could get unmapped. So remove the logic in copy_thread() and rely on exit_thread() to handle clone failure. Fixes: b2926a36b97a ("x86/shstk: Handle thread shadow stack") Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: H.J. Lu <hjl.tools@gmail.com> Link: https://lore.kernel.org/all/20230908203655.543765-3-rick.p.edgecombe%40intel.com
2023-09-19x86/shstk: Handle vfork clone failure correctlyRick Edgecombe1-2/+20
Shadow stacks are allocated automatically and freed on exit, depending on the clone flags. The two cases where new shadow stacks are not allocated are !CLONE_VM (fork()) and CLONE_VFORK (vfork()). For !CLONE_VM, although a new stack is not allocated, it can be freed normally because it will happen in the child's copy of the VM. However, for CLONE_VFORK the parent and the child are actually using the same shadow stack. So the kernel doesn't need to allocate *or* free a shadow stack for a CLONE_VFORK child. CLONE_VFORK children already need special tracking to avoid returning to userspace until the child exits or execs. Shadow stack uses this same tracking to avoid freeing CLONE_VFORK shadow stacks. However, the tracking is not setup until the clone has succeeded (internally). Which means, if a CLONE_VFORK fails, the existing logic will not know it is a CLONE_VFORK and proceed to unmap the parents shadow stack. This error handling cleanup logic runs via exit_thread() in the bad_fork_cleanup_thread label in copy_process(). The issue was seen in the glibc test "posix/tst-spawn3-pidfd" while running with shadow stack using currently out-of-tree glibc patches. Fix it by not unmapping the vfork shadow stack in the error case as well. Since clone is implemented in core code, it is not ideal to pass the clone flags along the error path in order to have shadow stack code have symmetric logic in the freeing half of the thread shadow stack handling. Instead use the existing state for thread shadow stacks to track whether the thread is managing its own shadow stack. For CLONE_VFORK, simply set shstk->base and shstk->size to 0, and have it mean the thread is not managing a shadow stack and so should skip cleanup work. Implement this by breaking up the CLONE_VFORK and !CLONE_VM cases in shstk_alloc_thread_stack() to separate conditionals since, the logic is now different between them. In the case of CLONE_VFORK && !CLONE_VM, the existing behavior is to not clean up the shadow stack in the child (which should go away quickly with either be exit or exec), so maintain that behavior by handling the CLONE_VFORK case first in the allocation path. This new logioc cleanly handles the case of normal, successful CLONE_VFORK's skipping cleaning up their shadow stack's on exit as well. So remove the existing, vfork shadow stack freeing logic. This is in deactivate_mm() where vfork_done is used to tell if it is a vfork child that can skip cleaning up the thread shadow stack. Fixes: b2926a36b97a ("x86/shstk: Handle thread shadow stack") Reported-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: H.J. Lu <hjl.tools@gmail.com> Link: https://lore.kernel.org/all/20230908203655.543765-2-rick.p.edgecombe%40intel.com
2023-09-19x86/srso: Fix SBPB enablement for spec_rstack_overflow=offJosh Poimboeuf1-1/+1
If the user has requested no SRSO mitigation, other mitigations can use the lighter-weight SBPB instead of IBPB. Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/b20820c3cfd1003171135ec8d762a0b957348497.1693889988.git.jpoimboe@kernel.org
2023-09-19x86/srso: Don't probe microcode in a guestJosh Poimboeuf1-1/+1
To support live migration, the hypervisor sets the "lowest common denominator" of features. Probing the microcode isn't allowed because any detected features might go away after a migration. As Andy Cooper states: "Linux must not probe microcode when virtualised.  What it may see instantaneously on boot (owing to MSR_PRED_CMD being fully passed through) is not accurate for the lifetime of the VM." Rely on the hypervisor to set the needed IBPB_BRTYPE and SBPB bits. Fixes: 1b5277c0ea0b ("x86/srso: Add SRSO_NO support") Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/3938a7209606c045a3f50305d201d840e8c834c7.1693889988.git.jpoimboe@kernel.org
2023-09-19x86/srso: Set CPUID feature bits independently of bug or mitigation statusJosh Poimboeuf2-31/+10
Booting with mitigations=off incorrectly prevents the X86_FEATURE_{IBPB_BRTYPE,SBPB} CPUID bits from getting set. Also, future CPUs without X86_BUG_SRSO might still have IBPB with branch type prediction flushing, in which case SBPB should be used instead of IBPB. The current code doesn't allow for that. Also, cpu_has_ibpb_brtype_microcode() has some surprising side effects and the setting of these feature bits really doesn't belong in the mitigation code anyway. Move it to earlier. Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/869a1709abfe13b673bdd10c2f4332ca253a40bc.1693889988.git.jpoimboe@kernel.org
2023-09-19x86/srso: Fix srso_show_state() side effectJosh Poimboeuf1-1/+1
Reading the 'spec_rstack_overflow' sysfs file can trigger an unnecessary MSR write, and possibly even a (handled) exception if the microcode hasn't been updated. Avoid all that by just checking X86_FEATURE_IBPB_BRTYPE instead, which gets set by srso_select_mitigation() if the updated microcode exists. Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/27d128899cb8aee9eb2b57ddc996742b0c1d776b.1693889988.git.jpoimboe@kernel.org
2023-09-19x86/xen: move paravirt lazy codeJuergen Gross1-67/+0
Only Xen is using the paravirt lazy mode code, so it can be moved to Xen specific sources. This allows to make some of the functions static or to merge them into their only call sites. While at it do a rename from "paravirt" to "xen" for all moved specifiers. No functional change. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20230913113828.18421-3-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>