summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-04-20x86/tsx: Disable TSX development mode at bootPawan Gupta4-14/+51
commit 400331f8ffa3bec5c561417e5eec6848464e9160 upstream. A microcode update on some Intel processors causes all TSX transactions to always abort by default[*]. Microcode also added functionality to re-enable TSX for development purposes. With this microcode loaded, if tsx=on was passed on the cmdline, and TSX development mode was already enabled before the kernel boot, it may make the system vulnerable to TSX Asynchronous Abort (TAA). To be on safer side, unconditionally disable TSX development mode during boot. If a viable use case appears, this can be revisited later. [*]: Intel TSX Disable Update for Selected Processors, doc ID: 643557 [ bp: Drop unstable web link, massage heavily. ] Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/347bd844da3a333a9793c6687d4e4eb3b2419a3e.1646943780.git.pawan.kumar.gupta@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-20x86/tsx: Use MSR_TSX_CTRL to clear CPUID bitsPawan Gupta2-7/+48
commit 258f3b8c3210b03386e4ad92b4bd8652b5c1beb3 upstream. tsx_clear_cpuid() uses MSR_TSX_FORCE_ABORT to clear CPUID.RTM and CPUID.HLE. Not all CPUs support MSR_TSX_FORCE_ABORT, alternatively use MSR_IA32_TSX_CTRL when supported. [ bp: Document how and why TSX gets disabled. ] Fixes: 293649307ef9 ("x86/tsx: Clear CPUID bits when TSX always force aborts") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/5b323e77e251a9c8bcdda498c5cc0095be1e1d3c.1646943780.git.pawan.kumar.gupta@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-13x86,static_call: Fix __static_call_return0 for i386Peter Zijlstra1-3/+2
commit 1cd5f059d956e6f614ba6666ecdbcf95db05d5f5 upstream. Paolo reported that the instruction sequence that is used to replace: call __static_call_return0 namely: 66 66 48 31 c0 data16 data16 xor %rax,%rax decodes to something else on i386, namely: 66 66 48 data16 dec %ax 31 c0 xor %eax,%eax Which is a nonsensical sequence that happens to have the same outcome. *However* an important distinction is that it consists of 2 instructions which is a problem when the thing needs to be overwriten with a regular call instruction again. As such, replace the instruction with something that decodes the same on both i386 and x86_64. Fixes: 3f2a8fc4b15d ("static_call/x86: Add __static_call_return0()") Reported-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220318204419.GT8939@worktop.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-13rtc: Check return value from mc146818_get_time()Mateusz Jończyk1-2/+6
[ Upstream commit 0dd8d6cb9eddfe637bcd821bbfd40ebd5a0737b9 ] There are 4 users of mc146818_get_time() and none of them was checking the return value from this function. Change this. Print the appropriate warnings in callers of mc146818_get_time() instead of in the function mc146818_get_time() itself, in order not to add strings to rtc-mc146818-lib.c, which is kind of a library. The callers of alpha_rtc_read_time() and cmos_read_time() may use the contents of (struct rtc_time *) even when the functions return a failure code. Therefore, set the contents of (struct rtc_time *) to 0x00, which looks more sensible then 0xff and aligns with the (possibly stale?) comment in cmos_read_time: /* * If pm_trace abused the RTC for storage, set the timespec to 0, * which tells the caller that this RTC value is unusable. */ For consistency, do this in mc146818_get_time(). Note: hpet_rtc_interrupt() may call mc146818_get_time() many times a second. It is very unlikely, though, that the RTC suddenly stops working and mc146818_get_time() would consistently fail. Only compile-tested on alpha. Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: linux-alpha@vger.kernel.org Cc: x86@kernel.org Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20211210200131.153887-4-mat.jonczyk@o2.pl Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08KVM: x86: fix sending PV IPILi RongQing1-1/+1
commit c15e0ae42c8e5a61e9aca8aac920517cf7b3e94e upstream. If apic_id is less than min, and (max - apic_id) is greater than KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but the new apic_id does not fit the bitmask. In this case __send_ipi_mask should send the IPI. This is mostly theoretical, but it can happen if the apic_ids on three iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0. Fixes: aaffcfd1e82 ("KVM: X86: Implement PV IPIs in linux guest") Signed-off-by: Li RongQing <lirongqing@baidu.com> Message-Id: <1646814944-51801-1-git-send-email-lirongqing@baidu.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-28ACPI / x86: Work around broken XSDT on Advantech DAC-BJ01 boardMark Cilissen1-0/+24
commit e702196bf85778f2c5527ca47f33ef2e2fca8297 upstream. On this board the ACPI RSDP structure points to both a RSDT and an XSDT, but the XSDT points to a truncated FADT. This causes all sorts of trouble and usually a complete failure to boot after the following error occurs: ACPI Error: Unsupported address space: 0x20 (*/hwregs-*) ACPI Error: AE_SUPPORT, Unable to initialize fixed events (*/evevent-*) ACPI: Unable to start ACPI Interpreter This leaves the ACPI implementation in such a broken state that subsequent kernel subsystem initialisations go wrong, resulting in among others mismapped PCI memory, SATA and USB enumeration failures, and freezes. As this is an older embedded platform that will likely never see any BIOS updates to address this issue and its default shipping OS only complies to ACPI 1.0, work around this by forcing `acpi=rsdt`. This patch, applied on top of Linux 5.10.102, was confirmed on real hardware to fix the issue. Signed-off-by: Mark Cilissen <mark@yotsuba.nl> Cc: All applicable <stable@vger.kernel.org> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-19x86/module: Fix the paravirt vs alternative orderPeter Zijlstra1-5/+8
commit 5adf349439d29f92467e864f728dfc23180f3ef9 upstream. Ever since commit 4e6292114c74 ("x86/paravirt: Add new features for paravirt patching") there is an ordering dependency between patching paravirt ops and patching alternatives, the module loader still violates this. Fixes: 4e6292114c74 ("x86/paravirt: Add new features for paravirt patching") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220303112825.068773913@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16x86/traps: Mark do_int3() NOKPROBE_SYMBOLLi Huafei1-0/+1
commit a365a65f9ca1ceb9cf1ac29db4a4f51df7c507ad upstream. Since kprobe_int3_handler() is called in do_int3(), probing do_int3() can cause a breakpoint recursion and crash the kernel. Therefore, do_int3() should be marked as NOKPROBE_SYMBOL. Fixes: 21e28290b317 ("x86/traps: Split int3 handler up") Signed-off-by: Li Huafei <lihuafei1@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220310120915.63349-1-lihuafei1@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16x86/sgx: Free backing memory after faulting the enclave pageJarkko Sakkinen1-9/+48
commit 08999b2489b4c9b939d7483dbd03702ee4576d96 upstream. There is a limited amount of SGX memory (EPC) on each system. When that memory is used up, SGX has its own swapping mechanism which is similar in concept but totally separate from the core mm/* code. Instead of swapping to disk, SGX swaps from EPC to normal RAM. That normal RAM comes from a shared memory pseudo-file and can itself be swapped by the core mm code. There is a hierarchy like this: EPC <-> shmem <-> disk After data is swapped back in from shmem to EPC, the shmem backing storage needs to be freed. Currently, the backing shmem is not freed. This effectively wastes the shmem while the enclave is running. The memory is recovered when the enclave is destroyed and the backing storage freed. Sort this out by freeing memory with shmem_truncate_range(), as soon as a page is faulted back to the EPC. In addition, free the memory for PCMD pages as soon as all PCMD's in a page have been marked as unused by zeroing its contents. Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220303223859.273187-1-jarkko@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16x86/boot: Fix memremap of setup_indirect structuresRoss Philipson4-43/+146
commit 7228918b34615ef6317edcd9a058a057bc54aa32 upstream. As documented, the setup_indirect structure is nested inside the setup_data structures in the setup_data list. The code currently accesses the fields inside the setup_indirect structure but only the sizeof(struct setup_data) is being memremapped. No crash occurred but this is just due to how the area is remapped under the covers. Properly memremap both the setup_data and setup_indirect structures in these cases before accessing them. Fixes: b3c72fc9a78e ("x86/boot: Introduce setup_indirect") Signed-off-by: Ross Philipson <ross.philipson@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/1645668456-22036-2-git-send-email-ross.philipson@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPUWanpeng Li1-3/+6
[ Upstream commit ec756e40e271866f951d77c5e923d8deb6002b15 ] Inspired by commit 3553ae5690a (x86/kvm: Don't use pvqspinlock code if only 1 vCPU), on a VM with only 1 vCPU, there is no need to enable pv tlb/ipi/sched_yield and we can save the memory for __pv_cpu_mask. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1645171838-2855-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-11x86/speculation: Warn about eIBRS + LFENCE + Unprivileged eBPF + SMTJosh Poimboeuf1-2/+25
commit 0de05d056afdb00eca8c7bbb0c79a3438daf700c upstream. The commit 44a3918c8245 ("x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting") added a warning for the "eIBRS + unprivileged eBPF" combination, which has been shown to be vulnerable against Spectre v2 BHB-based attacks. However, there's no warning about the "eIBRS + LFENCE retpoline + unprivileged eBPF" combo. The LFENCE adds more protection by shortening the speculation window after a mispredicted branch. That makes an attack significantly more difficult, even with unprivileged eBPF. So at least for now the logic doesn't warn about that combination. But if you then add SMT into the mix, the SMT attack angle weakens the effectiveness of the LFENCE considerably. So extend the "eIBRS + unprivileged eBPF" warning to also include the "eIBRS + LFENCE + unprivileged eBPF + SMT" case. [ bp: Massage commit message. ] Suggested-by: Alyssa Milburn <alyssa.milburn@linux.intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Warn about Spectre v2 LFENCE mitigationJosh Poimboeuf1-0/+5
commit eafd987d4a82c7bb5aa12f0e3b4f8f3dea93e678 upstream. With: f8a66d608a3e ("x86,bugs: Unconditionally allow spectre_v2=retpoline,amd") it became possible to enable the LFENCE "retpoline" on Intel. However, Intel doesn't recommend it, as it has some weaknesses compared to retpoline. Now AMD doesn't recommend it either. It can still be left available as a cmdline option. It's faster than retpoline but is weaker in certain scenarios -- particularly SMT, but even non-SMT may be vulnerable in some cases. So just unconditionally warn if the user requests it on the cmdline. [ bp: Massage commit message. ] Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Use generic retpoline by default on AMDKim Phillips1-9/+0
commit 244d00b5dd4755f8df892c86cab35fb2cfd4f14b upstream. AMD retpoline may be susceptible to speculation. The speculation execution window for an incorrect indirect branch prediction using LFENCE/JMP sequence may potentially be large enough to allow exploitation using Spectre V2. By default, don't use retpoline,lfence on AMD. Instead, use the generic retpoline. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation ↵Josh Poimboeuf1-6/+29
reporting commit 44a3918c8245ab10c6c9719dd12e7a8d291980d8 upstream. With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable to Spectre v2 BHB-based attacks. When both are enabled, print a warning message and report it in the 'spectre_v2' sysfs vulnerabilities file. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> [fllinden@amazon.com: backported to 5.15] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Add eIBRS + Retpoline optionsPeter Zijlstra1-37/+96
commit 1e19da8522c81bf46b335f84137165741e0d82b7 upstream. Thanks to the chaps at VUsec it is now clear that eIBRS is not sufficient, therefore allow enabling of retpolines along with eIBRS. Add spectre_v2=eibrs, spectre_v2=eibrs,lfence and spectre_v2=eibrs,retpoline options to explicitly pick your preferred means of mitigation. Since there's new mitigations there's also user visible changes in /sys/devices/system/cpu/vulnerabilities/spectre_v2 to reflect these new mitigations. [ bp: Massage commit message, trim error messages, do more precise eIBRS mode checking. ] Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Patrick Colp <patrick.colp@oracle.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCEPeter Zijlstra (Intel)1-11/+18
commit d45476d9832409371537013ebdd8dc1a7781f97a upstream. The RETPOLINE_AMD name is unfortunate since it isn't necessarily AMD only, in fact Hygon also uses it. Furthermore it will likely be sufficient for some Intel processors. Therefore rename the thing to RETPOLINE_LFENCE to better describe what it is. Add the spectre_v2=retpoline,lfence option as an alias to spectre_v2=retpoline,amd to preserve existing setups. However, the output of /sys/devices/system/cpu/vulnerabilities/spectre_v2 will be changed. [ bp: Fix typos, massage. ] Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> [fllinden@amazon.com: backported to 5.15] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86,bugs: Unconditionally allow spectre_v2=retpoline,amdPeter Zijlstra1-7/+0
commit f8a66d608a3e471e1202778c2a36cbdc96bae73b upstream. Currently Linux prevents usage of retpoline,amd on !AMD hardware, this is unfriendly and gets in the way of testing. Remove this restriction. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Borislav Petkov <bp@suse.de> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20211026120310.487348118@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-08mm: defer kmemleak object creation of module_alloc()Kefeng Wang1-3/+4
[ Upstream commit 60115fa54ad7b913b7cb5844e6b7ffeb842d55f2 ] Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN enabled(without KASAN_VMALLOC) on x86[1]. When the module area allocates memory, it's kmemleak_object is created successfully, but the KASAN shadow memory of module allocation is not ready, so when kmemleak scan the module's pointer, it will panic due to no shadow memory with KASAN check. module_alloc __vmalloc_node_range kmemleak_vmalloc kmemleak_scan update_checksum kasan_module_alloc kmemleak_ignore Note, there is no problem if KASAN_VMALLOC enabled, the modules area entire shadow memory is preallocated. Thus, the bug only exits on ARCH which supports dynamic allocation of module area per module load, for now, only x86/arm64/s390 are involved. Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of kmemleak in module_alloc() to fix this issue. [1] https://lore.kernel.org/all/6d41e2b9-4692-5ec4-b1cd-cbe29ae89739@huawei.com/ [wangkefeng.wang@huawei.com: fix build] Link: https://lkml.kernel.org/r/20211125080307.27225-1-wangkefeng.wang@huawei.com [akpm@linux-foundation.org: simplify ifdefs, per Andrey] Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com Link: https://lkml.kernel.org/r/20211124142034.192078-1-wangkefeng.wang@huawei.com Fixes: 793213a82de4 ("s390/kasan: dynamic shadow mem allocation for modules") Fixes: 39d114ddc682 ("arm64: add KASAN support") Fixes: bebf56a1b176 ("kasan: enable instrumentation of global variables") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: Yongqiang Liu <liuyongqiang13@huawei.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-02x86/ptrace: Fix xfpregs_set()'s incorrect xmm clearingAndy Lutomirski2-7/+6
commit 44cad52cc14ae10062f142ec16ede489bccf4469 upstream. xfpregs_set() handles 32-bit REGSET_XFP and 64-bit REGSET_FP. The actual code treats these regsets as modern FX state (i.e. the beginning part of XSTATE). The declarations of the regsets thought they were the legacy i387 format. The code thought they were the 32-bit (no xmm8..15) variant of XSTATE and, for good measure, made the high bits disappear by zeroing the wrong part of the buffer. The latter broke ptrace, and everything else confused anyone trying to understand the code. In particular, the nonsense definitions of the regsets confused me when I wrote this code. Clean this all up. Change the declarations to match reality (which shouldn't change the generated code, let alone the ABI) and fix xfpregs_set() to clear the correct bits and to only do so for 32-bit callers. Fixes: 6164331d15f7 ("x86/fpu: Rewrite xfpregs_set()") Reported-by: Luís Ferreira <contact@lsferreira.net> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215524 Link: https://lore.kernel.org/r/YgpFnZpF01WwR8wU@zn.tnic Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-16x86/sgx: Silence softlockup detection when releasing large enclavesReinette Chatre1-0/+2
commit 8795359e35bc33bf86b6d0765aa7f37431db3b9c upstream. Vijay reported that the "unclobbered_vdso_oversubscribed" selftest triggers the softlockup detector. Actual SGX systems have 128GB of enclave memory or more. The "unclobbered_vdso_oversubscribed" selftest creates one enclave which consumes all of the enclave memory on the system. Tearing down such a large enclave takes around a minute, most of it in the loop where the EREMOVE instruction is applied to each individual 4k enclave page. Spending one minute in a loop triggers the softlockup detector. Add a cond_resched() to give other tasks a chance to run and placate the softlockup detector. Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Vijay Dhanraj <vijay.dhanraj@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> (kselftest as sanity check) Link: https://lkml.kernel.org/r/ced01cac1e75f900251b0a4ae1150aa8ebd295ec.1644345232.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPINTony Luck1-0/+1
commit e464121f2d40eabc7d11823fb26db807ce945df4 upstream. Missed adding the Icelake-D CPU to the list. It uses the same MSRs to control and read the inventory number as all the other models. Fixes: dc6b025de95b ("x86/mce: Add Xeon Icelake to list of CPUs that support PPIN") Reported-by: Ailin Xu <ailin.xu@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220121174743.1875294-2-tony.luck@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01x86/MCE/AMD: Allow thresholding interface updates after initYazen Ghannam1-1/+1
commit 1f52b0aba6fd37653416375cb8a1ca673acf8d5f upstream. Changes to the AMD Thresholding sysfs code prevents sysfs writes from updating the underlying registers once CPU init is completed, i.e. "threshold_banks" is set. Allow the registers to be updated if the thresholding interface is already initialized or if in the init path. Use the "set_lvt_off" value to indicate if running in the init path, since this value is only set during init. Fixes: a037f3ca0ea0 ("x86/mce/amd: Make threshold bank setting hotplug robust") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220117161328.19148-1-yazen.ghannam@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27x86/mce: Mark mce_read_aux() noinstrBorislav Petkov1-1/+1
[ Upstream commit db6c996d6ce45dfb44891f0824a65ecec216f47a ] Fixes vmlinux.o: warning: objtool: do_machine_check()+0x681: call to mce_read_aux() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-10-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/mce: Mark mce_end() noinstrBorislav Petkov1-3/+11
[ Upstream commit b4813539d37fa31fed62cdfab7bd2dd8929c5b2e ] It is called by the #MC handler which is noinstr. Fixes vmlinux.o: warning: objtool: do_machine_check()+0xbd6: call to memset() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-9-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/mce: Mark mce_panic() noinstrBorislav Petkov1-3/+12
[ Upstream commit 3c7ce80a818fa7950be123cac80cd078e5ac1013 ] And allow instrumentation inside it because it does calls to other facilities which will not be tagged noinstr. Fixes vmlinux.o: warning: objtool: do_machine_check()+0xc73: call to mce_panic() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-8-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/mce: Allow instrumentation during task work queueingBorislav Petkov1-0/+11
[ Upstream commit 4fbce464db81a42f9a57ee242d6150ec7f996415 ] Fixes vmlinux.o: warning: objtool: do_machine_check()+0xdb1: call to queue_task_work() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-6-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/mm: Flush global TLB when switching to trampoline page-tableJoerg Roedel1-10/+2
[ Upstream commit 71d5049b053876afbde6c3273250b76935494ab2 ] Move the switching code into a function so that it can be re-used and add a global TLB flush. This makes sure that usage of memory which is not mapped in the trampoline page-table is reliably caught. Also move the clearing of CR4.PCIDE before the CR3 switch because the cr4_clear_bits() function will access data not mapped into the trampoline page-table. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211202153226.22946-4-joro@8bytes.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/mce/inject: Avoid out-of-bounds write when setting flagsZhang Zixun1-1/+1
[ Upstream commit de768416b203ac84e02a757b782a32efb388476f ] A contrived zero-length write, for example, by using write(2): ... ret = write(fd, str, 0); ... to the "flags" file causes: BUG: KASAN: stack-out-of-bounds in flags_write Write of size 1 at addr ffff888019be7ddf by task writefile/3787 CPU: 4 PID: 3787 Comm: writefile Not tainted 5.16.0-rc7+ #12 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 due to accessing buf one char before its start. Prevent such out-of-bounds access. [ bp: Productize into a proper patch. Link below is the next best thing because the original mail didn't get archived on lore. ] Fixes: 0451d14d0561 ("EDAC, mce_amd_inj: Modify flags attribute to use string arguments") Signed-off-by: Zhang Zixun <zhang133010@icloud.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/linux-edac/YcnePfF1OOqoQwrX@zn.tnic/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/gpu: Reserve stolen memory for first integrated Intel GPULucas De Marchi1-1/+9
commit 9c494ca4d3a535f9ca11ad6af1813983c1c6cbdd upstream. "Stolen memory" is memory set aside for use by an Intel integrated GPU. The intel_graphics_quirks() early quirk reserves this memory when it is called for a GPU that appears in the intel_early_ids[] table of integrated GPUs. Previously intel_graphics_quirks() was marked as QFLAG_APPLY_ONCE, so it was called only for the first Intel GPU found. If a discrete GPU happened to be enumerated first, intel_graphics_quirks() was called for it but not for any integrated GPU found later. Therefore, stolen memory for such an integrated GPU was never reserved. For example, this problem occurs in this Alderlake-P (integrated) + DG2 (discrete) topology where the DG2 is found first, but stolen memory is associated with the integrated GPU: - 00:01.0 Bridge `- 03:00.0 DG2 discrete GPU - 00:02.0 Integrated GPU (with stolen memory) Remove the QFLAG_APPLY_ONCE flag and call intel_graphics_quirks() for every Intel GPU. Reserve stolen memory for the first GPU that appears in intel_early_ids[]. [bhelgaas: commit log, add code comment, squash in https://lore.kernel.org/r/20220118190558.2ququ4vdfjuahicm@ldmartin-desk2] Link: https://lore.kernel.org/r/20220114002843.2083382-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-29x86/boot: Move EFI range reservation after cmdline parsingMike Rapoport1-3/+3
commit 2f5b3514c33fecad4003ce0f22ca9691492d310b upstream. The memory reservation in arch/x86/platform/efi/efi.c depends on at least two command line parameters. Put it back later in the boot process and move efi_memblock_x86_reserve_range() out of early_memory_reserve(). An attempt to fix this was done in 8d48bf8206f7 ("x86/boot: Pull up cmdline preparation and early param parsing") but that caused other troubles so it got reverted. The bug this is addressing is: Dan reports that Anjaneya Chagam can no longer use the efi=nosoftreserve kernel command line parameter to suppress "soft reservation" behavior. This is due to the fact that the following call-chain happens at boot: early_reserve_memory |-> efi_memblock_x86_reserve_range |-> efi_fake_memmap_early which does if (!efi_soft_reserve_enabled()) return; and that would have set EFI_MEM_NO_SOFT_RESERVE after having parsed "nosoftreserve". However, parse_early_param() gets called *after* it, leading to the boot cmdline not being taken into account. See also https://lore.kernel.org/r/e8dd8993c38702ee6dd73b3c11f158617e665607.camel@intel.com [ bp: Turn into a proper patch. ] Signed-off-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211213112757.2612-4-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-29Revert "x86/boot: Pull up cmdline preparation and early param parsing"Borislav Petkov1-39/+27
commit fbe6183998546f8896ee0b620ece86deff5a2fd1 upstream. This reverts commit 8d48bf8206f77aa8687f0e241e901e5197e52423. It turned out to be a bad idea as it broke supplying mem= cmdline parameters due to parse_memopt() requiring preparatory work like setting up the e820 table in e820__memory_setup() in order to be able to exclude the range specified by mem=. Pulling that up would've broken Xen PV again, see threads at https://lkml.kernel.org/r/20210920120421.29276-1-jgross@suse.com due to xen_memory_setup() needing the first reservations in early_reserve_memory() - kernel and initrd - to have happened already. This could be fixed again by having Xen do those reservations itself... Long story short, revert this and do a simpler fix in a later patch. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211213112757.2612-3-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-08x86/tsc: Disable clocksource watchdog for TSC on qualified platormsFeng Tang1-4/+24
commit b50db7095fe002fa3e16605546cba66bf1b68a3e upstream. There are cases that the TSC clocksource is wrongly judged as unstable by the clocksource watchdog mechanism which tries to validate the TSC against HPET, PM_TIMER or jiffies. While there is hardly a general reliable way to check the validity of a watchdog, Thomas Gleixner proposed [1]: "I'm inclined to lift that requirement when the CPU has: 1) X86_FEATURE_CONSTANT_TSC 2) X86_FEATURE_NONSTOP_TSC 3) X86_FEATURE_NONSTOP_TSC_S3 4) X86_FEATURE_TSC_ADJUST 5) At max. 4 sockets After two decades of horrors we're finally at a point where TSC seems to be halfway reliable and less abused by BIOS tinkerers. TSC_ADJUST was really key as we can now detect even small modifications reliably and the important point is that we can cure them as well (not pretty but better than all other options)." As feature #3 X86_FEATURE_NONSTOP_TSC_S3 only exists on several generations of Atom processorz, and is always coupled with X86_FEATURE_CONSTANT_TSC and X86_FEATURE_NONSTOP_TSC, skip checking it, and also be more defensive to use maximal 2 sockets. The check is done inside tsc_init() before registering 'tsc-early' and 'tsc' clocksources, as there were cases that both of them had been wrongly judged as unreliable. For more background of tsc/watchdog, there is a good summary in [2] [tglx} Update vs. jiffies: On systems where the only remaining clocksource aside of TSC is jiffies there is no way to make this work because that creates a circular dependency. Jiffies accuracy depends on not missing a periodic timer interrupt, which is not guaranteed. That could be detected by TSC, but as TSC is not trusted this cannot be compensated. The consequence is a circulus vitiosus which results in shutting down TSC and falling back to the jiffies clocksource which is even more unreliable. [1]. https://lore.kernel.org/lkml/87eekfk8bd.fsf@nanos.tec.linutronix.de/ [2]. https://lore.kernel.org/lkml/87a6pimt1f.ffs@nanos.tec.linutronix.de/ [ tglx: Refine comment and amend changelog ] Fixes: 6e3cd95234dc ("x86/hpet: Use another crystalball to evaluate HPET usability") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211117023751.24190-2-feng.tang@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-08x86/tsc: Add a timer to make sure TSC_adjust is always checkedFeng Tang1-0/+41
commit c7719e79347803b8e3b6b50da8c6db410a3012b5 upstream. The TSC_ADJUST register is checked every time a CPU enters idle state, but Thomas Gleixner mentioned there is still a caveat that a system won't enter idle [1], either because it's too busy or configured purposely to not enter idle. Setup a periodic timer (every 10 minutes) to make sure the check is happening on a regular base. [1] https://lore.kernel.org/lkml/875z286xtk.fsf@nanos.tec.linutronix.de/ Fixes: 6e3cd95234dc ("x86/hpet: Use another crystalball to evaluate HPET usability") Requested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211117023751.24190-1-feng.tang@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-08x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qwordMichael Sterritt1-18/+39
[ Upstream commit 1d5379d0475419085d3575bd9155f2e558e96390 ] Properly type the operands being passed to __put_user()/__get_user(). Otherwise, these routines truncate data for dependent instructions (e.g., INSW) and only read/write one byte. This has been tested by sending a string with REP OUTSW to a port and then reading it back in with REP INSW on the same port. Previous behavior was to only send and receive the first char of the size. For example, word operations for "abcd" would only read/write "ac". With change, the full string is now written and read back. Fixes: f980f9c31a923 (x86/sev-es: Compile early handler code into kernel image) Signed-off-by: Michael Sterritt <sterritt@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Peter Gonda <pgonda@google.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Link: https://lkml.kernel.org/r/20211119232757.176201-1-sterritt@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-08x86/hyperv: Move required MSRs check to initial platform probingSean Christopherson1-5/+15
[ Upstream commit f3e613e72f66226b3bea1046c1b864f67a3000a4 ] Explicitly check for MSR_HYPERCALL and MSR_VP_INDEX support when probing for running as a Hyper-V guest instead of waiting until hyperv_init() to detect the bogus configuration. Add messages to give the admin a heads up that they are likely running on a broken virtual machine setup. At best, silently disabling Hyper-V is confusing and difficult to debug, e.g. the kernel _says_ it's using all these fancy Hyper-V features, but always falls back to the native versions. At worst, the half baked setup will crash/hang the kernel. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20211104182239.1302956-3-seanjc@google.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25signal: Replace force_fatal_sig with force_exit_sig when in doubtEric W. Biederman1-1/+1
commit fcb116bc43c8c37c052530ead79872f8b2615711 upstream. Recently to prevent issues with SECCOMP_RET_KILL and similar signals being changed before they are delivered SA_IMMUTABLE was added. Unfortunately this broke debuggers[1][2] which reasonably expect to be able to trap synchronous SIGTRAP and SIGSEGV even when the target process is not configured to handle those signals. Add force_exit_sig and use it instead of force_fatal_sig where historically the code has directly called do_exit. This has the implementation benefits of going through the signal exit path (including generating core dumps) without the danger of allowing userspace to ignore or change these signals. This avoids userspace regressions as older kernels exited with do_exit which debuggers also can not intercept. In the future is should be possible to improve the quality of implementation of the kernel by changing some of these force_exit_sig calls to force_fatal_sig. That can be done where it matters on a case-by-case basis with careful analysis. Reported-by: Kyle Huey <me@kylehuey.com> Reported-by: kernel test robot <oliver.sang@intel.com> [1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com [2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020 Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed") Fixes: a3616a3c0272 ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die") Fixes: 83a1f27ad773 ("signal/powerpc: On swapcontext failure force SIGSEGV") Fixes: 9bc508cf0791 ("signal/s390: Use force_sigsegv in default_trap_handler") Fixes: 086ec444f866 ("signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig") Fixes: c317d306d550 ("signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails") Fixes: 695dd0d634df ("signal/x86: In emulate_vsyscall force a signal instead of calling do_exit") Fixes: 1fbd60df8a85 ("signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.") Fixes: 941edc5bf174 ("exit/syscall_user_dispatch: Send ordinary signals on failure") Link: https://lkml.kernel.org/r/871r3dqfv8.fsf_-_@email.froward.int.ebiederm.org Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Thomas Backlund <tmb@iki.fi> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-25signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)Eric W. Biederman1-1/+1
commit e21294a7aaae32c5d7154b187113a04db5852e37 upstream. Now that force_fatal_sig exists it is unnecessary and a bit confusing to use force_sigsegv in cases where the simpler force_fatal_sig is wanted. So change every instance we can to make the code clearer. Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Link: https://lkml.kernel.org/r/877de7jrev.fsf@disp2133 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Thomas Backlund <tmb@iki.fi> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-25signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.Eric W. Biederman1-1/+3
commit 1fbd60df8a852d9c55de8cd3621899cf4c72a5b7 upstream. Update save_v86_state to always complete all of it's work except possibly some of the copies to userspace even if save_v86_state takes a fault. This ensures that the kernel is always in a sane state, even if userspace has done something silly. When save_v86_state takes a fault update it to force userspace to take a SIGSEGV and terminate the userspace application. As Andy pointed out in review of the first version of this change there are races between sigaction and the application terinating. Now that the code has been modified to always perform all save_v86_state's work (except possibly copying to userspace) those races do not matter from a kernel perspective. Forcing the userspace application to terminate (by resetting it's handler to SIGDFL) is there to keep everything as close to the current behavior as possible while removing the unique (and difficult to maintain) use of do_exit. If this new SIGSEGV happens during handle_signal the next time around the exit_to_user_mode_loop, SIGSEGV will be delivered to userspace. All of the callers of handle_vm86_trap and handle_vm86_fault run the exit_to_user_mode_loop before they return to userspace any signal sent to the current task during their execution will be delivered to the current task before that tasks exits to usermode. Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: x86@kernel.org Cc: H Peter Anvin <hpa@zytor.com> v1: https://lkml.kernel.org/r/20211020174406.17889-10-ebiederm@xmission.com Link: https://lkml.kernel.org/r/877de1xcr6.fsf_-_@disp2133 Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Thomas Backlund <tmb@iki.fi> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-25x86/sgx: Fix free page accountingReinette Chatre1-6/+6
commit ac5d272a0ad0419f52e08c91953356e32b075af7 upstream. The SGX driver maintains a single global free page counter, sgx_nr_free_pages, that reflects the number of free pages available across all NUMA nodes. Correspondingly, a list of free pages is associated with each NUMA node and sgx_nr_free_pages is updated every time a page is added or removed from any of the free page lists. The main usage of sgx_nr_free_pages is by the reclaimer that runs when it (sgx_nr_free_pages) goes below a watermark to ensure that there are always some free pages available to, for example, support efficient page faults. With sgx_nr_free_pages accessed and modified from a few places it is essential to ensure that these accesses are done safely but this is not the case. sgx_nr_free_pages is read without any protection and updated with inconsistent protection by any one of the spin locks associated with the individual NUMA nodes. For example: CPU_A CPU_B ----- ----- spin_lock(&nodeA->lock); spin_lock(&nodeB->lock); ... ... sgx_nr_free_pages--; /* NOT SAFE */ sgx_nr_free_pages--; spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock); Since sgx_nr_free_pages may be protected by different spin locks while being modified from different CPUs, the following scenario is possible: CPU_A CPU_B ----- ----- {sgx_nr_free_pages = 100} spin_lock(&nodeA->lock); spin_lock(&nodeB->lock); sgx_nr_free_pages--; sgx_nr_free_pages--; /* LOAD sgx_nr_free_pages = 100 */ /* LOAD sgx_nr_free_pages = 100 */ /* sgx_nr_free_pages-- */ /* sgx_nr_free_pages-- */ /* STORE sgx_nr_free_pages = 99 */ /* STORE sgx_nr_free_pages = 99 */ spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock); In the above scenario, sgx_nr_free_pages is decremented from two CPUs but instead of sgx_nr_free_pages ending with a value that is two less than it started with, it was only decremented by one while the number of free pages were actually reduced by two. The consequence of sgx_nr_free_pages not being protected is that its value may not accurately reflect the actual number of free pages on the system, impacting the availability of free pages in support of many flows. The problematic scenario is when the reclaimer does not run because it believes there to be sufficient free pages while any attempt to allocate a page fails because there are no free pages available. In the SGX driver the reclaimer's watermark is only 32 pages so after encountering the above example scenario 32 times a user space hang is possible when there are no more free pages because of repeated page faults caused by no free pages made available. The following flow was encountered: asm_exc_page_fault ... sgx_vma_fault() sgx_encl_load_page() sgx_encl_eldu() // Encrypted page needs to be loaded from backing // storage into newly allocated SGX memory page sgx_alloc_epc_page() // Allocate a page of SGX memory __sgx_alloc_epc_page() // Fails, no free SGX memory ... if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer wake_up(&ksgxd_waitq); return -EBUSY; // Return -EBUSY giving reclaimer time to run return -EBUSY; return -EBUSY; return VM_FAULT_NOPAGE; The reclaimer is triggered in above flow with the following code: static bool sgx_should_reclaim(unsigned long watermark) { return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list); } In the problematic scenario there were no free pages available yet the value of sgx_nr_free_pages was above the watermark. The allocation of SGX memory thus always failed because of a lack of free pages while no free pages were made available because the reclaimer is never started because of sgx_nr_free_pages' incorrect value. The consequence was that user space kept encountering VM_FAULT_NOPAGE that caused the same address to be accessed repeatedly with the same result. Change the global free page counter to an atomic type that ensures simultaneous updates are done safely. While doing so, move the updating of the variable outside of the spin lock critical section to which it does not belong. Cc: stable@vger.kernel.org Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()") Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/a95a40743bbd3f795b465f30922dde7f1ea9e0eb.1637004094.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-25x86/boot: Pull up cmdline preparation and early param parsingBorislav Petkov1-27/+39
commit 8d48bf8206f77aa8687f0e241e901e5197e52423 upstream. Dan reports that Anjaneya Chagam can no longer use the efi=nosoftreserve kernel command line parameter to suppress "soft reservation" behavior. This is due to the fact that the following call-chain happens at boot: early_reserve_memory |-> efi_memblock_x86_reserve_range |-> efi_fake_memmap_early which does if (!efi_soft_reserve_enabled()) return; and that would have set EFI_MEM_NO_SOFT_RESERVE after having parsed "nosoftreserve". However, parse_early_param() gets called *after* it, leading to the boot cmdline not being taken into account. Therefore, carve out the command line preparation into a separate function which does the early param parsing too. So that it all goes together. And then call that function before early_reserve_memory() so that the params would have been parsed by then. Fixes: 8aa83e6395ce ("x86/setup: Call early_reserve_memory() earlier") Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Anjaneya Chagam <anjaneya.chagam@intel.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/e8dd8993c38702ee6dd73b3c11f158617e665607.camel@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/sev: Make the #VC exception stacks part of the default stacks storageBorislav Petkov1-32/+0
commit 541ac97186d9ea88491961a46284de3603c914fd upstream. The size of the exception stacks was increased by the commit in Fixes, resulting in stack sizes greater than a page in size. The #VC exception handling was only mapping the first (bottom) page, resulting in an SEV-ES guest failing to boot. Make the #VC exception stacks part of the default exception stacks storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map them only when a SEV-ES guest has been detected. Rip out the custom VC stacks mapping and storage code. [ bp: Steal and adapt Tom's commit message. ] Fixes: 7fae4c24a2b8 ("x86: Increase exception stack sizes") Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Brijesh Singh <brijesh.singh@amd.com> Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/sev: Add an x86 version of cc_platform_has()Tom Lendacky2-0/+75
commit aa5a461171f98fde0df78c4f6b5018a1e967cf81 upstream. Introduce an x86 version of the cc_platform_has() function. This will be used to replace vendor specific calls like sme_active(), sev_active(), etc. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210928191009.32551-4-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/mce: Add errata workaround for Skylake SKX37Dave Jones1-2/+3
commit e629fc1407a63dbb748f828f9814463ffc2a0af0 upstream. Errata SKX37 is word-for-word identical to the other errata listed in this workaround. I happened to notice this after investigating a CMCI storm on a Skylake host. While I can't confirm this was the root cause, spurious corrected errors does sound like a likely suspect. Fixes: 2976908e4198 ("x86/mce: Do not log spurious corrected mce errors") Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20211029205759.GA7385@codemonkey.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/sev: Fix stack type check in vc_switch_off_ist()Joerg Roedel1-1/+1
[ Upstream commit 5681981fb788281b09a4ea14d310d30b2bd89132 ] The value of STACK_TYPE_EXCEPTION_LAST points to the last _valid_ exception stack. Reflect that in the check done in the vc_switch_off_ist() function. Fixes: a13644f3a53de ("x86/entry/64: Add entry code for #VC handler") Reported-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211021080833.30875-2-joro@8bytes.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18x86/mm/64: Improve stack overflow warningsPeter Zijlstra2-12/+19
[ Upstream commit 44b979fa302cab91bdd2cc982823e5c13202cd4e ] Current code has an explicit check for hitting the task stack guard; but overflowing any of the other stacks will get you a non-descript general #DF warning. Improve matters by using get_stack_info_noinstr() to detetrmine if and which stack guard page got hit, enabling a better stack warning. In specific, Michael Wang reported what turned out to be an NMI exception stack overflow, which is now clearly reported as such: [] BUG: NMI stack guard page was hit at 0000000085fd977b (stack is 000000003a55b09e..00000000d8cce1a5) Reported-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Wang <yun.wang@linux.alibaba.com> Link: https://lkml.kernel.org/r/YUTE/NuqnaWbST8n@hirez.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18x86/iopl: Fake iopl(3) CLI/STI usagePeter Zijlstra2-0/+34
commit b968e84b509da593c50dc3db679e1d33de701f78 upstream. Since commit c8137ace5638 ("x86/iopl: Restrict iopl() permission scope") it's possible to emulate iopl(3) using ioperm(), except for the CLI/STI usage. Userspace CLI/STI usage is very dubious (read broken), since any exception taken during that window can lead to rescheduling anyway (or worse). The IOPL(2) manpage even states that usage of CLI/STI is highly discouraged and might even crash the system. Of course, that won't stop people and HP has the dubious honour of being the first vendor to be found using this in their hp-health package. In order to enable this 'software' to still 'work', have the #GP treat the CLI/STI instructions as NOPs when iopl(3). Warn the user that their program is doing dubious things. Fixes: a24ca9976843 ("x86/iopl: Remove legacy IOPL option") Reported-by: Ondrej Zary <linux@zary.sk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org # v5.5+ Link: https://lkml.kernel.org/r/20210918090641.GD5106@worktop.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/irq: Ensure PI wakeup handler is unregistered before module unloadSean Christopherson1-1/+3
commit 6ff53f6a438f72998f56e82e76694a1df9d1ea2c upstream. Add a synchronize_rcu() after clearing the posted interrupt wakeup handler to ensure all readers, i.e. in-flight IRQ handlers, see the new handler before returning to the caller. If the caller is an exiting module and is unregistering its handler, failure to wait could result in the IRQ handler jumping into an unloaded module. The registration path doesn't require synchronization, as it's the caller's responsibility to not generate interrupts it cares about until after its handler is registered. Fixes: f6b3c72c2366 ("x86/irq: Define a global vector for VT-d Posted-Interrupts") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009001107.3936588-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/cpu: Fix migration safety with X86_BUG_NULL_SELJane Malalane4-7/+42
commit 415de44076640483648d6c0f6d645a9ee61328ad upstream. Currently, Linux probes for X86_BUG_NULL_SEL unconditionally which makes it unsafe to migrate in a virtualised environment as the properties across the migration pool might differ. To be specific, the case which goes wrong is: 1. Zen1 (or earlier) and Zen2 (or later) in a migration pool 2. Linux boots on Zen2, probes and finds the absence of X86_BUG_NULL_SEL 3. Linux is then migrated to Zen1 Linux is now running on a X86_BUG_NULL_SEL-impacted CPU while believing that the bug is fixed. The only way to address the problem is to fully trust the "no longer affected" CPUID bit when virtualised, because in the above case it would be clear deliberately to indicate the fact "you might migrate to somewhere which has this behaviour". Zen3 adds the NullSelectorClearsBase CPUID bit to indicate that loading a NULL segment selector zeroes the base and limit fields, as well as just attributes. Zen2 also has this behaviour but doesn't have the NSCB bit. [ bp: Minor touchups. ] Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Signed-off-by: Borislav Petkov <bp@suse.de> CC: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20211021104744.24126-1-jane.malalane@citrix.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-16x86/fpu: Mask out the invalid MXCSR bits properlyBorislav Petkov1-1/+1
This is a fix for the fix (yeah, /facepalm). The correct mask to use is not the negation of the MXCSR_MASK but the actual mask which contains the supported bits in the MXCSR register. Reported and debugged by Ville Syrjälä <ville.syrjala@linux.intel.com> Fixes: d298b03506d3 ("x86/fpu: Restore the masking out of reserved MXCSR bits") Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Tested-by: Ser Olmy <ser.olmy@protonmail.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/YWgYIYXLriayyezv@intel.com