summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2023-01-18x86/resctrl: Fix event counts regression in reused RMIDsPeter Newman1-16/+33
commit 2a81160d29d65b5876ab3f824fda99ae0219f05e upstream. When creating a new monitoring group, the RMID allocated for it may have been used by a group which was previously removed. In this case, the hardware counters will have non-zero values which should be deducted from what is reported in the new group's counts. resctrl_arch_reset_rmid() initializes the prev_msr value for counters to 0, causing the initial count to be charged to the new group. Resurrect __rmid_read() and use it to initialize prev_msr correctly. Unlike before, __rmid_read() checks for error bits in the MSR read so that callers don't need to. Fixes: 1d81d15db39c ("x86/resctrl: Move mbm_overflow_count() into resctrl_arch_rmid_read()") Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221220164132.443083-1-peternewman@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-18x86/resctrl: Fix task CLOSID/RMID update racePeter Newman1-1/+11
commit fe1f0714385fbcf76b0cbceb02b7277d842014fc upstream. When the user moves a running task to a new rdtgroup using the task's file interface or by deleting its rdtgroup, the resulting change in CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the task(s) CPUs. x86 allows reordering loads with prior stores, so if the task starts running between a task_curr() check that the CPU hoisted before the stores in the CLOSID/RMID update then it can start running with the old CLOSID/RMID until it is switched again because __rdtgroup_move_task() failed to determine that it needs to be interrupted to obtain the new CLOSID/RMID. Refer to the diagram below: CPU 0 CPU 1 ----- ----- __rdtgroup_move_task(): curr <- t1->cpu->rq->curr __schedule(): rq->curr <- t1 resctrl_sched_in(): t1->{closid,rmid} -> {1,1} t1->{closid,rmid} <- {2,2} if (curr == t1) // false IPI(t1->cpu) A similar race impacts rdt_move_group_tasks(), which updates tasks in a deleted rdtgroup. In both cases, use smp_mb() to order the task_struct::{closid,rmid} stores before the loads in task_curr(). In particular, in the rdt_move_group_tasks() case, simply execute an smp_mb() on every iteration with a matching task. It is possible to use a single smp_mb() in rdt_move_group_tasks(), but this would require two passes and a means of remembering which task_structs were updated in the first loop. However, benchmarking results below showed too little performance impact in the simple approach to justify implementing the two-pass approach. Times below were collected using `perf stat` to measure the time to remove a group containing a 1600-task, parallel workload. CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads) # mkdir /sys/fs/resctrl/test # echo $$ > /sys/fs/resctrl/test/tasks # perf bench sched messaging -g 40 -l 100000 task-clock time ranges collected using: # perf stat rmdir /sys/fs/resctrl/test Baseline: 1.54 - 1.60 ms smp_mb() every matching task: 1.57 - 1.67 ms [ bp: Massage commit message. ] Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR") Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount") Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-18x86/pat: Fix pat_x_mtrr_type() for MTRR disabled caseJuergen Gross1-1/+2
commit 90b926e68f500844dff16b5bcea178dc55cf580a upstream. Since 72cbc8f04fe2 ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen") PAT can be enabled without MTRR. This has resulted in problems e.g. for a SEV-SNP guest running under Hyper-V, when trying to establish a new mapping via memremap() with WB caching mode, as pat_x_mtrr_type() will call mtrr_type_lookup(), which in turn is returning MTRR_TYPE_INVALID due to MTRR being disabled in this configuration. The result is a mapping with UC- caching, leading to severe performance degradation. Fix that by handling MTRR_TYPE_INVALID the same way as MTRR_TYPE_WRBACK in pat_x_mtrr_type() because MTRR_TYPE_INVALID means MTRRs are disabled. [ bp: Massage commit message. ] Fixes: 72cbc8f04fe2 ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen") Reported-by: Michael Kelley (LINUX) <mikelley@microsoft.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230110065427.20767-1-jgross@suse.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-18x86/boot: Avoid using Intel mnemonics in AT&T syntax asmPeter Zijlstra1-2/+2
commit 7c6dd961d0c8e7e8f9fdc65071fb09ece702e18d upstream. With 'GNU assembler (GNU Binutils for Debian) 2.39.90.20221231' the build now reports: arch/x86/realmode/rm/../../boot/bioscall.S: Assembler messages: arch/x86/realmode/rm/../../boot/bioscall.S:35: Warning: found `movsd'; assuming `movsl' was meant arch/x86/realmode/rm/../../boot/bioscall.S:70: Warning: found `movsd'; assuming `movsl' was meant arch/x86/boot/bioscall.S: Assembler messages: arch/x86/boot/bioscall.S:35: Warning: found `movsd'; assuming `movsl' was meant arch/x86/boot/bioscall.S:70: Warning: found `movsd'; assuming `movsl' was meant Which is due to: PR gas/29525 Note that with the dropped CMPSD and MOVSD Intel Syntax string insn templates taking operands, mixed IsString/non-IsString template groups (with memory operands) cannot occur anymore. With that maybe_adjust_templates() becomes unnecessary (and is hence being removed). More details: https://sourceware.org/bugzilla/show_bug.cgi?id=29525 Borislav Petkov further explains: " the particular problem here is is that the 'd' suffix is "conflicting" in the sense that you can have SSE mnemonics like movsD %xmm... and the same thing also for string ops (which is the case here) so apparently the agreement in binutils land is to use the always accepted suffixes 'l' or 'q' and phase out 'd' slowly... " Fixes: 7a734e7dd93b ("x86, setup: "glove box" BIOS calls -- infrastructure") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/Y71I3Ex2pvIxMpsP@hirez.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-18elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size}Catalin Marinas1-2/+2
commit 19e183b54528f11fafeca60fc6d0821e29ff281e upstream. A subsequent fix for arm64 will use this parameter to parse the vma information from the snapshot created by dump_vma_snapshot() rather than traversing the vma list without the mmap_lock. Fixes: 6dd8b1a0b6cb ("arm64: mte: Dump the MTE tags in the core file") Cc: <stable@vger.kernel.org> # 5.18.x Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Seth Jenkins <sethjenkins@google.com> Suggested-by: Seth Jenkins <sethjenkins@google.com> Cc: Will Deacon <will@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221222181251.1345752-3-catalin.marinas@arm.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-18KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUIDPaolo Bonzini1-16/+16
commit 45e966fcca03ecdcccac7cb236e16eea38cc18af upstream. Passing the host topology to the guest is almost certainly wrong and will confuse the scheduler. In addition, several fields of these CPUID leaves vary on each processor; it is simply impossible to return the right values from KVM_GET_SUPPORTED_CPUID in such a way that they can be passed to KVM_SET_CPUID2. The values that will most likely prevent confusion are all zeroes. Userspace will have to override it anyway if it wishes to present a specific topology to the guest. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-14x86/fpu: Emulate XRSTOR's behavior if the xfeatures PKRU bit is not setKyle Huey2-1/+22
commit d7e5aceace514a2b1b3ca3dc44f93f1704766ca7 upstream. The hardware XRSTOR instruction resets the PKRU register to its hardware init value (namely 0) if the PKRU bit is not set in the xfeatures mask. Emulating that here restores the pre-5.14 behavior for PTRACE_SET_REGSET with NT_X86_XSTATE, and makes sigreturn (which still uses XRSTOR) and ptrace behave identically. KVM has never used XRSTOR and never had this behavior, so KVM opts-out of this emulation by passing a NULL pkru pointer to copy_uabi_to_xstate(). Fixes: e84ba47e313d ("x86/fpu: Hook up PKRU into ptrace()") Signed-off-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20221115230932.7126-6-khuey%40kylehuey.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-14x86/fpu: Allow PKRU to be (once again) written by ptrace.Kyle Huey2-13/+21
commit 4a804c4f8356393d6b5eff7600f07615d7869c13 upstream. Move KVM's PKRU handling code in fpu_copy_uabi_to_guest_fpstate() to copy_uabi_to_xstate() so that it is shared with other APIs that write the XSTATE such as PTRACE_SETREGSET with NT_X86_XSTATE. This restores the pre-5.14 behavior of ptrace. The regression can be seen by running gdb and executing `p $pkru`, `set $pkru = 42`, and `p $pkru`. On affected kernels (5.14+) the write to the PKRU register (which gdb performs through ptrace) is ignored. [ dhansen: removed stable@ tag for now. The ABI was broken for long enough that this is not urgent material. Let's let it stew in tip for a few weeks before it's submitted to stable because there are so many ABIs potentially affected. ] Fixes: e84ba47e313d ("x86/fpu: Hook up PKRU into ptrace()") Signed-off-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20221115230932.7126-5-khuey%40kylehuey.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-14x86/fpu: Add a pkru argument to copy_uabi_to_xstate()Kyle Huey1-3/+13
commit 2c87767c35ee9744f666ccec869d5fe742c3de0a upstream. In preparation for moving PKRU handling code out of fpu_copy_uabi_to_guest_fpstate() and into copy_uabi_to_xstate(), add an argument that copy_uabi_from_kernel_to_xstate() can use to pass the canonical location of the PKRU value. For copy_sigframe_from_user_to_xstate() the kernel will actually restore the PKRU value from the fpstate, but pass in the thread_struct's pkru location anyways for consistency. Signed-off-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20221115230932.7126-4-khuey%40kylehuey.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-14x86/fpu: Add a pkru argument to copy_uabi_from_kernel_to_xstate().Kyle Huey4-4/+4
commit 1c813ce0305571e1b2e4cc4acca451da9e6ad18f upstream. Both KVM (through KVM_SET_XSTATE) and ptrace (through PTRACE_SETREGSET with NT_X86_XSTATE) ultimately call copy_uabi_from_kernel_to_xstate(), but the canonical locations for the current PKRU value for KVM guests and processes in a ptrace stop are different (in the kvm_vcpu_arch and the thread_state structs respectively). In preparation for eventually handling PKRU in copy_uabi_to_xstate, pass in a pointer to the PKRU location. Signed-off-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20221115230932.7126-3-khuey%40kylehuey.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-14x86/fpu: Take task_struct* in copy_sigframe_from_user_to_xstate()Kyle Huey3-4/+4
commit 6a877d2450ace4f27c012519e5a1ae818f931983 upstream. This will allow copy_sigframe_from_user_to_xstate() to grab the address of thread_struct's pkru value in a later patch. Signed-off-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20221115230932.7126-2-khuey%40kylehuey.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12x86/bugs: Flush IBP in ib_prctl_set()Rodrigo Branco1-0/+2
commit a664ec9158eeddd75121d39c9a0758016097fa96 upstream. We missed the window between the TIF flag update and the next reschedule. Signed-off-by: Rodrigo Branco <bsdaemon@google.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12x86/kexec: Fix double-free of elf header bufferTakashi Iwai1-3/+1
commit d00dd2f2645dca04cf399d8fc692f3f69b6dd996 upstream. After b3e34a47f989 ("x86/kexec: fix memory leak of elf header buffer"), freeing image->elf_headers in the error path of crash_load_segments() is not needed because kimage_file_post_load_cleanup() will take care of that later. And not clearing it could result in a double-free. Drop the superfluous vfree() call at the error path of crash_load_segments(). Fixes: b3e34a47f989 ("x86/kexec: fix memory leak of elf header buffer") Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20221122115122.13937-1-tiwai@suse.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07x86/kprobes: Fix optprobe optimization check with CONFIG_RETHUNKMasami Hiramatsu (Google)1-20/+8
commit 63dc6325ff41ee9e570bde705ac34a39c5dbeb44 upstream. Since the CONFIG_RETHUNK and CONFIG_SLS will use INT3 for stopping speculative execution after function return, kprobe jump optimization always fails on the functions with such INT3 inside the function body. (It already checks the INT3 padding between functions, but not inside the function) To avoid this issue, as same as kprobes, check whether the INT3 comes from kgdb or not, and if so, stop decoding and make it fail. The other INT3 will come from CONFIG_RETHUNK/CONFIG_SLS and those can be treated as a one-byte instruction. Fixes: e463a09af2f0 ("x86: Add straight-line-speculation mitigation") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/167146051929.1374301.7419382929328081706.stgit@devnote3 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07x86/kprobes: Fix kprobes instruction boudary check with CONFIG_RETHUNKMasami Hiramatsu (Google)1-3/+7
commit 1993bf97992df2d560287f3c4120eda57426843d upstream. Since the CONFIG_RETHUNK and CONFIG_SLS will use INT3 for stopping speculative execution after RET instruction, kprobes always failes to check the probed instruction boundary by decoding the function body if the probed address is after such sequence. (Note that some conditional code blocks will be placed after function return, if compiler decides it is not on the hot path.) This is because kprobes expects kgdb puts the INT3 as a software breakpoint and it will replace the original instruction. But these INT3 are not such purpose, it doesn't need to recover the original instruction. To avoid this issue, kprobes checks whether the INT3 is owned by kgdb or not, and if so, stop decoding and make it fail. The other INT3 will come from CONFIG_RETHUNK/CONFIG_SLS and those can be treated as a one-byte instruction. Fixes: e463a09af2f0 ("x86: Add straight-line-speculation mitigation") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/167146051026.1374301.392728975473572291.stgit@devnote3 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07ftrace/x86: Add back ftrace_expected for ftrace bug reportsSteven Rostedt (Google)1-0/+2
commit fd3dc56253acbe9c641a66d312d8393cd55eb04c upstream. After someone reported a bug report with a failed modification due to the expected value not matching what was found, it came to my attention that the ftrace_expected is no longer set when that happens. This makes for debugging the issue a bit more difficult. Set ftrace_expected to the expected code before calling ftrace_bug, so that it shows what was expected and why it failed. Link: https://lore.kernel.org/all/CA+wXwBQ-VhK+hpBtYtyZP-NiX4g8fqRRWithFOHQW-0coQ3vLg@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20221209105247.01d4e51d@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "x86@kernel.org" <x86@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Fixes: 768ae4406a5c ("x86/ftrace: Use text_poke()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07x86/microcode/intel: Do not retry microcode reloading on the APsAshok Raj1-7/+1
commit be1b670f61443aa5d0d01782e9b8ea0ee825d018 upstream. The retries in load_ucode_intel_ap() were in place to support systems with mixed steppings. Mixed steppings are no longer supported and there is only one microcode image at a time. Any retries will simply reattempt to apply the same image over and over without making progress. [ bp: Zap the circumstantial reasoning from the commit message. ] Fixes: 06b8534cb728 ("x86/microcode: Rework microcode loading") Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221129210832.107850-3-ashok.raj@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1Sean Christopherson1-1/+2
commit 31de69f4eea77b28a9724b3fa55aae104fc91fc7 upstream. Set ENABLE_USR_WAIT_PAUSE in KVM's supported VMX MSR configuration if the feature is supported in hardware and enabled in KVM's base, non-nested configuration, i.e. expose ENABLE_USR_WAIT_PAUSE to L1 if it's supported. This fixes a bug where saving/restoring, i.e. migrating, a vCPU will fail if WAITPKG (the associated CPUID feature) is enabled for the vCPU, and obviously allows L1 to enable the feature for L2. KVM already effectively exposes ENABLE_USR_WAIT_PAUSE to L1 by stuffing the allowed-1 control ina vCPU's virtual MSR_IA32_VMX_PROCBASED_CTLS2 when updating secondary controls in response to KVM_SET_CPUID(2), but (a) that depends on flawed code (KVM shouldn't touch VMX MSRs in response to CPUID updates) and (b) runs afoul of vmx_restore_control_msr()'s restriction that the guest value must be a strict subset of the supported host value. Although no past commit explicitly enabled nested support for WAITPKG, doing so is safe and functionally correct from an architectural perspective as no additional KVM support is needed to virtualize TPAUSE, UMONITOR, and UMWAIT for L2 relative to L1, and KVM already forwards VM-Exits to L1 as necessary (commit bf653b78f960, "KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit"). Note, KVM always keeps the hosts MSR_IA32_UMWAIT_CONTROL resident in hardware, i.e. always runs both L1 and L2 with the host's power management settings for TPAUSE and UMWAIT. See commit bf09fb6cba4f ("KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL") for more details. Fixes: e69e72faa3a0 ("KVM: x86: Add support for user wait instructions") Cc: stable@vger.kernel.org Reported-by: Aaron Lewis <aaronlewis@google.com> Reported-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20221213062306.667649-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itselfYuan ZhaoXiong1-2/+3
commit ef40757743b47cc95de9b4ed41525c94f8dc73d9 upstream. When a VM reboots itself, the reset process will result in an ioctl(KVM_SET_LAPIC, ...) to disable x2APIC mode and set the xAPIC id of the vCPU to its default value, which is the vCPU id. That will be handled in KVM as follows: kvm_vcpu_ioctl_set_lapic kvm_apic_set_state kvm_lapic_set_base => disable X2APIC mode kvm_apic_state_fixup kvm_lapic_xapic_id_updated kvm_xapic_id(apic) != apic->vcpu->vcpu_id kvm_set_apicv_inhibit(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) memcpy(vcpu->arch.apic->regs, s->regs, sizeof(*s)) => update APIC_ID When kvm_apic_set_state invokes kvm_lapic_set_base to disable x2APIC mode, the old 32-bit x2APIC id is still present rather than the 8-bit xAPIC id. kvm_lapic_xapic_id_updated will set the APICV_INHIBIT_REASON_APIC_ID_MODIFIED bit and disable APICv/x2AVIC. Instead, kvm_lapic_xapic_id_updated must be called after APIC_ID is changed. In fact, this fixes another small issue in the code in that potential changes to a vCPU's xAPIC ID need not be tracked for KVM_GET_LAPIC. Fixes: 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base") Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com> Message-Id: <1669984574-32692-1-git-send-email-yuanzhaoxiong@baidu.com> Cc: stable@vger.kernel.org Reported-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07KVM: nVMX: Inject #GP, not #UD, if "generic" VMXON CR0/CR4 check failsSean Christopherson1-11/+33
commit 9cc409325ddd776f6fd6293d5ce93ce1248af6e4 upstream. Inject #GP for if VMXON is attempting with a CR0/CR4 that fails the generic "is CRx valid" check, but passes the CR4.VMXE check, and do the generic checks _after_ handling the post-VMXON VM-Fail. The CR4.VMXE check, and all other #UD cases, are special pre-conditions that are enforced prior to pivoting on the current VMX mode, i.e. occur before interception if VMXON is attempted in VMX non-root mode. All other CR0/CR4 checks generate #GP and effectively have lower priority than the post-VMXON check. Per the SDM: IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ... THEN #UD; ELSIF not in VMX operation THEN IF (CPL > 0) or (in A20M mode) or (the values of CR0 and CR4 are not supported in VMX operation) THEN #GP(0); ELSIF in VMX non-root operation THEN VMexit; ELSIF CPL > 0 THEN #GP(0); ELSE VMfail("VMXON executed in VMX root operation"); FI; which, if re-written without ELSIF, yields: IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ... THEN #UD IF in VMX non-root operation THEN VMexit; IF CPL > 0 THEN #GP(0) IF in VMX operation THEN VMfail("VMXON executed in VMX root operation"); IF (in A20M mode) or (the values of CR0 and CR4 are not supported in VMX operation) THEN #GP(0); Note, KVM unconditionally forwards VMXON VM-Exits that occur in L2 to L1, i.e. there is no need to check the vCPU is not in VMX non-root mode. Add a comment to explain why unconditionally forwarding such exits is functionally correct. Reported-by: Eric Li <ercli@ucdavis.edu> Fixes: c7d855c2aff2 ("KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221006001956.329314-1-seanjc@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07KVM: VMX: Resume guest immediately when injecting #GP on ECREATESean Christopherson1-1/+3
commit eb3992e833d3a17f9b0a3e0371d0b1d3d566f740 upstream. Resume the guest immediately when injecting a #GP on ECREATE due to an invalid enclave size, i.e. don't attempt ECREATE in the host. The #GP is a terminal fault, e.g. skipping the instruction if ECREATE is successful would result in KVM injecting #GP on the instruction following ECREATE. Fixes: 70210c044b4e ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions") Cc: stable@vger.kernel.org Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20220930233132.1723330-1-seanjc@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07x86/MCE/AMD: Clear DFR errors found in THR handlerYazen Ghannam1-13/+20
commit bc1b705b0eee4c645ad8b3bbff3c8a66e9688362 upstream. AMD's MCA Thresholding feature counts errors of all severity levels, not just correctable errors. If a deferred error causes the threshold limit to be reached (it was the error that caused the overflow), then both a deferred error interrupt and a thresholding interrupt will be triggered. The order of the interrupts is not guaranteed. If the threshold interrupt handler is executed first, then it will clear MCA_STATUS for the error. It will not check or clear MCA_DESTAT which also holds a copy of the deferred error. When the deferred error interrupt handler runs it will not find an error in MCA_STATUS, but it will find the error in MCA_DESTAT. This will cause two errors to be logged. Check for deferred errors when handling a threshold interrupt. If a bank contains a deferred error, then clear the bank's MCA_DESTAT register. Define a new helper function to do the deferred error check and clearing of MCA_DESTAT. [ bp: Simplify, convert comment to passive voice. ] Fixes: 37d43acfd79f ("x86/mce/AMD: Redo error logging from APIC LVT interrupt handlers") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220621155943.33623-1-yazen.ghannam@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07x86/fpu/xstate: Fix XSTATE_WARN_ON() to emit relevant diagnosticsAndrew Cooper1-6/+6
commit 48280042f2c6e3ac2cfb1d8b752ab4a7e0baea24 upstream. "XSAVE consistency problem" has been reported under Xen, but that's the extent of my divination skills. Modify XSTATE_WARN_ON() to force the caller to provide relevant diagnostic information, and modify each caller suitably. For check_xstate_against_struct(), this removes a double WARN() where one will do perfectly fine. CC stable as this has been wonky debugging for 7 years and it is good to have there too. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220810221909.12768-1-andrew.cooper3@citrix.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07perf/x86/intel/uncore: Clear attr_update properlyAlexander Antonov1-1/+16
commit 6532783310e2b2f50dc13f46c49aa6546cb6e7a3 upstream. Current clear_attr_update procedure in pmu_set_mapping() sets attr_update field in NULL that is not correct because intel_uncore_type pmu types can contain several groups in attr_update field. For example, SPR platform already has uncore_alias_group to update and then UPI topology group will be added in next patches. Fix current behavior and clear attr_update group related to mapping only. Fixes: bb42b3d39781 ("perf/x86/intel/uncore: Expose an Uncore unit to IIO PMON mapping") Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221117122833.3103580-4-alexander.antonov@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07perf/x86/intel/uncore: Disable I/O stacks to PMU mapping on ICX-DAlexander Antonov2-0/+6
commit efe062705d149b20a15498cb999a9edbb8241e6f upstream. Current implementation of I/O stacks to PMU mapping doesn't support ICX-D. Detect ICX-D system to disable mapping. Fixes: 10337e95e04c ("perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on ICX") Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221117122833.3103580-5-alexander.antonov@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-31x86/apic: Handle no CONFIG_X86_X2APIC on systems with x2APIC enabled by BIOSMateusz Jończyk3-9/+11
[ Upstream commit e3998434da4f5b1f57f8d6a8a9f8502ee3723bae ] A kernel that was compiled without CONFIG_X86_X2APIC was unable to boot on platforms that have x2APIC already enabled in the BIOS before starting the kernel. The kernel was supposed to panic with an approprite error message in validate_x2apic() due to the missing X2APIC support. However, validate_x2apic() was run too late in the boot cycle, and the kernel tried to initialize the APIC nonetheless. This resulted in an earlier panic in setup_local_APIC() because the APIC was not registered. In my experiments, a panic message in setup_local_APIC() was not visible in the graphical console, which resulted in a hang with no indication what has gone wrong. Instead of calling panic(), disable the APIC, which results in a somewhat working system with the PIC only (and no SMP). This way the user is able to diagnose the problem more easily. Disabling X2APIC mode is not an option because it's impossible on systems with locked x2APIC. The proper place to disable the APIC in this case is in check_x2apic(), which is called early from setup_arch(). Doing this in __apic_intr_mode_select() is too late. Make check_x2apic() unconditionally available and remove the empty stub. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Reported-by: Robert Elliott (Servers) <elliott@hpe.com> Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/d573ba1c-0dc4-3016-712a-cc23a8a33d42@molgen.mpg.de Link: https://lore.kernel.org/lkml/20220911084711.13694-3-mat.jonczyk@o2.pl Link: https://lore.kernel.org/all/20221129215008.7247-1-mat.jonczyk@o2.pl Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31x86/hyperv: Remove unregister syscore call from Hyper-V cleanupGaurav Kohli1-2/+0
[ Upstream commit 32c97d980e2eef25465d453f2956a9ca68926a3c ] Hyper-V cleanup code comes under panic path where preemption and irq is already disabled. So calling of unregister_syscore_ops might schedule out the thread even for the case where mutex lock is free. hyperv_cleanup unregister_syscore_ops mutex_lock(&syscore_ops_lock) might_sleep Here might_sleep might schedule out this thread, where voluntary preemption config is on and this thread will never comes back. And also this was added earlier to maintain the symmetry which is not required as this can comes during crash shutdown path only. To prevent the same, removing unregister_syscore_ops function call. Signed-off-by: Gaurav Kohli <gauravkohli@linux.microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/1669443291-2575-1-git-send-email-gauravkohli@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31crypto: x86/sm4 - fix crash with CFI enabledEric Biggers2-6/+8
[ Upstream commit 2d203c46a0fa5df0785383b13b722483e1fd27a8 ] sm4_aesni_avx_ctr_enc_blk8(), sm4_aesni_avx_cbc_dec_blk8(), sm4_aesni_avx_cfb_dec_blk8(), sm4_aesni_avx2_ctr_enc_blk16(), sm4_aesni_avx2_cbc_dec_blk16(), and sm4_aesni_avx2_cfb_dec_blk16() are called via indirect function calls. Therefore they need to use SYM_TYPED_FUNC_START instead of SYM_FUNC_START to cause their type hashes to be emitted when the kernel is built with CONFIG_CFI_CLANG=y. Otherwise, the code crashes with a CFI failure. (Or at least that should be the case. For some reason the CFI checks in sm4_avx_cbc_decrypt(), sm4_avx_cfb_decrypt(), and sm4_avx_ctr_crypt() are not always being generated, using current tip-of-tree clang. Anyway, this patch is a good idea anyway.) Fixes: ccace936eec7 ("x86: Add types to indirectly called assembly functions") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31crypto: x86/sm3 - fix possible crash with CFI enabledEric Biggers1-1/+2
[ Upstream commit 8ba490d9f5a56f52091644325a32d3f71a982776 ] sm3_transform_avx() is called via indirect function calls. Therefore it needs to use SYM_TYPED_FUNC_START instead of SYM_FUNC_START to cause its type hash to be emitted when the kernel is built with CONFIG_CFI_CLANG=y. Otherwise, the code crashes with a CFI failure (if the compiler didn't happen to optimize out the indirect call). Fixes: ccace936eec7 ("x86: Add types to indirectly called assembly functions") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31crypto: x86/sha512 - fix possible crash with CFI enabledEric Biggers3-3/+6
[ Upstream commit a1d72fa33186ac69c7d8120c71f41ea4fc23dcc9 ] sha512_transform_ssse3(), sha512_transform_avx(), and sha512_transform_rorx() are called via indirect function calls. Therefore they need to use SYM_TYPED_FUNC_START instead of SYM_FUNC_START to cause their type hashes to be emitted when the kernel is built with CONFIG_CFI_CLANG=y. Otherwise, the code crashes with a CFI failure (if the compiler didn't happen to optimize out the indirect calls). Fixes: ccace936eec7 ("x86: Add types to indirectly called assembly functions") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31crypto: x86/sha256 - fix possible crash with CFI enabledEric Biggers4-4/+8
[ Upstream commit 19940ebbb59c12146d05c5f8acd873197b290648 ] sha256_transform_ssse3(), sha256_transform_avx(), sha256_transform_rorx(), and sha256_ni_transform() are called via indirect function calls. Therefore they need to use SYM_TYPED_FUNC_START instead of SYM_FUNC_START to cause their type hashes to be emitted when the kernel is built with CONFIG_CFI_CLANG=y. Otherwise, the code crashes with a CFI failure (if the compiler didn't happen to optimize out the indirect calls). Fixes: ccace936eec7 ("x86: Add types to indirectly called assembly functions") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31crypto: x86/sha1 - fix possible crash with CFI enabledEric Biggers2-2/+4
[ Upstream commit 32f34bf7e44eeaa241fb845d6f52af5104bc30fd ] sha1_transform_ssse3(), sha1_transform_avx(), and sha1_ni_transform() (but not sha1_transform_avx2()) are called via indirect function calls. Therefore they need to use SYM_TYPED_FUNC_START instead of SYM_FUNC_START to cause their type hashes to be emitted when the kernel is built with CONFIG_CFI_CLANG=y. Otherwise, the code crashes with a CFI failure (if the compiler didn't happen to optimize out the indirect calls). Fixes: ccace936eec7 ("x86: Add types to indirectly called assembly functions") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31crypto: x86/aria - fix crash with CFI enabledEric Biggers1-6/+7
[ Upstream commit c67b553a4f4a8bd921e4c9ceae00e111be09c488 ] aria_aesni_avx_encrypt_16way(), aria_aesni_avx_decrypt_16way(), aria_aesni_avx_ctr_crypt_16way(), aria_aesni_avx_gfni_encrypt_16way(), aria_aesni_avx_gfni_decrypt_16way(), and aria_aesni_avx_gfni_ctr_crypt_16way() are called via indirect function calls. Therefore they need to use SYM_TYPED_FUNC_START instead of SYM_FUNC_START to cause their type hashes to be emitted when the kernel is built with CONFIG_CFI_CLANG=y. Otherwise, the code crashes with a CFI failure. Fixes: ccace936eec7 ("x86: Add types to indirectly called assembly functions") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31crypto: x86/aegis128 - fix possible crash with CFI enabledEric Biggers1-4/+5
[ Upstream commit 8bd9974b6bfcd1e14a001deeca051aed7295559a ] crypto_aegis128_aesni_enc(), crypto_aegis128_aesni_enc_tail(), crypto_aegis128_aesni_dec(), and crypto_aegis128_aesni_dec_tail() are called via indirect function calls. Therefore they need to use SYM_TYPED_FUNC_START instead of SYM_FUNC_START to cause their type hashes to be emitted when the kernel is built with CONFIG_CFI_CLANG=y. Otherwise, the code crashes with a CFI failure (if the compiler didn't happen to optimize out the indirect calls). Fixes: ccace936eec7 ("x86: Add types to indirectly called assembly functions") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31x86/boot: Skip realmode init code when running as Xen PV guestJuergen Gross6-3/+17
[ Upstream commit f1e525009493cbd569e7c8dd7d58157855f8658d ] When running as a Xen PV guest there is no need for setting up the realmode trampoline, as realmode isn't supported in this environment. Trying to setup the trampoline has been proven to be problematic in some cases, especially when trying to debug early boot problems with Xen requiring to keep the EFI boot-services memory mapped (some firmware variants seem to claim basically all memory below 1Mb for boot services). Introduce new x86_platform_ops operations for that purpose, which can be set to a NOP by the Xen PV specific kernel boot code. [ bp: s/call_init_real_mode/do_init_real_mode/ ] Fixes: 084ee1c641a0 ("x86, realmode: Relocator for realmode code") Suggested-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221123114523.3467-1-jgross@suse.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31x86/xen: Fix memory leak in xen_init_lock_cpu()Xiu Jianfeng1-3/+3
[ Upstream commit ca84ce153d887b1dc8b118029976cc9faf2a9b40 ] In xen_init_lock_cpu(), the @name has allocated new string by kasprintf(), if bind_ipi_to_irqhandler() fails, it should be freed, otherwise may lead to a memory leak issue, fix it. Fixes: 2d9e1e2f58b5 ("xen: implement Xen-specific spinlocks") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20221123155858.11382-3-xiujianfeng@huawei.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31x86/xen: Fix memory leak in xen_smp_intr_init{_pv}()Xiu Jianfeng2-18/+18
[ Upstream commit 69143f60868b3939ddc89289b29db593b647295e ] These local variables @{resched|pmu|callfunc...}_name saves the new string allocated by kasprintf(), and when bind_{v}ipi_to_irqhandler() fails, it goes to the @fail tag, and calls xen_smp_intr_free{_pv}() to free resource, however the new string is not saved, which cause a memory leak issue. fix it. Fixes: 9702785a747a ("i386: move xen") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20221123155858.11382-2-xiujianfeng@huawei.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31uprobes/x86: Allow to probe a NOP instruction with 0x66 prefixOleg Nesterov1-1/+3
[ Upstream commit cefa72129e45313655d53a065b8055aaeb01a0c9 ] Intel ICC -hotpatch inserts 2-byte "0x66 0x90" NOP at the start of each function to reserve extra space for hot-patching, and currently it is not possible to probe these functions because branch_setup_xol_ops() wrongly rejects NOP with REP prefix as it treats them like word-sized branch instructions. Fixes: 250bbd12c2fe ("uprobes/x86: Refuse to attach uprobe to "word-sized" branch insns") Reported-by: Seiji Nishikawa <snishika@redhat.com> Suggested-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20221204173933.GA31544@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31perf/x86/intel/uncore: Fix reference count leak in __uncore_imc_init_box()Xiongfeng Wang1-0/+3
[ Upstream commit 17b8d847b92d815d1638f0de154654081d66b281 ] pci_get_device() will increase the reference count for the returned pci_dev, so tgl_uncore_get_mc_dev() will return a pci_dev with its reference count increased. We need to call pci_dev_put() to decrease the reference count before exiting from __uncore_imc_init_box(). Add pci_dev_put() for both normal and error path. Fixes: fdb64822443e ("perf/x86: Add Intel Tiger Lake uncore support") Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20221118063137.121512-5-wangxiongfeng2@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31perf/x86/intel/uncore: Fix reference count leak in snr_uncore_mmio_map()Xiongfeng Wang1-0/+2
[ Upstream commit 8ebd16c11c346751b3944d708e6c181ed4746c39 ] pci_get_device() will increase the reference count for the returned pci_dev, so snr_uncore_get_mc_dev() will return a pci_dev with its reference count increased. We need to call pci_dev_put() to decrease the reference count. Let's add the missing pci_dev_put(). Fixes: ee49532b38dd ("perf/x86/intel/uncore: Add IMC uncore support for Snow Ridge") Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20221118063137.121512-4-wangxiongfeng2@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31perf/x86/intel/uncore: Fix reference count leak in hswep_has_limit_sbox()Xiongfeng Wang1-0/+1
[ Upstream commit 1ff9dd6e7071a561f803135c1d684b13c7a7d01d ] pci_get_device() will increase the reference count for the returned 'dev'. We need to call pci_dev_put() to decrease the reference count. Since 'dev' is only used in pci_read_config_dword(), let's add pci_dev_put() right after it. Fixes: 9d480158ee86 ("perf/x86/intel/uncore: Remove uncore extra PCI dev HSWEP_PCI_PCU_3") Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20221118063137.121512-3-wangxiongfeng2@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31perf/x86/intel/uncore: Fix reference count leak in sad_cfg_iio_topology()Xiongfeng Wang1-0/+2
[ Upstream commit c508eb042d9739bf9473526f53303721b70e9100 ] pci_get_device() will increase the reference count for the returned pci_dev, and also decrease the reference count for the input parameter *from* if it is not NULL. If we break the loop in sad_cfg_iio_topology() with 'dev' not NULL. We need to call pci_dev_put() to decrease the reference count. Since pci_dev_put() can handle the NULL input parameter, we can just add one pci_dev_put() right before 'return ret'. Fixes: c1777be3646b ("perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on SNR") Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20221118063137.121512-2-wangxiongfeng2@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31x86/split_lock: Add sysctl to control the misery modeGuilherme G. Piccoli1-10/+53
[ Upstream commit 727209376f4998bc84db1d5d8af15afea846a92b ] Commit b041b525dab9 ("x86/split_lock: Make life miserable for split lockers") changed the way the split lock detector works when in "warn" mode; basically, it not only shows the warn message, but also intentionally introduces a slowdown through sleeping plus serialization mechanism on such task. Based on discussions in [0], seems the warning alone wasn't enough motivation for userspace developers to fix their applications. This slowdown is enough to totally break some proprietary (aka. unfixable) userspace[1]. Happens that originally the proposal in [0] was to add a new mode which would warns + slowdown the "split locking" task, keeping the old warn mode untouched. In the end, that idea was discarded and the regular/default "warn" mode now slows down the applications. This is quite aggressive with regards proprietary/legacy programs that basically are unable to properly run in kernel with this change. While it is understandable that a malicious application could DoS by split locking, it seems unacceptable to regress old/proprietary userspace programs through a default configuration that previously worked. An example of such breakage was reported in [1]. Add a sysctl to allow controlling the "misery mode" behavior, as per Thomas suggestion on [2]. This way, users running legacy and/or proprietary software are allowed to still execute them with a decent performance while still observing the warning messages on kernel log. [0] https://lore.kernel.org/lkml/20220217012721.9694-1-tony.luck@intel.com/ [1] https://github.com/doitsujin/dxvk/issues/2938 [2] https://lore.kernel.org/lkml/87pmf4bter.ffs@tglx/ [ dhansen: minor changelog tweaks, including clarifying the actual problem ] Fixes: b041b525dab9 ("x86/split_lock: Make life miserable for split lockers") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Tested-by: Andre Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/all/20221024200254.635256-1-gpiccoli%40igalia.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31x86/sgx: Reduce delay and interference of enclave releaseReinette Chatre1-4/+19
[ Upstream commit 7b72c823ddf8aaaec4e9fb28e6fbe4d511e7dad1 ] commit 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") introduced a cond_resched() during enclave release where the EREMOVE instruction is applied to every 4k enclave page. Giving other tasks an opportunity to run while tearing down a large enclave placates the soft lockup detector but Iqbal found that the fix causes a 25% performance degradation of a workload run using Gramine. Gramine maintains a 1:1 mapping between processes and SGX enclaves. That means if a workload in an enclave creates a subprocess then Gramine creates a duplicate enclave for that subprocess to run in. The consequence is that the release of the enclave used to run the subprocess can impact the performance of the workload that is run in the original enclave, especially in large enclaves when SGX2 is not in use. The workload run by Iqbal behaves as follows: Create enclave (enclave "A") /* Initialize workload in enclave "A" */ Create enclave (enclave "B") /* Run subprocess in enclave "B" and send result to enclave "A" */ Release enclave (enclave "B") /* Run workload in enclave "A" */ Release enclave (enclave "A") The performance impact of releasing enclave "B" in the above scenario is amplified when there is a lot of SGX memory and the enclave size matches the SGX memory. When there is 128GB SGX memory and an enclave size of 128GB, from the time enclave "B" starts the 128GB SGX memory is oversubscribed with a combined demand for 256GB from the two enclaves. Before commit 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") enclave release was done in a tight loop without giving other tasks a chance to run. Even though the system experienced soft lockups the workload (run in enclave "A") obtained good performance numbers because when the workload started running there was no interference. Commit 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") gave other tasks opportunity to run while an enclave is released. The impact of this in this scenario is that while enclave "B" is released and needing to access each page that belongs to it in order to run the SGX EREMOVE instruction on it, enclave "A" is attempting to run the workload needing to access the enclave pages that belong to it. This causes a lot of swapping due to the demand for the oversubscribed SGX memory. Longer latencies are experienced by the workload in enclave "A" while enclave "B" is released. Improve the performance of enclave release while still avoiding the soft lockup detector with two enhancements: - Only call cond_resched() after XA_CHECK_SCHED iterations. - Use the xarray advanced API to keep the xarray locked for XA_CHECK_SCHED iterations instead of locking and unlocking at every iteration. This batching solution is copied from sgx_encl_may_map() that also iterates through all enclave pages using this technique. With this enhancement the workload experiences a 5% performance degradation when compared to a kernel without commit 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves"), an improvement to the reported 25% degradation, while still placating the soft lockup detector. Scenarios with poor performance are still possible even with these enhancements. For example, short workloads creating sub processes while running in large enclaves. Further performance improvements are pursued in user space through avoiding to create duplicate enclaves for certain sub processes, and using SGX2 that will do lazy allocation of pages as needed so enclaves created for sub processes start quickly and release quickly. Fixes: 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") Reported-by: Md Iqbal Hossain <md.iqbal.hossain@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Md Iqbal Hossain <md.iqbal.hossain@intel.com> Link: https://lore.kernel.org/all/00efa80dd9e35dc85753e1c5edb0344ac07bb1f0.1667236485.git.reinette.chatre%40intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-21x86/vdso: Conditionally export __vdso_sgx_enter_enclave()Nathan Chancellor1-0/+2
commit 45be2ad007a9c6bea70249c4cf3e4905afe4caeb upstream. Recently, ld.lld moved from '--undefined-version' to '--no-undefined-version' as the default, which breaks building the vDSO when CONFIG_X86_SGX is not set: ld.lld: error: version script assignment of 'LINUX_2.6' to symbol '__vdso_sgx_enter_enclave' failed: symbol not defined __vdso_sgx_enter_enclave is only included in the vDSO when CONFIG_X86_SGX is set. Only export it if it will be present in the final object, which clears up the error. Fixes: 8466436952017 ("x86/vdso: Implement a vDSO for Intel SGX enclave call") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://github.com/ClangBuiltLinux/linux/issues/1756 Link: https://lore.kernel.org/r/20221109000306.1407357-1-nathan@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-06Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-1/+1
Pull kvm fixes from Paolo Bonzini: "Unless anything comes from the ARM side, this should be the last pull request for this release - and it's mostly documentation: - Document the interaction between KVM_CAP_HALT_POLL and halt_poll_ns - s390: fix multi-epoch extension in nested guests - x86: fix uninitialized variable on nested triple fault" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: Document the interaction between KVM_CAP_HALT_POLL and halt_poll_ns KVM: Move halt-polling documentation into common directory KVM: x86: fix uninitialized variable use on KVM_REQ_TRIPLE_FAULT KVM: s390: vsie: Fix the initialization of the epoch extension (epdx) field
2022-12-03x86/bugs: Make sure MSR_SPEC_CTRL is updated properly upon resume from S3Pawan Gupta3-9/+16
The "force" argument to write_spec_ctrl_current() is currently ambiguous as it does not guarantee the MSR write. This is due to the optimization that writes to the MSR happen only when the new value differs from the cached value. This is fine in most cases, but breaks for S3 resume when the cached MSR value gets out of sync with the hardware MSR value due to S3 resetting it. When x86_spec_ctrl_current is same as x86_spec_ctrl_base, the MSR write is skipped. Which results in SPEC_CTRL mitigations not getting restored. Move the MSR write from write_spec_ctrl_current() to a new function that unconditionally writes to the MSR. Update the callers accordingly and rename functions. [ bp: Rework a bit. ] Fixes: caa0ff24d5d0 ("x86/bugs: Keep a per-CPU IA32_SPEC_CTRL value") Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/806d39b0bfec2fe8f50dc5446dff20f5bb24a959.1669821572.git.pawan.kumar.gupta@linux.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-03Merge tag 'mm-hotfixes-stable-2022-12-02' of ↵Linus Torvalds1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "15 hotfixes, 11 marked cc:stable. Only three or four of the latter address post-6.0 issues, which is hopefully a sign that things are converging" * tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: revert "kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible" Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths mm/khugepaged: fix GUP-fast interaction by sending IPI mm/khugepaged: take the right locks for page table retraction mm: migrate: fix THP's mapcount on isolation mm: introduce arch_has_hw_nonleaf_pmd_young() mm: add dummy pmd_young() for architectures not having it mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes() tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep" nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry() hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing madvise: use zap_page_range_single for madvise dontneed mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE
2022-12-01mm: introduce arch_has_hw_nonleaf_pmd_young()Juergen Gross1-0/+8
When running as a Xen PV guests commit eed9a328aa1a ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") can cause a protection violation in pmdp_test_and_clear_young(): BUG: unable to handle page fault for address: ffff8880083374d0 #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 3026067 P4D 3026067 PUD 3027067 PMD 7fee5067 PTE 8010000008337065 Oops: 0003 [#1] PREEMPT SMP NOPTI CPU: 7 PID: 158 Comm: kswapd0 Not tainted 6.1.0-rc5-20221118-doflr+ #1 RIP: e030:pmdp_test_and_clear_young+0x25/0x40 This happens because the Xen hypervisor can't emulate direct writes to page table entries other than PTEs. This can easily be fixed by introducing arch_has_hw_nonleaf_pmd_young() similar to arch_has_hw_pte_young() and test that instead of CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG. Link: https://lkml.kernel.org/r/20221123064510.16225-1-jgross@suse.com Fixes: eed9a328aa1a ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") Signed-off-by: Juergen Gross <jgross@suse.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Acked-by: Yu Zhao <yuzhao@google.com> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Acked-by: David Hildenbrand <david@redhat.com> [core changes] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-01mm: add dummy pmd_young() for architectures not having itJuergen Gross1-0/+1
In order to avoid #ifdeffery add a dummy pmd_young() implementation as a fallback. This is required for the later patch "mm: introduce arch_has_hw_nonleaf_pmd_young()". Link: https://lkml.kernel.org/r/fd3ac3cd-7349-6bbd-890a-71a9454ca0b3@suse.com Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sander Eikelenboom <linux@eikelenboom.it> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>