summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2023-08-25x86/crash: optimize CPU changesEric DeVolder1-0/+10
crash_prepare_elf64_headers() writes into the elfcorehdr an ELF PT_NOTE for all possible CPUs. As such, subsequent changes to CPUs (ie. hot un/plug, online/offline) do not need to rewrite the elfcorehdr. The kimage->file_mode term covers kdump images loaded via the kexec_file_load() syscall. Since crash_prepare_elf64_headers() wrote the initial elfcorehdr, no update to the elfcorehdr is needed for CPU changes. The kimage->elfcorehdr_updated term covers kdump images loaded via the kexec_load() syscall. At least one memory or CPU change must occur to cause crash_prepare_elf64_headers() to rewrite the elfcorehdr. Afterwards, no update to the elfcorehdr is needed for CPU changes. This code is intentionally *NOT* hoisted into crash_handle_hotplug_event() as it would prevent the arch-specific handler from running for CPU changes. This would break PPC, for example, which needs to update other information besides the elfcorehdr, on CPU changes. Link: https://lkml.kernel.org/r/20230814214446.6659-9-eric.devolder@oracle.com Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Akhil Raj <lf32.dev@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-25crash: hotplug support for kexec_load()Eric DeVolder2-4/+34
The hotplug support for kexec_load() requires changes to the userspace kexec-tools and a little extra help from the kernel. Given a kdump capture kernel loaded via kexec_load(), and a subsequent hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites it to reflect the hotplug change. That is the desired outcome, however, at kernel panic time, the purgatory integrity check fails (because the elfcorehdr changed), and the capture kernel does not boot and no vmcore is generated. Therefore, the userspace kexec-tools/kexec must indicate to the kernel that the elfcorehdr can be modified (because the kexec excluded the elfcorehdr from the digest, and sized the elfcorehdr memory buffer appropriately). To facilitate hotplug support with kexec_load(): - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is safe for the kernel to modify the kexec_load()'d elfcorehdr - the /sys/kernel/crash_elfcorehdr_size node communicates the preferred size of the elfcorehdr memory buffer - The sysfs crash_hotplug nodes (ie. /sys/devices/system/[cpu|memory]/crash_hotplug) dynamically take into account kexec_file_load() vs kexec_load() and KEXEC_UPDATE_ELFCOREHDR. This is critical so that the udev rule processing of crash_hotplug is all that is needed to determine if the userspace unload-then-load of the kdump image is to be skipped, or not. The proposed udev rule change looks like: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" The table below indicates the behavior of kexec_load()'d kdump image updates (with the new udev crash_hotplug rule in place): Kernel |Kexec -------+-----+---- Old |Old |New | a | a -------+-----+---- New | a | b -------+-----+---- where kexec 'old' and 'new' delineate kexec-tools has the needed modifications for the crash hotplug feature, and kernel 'old' and 'new' delineate the kernel supports this crash hotplug feature. Behavior 'a' indicates the unload-then-reload of the entire kdump image. For the kexec 'old' column, the unload-then-reload occurs due to the missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec) does not present the crash_hotplug sysfs node, which leads to the unload-then-reload of the kdump image. Behavior 'b' indicates the desired optimized behavior of the kernel directly modifying the elfcorehdr and avoiding the unload-then-reload of the kdump image. If the udev rule is not updated with crash_hotplug node check, then no matter any combination of kernel or kexec is new or old, the kdump image continues to be unload-then-reload on hotplug changes. To fully support crash hotplug feature, there needs to be a rollout of kernel, kexec-tools and udev rule changes. However, the order of the rollout of these pieces does not matter; kexec_load()'d kdump images still function for hotplug as-is. Link: https://lkml.kernel.org/r/20230814214446.6659-7-eric.devolder@oracle.com Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Suggested-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Akhil Raj <lf32.dev@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-25x86/crash: add x86 crash hotplug supportEric DeVolder3-7/+116
When CPU or memory is hot un/plugged, or off/onlined, the crash elfcorehdr, which describes the CPUs and memory in the system, must also be updated. A new elfcorehdr is generated from the available CPUs and memory and replaces the existing elfcorehdr. The segment containing the elfcorehdr is identified at run-time in crash_core:crash_handle_hotplug_event(). No modifications to purgatory (see 'kexec: exclude elfcorehdr from the segment digest') or boot_params (as the elfcorehdr= capture kernel command line parameter pointer remains unchanged and correct) are needed, just elfcorehdr. For kexec_file_load(), the elfcorehdr segment size is based on NR_CPUS and CRASH_MAX_MEMORY_RANGES in order to accommodate a growing number of CPU and memory resources. For kexec_load(), the userspace kexec utility needs to size the elfcorehdr segment in the same/similar manner. To accommodate kexec_load() syscall in the absence of kexec_file_load() syscall support, prepare_elf_headers() and dependents are moved outside of CONFIG_KEXEC_FILE. [eric.devolder@oracle.com: correct unused function build error] Link: https://lkml.kernel.org/r/20230821182644.2143-1-eric.devolder@oracle.com Link: https://lkml.kernel.org/r/20230814214446.6659-6-eric.devolder@oracle.com Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Akhil Raj <lf32.dev@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21lockdep: fix static memory detection even moreHelge Deller1-18/+0
On the parisc architecture, lockdep reports for all static objects which are in the __initdata section (e.g. "setup_done" in devtmpfs, "kthreadd_done" in init/main.c) this warning: INFO: trying to register non-static key. The warning itself is wrong, because those objects are in the __initdata section, but the section itself is on parisc outside of range from _stext to _end, which is why the static_obj() functions returns a wrong answer. While fixing this issue, I noticed that the whole existing check can be simplified a lot. Instead of checking against the _stext and _end symbols (which include code areas too) just check for the .data and .bss segments (since we check a data object). This can be done with the existing is_kernel_core_data() macro. In addition objects in the __initdata section can be checked with init_section_contains(), and is_kernel_rodata() allows keys to be in the _ro_after_init section. This partly reverts and simplifies commit bac59d18c701 ("x86/setup: Fix static memory detection"). Link: https://lkml.kernel.org/r/ZNqrLRaOi/3wPAdp@p100 Fixes: bac59d18c701 ("x86/setup: Fix static memory detection") Signed-off-by: Helge Deller <deller@gmx.de> Cc: Borislav Petkov <bp@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18x86/kernel: increase kcov coverage under arch/x86/kernel folderPengfei Xu1-5/+4
Currently kcov instrument is disabled for object files under arch/x86/kernel folder. For object files under arch/x86/kernel, actually just disabling the kcov instrument of files:"head32.o or head64.o and sev.o" could achieve successful booting and provide kcov coverage for object files that do not disable kcov instrument. The additional kcov coverage collected from arch/x86/kernel folder helps kernel fuzzing efforts to find bugs. Link to related improvement discussion is below: https://groups.google.com/g/syzkaller/c/Dsl-RYGCqs8/m/x-tfpTyFBAAJ Related ticket is as follow: https://bugzilla.kernel.org/show_bug.cgi?id=198443 Link: https://lkml.kernel.org/r/06c0bb7b5f61e5884bf31180e8c122648c752010.1690771380.git.pengfei.xu@intel.com Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Pengfei Xu <pengfei.xu@intel.com> Cc: Aleksandr Nogikh <nogikh@google.com> Cc: <heng.su@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Kees Cook <keescook@google.com>, Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18nmi_backtrace: allow excluding an arbitrary CPUDouglas Anderson2-3/+3
The APIs that allow backtracing across CPUs have always had a way to exclude the current CPU. This convenience means callers didn't need to find a place to allocate a CPU mask just to handle the common case. Let's extend the API to take a CPU ID to exclude instead of just a boolean. This isn't any more complex for the API to handle and allows the hardlockup detector to exclude a different CPU (the one it already did a trace for) without needing to find space for a CPU mask. Arguably, this new API also encourages safer behavior. Specifically if the caller wants to avoid tracing the current CPU (maybe because they already traced the current CPU) this makes it more obvious to the caller that they need to make sure that the current CPU ID can't change. [akpm@linux-foundation.org: fix trigger_allbutcpu_cpu_backtrace() stub] Link: https://lkml.kernel.org/r/20230804065935.v4.1.Ia35521b91fc781368945161d7b28538f9996c182@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: kernel test robot <lkp@intel.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18range.h: Move resource API and constant to respective filesAndy Shevchenko2-1/+9
range.h works with struct range data type. The resource_size_t is an alien here. (1) Move cap_resource() implementation into its only user, and (2) rename and move RESOURCE_SIZE_MAX to limits.h. Link: https://lkml.kernel.org/r/20230804064636.15368-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Alexander Sverdlin <alexander.sverdlin@nokia.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18x86/asm: replace custom COUNT_ARGS() & CONCATENATE() implementationsAndy Shevchenko1-8/+3
Replace custom implementation of the macros from args.h. Link: https://lkml.kernel.org/r/20230718211147.18647-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Daniel Latypov <dlatypov@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Gow <davidgow@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18kexec: rename ARCH_HAS_KEXEC_PURGATORYEric DeVolder1-1/+1
The Kconfig refactor to consolidate KEXEC and CRASH options utilized option names of the form ARCH_SUPPORTS_<option>. Thus rename the ARCH_HAS_KEXEC_PURGATORY to ARCH_SUPPORTS_KEXEC_PURGATORY to follow the same. Link: https://lkml.kernel.org/r/20230712161545.87870-15-eric.devolder@oracle.com Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18x86/kexec: refactor for kernel/Kconfig.kexecEric DeVolder1-73/+19
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Link: https://lkml.kernel.org/r/20230712161545.87870-3-eric.devolder@oracle.com Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-30Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds8-56/+121
Pull kvm fixes from Paolo Bonzini: "x86: - Do not register IRQ bypass consumer if posted interrupts not supported - Fix missed device interrupt due to non-atomic update of IRR - Use GFP_KERNEL_ACCOUNT for pid_table in ipiv - Make VMREAD error path play nice with noinstr - x86: Acquire SRCU read lock when handling fastpath MSR writes - Support linking rseq tests statically against glibc 2.35+ - Fix reference count for stats file descriptors - Detect userspace setting invalid CR0 Non-KVM: - Remove coccinelle script that has caused multiple confusion ("debugfs, coccinelle: check for obsolete DEFINE_SIMPLE_ATTRIBUTE() usage", acked by Greg)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits) KVM: selftests: Expand x86's sregs test to cover illegal CR0 values KVM: VMX: Don't fudge CR0 and CR4 for restricted L2 guest KVM: x86: Disallow KVM_SET_SREGS{2} if incoming CR0 is invalid Revert "debugfs, coccinelle: check for obsolete DEFINE_SIMPLE_ATTRIBUTE() usage" KVM: selftests: Verify stats fd is usable after VM fd has been closed KVM: selftests: Verify stats fd can be dup()'d and read KVM: selftests: Verify userspace can create "redundant" binary stats files KVM: selftests: Explicitly free vcpus array in binary stats test KVM: selftests: Clean up stats fd in common stats_test() helper KVM: selftests: Use pread() to read binary stats header KVM: Grab a reference to KVM for VM and vCPU stats file descriptors selftests/rseq: Play nice with binaries statically linked against glibc 2.35+ Revert "KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid" KVM: x86: Acquire SRCU read lock when handling fastpath MSR writes KVM: VMX: Use vmread_error() to report VM-Fail in "goto" path KVM: VMX: Make VMREAD error path play nice with noinstr KVM: x86/irq: Conditionally register IRQ bypass consumer again KVM: X86: Use GFP_KERNEL_ACCOUNT for pid_table in ipiv KVM: x86: check the kvm_cpu_get_interrupt result before using it KVM: x86: VMX: set irr_pending in kvm_apic_update_irr ...
2023-07-30Merge tag 'x86_urgent_for_v6.5_rc4' of ↵Linus Torvalds3-9/+26
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - AMD's automatic IBRS doesn't enable cross-thread branch target injection protection (STIBP) for user processes. Enable STIBP on such systems. - Do not delete (but put the ref instead) of AMD MCE error thresholding sysfs kobjects when destroying them in order not to delete the kernfs pointer prematurely - Restore annotation in ret_from_fork_asm() in order to fix kthread stack unwinding from being marked as unreliable and thus breaking livepatching * tag 'x86_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Enable STIBP on AMD if Automatic IBRS is enabled x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocks x86: Fix kthread unwind
2023-07-30arch/*/configs/*defconfig: Replace AUTOFS4_FS by AUTOFS_FSSven Joachim2-2/+2
Commit a2225d931f75 ("autofs: remove left-over autofs4 stubs") promised the removal of the fs/autofs/Kconfig fragment for AUTOFS4_FS within a couple of releases, but five years later this still has not happened yet, and AUTOFS4_FS is still enabled in 63 defconfigs. Get rid of it mechanically: git grep -l CONFIG_AUTOFS4_FS -- '*defconfig' | xargs sed -i 's/AUTOFS4_FS/AUTOFS_FS/' Also just remove the AUTOFS4_FS config option stub. Anybody who hasn't regenerated their config file in the last five years will need to just get the new name right when they do. Signed-off-by: Sven Joachim <svenjoac@gmx.de> Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-29KVM: VMX: Don't fudge CR0 and CR4 for restricted L2 guestSean Christopherson1-4/+9
Stuff CR0 and/or CR4 to be compliant with a restricted guest if and only if KVM itself is not configured to utilize unrestricted guests, i.e. don't stuff CR0/CR4 for a restricted L2 that is running as the guest of an unrestricted L1. Any attempt to VM-Enter a restricted guest with invalid CR0/CR4 values should fail, i.e. in a nested scenario, KVM (as L0) should never observe a restricted L2 with incompatible CR0/CR4, since nested VM-Enter from L1 should have failed. And if KVM does observe an active, restricted L2 with incompatible state, e.g. due to a KVM bug, fudging CR0/CR4 instead of letting VM-Enter fail does more harm than good, as KVM will often neglect to undo the side effects, e.g. won't clear rmode.vm86_active on nested VM-Exit, and thus the damage can easily spill over to L1. On the other hand, letting VM-Enter fail due to bad guest state is more likely to contain the damage to L2 as KVM relies on hardware to perform most guest state consistency checks, i.e. KVM needs to be able to reflect a failed nested VM-Enter into L1 irrespective of (un)restricted guest behavior. Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Fixes: bddd82d19e2e ("KVM: nVMX: KVM needs to unset "unrestricted guest" VM-execution control in vmcs02 if vmcs12 doesn't set it") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230613203037.1968489-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: x86: Disallow KVM_SET_SREGS{2} if incoming CR0 is invalidSean Christopherson5-20/+52
Reject KVM_SET_SREGS{2} with -EINVAL if the incoming CR0 is invalid, e.g. due to setting bits 63:32, illegal combinations, or to a value that isn't allowed in VMX (non-)root mode. The VMX checks in particular are "fun" as failure to disallow Real Mode for an L2 that is configured with unrestricted guest disabled, when KVM itself has unrestricted guest enabled, will result in KVM forcing VM86 mode to virtual Real Mode for L2, but then fail to unwind the related metadata when synthesizing a nested VM-Exit back to L1 (which has unrestricted guest enabled). Opportunistically fix a benign typo in the prototype for is_valid_cr4(). Cc: stable@vger.kernel.org Reported-by: syzbot+5feef0b9ee9c8e9e5689@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000f316b705fdf6e2b4@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230613203037.1968489-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29Revert "KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"Sean Christopherson1-8/+2
Now that handle_fastpath_set_msr_irqoff() acquires kvm->srcu, i.e. allows dereferencing memslots during WRMSR emulation, drop the requirement that "next RIP" is valid. In hindsight, acquiring kvm->srcu would have been a better fix than avoiding the pastpath, but at the time it was thought that accessing SRCU-protected data in the fastpath was a one-off edge case. This reverts commit 5c30e8101e8d5d020b1d7119117889756a6ed713. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230721224337.2335137-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: x86: Acquire SRCU read lock when handling fastpath MSR writesSean Christopherson1-0/+4
Temporarily acquire kvm->srcu for read when potentially emulating WRMSR in the VM-Exit fastpath handler, as several of the common helpers used during emulation expect the caller to provide SRCU protection. E.g. if the guest is counting instructions retired, KVM will query the PMU event filter when stepping over the WRMSR. dump_stack+0x85/0xdf lockdep_rcu_suspicious+0x109/0x120 pmc_event_is_allowed+0x165/0x170 kvm_pmu_trigger_event+0xa5/0x190 handle_fastpath_set_msr_irqoff+0xca/0x1e0 svm_vcpu_run+0x5c3/0x7b0 [kvm_amd] vcpu_enter_guest+0x2108/0x2580 Alternatively, check_pmu_event_filter() could acquire kvm->srcu, but this isn't the first bug of this nature, e.g. see commit 5c30e8101e8d ("KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"). Providing protection for the entirety of WRMSR emulation will allow reverting the aforementioned commit, and will avoid having to play whack-a-mole when new uses of SRCU-protected structures are inevitably added in common emulation helpers. Fixes: dfdeda67ea2d ("KVM: x86/pmu: Prevent the PMU from counting disallowed events") Reported-by: Greg Thelen <gthelen@google.com> Reported-by: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230721224337.2335137-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: VMX: Use vmread_error() to report VM-Fail in "goto" pathSean Christopherson1-2/+1
Use vmread_error() to report VM-Fail on VMREAD for the "asm goto" case, now that trampoline case has yet another wrapper around vmread_error() to play nice with instrumentation. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230721235637.2345403-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: VMX: Make VMREAD error path play nice with noinstrSean Christopherson3-9/+26
Mark vmread_error_trampoline() as noinstr, and add a second trampoline for the CONFIG_CC_HAS_ASM_GOTO_OUTPUT=n case to enable instrumentation when handling VM-Fail on VMREAD. VMREAD is used in various noinstr flows, e.g. immediately after VM-Exit, and objtool rightly complains that the call to the error trampoline leaves a no-instrumentation section without annotating that it's safe to do so. vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0xc9: call to vmread_error_trampoline() leaves .noinstr.text section Note, strictly speaking, enabling instrumentation in the VM-Fail path isn't exactly safe, but if VMREAD fails the kernel/system is likely hosed anyways, and logging that there is a fatal error is more important than *maybe* encountering slightly unsafe instrumentation. Reported-by: Su Hui <suhui@nfschina.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230721235637.2345403-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: x86/irq: Conditionally register IRQ bypass consumer againLike Xu1-1/+1
As was attempted commit 14717e203186 ("kvm: Conditionally register IRQ bypass consumer"): "if we don't support a mechanism for bypassing IRQs, don't register as a consumer. Initially this applied to AMD processors, but when AVIC support was implemented for assigned devices, kvm_arch_has_irq_bypass() was always returning true. We can still skip registering the consumer where enable_apicv or posted-interrupts capability is unsupported or globally disabled. This eliminates meaningless dev_info()s when the connect fails between producer and consumer", such as on Linux hosts where enable_apicv or posted-interrupts capability is unsupported or globally disabled. Cc: Alex Williamson <alex.williamson@redhat.com> Reported-by: Yong He <alexyonghe@tencent.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217379 Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20230724111236.76570-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: X86: Use GFP_KERNEL_ACCOUNT for pid_table in ipivPeng Hao1-1/+2
The pid_table of ipiv is the persistent memory allocated by per-vcpu, which should be counted into the memory cgroup. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Message-Id: <CAPm50aLxCQ3TQP2Lhc0PX3y00iTRg+mniLBqNDOC=t9CLxMwwA@mail.gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: x86: check the kvm_cpu_get_interrupt result before using itMaxim Levitsky1-3/+7
The code was blindly assuming that kvm_cpu_get_interrupt never returns -1 when there is a pending interrupt. While this should be true, a bug in KVM can still cause this. If -1 is returned, the code before this patch was converting it to 0xFF, and 0xFF interrupt was injected to the guest, which results in an issue which was hard to debug. Add WARN_ON_ONCE to catch this case and skip the injection if this happens again. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230726135945.260841-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: x86: VMX: set irr_pending in kvm_apic_update_irrMaxim Levitsky1-1/+4
When the APICv is inhibited, the irr_pending optimization is used. Therefore, when kvm_apic_update_irr sets bits in the IRR, it must set irr_pending to true as well. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230726135945.260841-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: x86: VMX: __kvm_apic_update_irr must update the IRR atomicallyMaxim Levitsky1-7/+13
If APICv is inhibited, then IPIs from peer vCPUs are done by atomically setting bits in IRR. This means, that when __kvm_apic_update_irr copies PIR to IRR, it has to modify IRR atomically as well. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230726135945.260841-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-26x86/traps: Fix load_unaligned_zeropad() handling for shared TDX memoryKirill A. Shutemov1-7/+11
Commit c4e34dd99f2e ("x86: simplify load_unaligned_zeropad() implementation") changes how exceptions around load_unaligned_zeropad() handled. The kernel now uses the fault_address in fixup_exception() to verify the address calculations for the load_unaligned_zeropad(). It works fine for #PF, but breaks on #VE since no fault address is passed down to fixup_exception(). Propagating ve_info.gla down to fixup_exception() resolves the issue. See commit 1e7769653b06 ("x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page") for more context. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Michael Kelley <mikelley@microsoft.com> Fixes: c4e34dd99f2e ("x86: simplify load_unaligned_zeropad() implementation") Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-22x86/cpu: Enable STIBP on AMD if Automatic IBRS is enabledKim Phillips1-6/+9
Unlike Intel's Enhanced IBRS feature, AMD's Automatic IBRS does not provide protection to processes running at CPL3/user mode, see section "Extended Feature Enable Register (EFER)" in the APM v2 at https://bugzilla.kernel.org/attachment.cgi?id=304652 Explicitly enable STIBP to protect against cross-thread CPL3 branch target injections on systems with Automatic IBRS enabled. Also update the relevant documentation. Fixes: e7862eda309e ("x86/cpu: Support AMD Automatic IBRS") Reported-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230720194727.67022-1-kim.phillips@amd.com
2023-07-22x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocksYazen Ghannam1-2/+2
AMD systems from Family 10h to 16h share MCA bank 4 across multiple CPUs. Therefore, the threshold_bank structure for bank 4, and its threshold_block structures, will be initialized once at boot time. And the kobject for the shared bank will be added to each of the CPUs that share it. Furthermore, the threshold_blocks for the shared bank will be added again to the bank's kobject. These additions will increase the refcount for the bank's kobject. For example, a shared bank with two blocks and shared across two CPUs will be set up like this: CPU0 init bank create and add; bank refcount = 1; threshold_create_bank() block 0 init and add; bank refcount = 2; allocate_threshold_blocks() block 1 init and add; bank refcount = 3; allocate_threshold_blocks() CPU1 init bank add; bank refcount = 3; threshold_create_bank() block 0 add; bank refcount = 4; __threshold_add_blocks() block 1 add; bank refcount = 5; __threshold_add_blocks() Currently in threshold_remove_bank(), if the bank is shared then __threshold_remove_blocks() is called. Here the shared bank's kobject and the bank's blocks' kobjects are deleted. This is done on the first call even while the structures are still shared. Subsequent calls from other CPUs that share the structures will attempt to delete the kobjects. During kobject_del(), kobject->sd is removed. If the kobject is not part of a kset with default_groups, then subsequent kobject_del() calls seem safe even with kobject->sd == NULL. Originally, the AMD MCA thresholding structures did not use default_groups. And so the above behavior was not apparent. However, a recent change implemented default_groups for the thresholding structures. Therefore, kobject_del() will go down the sysfs_remove_groups() code path. In this case, the first kobject_del() may succeed and remove kobject->sd. But subsequent kobject_del() calls will give a WARNing in kernfs_remove_by_name_ns() since kobject->sd == NULL. Use kobject_put() on the shared bank's kobject when "removing" blocks. This decrements the bank's refcount while keeping kobjects enabled until the bank is no longer shared. At that point, kobject_put() will be called on the blocks which drives their refcount to 0 and deletes them and also decrementing the bank's refcount. And finally kobject_put() will be called on the bank driving its refcount to 0 and deleting it. The same example above: CPU1 shutdown bank is shared; bank refcount = 5; threshold_remove_bank() block 0 put parent bank; bank refcount = 4; __threshold_remove_blocks() block 1 put parent bank; bank refcount = 3; __threshold_remove_blocks() CPU0 shutdown bank is no longer shared; bank refcount = 3; threshold_remove_bank() block 0 put block; bank refcount = 2; deallocate_threshold_blocks() block 1 put block; bank refcount = 1; deallocate_threshold_blocks() put bank; bank refcount = 0; threshold_remove_bank() Fixes: 7f99cb5e6039 ("x86/CPU/AMD: Use default_groups in kobj_type") Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Mikulas Patocka <mpatocka@redhat.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/alpine.LRH.2.02.2205301145540.25840@file01.intranet.prod.int.rdu2.redhat.com
2023-07-21x86: Fix kthread unwindPeter Zijlstra1-1/+15
The rewrite of ret_from_form() misplaced an unwind hint which caused all kthread stack unwinds to be marked unreliable, breaking livepatching. Restore the annotation and add a comment to explain the how and why of things. Fixes: 3aec4ecb3d1f ("x86: Rewrite ret_from_fork() in C") Reported-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://lkml.kernel.org/r/20230719201538.GA3553016@hirez.programming.kicks-ass.net
2023-07-17x86/cpu/amd: Add a Zenbleed fixBorislav Petkov (AMD)5-0/+66
Add a fix for the Zen2 VZEROUPPER data corruption bug where under certain circumstances executing VZEROUPPER can cause register corruption or leak data. The optimal fix is through microcode but in the case the proper microcode revision has not been applied, enable a fallback fix using a chicken bit. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-07-17x86/cpu/amd: Move the errata checking functionality upBorislav Petkov (AMD)1-72/+67
Avoid new and remove old forward declarations. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-07-16Merge tag 'perf_urgent_for_v6.5_rc2' of ↵Linus Torvalds1-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Fix a lockdep warning when the event given is the first one, no event group exists yet but the code still goes and iterates over event siblings * tag 'perf_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix lockdep warning in for_each_sibling_event() on SPR
2023-07-15Merge tag 'x86_urgent_for_6.5_rc2' of ↵Linus Torvalds9-74/+119
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 CFI fixes from Peter Zijlstra: "Fix kCFI/FineIBT weaknesses The primary bug Alyssa noticed was that with FineIBT enabled function prologues have a spurious ENDBR instruction: __cfi_foo: endbr64 subl $hash, %r10d jz 1f ud2 nop 1: foo: endbr64 <--- *sadface* This means that any indirect call that fails to target the __cfi symbol and instead targets (the regular old) foo+0, will succeed due to that second ENDBR. Fixing this led to the discovery of a single indirect call that was still doing this: ret_from_fork(). Since that's an assembly stub the compiler would not generate the proper kCFI indirect call magic and it would not get patched. Brian came up with the most comprehensive fix -- convert the thing to C with only a very thin asm wrapper. This ensures the kernel thread boostrap is a proper kCFI call. While discussing all this, Kees noted that kCFI hashes could/should be poisoned to seal all functions whose address is never taken, further limiting the valid kCFI targets -- much like we already do for IBT. So what was a 'simple' observation and fix cascaded into a bunch of inter-related CFI infrastructure fixes" * tag 'x86_urgent_for_6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cfi: Only define poison_cfi() if CONFIG_X86_KERNEL_IBT=y x86/fineibt: Poison ENDBR at +0 x86: Rewrite ret_from_fork() in C x86/32: Remove schedule_tail_wrapper() x86/cfi: Extend ENDBR sealing to kCFI x86/alternative: Rename apply_ibt_endbr() x86/cfi: Extend {JMP,CAKK}_NOSPEC comment
2023-07-13Merge tag 'trace-v6.5-rc1-3' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix some missing-prototype warnings - Fix user events struct args (did not include size of struct) When creating a user event, the "struct" keyword is to denote that the size of the field will be passed in. But the parsing failed to handle this case. - Add selftest to struct sizes for user events - Fix sample code for direct trampolines. The sample code for direct trampolines attached to handle_mm_fault(). But the prototype changed and the direct trampoline sample code was not updated. Direct trampolines needs to have the arguments correct otherwise it can fail or crash the system. - Remove unused ftrace_regs_caller_ret() prototype. - Quiet false positive of FORTIFY_SOURCE Due to backward compatibility, the structure used to save stack traces in the kernel had a fixed size of 8. This structure is exported to user space via the tracing format file. A change was made to allow more than 8 functions to be recorded, and user space now uses the size field to know how many functions are actually in the stack. But the structure still has size of 8 (even though it points into the ring buffer that has the required amount allocated to hold a full stack. This was fine until the fortifier noticed that the memcpy(&entry->caller, stack, size) was greater than the 8 functions and would complain at runtime about it. Hide this by using a pointer to the stack location on the ring buffer instead of using the address of the entry structure caller field. - Fix a deadloop in reading trace_pipe that was caused by a mismatch between ring_buffer_empty() returning false which then asked to read the data, but the read code uses rb_num_of_entries() that returned zero, and causing a infinite "retry". - Fix a warning caused by not using all pages allocated to store ftrace functions, where this can happen if the linker inserts a bunch of "NULL" entries, causing the accounting of how many pages needed to be off. - Fix histogram synthetic event crashing when the start event is removed and the end event is still using a variable from it - Fix memory leak in freeing iter->temp in tracing_release_pipe() * tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix memory leak of iter->temp when reading trace_pipe tracing/histograms: Add histograms to hist_vars if they have referenced variables tracing: Stop FORTIFY_SOURCE complaining about stack trace caller ftrace: Fix possible warning on checking all pages used in ftrace_process_locs() ring-buffer: Fix deadloop issue on reading trace_pipe tracing: arm64: Avoid missing-prototype warnings selftests/user_events: Test struct size match cases tracing/user_events: Fix struct arg size match check x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret() arm64: ftrace: Add direct call trampoline samples support samples: ftrace: Save required argument registers in sample trampolines
2023-07-13Merge tag 'for-linus-6.5-rc2-tag' of ↵Linus Torvalds1-16/+21
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - a cleanup of the Xen related ELF-notes - a fix for virtio handling in Xen dom0 when running Xen in a VM * tag 'for-linus-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/virtio: Fix NULL deref when a bridge of PCI root bus has no parent x86/Xen: tidy xen-head.S
2023-07-11x86/cfi: Only define poison_cfi() if CONFIG_X86_KERNEL_IBT=yIngo Molnar1-0/+2
poison_cfi() was introduced in: 9831c6253ace ("x86/cfi: Extend ENDBR sealing to kCFI") ... but it's only ever used under CONFIG_X86_KERNEL_IBT=y, and if that option is disabled, we get: arch/x86/kernel/alternative.c:1243:13: error: ‘poison_cfi’ defined but not used [-Werror=unused-function] Guard the definition with CONFIG_X86_KERNEL_IBT. Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-07-11x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret()YueHaibing1-1/+0
This is now unused, so can remove it. Link: https://lore.kernel.org/linux-trace-kernel/20230623091640.21952-1-yuehaibing@huawei.com Cc: <mark.rutland@arm.com> Cc: <tglx@linutronix.de> Cc: <mingo@redhat.com> Cc: <bp@alien8.de> Cc: <dave.hansen@linux.intel.com> Cc: <x86@kernel.org> Cc: <hpa@zytor.com> Cc: <peterz@infradead.org> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-10x86/fineibt: Poison ENDBR at +0Peter Zijlstra1-0/+16
Alyssa noticed that when building the kernel with CFI_CLANG+IBT and booting on IBT enabled hardware to obtain FineIBT, the indirect functions look like: __cfi_foo: endbr64 subl $hash, %r10d jz 1f ud2 nop 1: foo: endbr64 This is because the compiler generates code for kCFI+IBT. In that case the caller does the hash check and will jump to +0, so there must be an ENDBR there. The compiler doesn't know about FineIBT at all; also it is possible to actually use kCFI+IBT when booting with 'cfi=kcfi' on IBT enabled hardware. Having this second ENDBR however makes it possible to elide the CFI check. Therefore, we should poison this second ENDBR when switching to FineIBT mode. Fixes: 931ab63664f0 ("x86/ibt: Implement FineIBT") Reported-by: "Milburn, Alyssa" <alyssa.milburn@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20230615193722.194131053@infradead.org
2023-07-10x86: Rewrite ret_from_fork() in CBrian Gerst4-49/+40
When kCFI is enabled, special handling is needed for the indirect call to the kernel thread function. Rewrite the ret_from_fork() function in C so that the compiler can properly handle the indirect call. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lkml.kernel.org/r/20230623225529.34590-3-brgerst@gmail.com
2023-07-10x86/32: Remove schedule_tail_wrapper()Brian Gerst1-23/+10
The unwinder expects a return address at the very top of the kernel stack just below pt_regs and before any stack frame is created. Instead of calling a wrapper, set up a return address as if ret_from_fork() was called from the syscall entry code. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lkml.kernel.org/r/20230623225529.34590-2-brgerst@gmail.com
2023-07-10x86/cfi: Extend ENDBR sealing to kCFIPeter Zijlstra1-1/+43
Kees noted that IBT sealing could be extended to kCFI. Fundamentally it is the list of functions that do not have their address taken and are thus never called indirectly. It doesn't matter that objtool uses IBT infrastructure to determine this list, once we have it it can also be used to clobber kCFI hashes and avoid kCFI indirect calls. Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lkml.kernel.org/r/20230622144321.494426891%40infradead.org
2023-07-10x86/alternative: Rename apply_ibt_endbr()Peter Zijlstra4-6/+9
The current name doesn't reflect what it does very well. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lkml.kernel.org/r/20230622144321.427441595%40infradead.org
2023-07-10x86/cfi: Extend {JMP,CAKK}_NOSPEC commentPeter Zijlstra1-0/+4
With the introduction of kCFI these helpers are no longer equivalent to C indirect calls and should be used with care. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lkml.kernel.org/r/20230622144321.360957723%40infradead.org
2023-07-10perf/x86: Fix lockdep warning in for_each_sibling_event() on SPRNamhyung Kim1-0/+7
On SPR, the load latency event needs an auxiliary event in the same group to work properly. There's a check in intel_pmu_hw_config() for this to iterate sibling events and find a mem-loads-aux event. The for_each_sibling_event() has a lockdep assert to make sure if it disabled hardirq or hold leader->ctx->mutex. This works well if the given event has a separate leader event since perf_try_init_event() grabs the leader->ctx->mutex to protect the sibling list. But it can cause a problem when the event itself is a leader since the event is not initialized yet and there's no ctx for the event. Actually I got a lockdep warning when I run the below command on SPR, but I guess it could be a NULL pointer dereference. $ perf record -d -e cpu/mem-loads/uP true The code path to the warning is: sys_perf_event_open() perf_event_alloc() perf_init_event() perf_try_init_event() x86_pmu_event_init() hsw_hw_config() intel_pmu_hw_config() for_each_sibling_event() lockdep_assert_event_ctx() We don't need for_each_sibling_event() when it's a standalone event. Let's return the error code directly. Fixes: f3c0eba28704 ("perf: Add a few assertions") Reported-by: Greg Thelen <gthelen@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20230704181516.3293665-1-namhyung@kernel.org
2023-07-09Merge tag 'x86_urgent_for_v6.5_rc1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu fix from Borislav Petkov: - Do FPU AP initialization on Xen PV too which got missed by the recent boot reordering work * tag 'x86_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/xen: Fix secondary processors' FPU initialization
2023-07-09Merge tag 'x86-core-2023-07-09' of ↵Linus Torvalds1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Thomas Gleixner: "A single fix for the mechanism to park CPUs with an INIT IPI. On shutdown or kexec, the kernel tries to park the non-boot CPUs with an INIT IPI. But the same code path is also used by the crash utility. If the CPU which panics is not the boot CPU then it sends an INIT IPI to the boot CPU which resets the machine. Prevent this by validating that the CPU which runs the stop mechanism is the boot CPU. If not, leave the other CPUs in HLT" * tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/smp: Don't send INIT to boot CPU
2023-07-07x86/smp: Don't send INIT to boot CPUThomas Gleixner1-0/+8
Parking CPUs in INIT works well, except for the crash case when the CPU which invokes smp_park_other_cpus_in_init() is not the boot CPU. Sending INIT to the boot CPU resets the whole machine. Prevent this by validating that this runs on the boot CPU. If not fall back and let CPUs hang in HLT. Fixes: 45e34c8af58f ("x86/smp: Put CPUs into INIT on shutdown if possible") Reported-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/87ttui91jo.ffs@tglx
2023-07-05x86/xen: Fix secondary processors' FPU initializationJuergen Gross1-0/+1
Moving the call of fpu__init_cpu() from cpu_init() to start_secondary() broke Xen PV guests, as those don't call start_secondary() for APs. Call fpu__init_cpu() in Xen's cpu_bringup(), which is the Xen PV replacement of start_secondary(). Fixes: b81fac906a8f ("x86/fpu: Move FPU initialization into arch_cpu_finalize_init()") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230703130032.22916-1-jgross@suse.com
2023-07-04x86/Xen: tidy xen-head.SJan Beulich1-16/+21
First of all move PV-only ELF notes inside the XEN_PV conditional; note that - HV_START_LOW is dropped altogether, as it was meaningful for 32-bit PV only, - the 32-bit instance of VIRT_BASE is dropped, as it would be dead code once inside the conditional, - while PADDR_OFFSET is not exactly unused for PVH, it defaults to zero there, and the hypervisor (or tool stack) complains if it is present but VIRT_BASE isn't. Then have the "supported features" note actually report reality: All three of the features there are supported and/or applicable only in certain cases. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/f99bacc6-2a2f-41b0-5c0b-e01b7051cb07@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-07-04Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds24-281/+471
Pull kvm updates from Paolo Bonzini: "ARM64: - Eager page splitting optimization for dirty logging, optionally allowing for a VM to avoid the cost of hugepage splitting in the stage-2 fault path. - Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact with services that live in the Secure world. pKVM intervenes on FF-A calls to guarantee the host doesn't misuse memory donated to the hyp or a pKVM guest. - Support for running the split hypervisor with VHE enabled, known as 'hVHE' mode. This is extremely useful for testing the split hypervisor on VHE-only systems, and paves the way for new use cases that depend on having two TTBRs available at EL2. - Generalized framework for configurable ID registers from userspace. KVM/arm64 currently prevents arbitrary CPU feature set configuration from userspace, but the intent is to relax this limitation and allow userspace to select a feature set consistent with the CPU. - Enable the use of Branch Target Identification (FEAT_BTI) in the hypervisor. - Use a separate set of pointer authentication keys for the hypervisor when running in protected mode, as the host is untrusted at runtime. - Ensure timer IRQs are consistently released in the init failure paths. - Avoid trapping CTR_EL0 on systems with Enhanced Virtualization Traps (FEAT_EVT), as it is a register commonly read from userspace. - Erratum workaround for the upcoming AmpereOne part, which has broken hardware A/D state management. RISC-V: - Redirect AMO load/store misaligned traps to KVM guest - Trap-n-emulate AIA in-kernel irqchip for KVM guest - Svnapot support for KVM Guest s390: - New uvdevice secret API - CMM selftest and fixes - fix racy access to target CPU for diag 9c x86: - Fix missing/incorrect #GP checks on ENCLS - Use standard mmu_notifier hooks for handling APIC access page - Drop now unnecessary TR/TSS load after VM-Exit on AMD - Print more descriptive information about the status of SEV and SEV-ES during module load - Add a test for splitting and reconstituting hugepages during and after dirty logging - Add support for CPU pinning in demand paging test - Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes included along the way - Add a "nx_huge_pages=never" option to effectively avoid creating NX hugepage recovery threads (because nx_huge_pages=off can be toggled at runtime) - Move handling of PAT out of MTRR code and dedup SVM+VMX code - Fix output of PIC poll command emulation when there's an interrupt - Add a maintainer's handbook to document KVM x86 processes, preferred coding style, testing expectations, etc. - Misc cleanups, fixes and comments Generic: - Miscellaneous bugfixes and cleanups Selftests: - Generate dependency files so that partial rebuilds work as expected" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (153 commits) Documentation/process: Add a maintainer handbook for KVM x86 Documentation/process: Add a label for the tip tree handbook's coding style KVM: arm64: Fix misuse of KVM_ARM_VCPU_POWER_OFF bit index RISC-V: KVM: Remove unneeded semicolon RISC-V: KVM: Allow Svnapot extension for Guest/VM riscv: kvm: define vcpu_sbi_ext_pmu in header RISC-V: KVM: Expose IMSIC registers as attributes of AIA irqchip RISC-V: KVM: Add in-kernel virtualization of AIA IMSIC RISC-V: KVM: Expose APLIC registers as attributes of AIA irqchip RISC-V: KVM: Add in-kernel emulation of AIA APLIC RISC-V: KVM: Implement device interface for AIA irqchip RISC-V: KVM: Skeletal in-kernel AIA irqchip support RISC-V: KVM: Set kvm_riscv_aia_nr_hgei to zero RISC-V: KVM: Add APLIC related defines RISC-V: KVM: Add IMSIC related defines RISC-V: KVM: Implement guest external interrupt line management KVM: x86: Remove PRIx* definitions as they are solely for user space s390/uv: Update query for secret-UVCs s390/uv: replace scnprintf with sysfs_emit s390/uvdevice: Add 'Lock Secret Store' UVC ...
2023-07-01Merge tag 'x86-urgent-2023-07-01' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Thomas Gleixner: "A single regression fix for x86: Moving the invocation of arch_cpu_finalize_init() earlier in the boot process caused a boot regression on IBT enabled system. The root cause is not the move of arch_cpu_finalize_init() itself. The system fails to boot because the subsequent efi_enter_virtual_mode() code has a non-IBT safe EFI call inside. This was not noticed before because IBT was enabled after the EFI initialization. Switching the EFI call to use the IBT safe wrapper cures the problem" * tag 'x86-urgent-2023-07-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/efi: Make efi_set_virtual_address_map IBT safe