| Age | Commit message (Collapse) | Author | Files | Lines |
|
commit e2ffe85b6d2bb7780174b87aa4468a39be17eb81 upstream.
Add KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM to allow L1 to set
FREEZE_IN_SMM in vmcs12's GUEST_IA32_DEBUGCTL field, as permitted
prior to commit 6b1dd26544d0 ("KVM: VMX: Preserve host's
DEBUGCTLMSR_FREEZE_IN_SMM while running the guest"). Enable the quirk
by default for backwards compatibility (like all quirks); userspace
can disable it via KVM_CAP_DISABLE_QUIRKS2 for consistency with the
constraints on WRMSR(IA32_DEBUGCTL).
Note that the quirk only bypasses the consistency check. The vmcs02 bit is
still owned by the host, and PMCs are not frozen during virtualized SMM.
In particular, if a host administrator decides that PMCs should not be
frozen during physical SMM, then L1 has no say in the matter.
Fixes: 095686e6fcb4 ("KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter")
Cc: stable@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260205231537.1278753-1-jmattson@google.com
[sean: tag for stable@, clean-up and fix goofs in the comment and docs]
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Rename quirk. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ae6730ff42b3a13d94b405edeb5e40108b6d21b6 ]
Use the MADT and SRAT table data to compute __num_nodes_per_package.
Specifically, SRAT has already been parsed in x86_numa_init(), which is called
before acpi_boot_init() which parses MADT. So both are available in
topology_init_possible_cpus().
This number is useful to divinate the various Intel CoD/SNC and AMD NPS modes,
since the platforms are failing to provide this otherwise.
Doing it this way is independent of the number of online CPUs and
other such shenanigans.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://patch.msgid.link/20260303110100.004091624@infradead.org
Stable-dep-of: 528d89a4707e ("x86/topo: Fix SNC topology mess")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 48084cc153a5b0fbf0aa98d47670d3be0b9f64d5 ]
The topology setup code needs to know the total number of physical
nodes enumerated in SRAT; however NUMA_EMU can cause the existing
numa_nodes_parsed bitmap to be fictitious. Therefore, keep a copy of
the bitmap specifically to retain the physical node count.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://patch.msgid.link/20260303110059.889884023@infradead.org
Stable-dep-of: 528d89a4707e ("x86/topo: Fix SNC topology mess")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 9073428bb204d921ae15326bb7d4558d9d269aab upstream.
The SEV-SNP IBPB-on-Entry feature does not require a guest-side
implementation. It was added in Zen5 h/w, after the first SNP Zen
implementation, and thus was not accounted for when the initial set of SNP
features were added to the kernel.
In its abundant precaution, commit
8c29f0165405 ("x86/sev: Add SEV-SNP guest feature negotiation support")
included SEV_STATUS' IBPB-on-Entry bit as a reserved bit, thereby masking
guests from using the feature.
Allow guests to make use of IBPB-on-Entry when supported by the hypervisor, as
the bit is now architecturally defined and safe to expose.
Fixes: 8c29f0165405 ("x86/sev: Add SEV-SNP guest feature negotiation support")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@kernel.org
Link: https://patch.msgid.link/20260203222405.4065706-2-kim.phillips@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a4b0bf6a40f3c107c67a24fbc614510ef5719980 upstream.
efi_free_boot_services() frees memory occupied by EFI_BOOT_SERVICES_CODE
and EFI_BOOT_SERVICES_DATA using memblock_free_late().
There are two issue with that: memblock_free_late() should be used for
memory allocated with memblock_alloc() while the memory reserved with
memblock_reserve() should be freed with free_reserved_area().
More acutely, with CONFIG_DEFERRED_STRUCT_PAGE_INIT=y
efi_free_boot_services() is called before deferred initialization of the
memory map is complete.
Benjamin Herrenschmidt reports that this causes a leak of ~140MB of
RAM on EC2 t3a.nano instances which only have 512MB or RAM.
If the freed memory resides in the areas that memory map for them is
still uninitialized, they won't be actually freed because
memblock_free_late() calls memblock_free_pages() and the latter skips
uninitialized pages.
Using free_reserved_area() at this point is also problematic because
__free_page() accesses the buddy of the freed page and that again might
end up in uninitialized part of the memory map.
Delaying the entire efi_free_boot_services() could be problematic
because in addition to freeing boot services memory it updates
efi.memmap without any synchronization and that's undesirable late in
boot when there is concurrency.
More robust approach is to only defer freeing of the EFI boot services
memory.
Split efi_free_boot_services() in two. First efi_unmap_boot_services()
collects ranges that should be freed into an array then
efi_free_boot_services() later frees them after deferred init is complete.
Link: https://lore.kernel.org/all/ec2aaef14783869b3be6e3c253b2dcbf67dbc12a.camel@kernel.crashing.org
Fixes: 916f676f8dc0 ("x86, efi: Retain boot service code until after switching to virtual mode")
Cc: <stable@vger.kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6517dfbcc918f970a928d9dc17586904bac06893 ]
Add two flags for KVM_CAP_X2APIC_API to allow userspace to control support
for Suppress EOI Broadcasts when using a split IRQCHIP (I/O APIC emulated
by userspace), which KVM completely mishandles. When x2APIC support was
first added, KVM incorrectly advertised and "enabled" Suppress EOI
Broadcast, without fully supporting the I/O APIC side of the equation,
i.e. without adding directed EOI to KVM's in-kernel I/O APIC.
That flaw was carried over to split IRQCHIP support, i.e. KVM advertised
support for Suppress EOI Broadcasts irrespective of whether or not the
userspace I/O APIC implementation supported directed EOIs. Even worse,
KVM didn't actually suppress EOI broadcasts, i.e. userspace VMMs without
support for directed EOI came to rely on the "spurious" broadcasts.
KVM "fixed" the in-kernel I/O APIC implementation by completely disabling
support for Suppress EOI Broadcasts in commit 0bcc3fb95b97 ("KVM: lapic:
stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use"), but
didn't do anything to remedy userspace I/O APIC implementations.
KVM's bogus handling of Suppress EOI Broadcast is problematic when the
guest relies on interrupts being masked in the I/O APIC until well after
the initial local APIC EOI. E.g. Windows with Credential Guard enabled
handles interrupts in the following order:
1. Interrupt for L2 arrives.
2. L1 APIC EOIs the interrupt.
3. L1 resumes L2 and injects the interrupt.
4. L2 EOIs after servicing.
5. L1 performs the I/O APIC EOI.
Because KVM EOIs the I/O APIC at step #2, the guest can get an interrupt
storm, e.g. if the IRQ line is still asserted and userspace reacts to the
EOI by re-injecting the IRQ, because the guest doesn't de-assert the line
until step #4, and doesn't expect the interrupt to be re-enabled until
step #5.
Unfortunately, simply "fixing" the bug isn't an option, as KVM has no way
of knowing if the userspace I/O APIC supports directed EOIs, i.e.
suppressing EOI broadcasts would result in interrupts being stuck masked
in the userspace I/O APIC due to step #5 being ignored by userspace. And
fully disabling support for Suppress EOI Broadcast is also undesirable, as
picking up the fix would require a guest reboot, *and* more importantly
would change the virtual CPU model exposed to the guest without any buy-in
from userspace.
Add KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST and
KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST flags to allow userspace to
explicitly enable or disable support for Suppress EOI Broadcasts. This
gives userspace control over the virtual CPU model exposed to the guest,
as KVM should never have enabled support for Suppress EOI Broadcast without
userspace opt-in. Not setting either flag will result in legacy quirky
behavior for backward compatibility.
Disallow fully enabling SUPPRESS_EOI_BROADCAST when using an in-kernel
I/O APIC, as KVM's history/support is just as tragic. E.g. it's not clear
that commit c806a6ad35bf ("KVM: x86: call irq notifiers with directed EOI")
was entirely correct, i.e. it may have simply papered over the lack of
Directed EOI emulation in the I/O APIC.
Note, Suppress EOI Broadcasts is defined only in Intel's SDM, not in AMD's
APM. But the bit is writable on some AMD CPUs, e.g. Turin, and KVM's ABI
is to support Directed EOI (KVM's name) irrespective of guest CPU vendor.
Fixes: 7543a635aa09 ("KVM: x86: Add KVM exit for IOAPIC EOIs")
Closes: https://lore.kernel.org/kvm/7D497EF1-607D-4D37-98E7-DAF95F099342@nutanix.com
Cc: stable@vger.kernel.org
Suggested-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Khushit Shah <khushit.shah@nutanix.com>
Link: https://patch.msgid.link/20260123125657.3384063-1-khushit.shah@nutanix.com
[sean: clean up minor formatting goofs and fix a comment typo]
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 24c8147abb39618d74fcc36e325765e8fe7bdd7a ]
Rustam reported his clang builds did not boot properly; turns out his
.config has: CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y set.
Fix up the FineIBT code to deal with this unusual alignment.
Fixes: 931ab63664f0 ("x86/ibt: Implement FineIBT")
Reported-by: Rustam Kovhaev <rkovhaev@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Rustam Kovhaev <rkovhaev@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a0cb371b521dde44f32cfe954b6ef6f82b407393 ]
The commit 5b472b6e5bd9 ("x86_64/bug: Implement __WARN_printf()")
implemented __WARN_printf(), which changed the mechanism to use UD1
instead of UD2. However, it only handles the trap in the runtime IDT
handler, while the early booting IDT handler lacks this handling. As a
result, the usage of WARN() before the runtime IDT setup can lead to
kernel crashes. Since KMSAN is enabled after the runtime IDT setup, it
is safe to use handle_bug() directly in early_fixup_exception() to
address this issue.
Fixes: 5b472b6e5bd9 ("x86_64/bug: Implement __WARN_printf()")
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/c4fb3645f60d3a78629d9870e8fcc8535281c24f.1768016713.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit aea251799998aa1b78eacdfb308f18ea114ea5b3 ]
Mahe reported missing function from stack trace on top of kprobe
multi program. The missing function is the very first one in the
stacktrace, the one that the bpf program is attached to.
# bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
Attaching 1 probe...
do_syscall_64+134
entry_SYSCALL_64_after_hwframe+118
('*' is used for kprobe_multi attachment)
The reason is that the previous change (the Fixes commit) fixed
stack unwind for tracepoint, but removed attached function address
from the stack trace on top of kprobe multi programs, which I also
overlooked in the related test (check following patch).
The tracepoint and kprobe_multi have different stack setup, but use
same unwind path. I think it's better to keep the previous change,
which fixed tracepoint unwind and instead change the kprobe multi
unwind as explained below.
The bpf program stack unwind calls perf_callchain_kernel for kernel
portion and it follows two unwind paths based on X86_EFLAGS_FIXED
bit in pt_regs.flags.
When the bit set we unwind from stack represented by pt_regs argument,
otherwise we unwind currently executed stack up to 'first_frame'
boundary.
The 'first_frame' value is taken from regs.rsp value, but ftrace_caller
and ftrace_regs_caller (ftrace trampoline) functions set the regs.rsp
to the previous stack frame, so we skip the attached function entry.
If we switch kprobe_multi unwind to use the X86_EFLAGS_FIXED bit,
we set the start of the unwind to the attached function address.
As another benefit we also cut extra unwind cycles needed to reach
the 'first_frame' boundary.
The speedup can be measured with trigger bench for kprobe_multi
program and stacktrace support.
- trigger bench with stacktrace on current code:
kprobe-multi : 0.810 ± 0.001M/s
kretprobe-multi: 0.808 ± 0.001M/s
- and with the fix:
kprobe-multi : 1.264 ± 0.001M/s
kretprobe-multi: 1.401 ± 0.002M/s
With the fix, the entry probe stacktrace:
# bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
Attaching 1 probe...
__x64_sys_newuname+9
do_syscall_64+134
entry_SYSCALL_64_after_hwframe+118
The return probe skips the attached function, because it's no longer
on the stack at the point of the unwind and this way is the same how
standard kretprobe works.
# bpftrace -e 'kretprobe:__x64_sys_newuname* { print(kstack)}'
Attaching 1 probe...
do_syscall_64+134
entry_SYSCALL_64_after_hwframe+118
Fixes: 6d08340d1e35 ("Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"")
Reported-by: Mahe Tardy <mahe.tardy@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20260126211837.472802-3-jolsa@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
Fedora QA reported the following panic:
BUG: unable to handle page fault for address: 0000000040003e54
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20251119-3.fc43 11/19/2025
RIP: 0010:vmware_hypercall4.constprop.0+0x52/0x90
..
Call Trace:
vmmouse_report_events+0x13e/0x1b0
psmouse_handle_byte+0x15/0x60
ps2_interrupt+0x8a/0xd0
...
because the QEMU VMware mouse emulation is buggy, and clears the top 32
bits of %rdi that the kernel kept a pointer in.
The QEMU vmmouse driver saves and restores the register state in a
"uint32_t data[6];" and as a result restores the state with the high
bits all cleared.
RDI originally contained the value of a valid kernel stack address
(0xff5eeb3240003e54). After the vmware hypercall it now contains
0x40003e54, and we get a page fault as a result when it is dereferenced.
The proper fix would be in QEMU, but this works around the issue in the
kernel to keep old setups working, when old kernels had not happened to
keep any state in %rdi over the hypercall.
In theory this same issue exists for all the hypercalls in the vmmouse
driver; in practice it has only been seen with vmware_hypercall3() and
vmware_hypercall4(). For now, just mark RDI/RSI as clobbered for those
two calls. This should have a minimal effect on code generation overall
as it should be rare for the compiler to want to make RDI/RSI live
across hypercalls.
Reported-by: Justin Forbes <jforbes@fedoraproject.org>
Link: https://lore.kernel.org/all/99a9c69a-fc1a-43b7-8d1e-c42d6493b41f@broadcom.com/
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The original patch inverted the PTE unconditionally to avoid
L1TF-vulnerable PTEs, but Linux doesn't make this adjustment in 2-level
paging.
Adjust the logic to use the flip_protnone_guard() helper, which is a nop
on 2-level paging but inverts the address bits in all other paging modes.
This doesn't matter for the Xen aspect of the original change. Linux no
longer supports running 32bit PV under Xen, and Xen doesn't support
running any 32bit PV guests without using PAE paging.
Link: https://lkml.kernel.org/r/20260126211046.2096622-1-andrew.cooper3@citrix.com
Fixes: b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
Reported-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Closes: https://lore.kernel.org/lkml/CAKFNMokwjw68ubYQM9WkzOuH51wLznHpEOMSqtMoV1Rn9JV_gw@mail.gmail.com/
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jann Horn <jannh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
For native, the choice of PTE is fine. There's real memory backing the
non-present PTE. However, for XenPV, Xen complains:
(XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing
To explain, some background on XenPV pagetables:
Xen PV guests are control their own pagetables; they choose the new
PTE value, and use hypercalls to make changes so Xen can audit for
safety.
In addition to a regular reference count, Xen also maintains a type
reference count. e.g. SegDesc (referenced by vGDT/vLDT), Writable
(referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a lower
pagetable level). This is in order to prevent e.g. a page being
inserted into the pagetables for which the guest has a writable mapping.
For non-present mappings, all other bits become software accessible,
and typically contain metadata rather a real frame address. There is
nothing that a reference count could sensibly be tied to. As such, even
if Xen could recognise the address as currently safe, nothing would
prevent that frame from changing owner to another VM in the future.
When Xen detects a PV guest writing a L1TF-PTE, it responds by
activating shadow paging. This is normally only used for the live phase
of migration, and comes with a reasonable overhead.
KFENCE only cares about getting #PF to catch wild accesses; it doesn't
care about the value for non-present mappings. Use a fully inverted PTE,
to avoid hitting the slow path when running under Xen.
While adjusting the logic, take the opportunity to skip all actions if the
PTE is already in the right state, half the number PVOps callouts, and
skip TLB maintenance on a !P -> P transition which benefits non-Xen cases
too.
Link: https://lkml.kernel.org/r/20260106180426.710013-1-andrew.cooper3@citrix.com
Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jann Horn <jannh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- Fix FPU core dumps on certain CPU models
- Fix htmldocs build warning
- Export TLB tracing event name via header
- Remove unused constant from <linux/mm_types.h>
- Fix comments
- Fix whitespace noise in documentation
- Fix variadic structure's definition to un-confuse UBSAN
- Fix posted MSI interrupts irq_retrigger() bug
- Fix asm build failure with older GCC builds
* tag 'x86-urgent-2025-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bug: Fix old GCC compile fails
x86/msi: Make irq_retrigger() functional for posted MSI
x86/platform/uv: Fix UBSAN array-index-out-of-bounds
mm: Remove tlb_flush_reason::NR_TLB_FLUSH_REASONS from <linux/mm_types.h>
x86/mm/tlb/trace: Export the TLB_REMOTE_WRONG_CPU enum in <trace/events/tlb.h>
x86/sgx: Remove unmatched quote in __sgx_encl_extend function comment
x86/boot/Documentation: Fix whitespace noise in boot.rst
x86/fpu: Fix FPU state core dump truncation on CPUs with no extended xfeatures
x86/boot/Documentation: Fix htmldocs build warning due to malformed table in boot.rst
|
|
clang is generating very inefficient code for native_save_fl() which is
used for local_irq_save() in critical spots.
Allowing the "pop %0" to use memory:
1) forces the compiler to add annoying stack canaries when
CONFIG_STACKPROTECTOR_STRONG=y in many places.
2) Almost always is followed by an immediate "move memory,register"
One good example is _raw_spin_lock_irqsave, with 8 extra instructions
ffffffff82067a30 <_raw_spin_lock_irqsave>:
ffffffff82067a30: ...
ffffffff82067a39: 53 push %rbx
// Three instructions to ajust the stack, read the per-cpu canary
// and copy it to 8(%rsp)
ffffffff82067a3a: 48 83 ec 10 sub $0x10,%rsp
ffffffff82067a3e: 65 48 8b 05 da 15 45 02 mov %gs:0x24515da(%rip),%rax # <__stack_chk_guard>
ffffffff82067a46: 48 89 44 24 08 mov %rax,0x8(%rsp)
ffffffff82067a4b: 9c pushf
// instead of pop %rbx, compiler uses 2 instructions.
ffffffff82067a4c: 8f 04 24 pop (%rsp)
ffffffff82067a4f: 48 8b 1c 24 mov (%rsp),%rbx
ffffffff82067a53: fa cli
ffffffff82067a54: b9 01 00 00 00 mov $0x1,%ecx
ffffffff82067a59: 31 c0 xor %eax,%eax
ffffffff82067a5b: f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
ffffffff82067a5f: 75 1d jne ffffffff82067a7e <_raw_spin_lock_irqsave+0x4e>
// three instructions to check the stack canary
ffffffff82067a61: 65 48 8b 05 b7 15 45 02 mov %gs:0x24515b7(%rip),%rax # <__stack_chk_guard>
ffffffff82067a69: 48 3b 44 24 08 cmp 0x8(%rsp),%rax
ffffffff82067a6e: 75 17 jne ffffffff82067a87
...
// One extra instruction to adjust the stack.
ffffffff82067a73: 48 83 c4 10 add $0x10,%rsp
...
// One more instruction in case the stack was mangled.
ffffffff82067a87: e8 a4 35 ff ff call ffffffff8205b030 <__stack_chk_fail>
This patch changes nothing for gcc, but for clang saves ~20000 bytes of text
even though more functions are inlined.
$ size vmlinux.gcc.before vmlinux.gcc.after vmlinux.clang.before vmlinux.clang.after
text data bss dec hex filename
45565821 25005462 4704800 75276083 47c9f33 vmlinux.gcc.before
45565821 25005462 4704800 75276083 47c9f33 vmlinux.gcc.after
45121072 24638617 5533040 75292729 47ce039 vmlinux.clang.before
45093887 24638633 5536808 75269328 47c84d0 vmlinux.clang.after
$ scripts/bloat-o-meter -t vmlinux.clang.before vmlinux.clang.after
add/remove: 1/2 grow/shrink: 21/533 up/down: 2250/-22112 (-19862)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For some mysterious reasons the GCC 8 and 9 preprocessor manages to
sporadically fumble _ASM_BYTES(0x0f, 0x0b):
$ grep ".byte[ ]*0x0f" defconfig-build/drivers/net/wireless/realtek/rtlwifi/base.s
1: .byte0x0f,0x0b ;
1: .byte 0x0f,0x0b ;
which makes the assembler upset and all that. While there are more
_ASM_BYTES() users (notably the NOP instructions), those don't seem
affected. Therefore replace the offending ASM_UD2 with one using the
ud2 mnemonic.
Reported-by: Jean Delvare <jdelvare@suse.de>
Suggested-by: Uros Bizjak <ubizjak@gmail.com>
Fixes: 85a2d4a890dc ("x86,ibt: Use UDB instead of 0xEA")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251218104659.GT3911114@noisy.programming.kicks-ass.net
|
|
Luigi reported that retriggering a posted MSI interrupt does not work
correctly.
The reason is that the retrigger happens at the vector domain by sending an
IPI to the actual vector on the target CPU. That works correctly exactly
once because the posted MSI interrupt chip does not issue an EOI as that's
only required for the posted MSI notification vector itself.
As a consequence the vector becomes stale in the ISR, which not only
affects this vector but also any lower priority vector in the affected
APIC because the ISR bit is not cleared.
Luigi proposed to set the vector in the remap PIR bitmap and raise the
posted MSI notification vector. That works, but that still does not cure a
related problem:
If there is ever a stray interrupt on such a vector, then the related
APIC ISR bit becomes stale due to the lack of EOI as described above.
Unlikely to happen, but if it happens it's not debuggable at all.
So instead of playing games with the PIR, this can be actually solved
for both cases by:
1) Keeping track of the posted interrupt vector handler state
2) Implementing a posted MSI specific irq_ack() callback which checks that
state. If the posted vector handler is inactive it issues an EOI,
otherwise it delegates that to the posted handler.
This is correct versus affinity changes and concurrent events on the posted
vector as the actual handler invocation is serialized through the interrupt
descriptor lock.
Fixes: ed1e48ea4370 ("iommu/vt-d: Enable posted mode for device MSIs")
Reported-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Luigi Rizzo <lrizzo@google.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251125214631.044440658@linutronix.de
Closes: https://lore.kernel.org/lkml/20251124104836.3685533-1-lrizzo@google.com
|
|
When UBSAN is enabled, multiple array-index-out-of-bounds messages are
printed:
[ 0.000000] [ T0] UBSAN: array-index-out-of-bounds in arch/x86/kernel/apic/x2apic_uv_x.c:276:23
[ 0.000000] [ T0] index 1 is out of range for type '<unknown> [1]'
...
[ 0.000000] [ T0] UBSAN: array-index-out-of-bounds in arch/x86/kernel/apic/x2apic_uv_x.c:277:32
[ 0.000000] [ T0] index 1 is out of range for type '<unknown> [1]'
...
[ 0.000000] [ T0] UBSAN: array-index-out-of-bounds in arch/x86/kernel/apic/x2apic_uv_x.c:282:16
[ 0.000000] [ T0] index 1 is out of range for type '<unknown> [1]'
...
[ 0.515850] [ T1] UBSAN: array-index-out-of-bounds in arch/x86/kernel/apic/x2apic_uv_x.c:1344:23
[ 0.519851] [ T1] index 1 is out of range for type '<unknown> [1]'
...
[ 0.603850] [ T1] UBSAN: array-index-out-of-bounds in arch/x86/kernel/apic/x2apic_uv_x.c:1345:32
[ 0.607850] [ T1] index 1 is out of range for type '<unknown> [1]'
...
[ 0.691850] [ T1] UBSAN: array-index-out-of-bounds in arch/x86/kernel/apic/x2apic_uv_x.c:1353:20
[ 0.695850] [ T1] index 1 is out of range for type '<unknown> [1]'
One-element arrays have been deprecated:
https://docs.kernel.org/process/deprecated.html#zero-length-and-one-element-arrays
Switch entry in struct uv_systab to a flexible array member to fix UBSAN
array-index-out-of-bounds messages.
sizeof(struct uv_systab) is passed to early_memremap() and ioremap(). The
flexible array member is not accessed until the UV system table size is used to
remap the entire UV system table, so changes to sizeof(struct uv_systab) have no
impact.
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/aTxksN-3otY41WvQ@hpe.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-auto
Pull __auto_type to auto conversion from Peter Anvin:
"Convert '__auto_type' to 'auto', defining a macro for 'auto' unless
C23+ is in use"
* tag 'auto-type-conversion-for-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-auto:
tools/virtio: replace "__auto_type" with "auto"
selftests/bpf: replace "__auto_type" with "auto"
arch/x86: replace "__auto_type" with "auto"
arch/nios2: replace "__auto_type" and adjacent equivalent with "auto"
fs/proc: replace "__auto_type" with "const auto"
include/linux: change "__auto_type" to "auto"
compiler_types.h: add "auto" as a macro for "__auto_type"
|
|
Replace instances of "__auto_type" with "auto" in:
arch/x86/include/asm/bug.h
arch/x86/include/asm/string_64.h
arch/x86/include/asm/uaccess_64.h
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Enhancements to Linux as the root partition for Microsoft Hypervisor:
- Support a new mode called L1VH, which allows Linux to drive the
hypervisor running the Azure Host directly
- Support for MSHV crash dump collection
- Allow Linux's memory management subsystem to better manage guest
memory regions
- Fix issues that prevented a clean shutdown of the whole system on
bare metal and nested configurations
- ARM64 support for the MSHV driver
- Various other bug fixes and cleanups
- Add support for Confidential VMBus for Linux guest on Hyper-V
- Secure AVIC support for Linux guests on Hyper-V
- Add the mshv_vtl driver to allow Linux to run as the secure kernel in
a higher virtual trust level for Hyper-V
* tag 'hyperv-next-signed-20251207' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (58 commits)
mshv: Cleanly shutdown root partition with MSHV
mshv: Use reboot notifier to configure sleep state
mshv: Add definitions for MSHV sleep state configuration
mshv: Add support for movable memory regions
mshv: Add refcount and locking to mem regions
mshv: Fix huge page handling in memory region traversal
mshv: Move region management to mshv_regions.c
mshv: Centralize guest memory region destruction
mshv: Refactor and rename memory region handling functions
mshv: adjust interrupt control structure for ARM64
Drivers: hv: use kmalloc_array() instead of kmalloc()
mshv: Add ioctl for self targeted passthrough hvcalls
Drivers: hv: Introduce mshv_vtl driver
Drivers: hv: Export some symbols for mshv_vtl
static_call: allow using STATIC_CALL_TRAMP_STR() from assembly
mshv: Extend create partition ioctl to support cpu features
mshv: Allow mappings that overlap in uaddr
mshv: Fix create memory region overlap check
mshv: add WQ_PERCPU to alloc_workqueue users
Drivers: hv: Use kmalloc_array() instead of kmalloc()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "panic: sys_info: Refactor and fix a potential issue" (Andy Shevchenko)
fixes a build issue and does some cleanup in ib/sys_info.c
- "Implement mul_u64_u64_div_u64_roundup()" (David Laight)
enhances the 64-bit math code on behalf of a PWM driver and beefs up
the test module for these library functions
- "scripts/gdb/symbols: make BPF debug info available to GDB" (Ilya Leoshkevich)
makes BPF symbol names, sizes, and line numbers available to the GDB
debugger
- "Enable hung_task and lockup cases to dump system info on demand" (Feng Tang)
adds a sysctl which can be used to cause additional info dumping when
the hung-task and lockup detectors fire
- "lib/base64: add generic encoder/decoder, migrate users" (Kuan-Wei Chiu)
adds a general base64 encoder/decoder to lib/ and migrates several
users away from their private implementations
- "rbree: inline rb_first() and rb_last()" (Eric Dumazet)
makes TCP a little faster
- "liveupdate: Rework KHO for in-kernel users" (Pasha Tatashin)
reworks the KEXEC Handover interfaces in preparation for Live Update
Orchestrator (LUO), and possibly for other future clients
- "kho: simplify state machine and enable dynamic updates" (Pasha Tatashin)
increases the flexibility of KEXEC Handover. Also preparation for LUO
- "Live Update Orchestrator" (Pasha Tatashin)
is a major new feature targeted at cloud environments. Quoting the
cover letter:
This series introduces the Live Update Orchestrator, a kernel
subsystem designed to facilitate live kernel updates using a
kexec-based reboot. This capability is critical for cloud
environments, allowing hypervisors to be updated with minimal
downtime for running virtual machines. LUO achieves this by
preserving the state of selected resources, such as memory,
devices and their dependencies, across the kernel transition.
As a key feature, this series includes support for preserving
memfd file descriptors, which allows critical in-memory data, such
as guest RAM or any other large memory region, to be maintained in
RAM across the kexec reboot.
Mike Rappaport merits a mention here, for his extensive review and
testing work.
- "kexec: reorganize kexec and kdump sysfs" (Sourabh Jain)
moves the kexec and kdump sysfs entries from /sys/kernel/ to
/sys/kernel/kexec/ and adds back-compatibility symlinks which can
hopefully be removed one day
- "kho: fixes for vmalloc restoration" (Mike Rapoport)
fixes a BUG which was being hit during KHO restoration of vmalloc()
regions
* tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (139 commits)
calibrate: update header inclusion
Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
vmcoreinfo: track and log recoverable hardware errors
kho: fix restoring of contiguous ranges of order-0 pages
kho: kho_restore_vmalloc: fix initialization of pages array
MAINTAINERS: TPM DEVICE DRIVER: update the W-tag
init: replace simple_strtoul with kstrtoul to improve lpj_setup
KHO: fix boot failure due to kmemleak access to non-PRESENT pages
Documentation/ABI: new kexec and kdump sysfs interface
Documentation/ABI: mark old kexec sysfs deprecated
kexec: move sysfs entries to /sys/kernel/kexec
test_kho: always print restore status
kho: free chunks using free_page() instead of kfree()
selftests/liveupdate: add kexec test for multiple and empty sessions
selftests/liveupdate: add simple kexec-based selftest for LUO
selftests/liveupdate: add userspace API selftests
docs: add documentation for memfd preservation via LUO
mm: memfd_luo: allow preserving memfd
liveupdate: luo_file: add private argument to store runtime state
mm: shmem: export some functions to internal.h
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Miscellaneous documentation fixes"
* tag 'x86-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/Documentation: Prefix hexadecimal literals with 0x
x86/boot/Documentation: Spell 'ID' consistently
x86/platform: Fix and extend kernel-doc comments in <asm/x86_init.h>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar:
"Address various objtool scalability bugs/inefficiencies exposed by
allmodconfig builds, plus improve the quality of alternatives
instructions generated code and disassembly"
* tag 'objtool-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Simplify .annotate_insn code generation output some more
objtool: Add more robust signal error handling, detect and warn about stack overflows
objtool: Remove newlines and tabs from annotation macros
objtool: Consolidate annotation macros
x86/asm: Remove ANNOTATE_DATA_SPECIAL usage
x86/alternative: Remove ANNOTATE_DATA_SPECIAL usage
objtool: Fix stack overflow in validate_branch()
|
|
Pull bitmap updates from Yury Norov:
- Runtime field_{get,prep}() (Geert)
- Rust ID pool updates (Alice)
- min_t() simplification (David)
- __sw_hweightN kernel-doc fixes (Andy)
- cpumask.h headers cleanup (Andy)
* tag 'bitmap-for-6.19' of github.com:/norov/linux: (32 commits)
rust_binder: use bitmap for allocation of handles
rust: id_pool: do not immediately acquire new ids
rust: id_pool: do not supply starting capacity
rust: id_pool: rename IdPool::new() to with_capacity()
rust: bitmap: add BitmapVec::new_inline()
rust: bitmap: add MAX_LEN and MAX_INLINE_LEN constants
cpumask: Don't use "proxy" headers
soc: renesas: Use bitfield helpers
clk: renesas: Use bitfield helpers
ALSA: usb-audio: Convert to common field_{get,prep}() helpers
soc: renesas: rz-sysc: Convert to common field_get() helper
pinctrl: ma35: Convert to common field_{get,prep}() helpers
iio: mlx90614: Convert to common field_{get,prep}() helpers
iio: dac: Convert to common field_prep() helper
gpio: aspeed: Convert to common field_{get,prep}() helpers
EDAC/ie31200: Convert to common field_get() helper
crypto: qat - convert to common field_get() helper
clk: at91: Convert to common field_{get,prep}() helpers
bitfield: Add non-constant field_{prep,get}() helpers
bitfield: Add less-checking __FIELD_{GET,PREP}()
...
|
|
Pull KVM updates from Paolo Bonzini:
"ARM:
- Support for userspace handling of synchronous external aborts
(SEAs), allowing the VMM to potentially handle the abort in a
non-fatal manner
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers
in hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU
- Allow page table destruction to reschedule, fixing long
need_resched latencies observed when destroying a large VM
- Minor fixes to KVM and selftests
Loongarch:
- Get VM PMU capability from HW GCFG register
- Add AVEC basic support
- Use 64-bit register definition for EIOINTC
- Add KVM timer test cases for tools/selftests
RISC/V:
- SBI message passing (MPXY) support for KVM guest
- Give a new, more specific error subcode for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
s390:
- Always allocate ESCA (Extended System Control Area), instead of
starting with the basic SCA and converting to ESCA with the
addition of the 65th vCPU. The price is increased number of exits
(and worse performance) on z10 and earlier processor; ESCA was
introduced by z114/z196 in 2010
- VIRT_XFER_TO_GUEST_WORK support
- Operation exception forwarding support
- Cleanups
x86:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
SPTE caching is disabled, as there can't be any relevant SPTEs to
zap
- Relocate a misplaced export
- Fix an async #PF bug where KVM would clear the completion queue
when the guest transitioned in and out of paging mode, e.g. when
handling an SMI and then returning to paged mode via RSM
- Leave KVM's user-return notifier registered even when disabling
virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
keeping the notifier registered is ok; the kernel does not use the
MSRs and the callback will run cleanly and restore host MSRs if the
CPU manages to return to userspace before the system goes down
- Use the checked version of {get,put}_user()
- Fix a long-lurking bug where KVM's lack of catch-up logic for
periodic APIC timers can result in a hard lockup in the host
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NTP corrections
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
latter behind CONFIG_CPU_MITIGATIONS
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
path; the only reason they were handled in the fast path was to
paper of a bug in the core #MC code, and that has long since been
fixed
- Add emulator support for AVX MOV instructions, to play nice with
emulated devices whose guest drivers like to access PCI BARs with
large multi-byte instructions
x86 (AMD):
- Fix a few missing "VMCB dirty" bugs
- Fix the worst of KVM's lack of EFER.LMSLE emulation
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode
- Fix incorrect handling of selective CR0 writes when checking
intercepts during emulation of L2 instructions
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
on VMRUN and #VMEXIT
- Fix a bug where KVM corrupt the guest code stream when re-injecting
a soft interrupt if the guest patched the underlying code after the
VM-Exit, e.g. when Linux patches code with a temporary INT3
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
to userspace, and extend KVM "support" to all policy bits that
don't require any actual support from KVM
x86 (Intel):
- Use the root role from kvm_mmu_page to construct EPTPs instead of
the current vCPU state, partly as worthwhile cleanup, but mostly to
pave the way for tracking per-root TLB flushes, and elide EPT
flushes on pCPU migration if the root is clean from a previous
flush
- Add a few missing nested consistency checks
- Rip out support for doing "early" consistency checks via hardware
as the functionality hasn't been used in years and is no longer
useful in general; replace it with an off-by-default module param
to WARN if hardware fails a check that KVM does not perform
- Fix a currently-benign bug where KVM would drop the guest's
SPEC_CTRL[63:32] on VM-Enter
- Misc cleanups
- Overhaul the TDX code to address systemic races where KVM (acting
on behalf of userspace) could inadvertantly trigger lock contention
in the TDX-Module; KVM was either working around these in weird,
ugly ways, or was simply oblivious to them (though even Yan's
devilish selftests could only break individual VMs, not the host
kernel)
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
TDX vCPU, if creating said vCPU failed partway through
- Fix a few sparse warnings (bad annotation, 0 != NULL)
- Use struct_size() to simplify copying TDX capabilities to userspace
- Fix a bug where TDX would effectively corrupt user-return MSR
values if the TDX Module rejects VP.ENTER and thus doesn't clobber
host MSRs as expected
Selftests:
- Fix a math goof in mmu_stress_test when running on a single-CPU
system/VM
- Forcefully override ARCH from x86_64 to x86 to play nice with
specifying ARCH=x86_64 on the command line
- Extend a bunch of nested VMX to validate nested SVM as well
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test
to verify KVM can save/restore nested VMX state when L1 is using
5-level paging, but L2 is not
- Clean up the guest paging code in anticipation of sharing the core
logic for nested EPT and nested NPT
guest_memfd:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety
of rough edges in guest_memfd along the way
- Define a CLASS to automatically handle get+put when grabbing a
guest_memfd from a memslot to make it harder to leak references
- Enhance KVM selftests to make it easer to develop and debug
selftests like those added for guest_memfd NUMA support, e.g. where
test and/or KVM bugs often result in hard-to-debug SIGBUS errors
- Misc cleanups
Generic:
- Use the recently-added WQ_PERCPU when creating the per-CPU
workqueue for irqfd cleanup
- Fix a goof in the dirty ring documentation
- Fix choice of target for directed yield across different calls to
kvm_vcpu_on_spin(); the function was always starting from the first
vCPU instead of continuing the round-robin search"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
KVM: arm64: selftests: Add test for AT emulation
KVM: arm64: nv: Expose hardware access flag management to NV guests
KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
KVM: arm64: Implement HW access flag management in stage-1 SW PTW
KVM: arm64: Propagate PTW errors up to AT emulation
KVM: arm64: Add helper for swapping guest descriptor
KVM: arm64: nv: Use pgtable definitions in stage-2 walk
KVM: arm64: Handle endianness in read helper for emulated PTW
KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
KVM: arm64: Call helper for reading descriptors directly
KVM: arm64: nv: Advertise support for FEAT_XNX
KVM: arm64: Teach ptdump about FEAT_XNX permissions
KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
...
|
|
Root partitions running on MSHV currently attempt ACPI power-off, which
MSHV intercepts and triggers a Machine Check Exception (MCE), leading
to a kernel panic.
Root partitions panic with a trace similar to:
[ 81.306348] reboot: Power down
[ 81.314709] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 0: b2000000c0060001
[ 81.314711] mce: [Hardware Error]: TSC 3b8cb60a66 PPIN 11d98332458e4ea9
[ 81.314713] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1759339405 SOCKET 0 APIC 0 microcode ffffffff
[ 81.314715] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 81.314716] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 81.314717] Kernel panic - not syncing: Fatal machine check
To avoid this, configure the sleep state in the hypervisor and invoke
the HVCALL_ENTER_SLEEP_STATE hypercall as the final step in the shutdown
sequence. This ensures a clean and safe shutdown of the root partition.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Co-developed-by: Anatol Belski <anbelski@linux.microsoft.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
Configure sleep state information from ACPI in MSHV hypervisor using
a reboot notifier. This data allows the hypervisor to correctly power
off the host during shutdown.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Co-developed-by: Anatol Belski <anbelski@linux.microsoft.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Reviewed-by: Stansialv Kinsburskii <skinsburskii@linux.miscrosoft.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
Provide an interface for Virtual Machine Monitor like OpenVMM and its
use as OpenHCL paravisor to control VTL0 (Virtual trust Level).
Expose devices and support IOCTLs for features like VTL creation,
VTL0 memory management, context switch, making hypercalls,
mapping VTL0 address space to VTL2 userspace, getting new VMBus
messages and channel events in VTL2 etc.
Co-developed-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Co-developed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
Remove newlines and tabs from the annotation macros so the invoking code
can insert them as needed to match the style of the surrounding code.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/66305834c2eb78f082217611b756231ae9c0b555.1764694625.git.jpoimboe@kernel.org
|
|
Instead of manually annotating each __ex_table entry, just make the
section mergeable and store the entry size in the ELF section header.
Either way works for objtool create_fake_symbols(), this way produces
cleaner code generation.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/b858cb7891c1ba0080e22a9c32595e6c302435e2.1764694625.git.jpoimboe@kernel.org
|
|
Instead of manually annotating each .altinstructions entry, just make
the section mergeable and store the entry size in the ELF section
header.
Either way works for objtool create_fake_symbols(), this way produces
cleaner code generation.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/5ac04e6db5be6453dce8003a771ebb0c47b4cd7a.1764694625.git.jpoimboe@kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CPU feature updates from Dave Hansen:
"The biggest thing of note here is Linear Address Space Separation
(LASS). It represents the first time I can think of that the
upper=>kernel/lower=>user address space convention is actually
recognized by the hardware on x86. It ensures that userspace can not
even get the hardware to _start_ page walks for the kernel address
space. This, of course, is a really nice generic side channel defense.
This is really only a down payment on LASS support. There are still
some details to work out in its interaction with EFI calls and
vsyscall emulation. For now, LASS is disabled if either of those
features is compiled in (which is almost always the case).
There's also one straggler commit in here which converts an
under-utilized AMD CPU feature leaf into a generic Linux-defined leaf
so more feature can be packed in there.
Summary:
- Enable Linear Address Space Separation (LASS)
- Change X86_FEATURE leaf 17 from an AMD leaf to Linux-defined"
* tag 'x86_cpu_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Enable LASS during CPU initialization
selftests/x86: Update the negative vsyscall tests to expect a #GP
x86/traps: Communicate a LASS violation in #GP message
x86/kexec: Disable LASS during relocate kernel
x86/alternatives: Disable LASS when patching kernel code
x86/asm: Introduce inline memcpy and memset
x86/cpu: Add an LASS dependency on SMAP
x86/cpufeatures: Enumerate the LASS feature bits
x86/cpufeatures: Make X86_FEATURE leaf 17 Linux-specific
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry update from Dave Hansen:
"This one is pretty trivial: fix a badly-named FRED data structure
member"
* tag 'x86_entry_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fred: Fix 64bit identifier in fred_ss
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Dave Hansen:
"The most significant are some changes to ensure that symbols exported
for KVM are used only by KVM modules themselves, along with some
related cleanups.
In true x86/misc fashion, the other patch is completely unrelated and
just enhances an existing pr_warn() to make it clear to users how they
have tainted their kernel when something is mucking with MSRs.
Summary:
- Make MSR-induced taint easier for users to track down
- Restrict KVM-specific exports to KVM itself"
* tag 'x86_misc_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Restrict KVM-induced symbol exports to KVM modules where obvious/possible
x86/mm: Drop unnecessary export of "ptdump_walk_pgd_level_debugfs"
x86/mtrr: Drop unnecessary export of "mtrr_state"
x86/bugs: Drop unnecessary export of "x86_spec_ctrl_base"
x86/msr: Add CPU_OUT_OF_SPEC taint name to "unrecognized" pr_warn(msg)
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX updates from Dave HansenL
"The main content here is adding support for the new EUPDATESVN SGX
ISA. Before this, folks who updated microcode had to reboot before
enclaves could attest to the new microcode. The new functionality lets
them do this without a reboot.
The rest are some nice, but relatively mundane comment and kernel-doc
fixups.
Summary:
- Allow security version (SVN) updates so enclaves can attest to new
microcode
- Fix kernel docs typos"
* tag 'x86_sgx_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Fix a typo in the kernel-doc comment for enum sgx_attribute
x86/sgx: Remove superfluous asterisk from copyright comment in asm/sgx.h
x86/sgx: Document structs and enums with '@', not '%'
x86/sgx: Add kernel-doc descriptions for params passed to vDSO user handler
x86/sgx: Add a missing colon in kernel-doc markup for "struct sgx_enclave_run"
x86/sgx: Enable automatic SVN updates for SGX enclaves
x86/sgx: Implement ENCLS[EUPDATESVN]
x86/sgx: Define error codes for use by ENCLS[EUPDATESVN]
x86/cpufeatures: Add X86_FEATURE_SGX_EUPDATESVN feature flag
x86/sgx: Introduce functions to count the sgx_(vepc_)open()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Borislav Petkov:
- Use the proper accessors when reading CR3 as part of the page level
transitions (5-level to 4-level, the use case being kexec) so that
only the physical address in CR3 is picked up and not flags which are
above the physical mask shift
- Clean up and unify __phys_addr_symbol() definitions
* tag 'x86_mm_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/libstub: Fix page table access in 5-level to 4-level paging transition
x86/boot: Fix page table access in 5-level to 4-level paging transition
x86/mm: Unify __phys_addr_symbol()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Largely cleanups along with a change to save XSS to the GHCB
(Guest-Host Communication Block) in SEV-ES guests so that the
hypervisor can determine the guest's XSAVES buffer size properly
and thus support shadow stacks in AMD confidential guests
* tag 'x86_sev_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cc: Fix enum spelling to fix kernel-doc warnings
x86/boot: Drop unused sev_enable() fallback
x86/coco/sev: Convert has_cpuflag() to use cpu_feature_enabled()
x86/sev: Include XSS value in GHCB CPUID request
x86/boot: Move boot_*msr helpers to asm/shared/msr.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
- The mandatory pile of cleanups the cat drags in every merge window
* tag 'x86_cleanups_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Clean up whitespace in a20.c
x86/mm: Delete disabled debug code
x86/{boot,mtrr}: Remove unused function declarations
x86/percpu: Use BIT_WORD() and BIT_MASK() macros
x86/cpufeatures: Correct LKGS feature flag description
x86/idtentry: Add missing '*' to kernel-doc lines
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
- Add support for AMD's Smart Data Cache Injection feature which allows
for direct insertion of data from I/O devices into the L3 cache, thus
bypassing DRAM and saving its bandwidth; the resctrl side of the
feature allows the size of the L3 used for data injection to be
controlled
- Add Intel Clearwater Forest to the list of CPUs which support
Sub-NUMA clustering
- Other fixes and cleanups
* tag 'x86_cache_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/resctrl: Update bit_usage to reflect io_alloc
fs/resctrl: Introduce interface to modify io_alloc capacity bitmasks
fs/resctrl: Modify struct rdt_parse_data to pass mode and CLOSID
fs/resctrl: Introduce interface to display io_alloc CBMs
fs/resctrl: Add user interface to enable/disable io_alloc feature
fs/resctrl: Introduce interface to display "io_alloc" support
x86,fs/resctrl: Implement "io_alloc" enable/disable handlers
x86,fs/resctrl: Detect io_alloc feature
x86/resctrl: Add SDCIAE feature in the command line options
x86/cpufeatures: Add support for L3 Smart Data Cache Injection Allocation Enforcement
fs/resctrl: Consider sparse masks when initializing new group's allocation
x86/resctrl: Support Sub-NUMA Cluster (SNC) mode on Clearwater Forest
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loading updates from Borislav Petkov:
- Add microcode staging support on Intel: it moves the sole microcode
blobs loading to a non-critical path so that microcode loading
latencies are kept at minimum. The actual "directing" the hardware to
load microcode is the only step which is done on the critical path.
This scheme is also opportunistic as in: on a failure, the machinery
falls back to normal loading
- Add the capability to the AMD side of the loader to select one of two
per-family/model/stepping patches: one is pre-Entrysign and the other
is post-Entrysign; with the goal to take care of machines which
haven't updated their BIOS yet - something they should absolutely do
as this is the only proper Entrysign fix
- Other small cleanups and fixlets
* tag 'x86_microcode_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Mark early_parse_cmdline() as __init
x86/microcode/AMD: Select which microcode patch to load
x86/microcode/intel: Enable staging when available
x86/microcode/intel: Support mailbox transfer
x86/microcode/intel: Implement staging handler
x86/microcode/intel: Define staging state struct
x86/microcode/intel: Establish staging control logic
x86/microcode: Introduce staging step to reduce late-loading time
x86/cpu/topology: Make primary thread mask available with SMP=n
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- The second part of the AMD MCA interrupts rework after the
last-minute show-stopper from the last merge window was sorted out.
After this, the AMD MCA deferred errors, thresholding and corrected
errors interrupt handlers use common MCA code and are tightly
integrated into the core MCA code, thereby getting rid of
considerable duplication. All culminating into allowing CMCI error
thresholding storms to be detected at AMD too, using the common
infrastructure
- Add support for two new MCA bank bits on AMD Zen6 which denote
whether the error address logged is a system physical address, which
obviates the need for it to be translated before further error
recovery can be done
* tag 'ras_core_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Handle AMD threshold interrupt storms
x86/mce: Do not clear bank's poll bit in mce_poll_banks on AMD SMCA systems
x86/mce: Add support for physical address valid bit
x86/mce: Save and use APEI corrected threshold limit
x86/mce/amd: Define threshold restart function for banks
x86/mce/amd: Remove redundant reset_block()
x86/mce/amd: Support SMCA Corrected Error Interrupt
x86/mce/amd: Enable interrupt vectors once per-CPU on SMCA systems
x86/mce: Unify AMD DFR handler with MCA Polling
x86/mce: Unify AMD THR handler with MCA Polling
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq updates from Thomas Gleixner:
"A large overhaul of the restartable sequences and CID management:
The recent enablement of RSEQ in glibc resulted in regressions which
are caused by the related overhead. It turned out that the decision to
invoke the exit to user work was not really a decision. More or less
each context switch caused that. There is a long list of small issues
which sums up nicely and results in a 3-4% regression in I/O
benchmarks.
The other detail which caused issues due to extra work in context
switch and task migration is the CID (memory context ID) management.
It also requires to use a task work to consolidate the CID space,
which is executed in the context of an arbitrary task and results in
sporadic uncontrolled exit latencies.
The rewrite addresses this by:
- Removing deprecated and long unsupported functionality
- Moving the related data into dedicated data structures which are
optimized for fast path processing.
- Caching values so actual decisions can be made
- Replacing the current implementation with a optimized inlined
variant.
- Separating fast and slow path for architectures which use the
generic entry code, so that only fault and error handling goes into
the TIF_NOTIFY_RESUME handler.
- Rewriting the CID management so that it becomes mostly invisible in
the context switch path. That moves the work of switching modes
into the fork/exit path, which is a reasonable tradeoff. That work
is only required when a process creates more threads than the
cpuset it is allowed to run on or when enough threads exit after
that. An artificial thread pool benchmarks which triggers this did
not degrade, it actually improved significantly.
The main effect in migration heavy scenarios is that runqueue lock
held time and therefore contention goes down significantly"
* tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched/mmcid: Switch over to the new mechanism
sched/mmcid: Implement deferred mode change
irqwork: Move data struct to a types header
sched/mmcid: Provide CID ownership mode fixup functions
sched/mmcid: Provide new scheduler CID mechanism
sched/mmcid: Introduce per task/CPU ownership infrastructure
sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
sched/mmcid: Provide precomputed maximal value
sched/mmcid: Move initialization out of line
signal: Move MMCID exit out of sighand lock
sched/mmcid: Convert mm CID mask to a bitmap
cpumask: Cache num_possible_cpus()
sched/mmcid: Use cpumask_weighted_or()
cpumask: Introduce cpumask_weighted_or()
sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
sched/mmcid: Move scheduler code out of global header
sched: Fixup whitespace damage
sched/mmcid: Cacheline align MM CID storage
sched/mmcid: Use proper data structures
sched/mmcid: Revert the complex CID management
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scoped user access updates from Thomas Gleixner:
"Scoped user mode access and related changes:
- Implement the missing u64 user access function on ARM when
CONFIG_CPU_SPECTRE=n.
This makes it possible to access a 64bit value in generic code with
[unsafe_]get_user(). All other architectures and ARM variants
provide the relevant accessors already.
- Ensure that ASM GOTO jump label usage in the user mode access
helpers always goes through a local C scope label indirection
inside the helpers.
This is required because compilers are not supporting that a ASM
GOTO target leaves a auto cleanup scope. GCC silently fails to emit
the cleanup invocation and CLANG fails the build.
[ Editor's note: gcc-16 will have fixed the code generation issue
in commit f68fe3ddda4 ("eh: Invoke cleanups/destructors in asm
goto jumps [PR122835]"). But we obviously have to deal with clang
and older versions of gcc, so.. - Linus ]
This provides generic wrapper macros and the conversion of affected
architecture code to use them.
- Scoped user mode access with auto cleanup
Access to user mode memory can be required in hot code paths, but
if it has to be done with user controlled pointers, the access is
shielded with a speculation barrier, so that the CPU cannot
speculate around the address range check. Those speculation
barriers impact performance quite significantly.
This cost can be avoided by "masking" the provided pointer so it is
guaranteed to be in the valid user memory access range and
otherwise to point to a guaranteed unpopulated address space. This
has to be done without branches so it creates an address dependency
for the access, which the CPU cannot speculate ahead.
This results in repeating and error prone programming patterns:
if (can_do_masked_user_access())
from = masked_user_read_access_begin((from));
else if (!user_read_access_begin(from, sizeof(*from)))
return -EFAULT;
unsafe_get_user(val, from, Efault);
user_read_access_end();
return 0;
Efault:
user_read_access_end();
return -EFAULT;
which can be replaced with scopes and automatic cleanup:
scoped_user_read_access(from, Efault)
unsafe_get_user(val, from, Efault);
return 0;
Efault:
return -EFAULT;
- Convert code which implements the above pattern over to
scope_user.*.access(). This also corrects a couple of imbalanced
masked_*_begin() instances which are harmless on most
architectures, but prevent PowerPC from implementing the masking
optimization.
- Add a missing speculation barrier in copy_from_user_iter()"
* tag 'core-uaccess-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lib/strn*,uaccess: Use masked_user_{read/write}_access_begin when required
scm: Convert put_cmsg() to scoped user access
iov_iter: Add missing speculation barrier to copy_from_user_iter()
iov_iter: Convert copy_from_user_iter() to masked user access
select: Convert to scoped user access
x86/futex: Convert to scoped user access
futex: Convert to get/put_user_inline()
uaccess: Provide put/get_user_inline()
uaccess: Provide scoped user access regions
arm64: uaccess: Use unsafe wrappers for ASM GOTO
s390/uaccess: Use unsafe wrappers for ASM GOTO
riscv/uaccess: Use unsafe wrappers for ASM GOTO
powerpc/uaccess: Use unsafe wrappers for ASM GOTO
x86/uaccess: Use unsafe wrappers for ASM GOTO
uaccess: Provide ASM GOTO safe wrappers for unsafe_*_user()
ARM: uaccess: Implement missing __get_user_asm_dword()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull bug handling infrastructure updates from Ingo Molnar:
"Core updates:
- Improve WARN(), which has vararg printf like arguments, to work
with the x86 #UD based WARN-optimizing infrastructure by hiding the
format in the bug_table and replacing this first argument with the
address of the bug-table entry, while making the actual function
that's called a UD1 instruction (Peter Zijlstra)
- Introduce the CONFIG_DEBUG_BUGVERBOSE_DETAILED Kconfig switch (Ingo
Molnar, s390 support by Heiko Carstens)
Fixes and cleanups:
- bugs/s390: Remove private WARN_ON() implementation (Heiko Carstens)
- <asm/bugs.h>: Make i386 use GENERIC_BUG_RELATIVE_POINTERS (Peter
Zijlstra)"
* tag 'core-bugs-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
x86/bugs: Make i386 use GENERIC_BUG_RELATIVE_POINTERS
x86/bug: Fix BUG_FORMAT vs KASLR
x86_64/bug: Inline the UD1
x86/bug: Implement WARN_ONCE()
x86_64/bug: Implement __WARN_printf()
x86/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILED
x86/bug: Add BUG_FORMAT basics
bug: Allow architectures to provide __WARN_printf()
bug: Implement WARN_ON() using __WARN_FLAGS()
bug: Add report_bug_entry()
bug: Add BUG_FORMAT_ARGS infrastructure
bug: Clean up CONFIG_GENERIC_BUG_RELATIVE_POINTERS
bug: Add BUG_FORMAT infrastructure
x86: Rework __bug_table helpers
bugs/s390: Remove private WARN_ON() implementation
bugs/core: Reorganize fields in the first line of WARNING output, add ->comm[] output
bugs/sh: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
bugs/parisc: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
bugs/riscv: Concatenate 'cond_str' with '__FILE__' in __BUG_FLAGS(), to extend WARN_ON/BUG_ON output
bugs/riscv: Pass in 'cond_str' to __BUG_FLAGS()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Scalability and load-balancing improvements:
- Enable scheduler feature NEXT_BUDDY (Mel Gorman)
- Reimplement NEXT_BUDDY to align with EEVDF goals (Mel Gorman)
- Skip sched_balance_running cmpxchg when balance is not due (Tim
Chen)
- Implement generic code for architecture specific sched domain NUMA
distances (Tim Chen)
- Optimize the NUMA distances of the sched-domains builds of Intel
Granite Rapids (GNR) and Clearwater Forest (CWF) platforms (Tim
Chen)
- Implement proportional newidle balance: a randomized algorithm that
runs newidle balancing proportional to its success rate. (Peter
Zijlstra)
Scheduler infrastructure changes:
- Implement the 'sched_change' scoped_guard() pattern for the entire
scheduler (Peter Zijlstra)
- More broadly utilize the sched_change guard (Peter Zijlstra)
- Add support to pick functions to take runqueue-flags (Joel
Fernandes)
- Provide and use set_need_resched_current() (Peter Zijlstra)
Fair scheduling enhancements:
- Forfeit vruntime on yield (Fernand Sieber)
- Only update stats for allowed CPUs when looking for dst group (Adam
Li)
CPU-core scheduling enhancements:
- Optimize core cookie matching check (Fernand Sieber)
Deadline scheduler fixes:
- Only set free_cpus for online runqueues (Doug Berger)
- Fix dl_server time accounting (Peter Zijlstra)
- Fix dl_server stop condition (Peter Zijlstra)
Proxy scheduling fixes:
- Yield the donor task (Fernand Sieber)
Fixes and cleanups:
- Fix do_set_cpus_allowed() locking (Peter Zijlstra)
- Fix migrate_disable_switch() locking (Peter Zijlstra)
- Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
(Hao Jia)
- Increase sched_tick_remote timeout (Phil Auld)
- sched/deadline: Use cpumask_weight_and() in dl_bw_cpus() (Shrikanth
Hegde)
- sched/deadline: Clean up select_task_rq_dl() (Shrikanth Hegde)"
* tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
sched: Provide and use set_need_resched_current()
sched/fair: Proportional newidle balance
sched/fair: Small cleanup to update_newidle_cost()
sched/fair: Small cleanup to sched_balance_newidle()
sched/fair: Revert max_newidle_lb_cost bump
sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
sched/fair: Enable scheduler feature NEXT_BUDDY
sched: Increase sched_tick_remote timeout
sched/fair: Have SD_SERIALIZE affect newidle balancing
sched/fair: Skip sched_balance_running cmpxchg when balance is not due
sched/deadline: Minor cleanup in select_task_rq_dl()
sched/deadline: Use cpumask_weight_and() in dl_bw_cpus
sched/deadline: Document dl_server
sched/deadline: Fix dl_server stop condition
sched/deadline: Fix dl_server time accounting
sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
sched/eevdf: Fix min_vruntime vs avg_vruntime
sched/core: Add comment explaining force-idle vruntime snapshots
sched/core: Optimize core cookie matching check
sched/proxy: Yield the donor task
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
"Callchain support:
- Add support for deferred user-space stack unwinding for perf,
enabled on x86. (Peter Zijlstra, Steven Rostedt)
- unwind_user/x86: Enable frame pointer unwinding on x86 (Josh
Poimboeuf)
x86 PMU support and infrastructure:
- x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra)
- x86/insn,uprobes,alternative: Unify insn_is_nop() (Peter Zijlstra)
Intel PMU driver:
- Large series to prepare for and implement architectural PEBS
support for Intel platforms such as Clearwater Forest (CWF) and
Panther Lake (PTL). (Dapeng Mi, Kan Liang)
- Check dynamic constraints (Kan Liang)
- Optimize PEBS extended config (Peter Zijlstra)
- cstates:
- Remove PC3 support from LunarLake (Zhang Rui)
- Add Pantherlake support (Zhang Rui)
- Clearwater Forest support (Zide Chen)
AMD PMU driver:
- x86/amd: Check event before enable to avoid GPF (George Kennedy)
Fixes and cleanups:
- task_work: Fix NMI race condition (Peter Zijlstra)
- perf/x86: Fix NULL event access and potential PEBS record loss
(Dapeng Mi)
- Misc other fixes and cleanups (Dapeng Mi, Ingo Molnar, Peter
Zijlstra)"
* tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
perf/x86/intel: Fix and clean up intel_pmu_drain_arch_pebs() type use
perf/x86/intel: Optimize PEBS extended config
perf/x86/intel: Check PEBS dyn_constraints
perf/x86/intel: Add a check for dynamic constraints
perf/x86/intel: Add counter group support for arch-PEBS
perf/x86/intel: Setup PEBS data configuration and enable legacy groups
perf/x86/intel: Update dyn_constraint base on PEBS event precise level
perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
perf/x86/intel: Process arch-PEBS records or record fragments
perf/x86/intel/ds: Factor out PEBS group processing code to functions
perf/x86/intel/ds: Factor out PEBS record processing code to functions
perf/x86/intel: Initialize architectural PEBS
perf/x86/intel: Correct large PEBS flag check
perf/x86/intel: Replace x86_pmu.drain_pebs calling with static call
perf/x86: Fix NULL event access and potential PEBS record loss
perf/x86: Remove redundant is_x86_event() prototype
entry,unwind/deferred: Fix unwind_reset_info() placement
unwind_user/x86: Fix arch=um build
perf: Support deferred user unwind
unwind_user/x86: Teach FP unwind about start of function
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
- klp-build livepatch module generation (Josh Poimboeuf)
Introduce new objtool features and a klp-build script to generate
livepatch modules using a source .patch as input.
This builds on concepts from the longstanding out-of-tree kpatch
project which began in 2012 and has been used for many years to
generate livepatch modules for production kernels. However, this is a
complete rewrite which incorporates hard-earned lessons from 12+
years of maintaining kpatch.
Key improvements compared to kpatch-build:
- Integrated with objtool: Leverages objtool's existing control-flow
graph analysis to help detect changed functions.
- Works on vmlinux.o: Supports late-linked objects, making it
compatible with LTO, IBT, and similar.
- Simplified code base: ~3k fewer lines of code.
- Upstream: No more out-of-tree #ifdef hacks, far less cruft.
- Cleaner internals: Vastly simplified logic for
symbol/section/reloc inclusion and special section extraction.
- Robust __LINE__ macro handling: Avoids false positive binary diffs
caused by the __LINE__ macro by introducing a fix-patch-lines
script which injects #line directives into the source .patch to
preserve the original line numbers at compile time.
- Disassemble code with libopcodes instead of running objdump
(Alexandre Chartre)
- Disassemble support (-d option to objtool) by Alexandre Chartre,
which supports the decoding of various Linux kernel code generation
specials such as alternatives:
17ef: sched_balance_find_dst_group+0x62f mov 0x34(%r9),%edx
17f3: sched_balance_find_dst_group+0x633 | <alternative.17f3> | X86_FEATURE_POPCNT
17f3: sched_balance_find_dst_group+0x633 | call 0x17f8 <__sw_hweight64> | popcnt %rdi,%rax
17f8: sched_balance_find_dst_group+0x638 cmp %eax,%edx
... jump table alternatives:
1895: sched_use_asym_prio+0x5 test $0x8,%ch
1898: sched_use_asym_prio+0x8 je 0x18a9 <sched_use_asym_prio+0x19>
189a: sched_use_asym_prio+0xa | <jump_table.189a> | JUMP
189a: sched_use_asym_prio+0xa | jmp 0x18ae <sched_use_asym_prio+0x1e> | nop2
189c: sched_use_asym_prio+0xc mov $0x1,%eax
18a1: sched_use_asym_prio+0x11 and $0x80,%ecx
... exception table alternatives:
native_read_msr:
5b80: native_read_msr+0x0 mov %edi,%ecx
5b82: native_read_msr+0x2 | <ex_table.5b82> | EXCEPTION
5b82: native_read_msr+0x2 | rdmsr | resume at 0x5b84 <native_read_msr+0x4>
5b84: native_read_msr+0x4 shl $0x20,%rdx
.... x86 feature flag decoding (also see the X86_FEATURE_POPCNT
example in sched_balance_find_dst_group() above):
2faaf: start_thread_common.constprop.0+0x1f jne 0x2fba4 <start_thread_common.constprop.0+0x114>
2fab5: start_thread_common.constprop.0+0x25 | <alternative.2fab5> | X86_FEATURE_ALWAYS | X86_BUG_NULL_SEG
2fab5: start_thread_common.constprop.0+0x25 | jmp 0x2faba <.altinstr_aux+0x2f4> | jmp 0x4b0 <start_thread_common.constprop.0+0x3f> | nop5
2faba: start_thread_common.constprop.0+0x2a mov $0x2b,%eax
... NOP sequence shortening:
1048e2: snapshot_write_finalize+0xc2 je 0x104917 <snapshot_write_finalize+0xf7>
1048e4: snapshot_write_finalize+0xc4 nop6
1048ea: snapshot_write_finalize+0xca nop11
1048f5: snapshot_write_finalize+0xd5 nop11
104900: snapshot_write_finalize+0xe0 mov %rax,%rcx
104903: snapshot_write_finalize+0xe3 mov 0x10(%rdx),%rax
... and much more.
- Function validation tracing support (Alexandre Chartre)
- Various -ffunction-sections fixes (Josh Poimboeuf)
- Clang AutoFDO (Automated Feedback-Directed Optimizations) support
(Josh Poimboeuf)
- Misc fixes and cleanups (Borislav Petkov, Chen Ni, Dylan Hatch, Ingo
Molnar, John Wang, Josh Poimboeuf, Pankaj Raghav, Peter Zijlstra,
Thorsten Blum)
* tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
objtool: Fix segfault on unknown alternatives
objtool: Build with disassembly can fail when including bdf.h
objtool: Trim trailing NOPs in alternative
objtool: Add wide output for disassembly
objtool: Compact output for alternatives with one instruction
objtool: Improve naming of group alternatives
objtool: Add Function to get the name of a CPU feature
objtool: Provide access to feature and flags of group alternatives
objtool: Fix address references in alternatives
objtool: Disassemble jump table alternatives
objtool: Disassemble exception table alternatives
objtool: Print addresses with alternative instructions
objtool: Disassemble group alternatives
objtool: Print headers for alternatives
objtool: Preserve alternatives order
objtool: Add the --disas=<function-pattern> action
objtool: Do not validate IBT for .return_sites and .call_sites
objtool: Improve tracing of alternative instructions
objtool: Add functions to better name alternatives
objtool: Identify the different types of alternatives
...
|
|
Fix most (17) kernel-doc warnings in x86_init.h (except for struct
x86_init_ops). The changes are:
- fix struct member name typos
- add ending ':' to struct member names
- add some missing struct member descriptions
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251129002524.1196500-1-rdunlap@infradead.org
|
|
Linus figured less #ifdef is more better and making x86-32 use
GENERIC_BUG_RELATIVE_POINTERS removes one layer of macro magic from
the bug.h bits.
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Encoding a relative NULL pointer doesn't work for KASLR, when the
whole kernel image gets shifted, the __bug_table and the target string
get shifted by the same amount and the relative offset is preserved.
However when the target is an absolute 0 value and the __bug_table
gets moved about, the end result in a pointer equivalent to
kaslr_offset(), not NULL.
Notably, this will generate SHN_UNDEF relocations, and Ard would
really like to not have those at all.
Use the empty string to denote no-string.
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|