summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2025-04-11x86/alternatives: Rename 'text_poke_flush()' to 'smp_text_poke_batch_flush()'Ingo Molnar1-3/+3
This name is actually actively confusing, because the simple text_poke*() APIs use MM-switching based code patching, while text_poke_flush() is part of the INT3 based text_poke_int3_*() machinery that is an additional layer of functionality on top of regular text_poke*() functionality. Rename it to smp_text_poke_batch_flush() to make it clear which layer it belongs to. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-15-mingo@kernel.org
2025-04-11x86/alternatives: Remove the confusing, inaccurate & unnecessary ↵Ingo Molnar1-14/+10
'temp_mm_state_t' abstraction So the temp_mm_state_t abstraction used by use_temporary_mm() and unuse_temporary_mm() is super confusing: - The whole machinery is about temporarily switching to the text_poke_mm utility MM that got allocated during bootup for text-patching purposes alone: temp_mm_state_t prev; /* * Loading the temporary mm behaves as a compiler barrier, which * guarantees that the PTE will be set at the time memcpy() is done. */ prev = use_temporary_mm(text_poke_mm); - Yet the value that gets saved in the temp_mm_state_t variable is not the temporary MM ... but the previous MM... - Ie. we temporarily put the non-temporary MM into a variable that has the temp_mm_state_t type. This makes no sense whatsoever. - The confusion continues in unuse_temporary_mm(): static inline void unuse_temporary_mm(temp_mm_state_t prev_state) Here we unuse an MM that is ... not the temporary MM, but the previous MM. :-/ Fix up all this confusion by removing the unnecessary layer of abstraction and using a bog-standard 'struct mm_struct *prev_mm' variable to save the MM to. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-14-mingo@kernel.org
2025-04-11x86/alternatives: Remove duplicate 'text_poke_early()' prototypeIngo Molnar1-1/+0
It's declared in <asm/text-patching.h> already. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-12-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'bp_desc' to 'int3_desc'Ingo Molnar1-6/+6
Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-11-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'poking_addr' to 'text_poke_mm_addr'Ingo Molnar1-8/+8
Put it into the text_poke_* namespace of <asm/text-patching.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-10-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'poking_mm' to 'text_poke_mm'Ingo Molnar1-9/+9
Put it into the text_poke_* namespace of <asm/text-patching.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-9-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'poke_int3_handler()' to 'smp_text_poke_int3_handler()'Ingo Molnar2-4/+4
All related functions in this subsystem already have a text_poke_int3_ prefix - add it to the trap handler as well. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-8-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'text_poke_bp()' to 'smp_text_poke_single()'Ingo Molnar5-9/+9
Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-7-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'text_poke_bp_batch()' to ↵Ingo Molnar1-6/+6
'smp_text_poke_batch_process()' Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-6-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'bp_refs' to 'text_poke_array_refs'Ingo Molnar1-7/+7
Make it clear that these reference counts lock access to text_poke_array. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-5-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'struct bp_patching_desc' to 'struct ↵Ingo Molnar1-4/+4
text_poke_int3_vec' Follow the INT3 text-poking nomenclature, and also adopt the 'vector' name for the entire object, instead of the rather opaque 'descriptor' naming. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-4-mingo@kernel.org
2025-04-11x86/alternatives: Document the text_poke_bp_batch() synchronization rules a ↵Peter Zijlstra1-0/+7
bit more Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250411054105.2341982-3-mingo@kernel.org
2025-04-11x86/alternatives: Improve code-patching scalability by removing false ↵Eric Dumazet1-12/+18
sharing in poke_int3_handler() eBPF programs can be run 50,000,000 times per second on busy servers. Whenever /proc/sys/kernel/bpf_stats_enabled is turned off, hundreds of calls sites are patched from text_poke_bp_batch() and we see a huge loss of performance due to false sharing on bp_desc.refs lasting up to three seconds. 51.30% server_bin [kernel.kallsyms] [k] poke_int3_handler | |--46.45%--poke_int3_handler | exc_int3 | asm_exc_int3 | | | |--24.26%--cls_bpf_classify | | tcf_classify | | __dev_queue_xmit | | ip6_finish_output2 | | ip6_output | | ip6_xmit | | inet6_csk_xmit | | __tcp_transmit_skb Fix this by replacing bp_desc.refs with a per-cpu bp_refs. Before the patch, on a host with 240 cores (480 threads): $ sysctl -wq kernel.bpf_stats_enabled=0 text_poke_bp_batch(nr_entries=164) : Took 2655300 usec $ bpftool prog | grep run_time_ns ... 105: sched_cls name hn_egress tag 699fc5eea64144e3 gpl run_time_ns 3009063719 run_cnt 82757845 : average cost is 36 nsec per call After this patch: $ sysctl -wq kernel.bpf_stats_enabled=0 text_poke_bp_batch(nr_entries=164) : Took 702 usec $ bpftool prog | grep run_time_ns ... 105: sched_cls name hn_egress tag 699fc5eea64144e3 gpl run_time_ns 1928223019 run_cnt 67682728 : average cost is 28 nsec per call Ie. text-patching performance improved 3700x: from 2.65 seconds to 0.0007 seconds. Since the atomic_cond_read_acquire(refs, !VAL) spin-loop was not triggered even once in my tests, add an unlikely() annotation, because this appears to be the common case. [ mingo: Improved the changelog some more. ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250411054105.2341982-2-mingo@kernel.org
2025-04-11x86/i8253: Call clockevent_i8253_disable() with interrupts disabledFernando Fernandez Mancera1-1/+2
There's a lockdep false positive warning related to i8253_lock: WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ... systemd-sleep/3324 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: ffffffffb2c23398 (i8253_lock){+.+.}-{2:2}, at: pcspkr_event+0x3f/0xe0 [pcspkr] ... ... which became HARDIRQ-irq-unsafe at: ... lock_acquire+0xd0/0x2f0 _raw_spin_lock+0x30/0x40 clockevent_i8253_disable+0x1c/0x60 pit_timer_init+0x25/0x50 hpet_time_init+0x46/0x50 x86_late_time_init+0x1b/0x40 start_kernel+0x962/0xa00 x86_64_start_reservations+0x24/0x30 x86_64_start_kernel+0xed/0xf0 common_startup_64+0x13e/0x141 ... Lockdep complains due pit_timer_init() using the lock in an IRQ-unsafe fashion, but it's a false positive, because there is no deadlock possible at that point due to init ordering: at the point where pit_timer_init() is called there is no other possible usage of i8253_lock because the system is still in the very early boot stage with no interrupts. But in any case, pit_timer_init() should disable interrupts before calling clockevent_i8253_disable() out of general principle, and to keep lockdep working even in this scenario. Use scoped_guard() for that, as suggested by Thomas Gleixner. [ mingo: Cleaned up the changelog. ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/Z-uwd4Bnn7FcCShX@gmail.com
2025-04-10x86/kexec: Invalidate GDT/IDT from relocate_kernel() instead of earlierDavid Woodhouse2-10/+9
Reduce the window during which exceptions are unhandled, by leaving the GDT/IDT in place all the way into the relocate_kernel() function, until the moment that %cr3 gets replaced. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250326142404.256980-4-dwmw2@infradead.org
2025-04-10x86/kexec: Add 8250 MMIO serial port outputDavid Woodhouse3-0/+42
This supports the same 32-bit MMIO-mapped 8250 as the early_printk code. It's not clear why the early_printk code supports this form and only this form; the actual runtime 8250_pci doesn't seem to support it. But having hacked up QEMU to expose such a device, early_printk does work with it, and now so does the kexec debug code. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250326142404.256980-3-dwmw2@infradead.org
2025-04-10x86/kexec: Add 8250 serial port outputDavid Woodhouse2-6/+39
If a serial port was configured for early_printk, use it for debug output from the relocate_kernel exception handler too. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250326142404.256980-2-dwmw2@infradead.org
2025-04-10x86/msr: Rename 'wrmsrl_cstar()' to 'wrmsrq_cstar()'Ingo Molnar1-3/+3
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'native_wrmsrl()' to 'native_wrmsrq()'Ingo Molnar2-2/+2
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'wrmsrl_amd_safe()' to 'wrmsrq_amd_safe()'Ingo Molnar1-2/+2
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl_amd_safe()' to 'rdmsrq_amd_safe()'Ingo Molnar1-2/+2
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'mce_wrmsrl()' to 'mce_wrmsrq()'Ingo Molnar1-6/+6
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'mce_rdmsrl()' to 'mce_rdmsrq()'Ingo Molnar2-17/+17
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'wrmsrl_on_cpu()' to 'wrmsrq_on_cpu()'Ingo Molnar1-1/+1
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl_on_cpu()' to 'rdmsrq_on_cpu()'Ingo Molnar2-4/+4
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'wrmsrl_safe_on_cpu()' to 'wrmsrq_safe_on_cpu()'Ingo Molnar1-1/+1
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl_safe_on_cpu()' to 'rdmsrq_safe_on_cpu()'Ingo Molnar1-3/+3
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'wrmsrl_safe()' to 'wrmsrq_safe()'Ingo Molnar6-15/+15
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl_safe()' to 'rdmsrq_safe()'Ingo Molnar10-23/+23
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'wrmsrl()' to 'wrmsrq()'Ingo Molnar29-129/+129
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl()' to 'rdmsrq()'Ingo Molnar33-93/+93
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Use u64 in rdmsrl_amd_safe() and wrmsrl_amd_safe()Ingo Molnar1-2/+2
Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-09Merge tag 'v6.15-rc1' into x86/mm, to pick up fixesIngo Molnar10-54/+87
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-09Merge tag 'v6.15-rc1' into x86/asm, to refresh the branchIngo Molnar30-753/+900
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-09x86/cacheinfo: Properly parse CPUID(0x80000006) L2/L3 associativityAhmed S. Darwish1-2/+8
Complete the AMD CPUID(4) emulation logic, which uses CPUID(0x80000006) for L2/L3 cache info and an assocs[] associativity mapping array, by adding entries for 3-way caches and 6-way caches. Properly handle the case where CPUID(0x80000006) returns an L2/L3 associativity of 9. This is not real associativity, but a marker to indicate that the respective L2/L3 cache information should be retrieved from CPUID(0x8000001d) instead. If such a marker is encountered, return early from legacy_amd_cpuid4(), thus effectively emulating an "invalid index" CPUID(4) response with a cache type of zero. When checking if CPUID(0x80000006) L2/L3 cache info output is valid, and given the associtivity marker 9 above, do not just check if the whole ECX/EDX register is zero. Rather, check if the associativity is zero or 9. An associativity of zero implies no L2/L3 cache, which make it the more correct check anyway vs. a zero check of the whole output register. Fixes: a326e948c538 ("x86, cacheinfo: Fixup L3 cache information for AMD multi-node processors") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250409122233.1058601-3-darwi@linutronix.de
2025-04-09x86/cacheinfo: Properly parse CPUID(0x80000005) L1d/L1i associativityAhmed S. Darwish1-3/+6
For the AMD CPUID(4) emulation cache info logic, the same associativity mapping array, assocs[], is used for both CPUID(0x80000005) and CPUID(0x80000006). This is incorrect since per the AMD manuals, the mappings for CPUID(0x80000005) L1d/L1i associativity is: n = 0x1 -> 0xfe n n = 0xff fully associative while assocs[] maps these values to: n = 0x1, 0x2, 0x4 n n = 0x3, 0x7, 0x9 0 n = 0x6 8 n = 0x8 16 n = 0xa 32 n = 0xb 48 n = 0xc 64 n = 0xd 96 n = 0xe 128 n = 0xf fully associative which is only valid for CPUID(0x80000006). Parse CPUID(0x80000005) L1d/L1i associativity values as shown in the AMD manuals. Since the 0xffff literal is used to denote full associativity at the AMD CPUID(4)-emulation logic, define AMD_CPUID4_FULLY_ASSOCIATIVE for it instead of spreading that literal in more places. Mark the assocs[] mapping array as only valid for CPUID(0x80000006) L2/L3 cache information. Fixes: a326e948c538 ("x86, cacheinfo: Fixup L3 cache information for AMD multi-node processors") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250409122233.1058601-2-darwi@linutronix.de
2025-04-09x86/cpu: Avoid running off the end of an AMD erratum tableDave Hansen1-0/+1
The NULL array terminator at the end of erratum_1386_microcode was removed during the switch from x86_cpu_desc to x86_cpu_id. This causes readers to run off the end of the array. Replace the NULL. Fixes: f3f325152673 ("x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id'") Reported-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
2025-04-09x86/bugs: Add RSB mitigation documentJosh Poimboeuf1-51/+13
Create a document to summarize hard-earned knowledge about RSB-related mitigations, with references, and replace the overly verbose yet incomplete comments with a reference to the document. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/ab73f4659ba697a974759f07befd41ae605e33dd.1744148254.git.jpoimboe@kernel.org
2025-04-09x86/bugs: Don't fill RSB on context switch with eIBRSJosh Poimboeuf1-12/+12
User->user Spectre v2 attacks (including RSB) across context switches are already mitigated by IBPB in cond_mitigation(), if enabled globally or if either the prev or the next task has opted in to protection. RSB filling without IBPB serves no purpose for protecting user space, as indirect branches are still vulnerable. User->kernel RSB attacks are mitigated by eIBRS. In which case the RSB filling on context switch isn't needed, so remove it. Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Amit Shah <amit.shah@amd.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/98cdefe42180358efebf78e3b80752850c7a3e1b.1744148254.git.jpoimboe@kernel.org
2025-04-09x86/bugs: Don't fill RSB on VMEXIT with eIBRS+retpolineJosh Poimboeuf1-4/+4
eIBRS protects against guest->host RSB underflow/poisoning attacks. Adding retpoline to the mix doesn't change that. Retpoline has a balanced CALL/RET anyway. So the current full RSB filling on VMEXIT with eIBRS+retpoline is overkill. Disable it or do the VMEXIT_LITE mitigation if needed. Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Amit Shah <amit.shah@amd.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Woodhouse <dwmw2@infradead.org> Link: https://lore.kernel.org/r/84a1226e5c9e2698eae1b5ade861f1b8bf3677dc.1744148254.git.jpoimboe@kernel.org
2025-04-09x86/bugs: Fix RSB clearing in indirect_branch_prediction_barrier()Josh Poimboeuf1-1/+0
IBPB is expected to clear the RSB. However, if X86_BUG_IBPB_NO_RET is set, that doesn't happen. Make indirect_branch_prediction_barrier() take that into account by calling write_ibpb() which clears RSB on X86_BUG_IBPB_NO_RET: /* Make sure IBPB clears return stack preductions too. */ FILL_RETURN_BUFFER %rax, RSB_CLEAR_LOOPS, X86_BUG_IBPB_NO_RET Note that, as of the previous patch, write_ibpb() also reads 'x86_pred_cmd' in order to use SBPB when applicable: movl _ASM_RIP(x86_pred_cmd), %eax Therefore that existing behavior in indirect_branch_prediction_barrier() is not lost. Fixes: 50e4b3b94090 ("x86/entry: Have entry_ibpb() invalidate return predictions") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/bba68888c511743d4cd65564d1fc41438907523f.1744148254.git.jpoimboe@kernel.org
2025-04-09x86/bugs: Rename entry_ibpb() to write_ibpb()Josh Poimboeuf1-3/+3
There's nothing entry-specific about entry_ibpb(). In preparation for calling it from elsewhere, rename it to write_ibpb(). Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/1e54ace131e79b760de3fe828264e26d0896e3ac.1744148254.git.jpoimboe@kernel.org
2025-04-09x86/early_printk: Use 'mmio32' for consistency, fix commentsAndy Shevchenko1-5/+5
First of all, using 'mmio' prevents proper implementation of 8-bit accessors. Second, it's simply inconsistent with uart8250 set of options. Rename it to 'mmio32'. While at it, remove rather misleading comment in the documentation. From now on mmio32 is self-explanatory and pciserial supports not only 32-bit MMIO accessors. Also, while at it, fix the comment for the "pciserial" case. The comment seems to be a copy'n'paste error when mentioning "serial" instead of "pciserial" (with double quotes). Fix this. With that, move it upper, so we don't calculate 'buf' twice. Fixes: 3181424aeac2 ("x86/early_printk: Add support for MMIO-based UARTs") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Denis Mukhin <dmukhin@ford.com> Link: https://lore.kernel.org/r/20250407172214.792745-1-andriy.shevchenko@linux.intel.com
2025-04-09x86/resctrl: Fix rdtgroup_mkdir()'s unlocked use of kernfs_node::nameJames Morse1-21/+27
Since 741c10b096bc ("kernfs: Use RCU to access kernfs_node::name.") a helper rdt_kn_name() that checks that rdtgroup_mutex is held has been used for all accesses to the kernfs node name. rdtgroup_mkdir() uses the name to determine if a valid monitor group is being created by checking the parent name is "mon_groups". This is done without holding rdtgroup_mutex, and now triggers the following warning: | WARNING: suspicious RCU usage | 6.15.0-rc1 #4465 Tainted: G E | ----------------------------- | arch/x86/kernel/cpu/resctrl/internal.h:408 suspicious rcu_dereference_check() usage! [...] | Call Trace: | <TASK> | dump_stack_lvl | lockdep_rcu_suspicious.cold | is_mon_groups | rdtgroup_mkdir | kernfs_iop_mkdir | vfs_mkdir | do_mkdirat | __x64_sys_mkdir | do_syscall_64 | entry_SYSCALL_64_after_hwframe Creating a control or monitor group calls mkdir_rdt_prepare(), which uses rdtgroup_kn_lock_live() to take the rdtgroup_mutex. To avoid taking and dropping the lock, move the check for the monitor group name and position into mkdir_rdt_prepare() so that it occurs under rdtgroup_mutex. Hoist is_mon_groups() earlier in the file. [ bp: Massage. ] Fixes: 741c10b096bc ("kernfs: Use RCU to access kernfs_node::name.") Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250407124637.2433230-1-james.morse@arm.com
2025-04-07x86/e820: Fix handling of subpage regions when calculating nosave ranges in ↵Myrrh Periwinkle1-9/+8
e820__register_nosave_regions() While debugging kexec/hibernation hangs and crashes, it turned out that the current implementation of e820__register_nosave_regions() suffers from multiple serious issues: - The end of last region is tracked by PFN, causing it to find holes that aren't there if two consecutive subpage regions are present - The nosave PFN ranges derived from holes are rounded out (instead of rounded in) which makes it inconsistent with how explicitly reserved regions are handled Fix this by: - Treating reserved regions as if they were holes, to ensure consistent handling (rounding out nosave PFN ranges is more correct as the kernel does not use partial pages) - Tracking the end of the last RAM region by address instead of pages to detect holes more precisely These bugs appear to have been introduced about ~18 years ago with the very first version of e820_mark_nosave_regions(), and its flawed assumptions were carried forward uninterrupted through various waves of rewrites and renames. [ mingo: Added Git archeology details, for kicks and giggles. ] Fixes: e8eff5ac294e ("[PATCH] Make swsusp avoid memory holes and reserved memory regions on x86_64") Reported-by: Roberto Ricci <io@r-ricci.it> Tested-by: Roberto Ricci <io@r-ricci.it> Signed-off-by: Myrrh Periwinkle <myrrhperiwinkle@qtmlabs.xyz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Len Brown <len.brown@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250406-fix-e820-nosave-v3-1-f3787bc1ee1d@qtmlabs.xyz Closes: https://lore.kernel.org/all/Z4WFjBVHpndct7br@desktop0a/
2025-04-07x86/acpi: Don't limit CPUs to 1 for Xen PV guests due to disabled ACPIPetr Vaněk1-0/+11
Xen disables ACPI for PV guests in DomU, which causes acpi_mps_check() to return 1 when CONFIG_X86_MPPARSE is not set. As a result, the local APIC is disabled and the guest is later limited to a single vCPU, despite being configured with more. This regression was introduced in version 6.9 in commit 7c0edad3643f ("x86/cpu/topology: Rework possible CPU management"), which added an early check that limits CPUs to 1 if apic_is_disabled. Update the acpi_mps_check() logic to return 0 early when running as a Xen PV guest in DomU, preventing APIC from being disabled in this specific case and restoring correct multi-vCPU behaviour. Fixes: 7c0edad3643f ("x86/cpu/topology: Rework possible CPU management") Signed-off-by: Petr Vaněk <arkamar@atlas.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250407132445.6732-2-arkamar@atlas.cz
2025-04-07x86/microcode/AMD: Clean the cache if update did not load microcodeBoris Ostrovsky1-0/+7
If microcode did not get loaded there is no reason to keep it in the cache. Moreover, if loading failed it will not be possible to load an earlier version of microcode since the failed revision will always be selected from the cache on the next reload attempt. Since the failed revisions is not easily available at this point just clean the whole cache. It will be rebuilt later if needed. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250327230503.1850368-3-boris.ostrovsky@oracle.com
2025-04-06x86/cpuid: Add AMX and SPEC_CTRL dependenciesAndi Kleen1-0/+4
Add some missing dependencies to the CPUID dependency table: - All the AMX features depend on AMX_TILE - All the SPEC_CTRL features depend on SPEC_CTRL [ mingo: Keep the AMX part of the table grouped ... ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Ahmed S. Darwish <darwi@linutronix.de> Link: https://lore.kernel.org/r/20240924170128.2611854-1-ak@linux.intel.com
2025-04-06Merge tag 'timers-cleanups-2025-04-06' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A set of final cleanups for the timer subsystem: - Convert all del_timer[_sync]() instances over to the new timer_delete[_sync]() API and remove the legacy wrappers. Conversion was done with coccinelle plus some manual fixups as coccinelle chokes on scoped_guard(). - The final cleanup of the hrtimer_init() to hrtimer_setup() conversion. This has been delayed to the end of the merge window, so that all patches which have been merged through other trees are in mainline and all new users are catched. Doing this right before rc1 ensures that new code which is merged post rc1 is not introducing new instances of the original functionality" * tag 'timers-cleanups-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tracing/timers: Rename the hrtimer_init event to hrtimer_setup hrtimers: Rename debug_init_on_stack() to debug_setup_on_stack() hrtimers: Rename debug_init() to debug_setup() hrtimers: Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper() hrtimers: Remove unnecessary NULL check in hrtimer_start_range_ns() hrtimers: Make callback function pointer private hrtimers: Merge __hrtimer_init() into __hrtimer_setup() hrtimers: Switch to use __htimer_setup() hrtimers: Delete hrtimer_init() treewide: Convert new and leftover hrtimer_init() users treewide: Switch/rename to timer_delete[_sync]()
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner1-3/+3
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>