Age | Commit message (Collapse) | Author | Files | Lines |
|
commit b8c3a2502a205321fe66c356f4b70cabd8e1a5fc upstream.
The only difference between 5 and 6 is the new counters snapshotting
group, without the following counters snapshotting enabling patches,
it's impossible to utilize the feature in a PEBS record. It's safe to
share the same code path with format 5.
Add format 6, so the end user can at least utilize the legacy PEBS
features.
Fixes: a932aa0e868f ("perf/x86: Add Lunar Lake and Arrow Lake support")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241216204505.748363-1-kan.liang@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit aa5d2ca7c179c40669edb5e96d931bf9828dea3d upstream.
The released OCR and FRONTEND events utilized more bits on Lunar Lake
p-core. The corresponding mask in the extra_regs has to be extended to
unblock the extra bits.
Add a dedicated intel_lnc_extra_regs.
Fixes: a932aa0e868f ("perf/x86: Add Lunar Lake and Arrow Lake support")
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20241216160252.430858-1-kan.liang@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dc81e556f2a017d681251ace21bf06c126d5a192 upstream.
An indirect branch instruction sets the CPU indirect branch tracker
(IBT) into WAIT_FOR_ENDBRANCH (WFE) state and WFE stays asserted
across the instruction boundary. When the decoder finds an
inappropriate instruction while WFE is set ENDBR, the CPU raises a #CP
fault.
For the "kernel IBT no ENDBR" selftest where #CPs are deliberately
triggered, the WFE state of the interrupted context needs to be
cleared to let execution continue. Otherwise when the CPU resumes
from the instruction that just caused the previous #CP, another
missing-ENDBRANCH #CP is raised and the CPU enters a dead loop.
This is not a problem with IDT because it doesn't preserve WFE and
IRET doesn't set WFE. But FRED provides space on the entry stack
(in an expanded CS area) to save and restore the WFE state, thus the
WFE state is no longer clobbered, so software must clear it.
Clear WFE to avoid dead looping in ibt_clear_fred_wfe() and the
!ibt_fatal code path when execution is allowed to continue.
Clobbering WFE in any other circumstance is a security-relevant bug.
[ dhansen: changelog rewording ]
Fixes: a5f6c2ace997 ("x86/shstk: Add user control-protection fault handler")
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241113175934.3897541-1-xin%40zytor.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b6ccddd6fe1fd49c7a82b6fbed01cccad21a29c7 upstream.
From the perspective of the uncore PMU, the Clearwater Forest is the
same as the previous Sierra Forest. The only difference is the event
list, which will be supported in the perf tool later.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241211161146.235253-1-kan.liang@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c1474bb0b7cff4e8481095bd0618b8f6c2f0aeb4 ]
The branch instructions beq, bne, blt, bge, bltu, bgeu and jirl belong
to the format reg2i16, but the sequence of oprand is different for the
instruction jirl. So adjust the parameter order of emit_jirl() to make
it more readable correspond with the Instruction Set Architecture manual.
Here are the instruction formats:
beq rj, rd, offs16
bne rj, rd, offs16
blt rj, rd, offs16
bge rj, rd, offs16
bltu rj, rd, offs16
bgeu rj, rd, offs16
jirl rd, rj, offs16
Link: https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#branch-instructions
Suggested-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 55dc2f8f263448f1e6c7ef135d08e640d5a4826e ]
Since screen_info.lfb_base is a __u32 type, an above-4G address need an
ext_lfb_base to present its higher 32bits. In init_screen_info() we can
use __screen_info_lfb_base() to handle this case for reserving screen
info memory.
Signed-off-by: Xuefeng Zhao <zhaoxuefeng@loongson.cn>
Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn>
Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 05aa156e156ef3168e7ab8a68721945196495c17 ]
The mapping VMA address is saved in VAS window struct when the
paste address is mapped. This VMA address is used during migration
to unmap the paste address if the window is active. The paste
address mapping will be removed when the window is closed or with
the munmap(). But the VMA address in the VAS window is not updated
with munmap() which is causing invalid access during migration.
The KASAN report shows:
[16386.254991] BUG: KASAN: slab-use-after-free in reconfig_close_windows+0x1a0/0x4e8
[16386.255043] Read of size 8 at addr c00000014a819670 by task drmgr/696928
[16386.255096] CPU: 29 UID: 0 PID: 696928 Comm: drmgr Kdump: loaded Tainted: G B 6.11.0-rc5-nxgzip #2
[16386.255128] Tainted: [B]=BAD_PAGE
[16386.255148] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.00 (NH1110_016) hv:phyp pSeries
[16386.255181] Call Trace:
[16386.255202] [c00000016b297660] [c0000000018ad0ac] dump_stack_lvl+0x84/0xe8 (unreliable)
[16386.255246] [c00000016b297690] [c0000000006e8a90] print_report+0x19c/0x764
[16386.255285] [c00000016b297760] [c0000000006e9490] kasan_report+0x128/0x1f8
[16386.255309] [c00000016b297880] [c0000000006eb5c8] __asan_load8+0xac/0xe0
[16386.255326] [c00000016b2978a0] [c00000000013f898] reconfig_close_windows+0x1a0/0x4e8
[16386.255343] [c00000016b297990] [c000000000140e58] vas_migration_handler+0x3a4/0x3fc
[16386.255368] [c00000016b297a90] [c000000000128848] pseries_migrate_partition+0x4c/0x4c4
...
[16386.256136] Allocated by task 696554 on cpu 31 at 16377.277618s:
[16386.256149] kasan_save_stack+0x34/0x68
[16386.256163] kasan_save_track+0x34/0x80
[16386.256175] kasan_save_alloc_info+0x58/0x74
[16386.256196] __kasan_slab_alloc+0xb8/0xdc
[16386.256209] kmem_cache_alloc_noprof+0x200/0x3d0
[16386.256225] vm_area_alloc+0x44/0x150
[16386.256245] mmap_region+0x214/0x10c4
[16386.256265] do_mmap+0x5fc/0x750
[16386.256277] vm_mmap_pgoff+0x14c/0x24c
[16386.256292] ksys_mmap_pgoff+0x20c/0x348
[16386.256303] sys_mmap+0xd0/0x160
...
[16386.256350] Freed by task 0 on cpu 31 at 16386.204848s:
[16386.256363] kasan_save_stack+0x34/0x68
[16386.256374] kasan_save_track+0x34/0x80
[16386.256384] kasan_save_free_info+0x64/0x10c
[16386.256396] __kasan_slab_free+0x120/0x204
[16386.256415] kmem_cache_free+0x128/0x450
[16386.256428] vm_area_free_rcu_cb+0xa8/0xd8
[16386.256441] rcu_do_batch+0x2c8/0xcf0
[16386.256458] rcu_core+0x378/0x3c4
[16386.256473] handle_softirqs+0x20c/0x60c
[16386.256495] do_softirq_own_stack+0x6c/0x88
[16386.256509] do_softirq_own_stack+0x58/0x88
[16386.256521] __irq_exit_rcu+0x1a4/0x20c
[16386.256533] irq_exit+0x20/0x38
[16386.256544] interrupt_async_exit_prepare.constprop.0+0x18/0x2c
...
[16386.256717] Last potentially related work creation:
[16386.256729] kasan_save_stack+0x34/0x68
[16386.256741] __kasan_record_aux_stack+0xcc/0x12c
[16386.256753] __call_rcu_common.constprop.0+0x94/0xd04
[16386.256766] vm_area_free+0x28/0x3c
[16386.256778] remove_vma+0xf4/0x114
[16386.256797] do_vmi_align_munmap.constprop.0+0x684/0x870
[16386.256811] __vm_munmap+0xe0/0x1f8
[16386.256821] sys_munmap+0x54/0x6c
[16386.256830] system_call_exception+0x1a0/0x4a0
[16386.256841] system_call_vectored_common+0x15c/0x2ec
[16386.256868] The buggy address belongs to the object at c00000014a819670
which belongs to the cache vm_area_struct of size 168
[16386.256887] The buggy address is located 0 bytes inside of
freed 168-byte region [c00000014a819670, c00000014a819718)
[16386.256915] The buggy address belongs to the physical page:
[16386.256928] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14a81
[16386.256950] memcg:c0000000ba430001
[16386.256961] anon flags: 0x43ffff800000000(node=4|zone=0|lastcpupid=0x7ffff)
[16386.256975] page_type: 0xfdffffff(slab)
[16386.256990] raw: 043ffff800000000 c00000000501c080 0000000000000000 5deadbee00000001
[16386.257003] raw: 0000000000000000 00000000011a011a 00000001fdffffff c0000000ba430001
[16386.257018] page dumped because: kasan: bad access detected
This patch adds close() callback in vas_vm_ops vm_operations_struct
which will be executed during munmap() before freeing VMA. The VMA
address in the VAS window is set to NULL after holding the window
mmap_mutex.
Fixes: 37e6764895ef ("powerpc/pseries/vas: Add VAS migration handler")
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20241214051758.997759-1-haren@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 058387d9c6b70e225da82492e1e193635c3fac3f ]
Set the cache-line-size parameter of the L2 cache for each core to the
correct value of 64 bytes.
Previously, the L2 cache line size was incorrectly set to 128 bytes
for the Broadcom BCM2712. This causes validation tests for the
Performance Application Programming Interface (PAPI) tool to fail as
they depend on sysfs accurately reporting cache line sizes.
The correct value of 64 bytes is stated in the official documentation of
the ARM Cortex A-72, which is linked in the comments of
arm64/boot/dts/broadcom/bcm2712.dtsi as the source for cache-line-size.
Fixes: faa3381267d0 ("arm64: dts: broadcom: Add minimal support for Raspberry Pi 5")
Signed-off-by: Willow Cunningham <willow.e.cunningham@maine.edu>
Link: https://lore.kernel.org/r/20241007212954.214724-1-willow.e.cunningham@maine.edu
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 4d5163cba43fe96902165606fa54e1aecbbb32de upstream.
Drop KVM's arbitrary behavior of making DE_CFG.LFENCE_SERIALIZE read-only
for the guest, as rejecting writes can lead to guest crashes, e.g. Windows
in particular doesn't gracefully handle unexpected #GPs on the WRMSR, and
nothing in the AMD manuals suggests that LFENCE_SERIALIZE is read-only _if
it exists_.
KVM only allows LFENCE_SERIALIZE to be set, by the guest or host, if the
underlying CPU has X86_FEATURE_LFENCE_RDTSC, i.e. if LFENCE is guaranteed
to be serializing. So if the guest sets LFENCE_SERIALIZE, KVM will provide
the desired/correct behavior without any additional action (the guest's
value is never stuffed into hardware). And having LFENCE be serializing
even when it's not _required_ to be is a-ok from a functional perspective.
Fixes: 74a0e79df68a ("KVM: SVM: Disallow guest from changing userspace's MSR_AMD64_DE_CFG value")
Fixes: d1d93fa90f1a ("KVM: SVM: Add MSR-based feature support for serializing LFENCE")
Reported-by: Simon Pilkington <simonp.git@mailbox.org>
Closes: https://lore.kernel.org/all/52914da7-a97b-45ad-86a0-affdf8266c61@mailbox.org
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20241211172952.1477605-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9b42d1e8e4fe9dc631162c04caa69b0d1860b0f0 upstream.
Use is_64_bit_hypercall() instead of is_64_bit_mode() to detect a 64-bit
hypercall when completing said hypercall. For guests with protected state,
e.g. SEV-ES and SEV-SNP, KVM must assume the hypercall was made in 64-bit
mode as the vCPU state needed to detect 64-bit mode is unavailable.
Hacking the sev_smoke_test selftest to generate a KVM_HC_MAP_GPA_RANGE
hypercall via VMGEXIT trips the WARN:
------------[ cut here ]------------
WARNING: CPU: 273 PID: 326626 at arch/x86/kvm/x86.h:180 complete_hypercall_exit+0x44/0xe0 [kvm]
Modules linked in: kvm_amd kvm ... [last unloaded: kvm]
CPU: 273 UID: 0 PID: 326626 Comm: sev_smoke_test Not tainted 6.12.0-smp--392e932fa0f3-feat #470
Hardware name: Google Astoria/astoria, BIOS 0.20240617.0-0 06/17/2024
RIP: 0010:complete_hypercall_exit+0x44/0xe0 [kvm]
Call Trace:
<TASK>
kvm_arch_vcpu_ioctl_run+0x2400/0x2720 [kvm]
kvm_vcpu_ioctl+0x54f/0x630 [kvm]
__se_sys_ioctl+0x6b/0xc0
do_syscall_64+0x83/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
---[ end trace 0000000000000000 ]---
Fixes: b5aead0064f3 ("KVM: x86: Assume a 64-bit hypercall for guests with protected state")
Cc: stable@vger.kernel.org
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20241128004344.4072099-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bcc80dec91ee745b3d66f3e48f0ec2efdea97149 upstream.
read_hv_sched_clock_tsc() assumes that the Hyper-V clock counter is
bigger than the variable hv_sched_clock_offset, which is cached during
early boot, but depending on the timing this assumption may be false
when a hibernated VM starts again (the clock counter starts from 0
again) and is resuming back (Note: hv_init_tsc_clocksource() is not
called during hibernation/resume); consequently,
read_hv_sched_clock_tsc() may return a negative integer (which is
interpreted as a huge positive integer since the return type is u64)
and new kernel messages are prefixed with huge timestamps before
read_hv_sched_clock_tsc() grows big enough (which typically takes
several seconds).
Fix the issue by saving the Hyper-V clock counter just before the
suspend, and using it to correct the hv_sched_clock_offset in
resume. This makes hv tsc page based sched_clock continuous and ensures
that post resume, it starts from where it left off during suspend.
Override x86_platform.save_sched_clock_state and
x86_platform.restore_sched_clock_state routines to correct this as soon
as possible.
Note: if Invariant TSC is available, the issue doesn't happen because
1) we don't register read_hv_sched_clock_tsc() for sched clock:
See commit e5313f1c5404 ("clocksource/drivers/hyper-v: Rework
clocksource and sched clock setup");
2) the common x86 code adjusts TSC similarly: see
__restore_processor_state() -> tsc_verify_tsc_adjust(true) and
x86_platform.restore_sched_clock_state().
Cc: stable@vger.kernel.org
Fixes: 1349401ff1aa ("clocksource/drivers/hyper-v: Suspend/resume Hyper-V clocksource for hibernation")
Co-developed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20240917053917.76787-1-namjain@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240917053917.76787-1-namjain@linux.microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 41856638e6c4ed51d8aa9e54f70059d1e357b46e upstream.
With uncoupling of physical and virtual address spaces population of
the identity mapping was changed to use the type POPULATE_IDENTITY
instead of POPULATE_DIRECT. This breaks DirectMap accounting:
> cat /proc/meminfo
DirectMap4k: 55296 kB
DirectMap1M: 18446744073709496320 kB
Adjust all locations of update_page_count() in vmem.c to use
POPULATE_IDENTITY instead of POPULATE_DIRECT as well. With this
accounting is correct again:
> cat /proc/meminfo
DirectMap4k: 54264 kB
DirectMap1M: 8334336 kB
Fixes: c98d2ecae08f ("s390/mm: Uncouple physical vs virtual address spaces")
Cc: stable@vger.kernel.org
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit aef25be35d23ec768eed08bfcf7ca3cf9685bc28 upstream.
The Hexagon-specific constant extender optimization in LLVM may crash on
Linux kernel code [1], such as fs/bcache/btree_io.c after
commit 32ed4a620c54 ("bcachefs: Btree path tracepoints") in 6.12:
clang: llvm/lib/Target/Hexagon/HexagonConstExtenders.cpp:745: bool (anonymous namespace)::HexagonConstExtenders::ExtRoot::operator<(const HCE::ExtRoot &) const: Assertion `ThisB->getParent() == OtherB->getParent()' failed.
Stack dump:
0. Program arguments: clang --target=hexagon-linux-musl ... fs/bcachefs/btree_io.c
1. <eof> parser at end of file
2. Code generation
3. Running pass 'Function Pass Manager' on module 'fs/bcachefs/btree_io.c'.
4. Running pass 'Hexagon constant-extender optimization' on function '@__btree_node_lock_nopath'
Without assertions enabled, there is just a hang during compilation.
This has been resolved in LLVM main (20.0.0) [2] and backported to LLVM
19.1.0 but the kernel supports LLVM 13.0.1 and newer, so disable the
constant expander optimization using the '-mllvm' option when using a
toolchain that is not fixed.
Cc: stable@vger.kernel.org
Link: https://github.com/llvm/llvm-project/issues/99714 [1]
Link: https://github.com/llvm/llvm-project/commit/68df06a0b2998765cb0a41353fcf0919bbf57ddb [2]
Link: https://github.com/llvm/llvm-project/commit/2ab8d93061581edad3501561722ebd5632d73892 [3]
Reviewed-by: Brian Cain <bcain@quicinc.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1201f226c863b7da739f7420ddba818cedf372fc upstream.
Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to
avoid the overead of CPUID during runtime. The offset, size, and metadata
for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e.
is constant for a given CPU, and thus can be cached during module load.
On Intel's Emerald Rapids, CPUID is *wildly* expensive, to the point where
recomputing XSAVE offsets and sizes results in a 4x increase in latency of
nested VM-Enter and VM-Exit (nested transitions can trigger
xstate_required_size() multiple times per transition), relative to using
cached values. The issue is easily visible by running `perf top` while
triggering nested transitions: kvm_update_cpuid_runtime() shows up at a
whopping 50%.
As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test
and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested
roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald
Rapids (EMR) takes:
SKX 11650
ICX 22350
EMR 28850
Using cached values, the latency drops to:
SKX 6850
ICX 9000
EMR 7900
The underlying issue is that CPUID itself is slow on ICX, and comically
slow on EMR. The problem is exacerbated on CPUs which support XSAVES
and/or XSAVEC, as KVM invokes xstate_required_size() twice on each
runtime CPUID update, and because there are more supported XSAVE features
(CPUID for supported XSAVE feature sub-leafs is significantly slower).
SKX:
CPUID.0xD.2 = 348 cycles
CPUID.0xD.3 = 400 cycles
CPUID.0xD.4 = 276 cycles
CPUID.0xD.5 = 236 cycles
<other sub-leaves are similar>
EMR:
CPUID.0xD.2 = 1138 cycles
CPUID.0xD.3 = 1362 cycles
CPUID.0xD.4 = 1068 cycles
CPUID.0xD.5 = 910 cycles
CPUID.0xD.6 = 914 cycles
CPUID.0xD.7 = 1350 cycles
CPUID.0xD.8 = 734 cycles
CPUID.0xD.9 = 766 cycles
CPUID.0xD.10 = 732 cycles
CPUID.0xD.11 = 718 cycles
CPUID.0xD.12 = 734 cycles
CPUID.0xD.13 = 1700 cycles
CPUID.0xD.14 = 1126 cycles
CPUID.0xD.15 = 898 cycles
CPUID.0xD.16 = 716 cycles
CPUID.0xD.17 = 748 cycles
CPUID.0xD.18 = 776 cycles
Note, updating runtime CPUID information multiple times per nested
transition is itself a flaw, especially since CPUID is a mandotory
intercept on both Intel and AMD. E.g. KVM doesn't need to ensure emulated
CPUID state is up-to-date while running L2. That flaw will be fixed in a
future patch, as deferring runtime CPUID updates is more subtle than it
appears at first glance, the benefits aren't super critical to have once
the XSAVE issue is resolved, and caching CPUID output is desirable even if
KVM's updates are deferred.
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241211013302.1347853-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 03c7527e97f73081633d773f9f8c2373f9854b25 upstream.
Catalin reports that a hypervisor lying to a guest about the size
of the ASID field may result in unexpected issues:
- if the underlying HW does only supports 8 bit ASIDs, the ASID
field in a TLBI VAE1* operation is only 8 bits, and the HW will
ignore the other 8 bits
- if on the contrary the HW is 16 bit capable, the ASID field
in the same TLBI operation is always 16 bits, irrespective of
the value of TCR_ELx.AS.
This could lead to missed invalidations if the guest was lead to
assume that the HW had 8 bit ASIDs while they really are 16 bit wide.
In order to avoid any potential disaster that would be hard to debug,
prenent the migration between a host with 8 bit ASIDs to one with
wider ASIDs (the converse was obviously always forbidden). This is
also consistent with what we already do for VMIDs.
If it becomes absolutely mandatory to support such a migration path
in the future, we will have to trap and emulate all TLBIs, something
that nobody should look forward to.
Fixes: d5a32b60dc18 ("KVM: arm64: Allow userspace to change ID_AA64MMFR{0-2}_EL1")
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20241203190236.505759-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 282da38b465395c930687974627c24f47ddce5ff ]
The calculation determining whether to use three- or four-level paging
didn't account for KMSAN modules metadata. Include this metadata in the
virtual memory size calculation to ensure correct paging mode selection
and avoiding potentially unnecessary physical memory size limitations.
Fixes: 65ca73f9fb36 ("s390/mm: define KMSAN metadata for vmalloc and modules")
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5fa49dd8e521a42379e5e41fcf2c92edaaec0a8b ]
DEFINE_IPL_ATTR_STR_RW() macro produces "unsigned 'len' is never less
than zero." warning when sys_vmcmd_on_*_store() callbacks are defined.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412081614.5uel8F6W-lkp@intel.com/
Fixes: 247576bf624a ("s390/ipl: Do not accept z/VM CP diag X'008' cmds longer than max length")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ea6398a5af81e3e7fb3da5d261694d479a321fd9 ]
This doesn't cause a problem currently as HVIEN isn't used elsewhere
yet. Found by inspection.
Signed-off-by: Michael Neuling <michaelneuling@tenstorrent.com>
Fixes: 16b0bde9a37c ("RISC-V: KVM: Add perf sampling support for guests")
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20241127041840.419940-1-michaelneuling@tenstorrent.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 7fa0da5373685e7ed249af3fa317ab1e1ba8b0a6 upstream.
The hypercall page is no longer needed. It can be removed, as from the
Xen perspective it is optional.
But, from Linux's perspective, it removes naked RET instructions that
escape the speculative protections that Call Depth Tracking and/or
Untrain Ret are trying to achieve.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b1c2cb86f4a7861480ad54bb9a58df3cbebf8e92 upstream.
Call the Xen hypervisor via the new xen_hypercall_func static-call
instead of the hypercall page.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b4845bb6383821a9516ce30af3a27dc873e37fd4 upstream.
Add generic hypercall functions usable for all normal (i.e. not iret)
hypercalls. Depending on the guest type and the processor vendor
different functions need to be used due to the to be used instruction
for entering the hypervisor:
- PV guests need to use syscall
- HVM/PVH guests on Intel need to use vmcall
- HVM/PVH guests on AMD and Hygon need to use vmmcall
As PVH guests need to issue hypercalls very early during boot, there
is a 4th hypercall function needed for HVM/PVH which can be used on
Intel and AMD processors. It will check the vendor type and then set
the Intel or AMD specific function to use via static_call().
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a2796dff62d6c6bfc5fbebdf2bee0d5ac0438906 upstream.
Instead of jumping to the Xen hypercall page for doing the iret
hypercall, directly code the required sequence in xen-asm.S.
This is done in preparation of no longer using hypercall page at all,
as it has shown to cause problems with speculation mitigations.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0ef8047b737d7480a5d4c46d956e97c190f13050 upstream.
Add static_call_update_early() for updating static-call targets in
very early boot.
This will be needed for support of Xen guest type specific hypercall
functions.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit efbcd61d9bebb771c836a3b8bfced8165633db7c upstream.
In order to be able to differentiate between AMD and Intel based
systems for very early hypercalls without having to rely on the Xen
hypercall page, make get_cpu_vendor() non-static.
Refactor early_cpu_init() for the same reason by splitting out the
loop initializing cpu_devs() into an externally callable function.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6685f5d572c22e1003e7c0d089afe1c64340ab1f upstream.
commit 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits in
ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to guests,
but didn't add trap handling. A previous patch supplied the missing trap
handling.
Existing VMs that have the MPAM field of ID_AA64PFR0_EL1 set need to
be migratable, but there is little point enabling the MPAM CPU
interface on new VMs until there is something a guest can do with it.
Clear the MPAM field from the guest's ID_AA64PFR0_EL1 and on hardware
that supports MPAM, politely ignore the VMMs attempts to set this bit.
Guests exposed to this bug have the sanitised value of the MPAM field,
so only the correct value needs to be ignored. This means the field
can continue to be used to block migration to incompatible hardware
(between MPAM=1 and MPAM=5), and the VMM can't rely on the field
being ignored.
Signed-off-by: James Morse <james.morse@arm.com>
Co-developed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241030160317.2528209-7-joey.gouly@arm.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
[maz: adapted to lack of ID_FILTERED()]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b3431a8bb336cece8adc452437befa7d4534b2fd upstream.
flush_tlb_kernel_range() may use IPIs to flush the TLBs of all the
cores, which triggers the following warning when the irqs are disabled:
[ 3.455330] WARNING: CPU: 1 PID: 0 at kernel/smp.c:815 smp_call_function_many_cond+0x452/0x520
[ 3.456647] Modules linked in:
[ 3.457218] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.12.0-rc7-00010-g91d3de7240b8 #1
[ 3.457416] Hardware name: QEMU QEMU Virtual Machine, BIOS
[ 3.457633] epc : smp_call_function_many_cond+0x452/0x520
[ 3.457736] ra : on_each_cpu_cond_mask+0x1e/0x30
[ 3.457786] epc : ffffffff800b669a ra : ffffffff800b67c2 sp : ff2000000000bb50
[ 3.457824] gp : ffffffff815212b8 tp : ff6000008014f080 t0 : 000000000000003f
[ 3.457859] t1 : ffffffff815221e0 t2 : 000000000000000f s0 : ff2000000000bc10
[ 3.457920] s1 : 0000000000000040 a0 : ffffffff815221e0 a1 : 0000000000000001
[ 3.457953] a2 : 0000000000010000 a3 : 0000000000000003 a4 : 0000000000000000
[ 3.458006] a5 : 0000000000000000 a6 : ffffffffffffffff a7 : 0000000000000000
[ 3.458042] s2 : ffffffff815223be s3 : 00fffffffffff000 s4 : ff600001ffe38fc0
[ 3.458076] s5 : ff600001ff950d00 s6 : 0000000200000120 s7 : 0000000000000001
[ 3.458109] s8 : 0000000000000001 s9 : ff60000080841ef0 s10: 0000000000000001
[ 3.458141] s11: ffffffff81524812 t3 : 0000000000000001 t4 : ff60000080092bc0
[ 3.458172] t5 : 0000000000000000 t6 : ff200000000236d0
[ 3.458203] status: 0000000200000100 badaddr: ffffffff800b669a cause: 0000000000000003
[ 3.458373] [<ffffffff800b669a>] smp_call_function_many_cond+0x452/0x520
[ 3.458593] [<ffffffff800b67c2>] on_each_cpu_cond_mask+0x1e/0x30
[ 3.458625] [<ffffffff8000e4ca>] __flush_tlb_range+0x118/0x1ca
[ 3.458656] [<ffffffff8000e6b2>] flush_tlb_kernel_range+0x1e/0x26
[ 3.458683] [<ffffffff801ea56a>] kfence_protect+0xc0/0xce
[ 3.458717] [<ffffffff801e9456>] kfence_guarded_free+0xc6/0x1c0
[ 3.458742] [<ffffffff801e9d6c>] __kfence_free+0x62/0xc6
[ 3.458764] [<ffffffff801c57d8>] kfree+0x106/0x32c
[ 3.458786] [<ffffffff80588cf2>] detach_buf_split+0x188/0x1a8
[ 3.458816] [<ffffffff8058708c>] virtqueue_get_buf_ctx+0xb6/0x1f6
[ 3.458839] [<ffffffff805871da>] virtqueue_get_buf+0xe/0x16
[ 3.458880] [<ffffffff80613d6a>] virtblk_done+0x5c/0xe2
[ 3.458908] [<ffffffff8058766e>] vring_interrupt+0x6a/0x74
[ 3.458930] [<ffffffff800747d8>] __handle_irq_event_percpu+0x7c/0xe2
[ 3.458956] [<ffffffff800748f0>] handle_irq_event+0x3c/0x86
[ 3.458978] [<ffffffff800786cc>] handle_simple_irq+0x9e/0xbe
[ 3.459004] [<ffffffff80073934>] generic_handle_domain_irq+0x1c/0x2a
[ 3.459027] [<ffffffff804bf87c>] imsic_handle_irq+0xba/0x120
[ 3.459056] [<ffffffff80073934>] generic_handle_domain_irq+0x1c/0x2a
[ 3.459080] [<ffffffff804bdb76>] riscv_intc_aia_irq+0x24/0x34
[ 3.459103] [<ffffffff809d0452>] handle_riscv_irq+0x2e/0x4c
[ 3.459133] [<ffffffff809d923e>] call_on_irq_stack+0x32/0x40
So only flush the local TLB and let the lazy kfence page fault handling
deal with the faults which could happen when a core has an old protected
pte version cached in its TLB. That leads to potential inaccuracies which
can be tolerated when using kfence.
Fixes: 47513f243b45 ("riscv: Enable KFENCE for riscv64")
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241209074125.52322-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c796e187201242992d6d292bfeff41aadfdf3f29 upstream.
riscv uses fixmap addresses to map the dtb so we can't use __pa() which
is reserved for linear mapping addresses.
Fixes: b2473a359763 ("of/fdt: add dt_phys arg to early_init_dt_scan and early_init_dt_verify")
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20241209074508.53037-1-alexghiti@rivosinc.com
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 21f1b85c8912262adf51707e63614a114425eb10 upstream.
The vmemmap's, which is used for RV64 with SPARSEMEM_VMEMMAP, page
tables are populated using pmd (page middle directory) hugetables.
However, the pmd allocation is not using the generic mechanism used by
the VMA code (e.g. pmd_alloc()), or the RISC-V specific
create_pgd_mapping()/alloc_pmd_late(). Instead, the vmemmap page table
code allocates a page, and calls vmemmap_set_pmd(). This results in
that the pmd ctor is *not* called, nor would it make sense to do so.
Now, when tearing down a vmemmap page table pmd, the cleanup code
would unconditionally, and incorrectly call the pmd dtor, which
results in a crash (best case).
This issue was found when running the HMM selftests:
| tools/testing/selftests/mm# ./test_hmm.sh smoke
| ... # when unloading the test_hmm.ko module
| page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10915b
| flags: 0x1000000000000000(node=0|zone=1)
| raw: 1000000000000000 0000000000000000 dead000000000122 0000000000000000
| raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
| page dumped because: VM_BUG_ON_PAGE(ptdesc->pmd_huge_pte)
| ------------[ cut here ]------------
| kernel BUG at include/linux/mm.h:3080!
| Kernel BUG [#1]
| Modules linked in: test_hmm(-) sch_fq_codel fuse drm drm_panel_orientation_quirks backlight dm_mod
| CPU: 1 UID: 0 PID: 514 Comm: modprobe Tainted: G W 6.12.0-00982-gf2a4f1682d07 #2
| Tainted: [W]=WARN
| Hardware name: riscv-virtio qemu/qemu, BIOS 2024.10 10/01/2024
| epc : remove_pgd_mapping+0xbec/0x1070
| ra : remove_pgd_mapping+0xbec/0x1070
| epc : ffffffff80010a68 ra : ffffffff80010a68 sp : ff20000000a73940
| gp : ffffffff827b2d88 tp : ff6000008785da40 t0 : ffffffff80fbce04
| t1 : 0720072007200720 t2 : 706d756420656761 s0 : ff20000000a73a50
| s1 : ff6000008915cff8 a0 : 0000000000000039 a1 : 0000000000000008
| a2 : ff600003fff0de20 a3 : 0000000000000000 a4 : 0000000000000000
| a5 : 0000000000000000 a6 : c0000000ffffefff a7 : ffffffff824469b8
| s2 : ff1c0000022456c0 s3 : ff1ffffffdbfffff s4 : ff6000008915c000
| s5 : ff6000008915c000 s6 : ff6000008915c000 s7 : ff1ffffffdc00000
| s8 : 0000000000000001 s9 : ff1ffffffdc00000 s10: ffffffff819a31f0
| s11: ffffffffffffffff t3 : ffffffff8000c950 t4 : ff60000080244f00
| t5 : ff60000080244000 t6 : ff20000000a73708
| status: 0000000200000120 badaddr: ffffffff80010a68 cause: 0000000000000003
| [<ffffffff80010a68>] remove_pgd_mapping+0xbec/0x1070
| [<ffffffff80fd238e>] vmemmap_free+0x14/0x1e
| [<ffffffff8032e698>] section_deactivate+0x220/0x452
| [<ffffffff8032ef7e>] sparse_remove_section+0x4a/0x58
| [<ffffffff802f8700>] __remove_pages+0x7e/0xba
| [<ffffffff803760d8>] memunmap_pages+0x2bc/0x3fe
| [<ffffffff02a3ca28>] dmirror_device_remove_chunks+0x2ea/0x518 [test_hmm]
| [<ffffffff02a3e026>] hmm_dmirror_exit+0x3e/0x1018 [test_hmm]
| [<ffffffff80102c14>] __riscv_sys_delete_module+0x15a/0x2a6
| [<ffffffff80fd020c>] do_trap_ecall_u+0x1f2/0x266
| [<ffffffff80fde0a2>] _new_vmalloc_restore_context_a0+0xc6/0xd2
| Code: bf51 7597 0184 8593 76a5 854a 4097 0029 80e7 2c00 (9002) 7597
| ---[ end trace 0000000000000000 ]---
| Kernel panic - not syncing: Fatal exception in interrupt
Add a check to avoid calling the pmd dtor, if the calling context is
vmemmap_free().
Fixes: c75a74f4ba19 ("riscv: mm: Add memory hotplugging support")
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20241120131203.1859787-1-bjorn@kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9f3de72a0c37005f897d69e4bdd59c25b8898447 upstream.
The PEBS kernel warnings can still be observed with the below case.
when the below commands are running in parallel for a while.
while true;
do
perf record --no-buildid -a --intr-regs=AX \
-e cpu/event=0xd0,umask=0x81/pp \
-c 10003 -o /dev/null ./triad;
done &
while true;
do
perf record -e 'cpu/mem-loads,ldlat=3/uP' -W -d -- ./dtlb
done
The commit b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing
PEBS_DATA_CFG") intends to flush the entire PEBS buffer before the
hardware is reprogrammed. However, it fails in the above case.
The first perf command utilizes the large PEBS, while the second perf
command only utilizes a single PEBS. When the second perf event is
added, only the n_pebs++. The intel_pmu_pebs_enable() is invoked after
intel_pmu_pebs_add(). So the cpuc->n_pebs == cpuc->n_large_pebs check in
the intel_pmu_drain_large_pebs() fails. The PEBS DS is not flushed.
The new PEBS event should not be taken into account when flushing the
existing PEBS DS.
The check is unnecessary here. Before the hardware is reprogrammed, all
the stale records must be drained unconditionally.
For single PEBS or PEBS-vi-pt, the DS must be empty. The drain_pebs()
can handle the empty case. There is no harm to unconditionally drain the
PEBS DS.
Fixes: b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241119135504.1463839-2-kan.liang@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d44d26987bb3df6d76556827097fc9ce17565cb8 upstream.
Since 135225a363ae timekeeping_cycles_to_ns() handles large offsets which
would lead to 64bit multiplication overflows correctly. It's also protected
against negative motion of the clocksource unconditionally, which was
exclusive to x86 before.
timekeeping_advance() handles large offsets already correctly.
That means the value of CONFIG_DEBUG_TIMEKEEPING which analyzed these cases
is very close to zero. Remove all of it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241031120328.536010148@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 48796104c864cf4dafa80bd8c2ce88f9c92a65ea upstream.
Prior to commit 0467cdde8c43 ("s390/pci: Sort PCI functions prior to
creating virtual busses") the IOMMU was initialized and the device was
registered as part of zpci_create_device() with the struct zpci_dev
freed if either resulted in an error. With that commit this was moved
into a separate function called zpci_add_device().
While this new function logs when adding failed, it expects the caller
not to use and to free the struct zpci_dev on error. This difference
between it and zpci_create_device() was missed while changing the
callers and the incompletely initialized struct zpci_dev may get used in
zpci_scan_configured_device in the error path. This then leads to
a crash due to the device not being registered with the zbus. It was
also not freed in this case. Fix this by handling the error return of
zpci_add_device(). Since in this case the zdev was not added to the
zpci_list it can simply be discarded and freed. Also make this more
explicit by moving the kref_init() into zpci_add_device() and document
that zpci_zdev_get()/zpci_zdev_put() must be used after adding.
Cc: stable@vger.kernel.org
Fixes: 0467cdde8c43 ("s390/pci: Sort PCI functions prior to creating virtual busses")
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
kvm_faultin_pfn()
Since 5.16 and prior to 6.13 KVM can't be used with FSDAX
guest memory (PMD pages). To reproduce the issue you need to reserve
guest memory with `memmap=` cmdline, create and mount FS in DAX mode
(tested both XFS and ext4), see doc link below. ndctl command for test:
ndctl create-namespace -v -e namespace1.0 --map=dev --mode=fsdax -a 2M
Then pass memory object to qemu like:
-m 8G -object memory-backend-file,id=ram0,size=8G,\
mem-path=/mnt/pmem/guestmem,share=on,prealloc=on,dump=off,align=2097152 \
-numa node,memdev=ram0,cpus=0-1
QEMU fails to run guest with error: kvm run failed Bad address
and there are two warnings in dmesg:
WARN_ON_ONCE(!page_count(page)) in kvm_is_zone_device_page() and
WARN_ON_ONCE(folio_ref_count(folio) <= 0) in try_grab_folio() (v6.6.63)
It looks like in the past assumption was made that pfn won't change from
faultin_pfn() to release_pfn_clean(), e.g. see
commit 4cd071d13c5c ("KVM: x86/mmu: Move calls to thp_adjust() down a level")
But kvm_page_fault structure made pfn part of mutable state, so
now release_pfn_clean() can take hugepage-adjusted pfn.
And it works for all cases (/dev/shm, hugetlb, devdax) except fsdax.
Apparently in fsdax mode faultin-pfn and adjusted-pfn may refer to
different folios, so we're getting get_page/put_page imbalance.
To solve this preserve faultin pfn in separate local variable
and pass it in kvm_release_pfn_clean().
Patch tested for all mentioned guest memory backends with tdp_mmu={0,1}.
No bug in upstream as it was solved fundamentally by
commit 8dd861cc07e2 ("KVM: x86/mmu: Put refcounted pages instead of blindly releasing pfns")
and related patch series.
Link: https://nvdimm.docs.kernel.org/2mib_fs_dax.html
Fixes: 2f6305dd5676 ("KVM: MMU: change kvm_tdp_mmu_map() arguments to kvm_page_fault")
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Nikolay Kuratov <kniv@yandex-team.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit aeb68937614f4aeceaaa762bd7f0212ce842b797 ]
Build 6.13-rc12 for x86_64 with gcc 14.2.1 fails with the error:
ld: vmlinux.o: in function `virtual_mapped':
linux/arch/x86/kernel/relocate_kernel_64.S:249:(.text+0x5915b): undefined reference to `saved_context_gdt_desc'
when CONFIG_KEXEC_JUMP is enabled.
This was introduced by commit 07fa619f2a40 ("x86/kexec: Restore GDT on
return from ::preserve_context kexec") which introduced a use of
saved_context_gdt_desc without a declaration for it.
Fix that by including asm/asm-offsets.h where saved_context_gdt_desc
is defined (indirectly in include/generated/asm-offsets.h which
asm/asm-offsets.h includes).
Fixes: 07fa619f2a40 ("x86/kexec: Restore GDT on return from ::preserve_context kexec")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Closes: https://lore.kernel.org/oe-kbuild-all/202411270006.ZyyzpYf8-lkp@intel.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f82e62d470cc990ebd9d691f931dd418e4e9cea9 ]
When enabling GICv4.1 in hip09, VMAPP fails to clear some caches during
the unmap operation, which can causes vSGIs to be lost.
To fix the issue, invalidate the related vPE cache through GICR_INVALLR
after VMOVP.
Suggested-by: Marc Zyngier <maz@kernel.org>
Co-developed-by: Nianyao Tang <tangnianyao@huawei.com>
Signed-off-by: Nianyao Tang <tangnianyao@huawei.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit cf89c9434af122f28a3552e6f9cc5158c33ce50a ]
On some powermacs `escc` nodes are missing `#size-cells` properties,
which is deprecated and now triggers a warning at boot since commit
045b14ca5c36 ("of: WARN on deprecated #address-cells/#size-cells
handling").
For example:
Missing '#size-cells' in /pci@f2000000/mac-io@c/escc@13000
WARNING: CPU: 0 PID: 0 at drivers/of/base.c:133 of_bus_n_size_cells+0x98/0x108
Hardware name: PowerMac3,1 7400 0xc0209 PowerMac
...
Call Trace:
of_bus_n_size_cells+0x98/0x108 (unreliable)
of_bus_default_count_cells+0x40/0x60
__of_get_address+0xc8/0x21c
__of_address_to_resource+0x5c/0x228
pmz_init_port+0x5c/0x2ec
pmz_probe.isra.0+0x144/0x1e4
pmz_console_init+0x10/0x48
console_init+0xcc/0x138
start_kernel+0x5c4/0x694
As powermacs boot via prom_init it's possible to add the missing
properties to the device tree during boot, avoiding the warning. Note
that `escc-legacy` nodes are also missing `#size-cells` properties, but
they are skipped by the macio driver, so leave them alone.
Depends-on: 045b14ca5c36 ("of: WARN on deprecated #address-cells/#size-cells handling")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20241126025710.591683-1-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4fbd66d8254cedfd1218393f39d83b6c07a01917 ]
Fix the dtc warnings:
arch/mips/boot/dts/loongson/ls7a-pch.dtsi:68.16-416.5: Warning (interrupt_provider): /bus@10000000/pci@1a000000: '#interrupt-cells' found, but node is not an interrupt provider
arch/mips/boot/dts/loongson/ls7a-pch.dtsi:68.16-416.5: Warning (interrupt_provider): /bus@10000000/pci@1a000000: '#interrupt-cells' found, but node is not an interrupt provider
arch/mips/boot/dts/loongson/loongson64g_4core_ls7a.dtb: Warning (interrupt_map): Failed prerequisite 'interrupt_provider'
And a runtime warning introduced in commit 045b14ca5c36 ("of: WARN on
deprecated #address-cells/#size-cells handling"):
WARNING: CPU: 0 PID: 1 at drivers/of/base.c:106 of_bus_n_addr_cells+0x9c/0xe0
Missing '#address-cells' in /bus@10000000/pci@1a000000/pci_bridge@9,0
The fix is similar to commit d89a415ff8d5 ("MIPS: Loongson64: DTS: Fix PCIe
port nodes for ls7a"), which has fixed the issue for ls2k (despite its
subject mentions ls7a).
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 88fd2b70120d52c1010257d36776876941375490 ]
Commit bab1c299f3945ffe79 ("LoongArch: Fix sleeping in atomic context in
setup_tlb_handler()") changes the gfp flag from GFP_KERNEL to GFP_ATOMIC
for alloc_pages_node(). However, for PREEMPT_RT kernels we can still get
a "sleeping in atomic context" error:
[ 0.372259] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[ 0.372266] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
[ 0.372268] preempt_count: 1, expected: 0
[ 0.372270] RCU nest depth: 1, expected: 1
[ 0.372272] 3 locks held by swapper/1/0:
[ 0.372274] #0: 900000000c9f5e60 (&pcp->lock){+.+.}-{3:3}, at: get_page_from_freelist+0x524/0x1c60
[ 0.372294] #1: 90000000087013b8 (rcu_read_lock){....}-{1:3}, at: rt_spin_trylock+0x50/0x140
[ 0.372305] #2: 900000047fffd388 (&zone->lock){+.+.}-{3:3}, at: __rmqueue_pcplist+0x30c/0xea0
[ 0.372314] irq event stamp: 0
[ 0.372316] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[ 0.372322] hardirqs last disabled at (0): [<9000000005947320>] copy_process+0x9c0/0x26e0
[ 0.372329] softirqs last enabled at (0): [<9000000005947320>] copy_process+0x9c0/0x26e0
[ 0.372335] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 0.372341] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.12.0-rc7+ #1891
[ 0.372346] Hardware name: Loongson Loongson-3A5000-7A1000-1w-CRB/Loongson-LS3A5000-7A1000-1w-CRB, BIOS vUDK2018-LoongArch-V2.0.0-prebeta9 10/21/2022
[ 0.372349] Stack : 0000000000000089 9000000005a0db9c 90000000071519c8 9000000100388000
[ 0.372486] 900000010038b890 0000000000000000 900000010038b898 9000000007e53788
[ 0.372492] 900000000815bcc8 900000000815bcc0 900000010038b700 0000000000000001
[ 0.372498] 0000000000000001 4b031894b9d6b725 00000000055ec000 9000000100338fc0
[ 0.372503] 00000000000000c4 0000000000000001 000000000000002d 0000000000000003
[ 0.372509] 0000000000000030 0000000000000003 00000000055ec000 0000000000000003
[ 0.372515] 900000000806d000 9000000007e53788 00000000000000b0 0000000000000004
[ 0.372521] 0000000000000000 0000000000000000 900000000c9f5f10 0000000000000000
[ 0.372526] 90000000076f12d8 9000000007e53788 9000000005924778 0000000000000000
[ 0.372532] 00000000000000b0 0000000000000004 0000000000000000 0000000000070000
[ 0.372537] ...
[ 0.372540] Call Trace:
[ 0.372542] [<9000000005924778>] show_stack+0x38/0x180
[ 0.372548] [<90000000071519c4>] dump_stack_lvl+0x94/0xe4
[ 0.372555] [<900000000599b880>] __might_resched+0x1a0/0x260
[ 0.372561] [<90000000071675cc>] rt_spin_lock+0x4c/0x140
[ 0.372565] [<9000000005cbb768>] __rmqueue_pcplist+0x308/0xea0
[ 0.372570] [<9000000005cbed84>] get_page_from_freelist+0x564/0x1c60
[ 0.372575] [<9000000005cc0d98>] __alloc_pages_noprof+0x218/0x1820
[ 0.372580] [<900000000593b36c>] tlb_init+0x1ac/0x298
[ 0.372585] [<9000000005924b74>] per_cpu_trap_init+0x114/0x140
[ 0.372589] [<9000000005921964>] cpu_probe+0x4e4/0xa60
[ 0.372592] [<9000000005934874>] start_secondary+0x34/0xc0
[ 0.372599] [<900000000715615c>] smpboot_entry+0x64/0x6c
This is because in PREEMPT_RT kernels normal spinlocks are replaced by
rt spinlocks and rt_spin_lock() will cause sleeping. Fix it by disabling
NUMA optimization completely for PREEMPT_RT kernels.
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3b96b895127b7c0aed63d82c974b46340e8466c1 ]
Some computers with CPUs that lack Thunderbolt features use discrete
Thunderbolt chips to add Thunderbolt functionality. These Thunderbolt
chips are located within the chassis; between the Root Port labeled
ExternalFacingPort and the USB-C port.
These Thunderbolt PCIe devices should be labeled as fixed and trusted, as
they are built into the computer. Otherwise, security policies that rely on
those flags may have unintended results, such as preventing USB-C ports
from enumerating.
Detect the above scenario through the process of elimination.
1) Integrated Thunderbolt host controllers already have Thunderbolt
implemented, so anything outside their external facing Root Port is
removable and untrusted.
Detect them using the following properties:
- Most integrated host controllers have the "usb4-host-interface"
ACPI property, as described here:
https://learn.microsoft.com/en-us/windows-hardware/drivers/pci/dsd-for-pcie-root-ports#mapping-native-protocols-pcie-displayport-tunneled-through-usb4-to-usb4-host-routers
- Integrated Thunderbolt PCIe Root Ports before Alder Lake do not
have the "usb4-host-interface" ACPI property. Identify those by
their PCI IDs instead.
2) If a Root Port does not have integrated Thunderbolt capabilities, but
has the "ExternalFacingPort" ACPI property, that means the
manufacturer has opted to use a discrete Thunderbolt host controller
that is built into the computer.
This host controller can be identified by virtue of being located
directly below an external-facing Root Port that lacks integrated
Thunderbolt. Label it as trusted and fixed.
Everything downstream from it is untrusted and removable.
The "ExternalFacingPort" ACPI property is described here:
https://learn.microsoft.com/en-us/windows-hardware/drivers/pci/dsd-for-pcie-root-ports#identifying-externally-exposed-pcie-root-ports
Link: https://lore.kernel.org/r/20240910-trust-tbt-fix-v5-1-7a7a42a5f496@chromium.org
Suggested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Esther Shimanovich <eshimanovich@chromium.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c163e40af9b2331b2c629fd4ec8b703ed4d4ae39 ]
clocksource_delta() has two variants. One with a check for negative motion,
which is only selected by x86. This is a historic leftover as this function
was previously used in the time getter hot paths.
Since 135225a363ae timekeeping_cycles_to_ns() has unconditional protection
against this as a by-product of the protection against 64bit math overflow.
clocksource_delta() is only used in the clocksource watchdog and in
timekeeping_advance(). The extra conditional there is not hurting anyone.
Remove the config option and unconditionally prevent negative motion of the
readout.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241031120328.599430157@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit de20037e1b3c2f2ca97b8c12b8c7bca8abd509a7 ]
Warning at every leaking bits can cause a flood of message, triggering
various stall-warning mechanisms to fire, including CSD locks, which
makes the machine to be unusable.
Track the bits that are being leaked, and only warn when a new bit is
set.
That said, this patch will help with the following issues:
1) It will tell us which bits are being set, so, it is easy to
communicate it back to vendor, and to do a root-cause analyzes.
2) It avoid the machine to be unusable, because, worst case
scenario, the user gets less than 60 WARNs (one per unhandled bit).
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sandipan Das <sandipan.das@amd.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20241001141020.2620361-1-leitao@debian.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a0bd7dacbd51c632b8e2c0500b479af564afadf3 ]
CPU hotplug remove handling triggers the following function
call sequence:
CPUHP_AP_PERF_S390_SF_ONLINE --> s390_pmu_sf_offline_cpu()
...
CPUHP_AP_PERF_ONLINE --> perf_event_exit_cpu()
The s390 CPUMF sampling CPU hotplug handler invokes:
s390_pmu_sf_offline_cpu()
+--> cpusf_pmu_setup()
+--> setup_pmc_cpu()
+--> deallocate_buffers()
This function de-allocates all sampling data buffers (SDBs) allocated
for that CPU at event initialization. It also clears the
PMU_F_RESERVED bit. The CPU is gone and can not be sampled.
With the event still being active on the removed CPU, the CPU event
hotplug support in kernel performance subsystem triggers the
following function calls on the removed CPU:
perf_event_exit_cpu()
+--> perf_event_exit_cpu_context()
+--> __perf_event_exit_context()
+--> __perf_remove_from_context()
+--> event_sched_out()
+--> cpumsf_pmu_del()
+--> cpumsf_pmu_stop()
+--> hw_perf_event_update()
to stop and remove the event. During removal of the event, the
sampling device driver tries to read out the remaining samples from
the sample data buffers (SDBs). But they have already been freed
(and may have been re-assigned). This may lead to a use after free
situation in which case the samples are most likely invalid. In the
best case the memory has not been reassigned and still contains
valid data.
Remedy this situation and check if the CPU is still in reserved
state (bit PMU_F_RESERVED set). In this case the SDBs have not been
released an contain valid data. This is always the case when
the event is removed (and no CPU hotplug off occured).
If the PMU_F_RESERVED bit is not set, the SDB buffers are gone.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 25f39d3dcb48bbc824a77d16b3d977f0f3713cfe ]
Ensure that VFs used in isolation, that is with their parent PF not
visible to the configuration but with their RID exposed, are treated
compatibly with existing isolated VF use cases without exposed RID
including RoCE Express VFs. This allows creating configurations where
one LPAR manages PFs while their child VFs are used by other LPARs. This
gives the LPAR managing the PFs a role analogous to that of the
hypervisor in a typical use case of passing child VFs to guests.
Instead of creating a multifunction struct zpci_bus whenever a PCI
function with RID exposed is discovered only create such a bus for
configured physical functions and only consider multifunction busses
when searching for an existing bus. Additionally only set zdev->devfn to
the devfn part of the RID once the function is added to a multifunction
bus.
This also fixes probing of more than 7 such isolated VFs from the same
physical bus. This is because common PCI code in pci_scan_slot() only
looks for more functions when pdev->multifunction is set which somewhat
counter intutively is not the case for VFs.
Note that PFs are looked at before their child VFs is guaranteed because
we sort the zpci_list by RID ascending.
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 126034faaac5f356822c4a9bebfa75664da11056 ]
The newly introduced topology ID (TID) field in the CLP Query PCI
Function explicitly identifies groups of PCI functions whose RIDs belong
to the same (sub-)topology. When available use the TID instead of the
PCHID to match zPCI busses/domains for multi-function devices. Note that
currently only a single PCI bus per TID is supported. This change is
required because in future machines the PCHID will not identify a PCI
card but a specific port in the case of some multi-port NICs while from
a PCI point of view the entire card is a subtopology.
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0467cdde8c4320bbfdb31a8cff1277b202f677fc ]
Instead of relying on the observed but not architected firmware behavior
that PCI functions from the same card are listed in ascending RID order
in clp_list_pci() ensure this by sorting. To allow for sorting separate
the initial clp_list_pci() and creation of the virtual PCI busses.
Note that fundamentally in our per-PCI function hotplug design non RID
order of discovery is still possible. For example when the two PFs of
a two port NIC are hotplugged after initial boot and in descending RID
order. In this case the virtual PCI bus would be created by the second
PF using that PF's UID as domain number instead of that of the first PF.
Thus the domain number would then change from the UID of the second PF
to that of the first PF on reboot but there is really nothing we can do
about that since changing domain numbers at runtime seems even worse.
This only impacts the domain number as the RIDs are consistent and thus
even with just the second PF visible it will show up in the correct
position on the virtual bus.
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit d0ceea662d459726487030237689835fcc0483e5 upstream.
The set_p4d() and set_pgd() functions (in 4-level or 5-level page table setups
respectively) assume that the root page table is actually a 8KiB allocation,
with the userspace root immediately after the kernel root page table (so that
the former can enforce NX on on all the subordinate page tables, which are
actually shared).
However, users of the kernel_ident_mapping_init() code do not give it an 8KiB
allocation for its PGD. Both swsusp_arch_resume() and acpi_mp_setup_reset()
allocate only a single 4KiB page. The kexec code on x86_64 currently gets
away with it purely by chance, because it allocates 8KiB for its "control
code page" and then actually uses the first half for the PGD, then copies the
actual trampoline code into the second half only after the identmap code has
finished scribbling over it.
Fix this by defining a _PAGE_NOPTISHADOW bit (which can use the same bit as
_PAGE_SAVED_DIRTY since one is only for the PGD/P4D root and the other is
exclusively for leaf PTEs.). This instructs __pti_set_user_pgtbl() not to
write to the userspace 'shadow' PGD.
Strictly, the _PAGE_NOPTISHADOW bit doesn't need to be written out to the
actual page tables; since __pti_set_user_pgtbl() returns the value to be
written to the kernel page table, it could be filtered out. But there seems
to be no benefit to actually doing so.
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/412c90a4df7aef077141d9f68d19cbe5602d6c6d.camel@infradead.org
Cc: stable@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 73da582a476ea6e3512f89f8ed57dfed945829a2 upstream.
The rework of possible CPUs management erroneously disabled SMP when the
IO/APIC is disabled either by the 'noapic' command line parameter or during
IO/APIC setup. SMP is possible without IO/APIC.
Remove the ioapic_is_disabled conditions from the relevant possible CPU
management code paths to restore the orgininal behaviour.
Fixes: 7c0edad3643f ("x86/cpu/topology: Rework possible CPU management")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241202145905.1482-1-ffmancera@riseup.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c9a4b55431e5220347881e148725bed69c84e037 upstream.
Under some conditions, MONITOR wakeups on Lunar Lake processors
can be lost, resulting in significant user-visible delays.
Add Lunar Lake to X86_BUG_MONITOR so that wake_up_idle_cpu()
always sends an IPI, avoiding this potential delay.
Reported originally here:
https://bugzilla.kernel.org/show_bug.cgi?id=219364
[ dhansen: tweak subject ]
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/a4aa8842a3c3bfdb7fe9807710eef159cbf0e705.1731463305.git.len.brown%40intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 07fa619f2a40c221ea27747a3323cabc59ab25eb upstream.
The restore_processor_state() function explicitly states that "the asm code
that gets us here will have restored a usable GDT". That wasn't true in the
case of returning from a ::preserve_context kexec. Make it so.
Without this, the kernel was depending on the called function to reload a
GDT which is appropriate for the kernel before returning.
Test program:
#include <unistd.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <linux/kexec.h>
#include <linux/reboot.h>
#include <sys/reboot.h>
#include <sys/syscall.h>
int main (void)
{
struct kexec_segment segment = {};
unsigned char purgatory[] = {
0x66, 0xba, 0xf8, 0x03, // mov $0x3f8, %dx
0xb0, 0x42, // mov $0x42, %al
0xee, // outb %al, (%dx)
0xc3, // ret
};
int ret;
segment.buf = &purgatory;
segment.bufsz = sizeof(purgatory);
segment.mem = (void *)0x400000;
segment.memsz = 0x1000;
ret = syscall(__NR_kexec_load, 0x400000, 1, &segment, KEXEC_PRESERVE_CONTEXT);
if (ret) {
perror("kexec_load");
exit(1);
}
ret = syscall(__NR_reboot, LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, LINUX_REBOOT_CMD_KEXEC);
if (ret) {
perror("kexec reboot");
exit(1);
}
printf("Success\n");
return 0;
}
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241205153343.3275139-2-dwmw2@infradead.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9677be09e5e4fbe48aeccb06ae3063c5eba331c3 upstream.
Linux remembers cpu_cachinfo::num_leaves per CPU, but x86 initializes all
CPUs from the same global "num_cache_leaves".
This is erroneous on systems such as Meteor Lake, where each CPU has a
distinct num_leaves value. Delete the global "num_cache_leaves" and
initialize num_leaves on each CPU.
init_cache_level() no longer needs to set num_leaves. Also, it never had to
set num_levels as it is unnecessary in x86. Keep checking for zero cache
leaves. Such condition indicates a bug.
[ bp: Cleanup. ]
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org # 6.3+
Link: https://lore.kernel.org/r/20241128002247.26726-3-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 594bfc4947c4fcabba1318d8384c61a29a6b89fb upstream.
Currently poe_set() doesn't initialize the temporary 'ctrl' variable,
and a SETREGSET call with a length of zero will leave this
uninitialized. Consequently an arbitrary value will be written back to
target->thread.por_el0, potentially leaking up to 64 bits of memory from
the kernel stack. The read is limited to a specific slot on the stack,
and the issue does not provide a write mechanism.
Fix this by initializing the temporary value before copying the regset
from userspace, as for other regsets (e.g. NT_PRSTATUS, NT_PRFPREG,
NT_ARM_SYSTEM_CALL). In the case of a zero-length write, the existing
contents of POR_EL1 will be retained.
Before this patch:
| # ./poe-test
| Attempting to write NT_ARM_POE::por_el0 = 0x900d900d900d900d
| SETREGSET(nt=0x40f, len=8) wrote 8 bytes
|
| Attempting to read NT_ARM_POE::por_el0
| GETREGSET(nt=0x40f, len=8) read 8 bytes
| Read NT_ARM_POE::por_el0 = 0x900d900d900d900d
|
| Attempting to write NT_ARM_POE (zero length)
| SETREGSET(nt=0x40f, len=0) wrote 0 bytes
|
| Attempting to read NT_ARM_POE::por_el0
| GETREGSET(nt=0x40f, len=8) read 8 bytes
| Read NT_ARM_POE::por_el0 = 0xffff8000839c3d50
After this patch:
| # ./poe-test
| Attempting to write NT_ARM_POE::por_el0 = 0x900d900d900d900d
| SETREGSET(nt=0x40f, len=8) wrote 8 bytes
|
| Attempting to read NT_ARM_POE::por_el0
| GETREGSET(nt=0x40f, len=8) read 8 bytes
| Read NT_ARM_POE::por_el0 = 0x900d900d900d900d
|
| Attempting to write NT_ARM_POE (zero length)
| SETREGSET(nt=0x40f, len=0) wrote 0 bytes
|
| Attempting to read NT_ARM_POE::por_el0
| GETREGSET(nt=0x40f, len=8) read 8 bytes
| Read NT_ARM_POE::por_el0 = 0x900d900d900d900d
Fixes: 175198199262 ("arm64/ptrace: add support for FEAT_POE")
Cc: <stable@vger.kernel.org> # 6.12.x
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241205121655.1824269-4-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|