kernel/linux.git/arch, branch v6.6.66

KVM: x86/mmu: Ensure that kvm_release_pfn_clean() takes exact pfn from kvm_faultin_pfn()

2024-12-14T19:00:21+00:00

Since 5.16 and prior to 6.13 KVM can't be used with FSDAX guest memory (PMD pages). To reproduce the issue you need to reserve guest memory with `memmap=` cmdline, create and mount FS in DAX mode (tested both XFS and ext4), see doc link below. ndctl command for test: ndctl create-namespace -v -e namespace1.0 --map=dev --mode=fsdax -a 2M Then pass memory object to qemu like: -m 8G -object memory-backend-file,id=ram0,size=8G,\ mem-path=/mnt/pmem/guestmem,share=on,prealloc=on,dump=off,align=2097152 \ -numa node,memdev=ram0,cpus=0-1 QEMU fails to run guest with error: kvm run failed Bad address and there are two warnings in dmesg: WARN_ON_ONCE(!page_count(page)) in kvm_is_zone_device_page() and WARN_ON_ONCE(folio_ref_count(folio) <= 0) in try_grab_folio() (v6.6.63) It looks like in the past assumption was made that pfn won't change from faultin_pfn() to release_pfn_clean(), e.g. see commit 4cd071d13c5c ("KVM: x86/mmu: Move calls to thp_adjust() down a level") But kvm_page_fault structure made pfn part of mutable state, so now release_pfn_clean() can take hugepage-adjusted pfn. And it works for all cases (/dev/shm, hugetlb, devdax) except fsdax. Apparently in fsdax mode faultin-pfn and adjusted-pfn may refer to different folios, so we're getting get_page/put_page imbalance. To solve this preserve faultin pfn in separate local variable and pass it in kvm_release_pfn_clean(). Patch tested for all mentioned guest memory backends with tdp_mmu={0,1}. No bug in upstream as it was solved fundamentally by commit 8dd861cc07e2 ("KVM: x86/mmu: Put refcounted pages instead of blindly releasing pfns") and related patch series. Link: https://nvdimm.docs.kernel.org/2mib_fs_dax.html Fixes: 2f6305dd5676 ("KVM: MMU: change kvm_tdp_mmu_map() arguments to kvm_page_fault") Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Reviewed-by: Sean Christopherson Signed-off-by: Nikolay Kuratov Signed-off-by: Greg Kroah-Hartman

x86: Fix build regression with CONFIG_KEXEC_JUMP enabled

2024-12-14T19:00:20+00:00

[ Upstream commit aeb68937614f4aeceaaa762bd7f0212ce842b797 ] Build 6.13-rc12 for x86_64 with gcc 14.2.1 fails with the error: ld: vmlinux.o: in function `virtual_mapped': linux/arch/x86/kernel/relocate_kernel_64.S:249:(.text+0x5915b): undefined reference to `saved_context_gdt_desc' when CONFIG_KEXEC_JUMP is enabled. This was introduced by commit 07fa619f2a40 ("x86/kexec: Restore GDT on return from ::preserve_context kexec") which introduced a use of saved_context_gdt_desc without a declaration for it. Fix that by including asm/asm-offsets.h where saved_context_gdt_desc is defined (indirectly in include/generated/asm-offsets.h which asm/asm-offsets.h includes). Fixes: 07fa619f2a40 ("x86/kexec: Restore GDT on return from ::preserve_context kexec") Signed-off-by: Damien Le Moal Acked-by: Borislav Petkov (AMD) Acked-by: David Woodhouse Closes: https://lore.kernel.org/oe-kbuild-all/202411270006.ZyyzpYf8-lkp@intel.com/ Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin

powerpc/prom_init: Fixup missing powermac #size-cells

2024-12-14T19:00:16+00:00

[ Upstream commit cf89c9434af122f28a3552e6f9cc5158c33ce50a ] On some powermacs `escc` nodes are missing `#size-cells` properties, which is deprecated and now triggers a warning at boot since commit 045b14ca5c36 ("of: WARN on deprecated #address-cells/#size-cells handling"). For example: Missing '#size-cells' in /pci@f2000000/mac-io@c/escc@13000 WARNING: CPU: 0 PID: 0 at drivers/of/base.c:133 of_bus_n_size_cells+0x98/0x108 Hardware name: PowerMac3,1 7400 0xc0209 PowerMac ... Call Trace: of_bus_n_size_cells+0x98/0x108 (unreliable) of_bus_default_count_cells+0x40/0x60 __of_get_address+0xc8/0x21c __of_address_to_resource+0x5c/0x228 pmz_init_port+0x5c/0x2ec pmz_probe.isra.0+0x144/0x1e4 pmz_console_init+0x10/0x48 console_init+0xcc/0x138 start_kernel+0x5c4/0x694 As powermacs boot via prom_init it's possible to add the missing properties to the device tree during boot, avoiding the warning. Note that `escc-legacy` nodes are also missing `#size-cells` properties, but they are skipped by the macio driver, so leave them alone. Depends-on: 045b14ca5c36 ("of: WARN on deprecated #address-cells/#size-cells handling") Signed-off-by: Michael Ellerman Reviewed-by: Rob Herring Signed-off-by: Madhavan Srinivasan Link: https://patch.msgid.link/20241126025710.591683-1-mpe@ellerman.id.au Signed-off-by: Sasha Levin

MIPS: Loongson64: DTS: Really fix PCIe port nodes for ls7a

2024-12-14T19:00:16+00:00

[ Upstream commit 4fbd66d8254cedfd1218393f39d83b6c07a01917 ] Fix the dtc warnings: arch/mips/boot/dts/loongson/ls7a-pch.dtsi:68.16-416.5: Warning (interrupt_provider): /bus@10000000/pci@1a000000: '#interrupt-cells' found, but node is not an interrupt provider arch/mips/boot/dts/loongson/ls7a-pch.dtsi:68.16-416.5: Warning (interrupt_provider): /bus@10000000/pci@1a000000: '#interrupt-cells' found, but node is not an interrupt provider arch/mips/boot/dts/loongson/loongson64g_4core_ls7a.dtb: Warning (interrupt_map): Failed prerequisite 'interrupt_provider' And a runtime warning introduced in commit 045b14ca5c36 ("of: WARN on deprecated #address-cells/#size-cells handling"): WARNING: CPU: 0 PID: 1 at drivers/of/base.c:106 of_bus_n_addr_cells+0x9c/0xe0 Missing '#address-cells' in /bus@10000000/pci@1a000000/pci_bridge@9,0 The fix is similar to commit d89a415ff8d5 ("MIPS: Loongson64: DTS: Fix PCIe port nodes for ls7a"), which has fixed the issue for ls2k (despite its subject mentions ls7a). Signed-off-by: Xi Ruoyao Signed-off-by: Thomas Bogendoerfer Signed-off-by: Sasha Levin

LoongArch: Fix sleeping in atomic context for PREEMPT_RT

2024-12-14T19:00:15+00:00

[ Upstream commit 88fd2b70120d52c1010257d36776876941375490 ] Commit bab1c299f3945ffe79 ("LoongArch: Fix sleeping in atomic context in setup_tlb_handler()") changes the gfp flag from GFP_KERNEL to GFP_ATOMIC for alloc_pages_node(). However, for PREEMPT_RT kernels we can still get a "sleeping in atomic context" error: [ 0.372259] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 [ 0.372266] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1 [ 0.372268] preempt_count: 1, expected: 0 [ 0.372270] RCU nest depth: 1, expected: 1 [ 0.372272] 3 locks held by swapper/1/0: [ 0.372274] #0: 900000000c9f5e60 (&pcp->lock){+.+.}-{3:3}, at: get_page_from_freelist+0x524/0x1c60 [ 0.372294] #1: 90000000087013b8 (rcu_read_lock){....}-{1:3}, at: rt_spin_trylock+0x50/0x140 [ 0.372305] #2: 900000047fffd388 (&zone->lock){+.+.}-{3:3}, at: __rmqueue_pcplist+0x30c/0xea0 [ 0.372314] irq event stamp: 0 [ 0.372316] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [ 0.372322] hardirqs last disabled at (0): [<9000000005947320>] copy_process+0x9c0/0x26e0 [ 0.372329] softirqs last enabled at (0): [<9000000005947320>] copy_process+0x9c0/0x26e0 [ 0.372335] softirqs last disabled at (0): [<0000000000000000>] 0x0 [ 0.372341] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.12.0-rc7+ #1891 [ 0.372346] Hardware name: Loongson Loongson-3A5000-7A1000-1w-CRB/Loongson-LS3A5000-7A1000-1w-CRB, BIOS vUDK2018-LoongArch-V2.0.0-prebeta9 10/21/2022 [ 0.372349] Stack : 0000000000000089 9000000005a0db9c 90000000071519c8 9000000100388000 [ 0.372486] 900000010038b890 0000000000000000 900000010038b898 9000000007e53788 [ 0.372492] 900000000815bcc8 900000000815bcc0 900000010038b700 0000000000000001 [ 0.372498] 0000000000000001 4b031894b9d6b725 00000000055ec000 9000000100338fc0 [ 0.372503] 00000000000000c4 0000000000000001 000000000000002d 0000000000000003 [ 0.372509] 0000000000000030 0000000000000003 00000000055ec000 0000000000000003 [ 0.372515] 900000000806d000 9000000007e53788 00000000000000b0 0000000000000004 [ 0.372521] 0000000000000000 0000000000000000 900000000c9f5f10 0000000000000000 [ 0.372526] 90000000076f12d8 9000000007e53788 9000000005924778 0000000000000000 [ 0.372532] 00000000000000b0 0000000000000004 0000000000000000 0000000000070000 [ 0.372537] ... [ 0.372540] Call Trace: [ 0.372542] [<9000000005924778>] show_stack+0x38/0x180 [ 0.372548] [<90000000071519c4>] dump_stack_lvl+0x94/0xe4 [ 0.372555] [<900000000599b880>] __might_resched+0x1a0/0x260 [ 0.372561] [<90000000071675cc>] rt_spin_lock+0x4c/0x140 [ 0.372565] [<9000000005cbb768>] __rmqueue_pcplist+0x308/0xea0 [ 0.372570] [<9000000005cbed84>] get_page_from_freelist+0x564/0x1c60 [ 0.372575] [<9000000005cc0d98>] __alloc_pages_noprof+0x218/0x1820 [ 0.372580] [<900000000593b36c>] tlb_init+0x1ac/0x298 [ 0.372585] [<9000000005924b74>] per_cpu_trap_init+0x114/0x140 [ 0.372589] [<9000000005921964>] cpu_probe+0x4e4/0xa60 [ 0.372592] [<9000000005934874>] start_secondary+0x34/0xc0 [ 0.372599] [<900000000715615c>] smpboot_entry+0x64/0x6c This is because in PREEMPT_RT kernels normal spinlocks are replaced by rt spinlocks and rt_spin_lock() will cause sleeping. Fix it by disabling NUMA optimization completely for PREEMPT_RT kernels. Signed-off-by: Huacai Chen Signed-off-by: Sasha Levin

PCI: Detect and trust built-in Thunderbolt chips

2024-12-14T19:00:14+00:00

[ Upstream commit 3b96b895127b7c0aed63d82c974b46340e8466c1 ] Some computers with CPUs that lack Thunderbolt features use discrete Thunderbolt chips to add Thunderbolt functionality. These Thunderbolt chips are located within the chassis; between the Root Port labeled ExternalFacingPort and the USB-C port. These Thunderbolt PCIe devices should be labeled as fixed and trusted, as they are built into the computer. Otherwise, security policies that rely on those flags may have unintended results, such as preventing USB-C ports from enumerating. Detect the above scenario through the process of elimination. 1) Integrated Thunderbolt host controllers already have Thunderbolt implemented, so anything outside their external facing Root Port is removable and untrusted. Detect them using the following properties: - Most integrated host controllers have the "usb4-host-interface" ACPI property, as described here: https://learn.microsoft.com/en-us/windows-hardware/drivers/pci/dsd-for-pcie-root-ports#mapping-native-protocols-pcie-displayport-tunneled-through-usb4-to-usb4-host-routers - Integrated Thunderbolt PCIe Root Ports before Alder Lake do not have the "usb4-host-interface" ACPI property. Identify those by their PCI IDs instead. 2) If a Root Port does not have integrated Thunderbolt capabilities, but has the "ExternalFacingPort" ACPI property, that means the manufacturer has opted to use a discrete Thunderbolt host controller that is built into the computer. This host controller can be identified by virtue of being located directly below an external-facing Root Port that lacks integrated Thunderbolt. Label it as trusted and fixed. Everything downstream from it is untrusted and removable. The "ExternalFacingPort" ACPI property is described here: https://learn.microsoft.com/en-us/windows-hardware/drivers/pci/dsd-for-pcie-root-ports#identifying-externally-exposed-pcie-root-ports Link: https://lore.kernel.org/r/20240910-trust-tbt-fix-v5-1-7a7a42a5f496@chromium.org Suggested-by: Mika Westerberg Signed-off-by: Esther Shimanovich Signed-off-by: Bjorn Helgaas Tested-by: Mika Westerberg Tested-by: Mario Limonciello Reviewed-by: Mika Westerberg Reviewed-by: Mario Limonciello Signed-off-by: Sasha Levin

perf/x86/amd: Warn only on new bits set

2024-12-14T18:59:59+00:00

[ Upstream commit de20037e1b3c2f2ca97b8c12b8c7bca8abd509a7 ] Warning at every leaking bits can cause a flood of message, triggering various stall-warning mechanisms to fire, including CSD locks, which makes the machine to be unusable. Track the bits that are being leaked, and only warn when a new bit is set. That said, this patch will help with the following issues: 1) It will tell us which bits are being set, so, it is easy to communicate it back to vendor, and to do a root-cause analyzes. 2) It avoid the machine to be unusable, because, worst case scenario, the user gets less than 60 WARNs (one per unhandled bit). Suggested-by: Paul E. McKenney Signed-off-by: Breno Leitao Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Sandipan Das Reviewed-by: Paul E. McKenney Link: https://lkml.kernel.org/r/20241001141020.2620361-1-leitao@debian.org Signed-off-by: Sasha Levin

s390/cpum_sf: Handle CPU hotplug remove during sampling

2024-12-14T18:59:58+00:00

[ Upstream commit a0bd7dacbd51c632b8e2c0500b479af564afadf3 ] CPU hotplug remove handling triggers the following function call sequence: CPUHP_AP_PERF_S390_SF_ONLINE --> s390_pmu_sf_offline_cpu() ... CPUHP_AP_PERF_ONLINE --> perf_event_exit_cpu() The s390 CPUMF sampling CPU hotplug handler invokes: s390_pmu_sf_offline_cpu() +--> cpusf_pmu_setup() +--> setup_pmc_cpu() +--> deallocate_buffers() This function de-allocates all sampling data buffers (SDBs) allocated for that CPU at event initialization. It also clears the PMU_F_RESERVED bit. The CPU is gone and can not be sampled. With the event still being active on the removed CPU, the CPU event hotplug support in kernel performance subsystem triggers the following function calls on the removed CPU: perf_event_exit_cpu() +--> perf_event_exit_cpu_context() +--> __perf_event_exit_context() +--> __perf_remove_from_context() +--> event_sched_out() +--> cpumsf_pmu_del() +--> cpumsf_pmu_stop() +--> hw_perf_event_update() to stop and remove the event. During removal of the event, the sampling device driver tries to read out the remaining samples from the sample data buffers (SDBs). But they have already been freed (and may have been re-assigned). This may lead to a use after free situation in which case the samples are most likely invalid. In the best case the memory has not been reassigned and still contains valid data. Remedy this situation and check if the CPU is still in reserved state (bit PMU_F_RESERVED set). In this case the SDBs have not been released an contain valid data. This is always the case when the event is removed (and no CPU hotplug off occured). If the PMU_F_RESERVED bit is not set, the SDB buffers are gone. Signed-off-by: Thomas Richter Reviewed-by: Hendrik Brueckner Signed-off-by: Heiko Carstens Signed-off-by: Sasha Levin

x86/mm: Add _PAGE_NOPTISHADOW bit to avoid updating userspace page tables

2024-12-14T18:59:58+00:00

commit d0ceea662d459726487030237689835fcc0483e5 upstream. The set_p4d() and set_pgd() functions (in 4-level or 5-level page table setups respectively) assume that the root page table is actually a 8KiB allocation, with the userspace root immediately after the kernel root page table (so that the former can enforce NX on on all the subordinate page tables, which are actually shared). However, users of the kernel_ident_mapping_init() code do not give it an 8KiB allocation for its PGD. Both swsusp_arch_resume() and acpi_mp_setup_reset() allocate only a single 4KiB page. The kexec code on x86_64 currently gets away with it purely by chance, because it allocates 8KiB for its "control code page" and then actually uses the first half for the PGD, then copies the actual trampoline code into the second half only after the identmap code has finished scribbling over it. Fix this by defining a _PAGE_NOPTISHADOW bit (which can use the same bit as _PAGE_SAVED_DIRTY since one is only for the PGD/P4D root and the other is exclusively for leaf PTEs.). This instructs __pti_set_user_pgtbl() not to write to the userspace 'shadow' PGD. Strictly, the _PAGE_NOPTISHADOW bit doesn't need to be written out to the actual page tables; since __pti_set_user_pgtbl() returns the value to be written to the kernel page table, it could be filtered out. But there seems to be no benefit to actually doing so. Suggested-by: Dave Hansen Signed-off-by: David Woodhouse Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/412c90a4df7aef077141d9f68d19cbe5602d6c6d.camel@infradead.org Cc: stable@kernel.org Cc: Linus Torvalds Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Rik van Riel Signed-off-by: Greg Kroah-Hartman

x86/kexec: Restore GDT on return from ::preserve_context kexec

2024-12-14T18:59:56+00:00

commit 07fa619f2a40c221ea27747a3323cabc59ab25eb upstream. The restore_processor_state() function explicitly states that "the asm code that gets us here will have restored a usable GDT". That wasn't true in the case of returning from a ::preserve_context kexec. Make it so. Without this, the kernel was depending on the called function to reload a GDT which is appropriate for the kernel before returning. Test program: #include #include #include #include #include #include #include #include int main (void) { struct kexec_segment segment = {}; unsigned char purgatory[] = { 0x66, 0xba, 0xf8, 0x03, // mov $0x3f8, %dx 0xb0, 0x42, // mov $0x42, %al 0xee, // outb %al, (%dx) 0xc3, // ret }; int ret; segment.buf = &purgatory; segment.bufsz = sizeof(purgatory); segment.mem = (void *)0x400000; segment.memsz = 0x1000; ret = syscall(__NR_kexec_load, 0x400000, 1, &segment, KEXEC_PRESERVE_CONTEXT); if (ret) { perror("kexec_load"); exit(1); } ret = syscall(__NR_reboot, LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, LINUX_REBOOT_CMD_KEXEC); if (ret) { perror("kexec reboot"); exit(1); } printf("Success\n"); return 0; } Signed-off-by: David Woodhouse Signed-off-by: Ingo Molnar Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20241205153343.3275139-2-dwmw2@infradead.org Signed-off-by: Greg Kroah-Hartman