summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2025-02-13um: convert irq_lock to raw spinlockJohannes Berg1-32/+47
Since this is deep in the architecture, and the code is called nested into other deep management code, this really needs to be a raw spinlock. Convert it. Link: https://patch.msgid.link/20250110125550.32479-8-johannes@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2025-02-13um: virtio_uml: use raw spinlockJohannes Berg1-4/+4
This is needed because at least in time-travel the code can be called directly from the deep architecture and IRQ handling code. Link: https://patch.msgid.link/20250110125550.32479-7-johannes@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2025-02-13um: virt-pci: don't use kmalloc()Johannes Berg1-96/+102
This code can be called deep in the IRQ handling, for example, and then cannot normally use kmalloc(). Have its own pre-allocated memory and use from there instead so this doesn't occur. Only in the (very rare) case of memcpy_toio() we'd still need to allocate memory. Link: https://patch.msgid.link/20250110125550.32479-6-johannes@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2025-02-13um: fix execve stub execution on old host OSsBenjamin Berg1-3/+13
The stub execution uses the somewhat new close_range and execveat syscalls. Of these two, the execveat call is essential, but the close_range call is more about stub process hygiene rather than safety (and its result is ignored). Replace both calls with a raw syscall as older machines might not have a recent enough kernel for close_range (with CLOSE_RANGE_CLOEXEC) or a libc that does not yet expose both of the syscalls. Fixes: 32e8eaf263d9 ("um: use execveat to create userspace MMs") Reported-by: Glenn Washburn <development@efficientek.com> Closes: https://lore.kernel.org/20250108022404.05e0de1e@crass-HP-ZBook-15-G2 Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250113094107.674738-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2025-02-13um: properly align signal stack on x86_64Benjamin Berg1-3/+5
The stack needs to be properly aligned so 16 byte memory accesses on the stack are correct. This was broken when introducing the dynamic math register sizing as the rounding was not moved appropriately. Fixes: 3f17fed21491 ("um: switch to regset API and depend on XSTATE") Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250107133509.265576-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2025-02-13um: avoid copying FP state from init_taskBenjamin Berg1-1/+9
The init_task instance of struct task_struct is statically allocated and does not contain the dynamic area for the userspace FP registers. As such, limit the copy to the valid area of init_task and fill the rest with zero. Note that the FP state is only needed for userspace, and as such it is entirely reasonable for init_task to not contain it. Reported-by: Brian Norris <briannorris@chromium.org> Closes: https://lore.kernel.org/Z1ySXmjZm-xOqk90@google.com Fixes: 3f17fed21491 ("um: switch to regset API and depend on XSTATE") Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241217202745.1402932-3-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2025-02-13um: add back support for FXSAVE registersBenjamin Berg2-3/+23
It was reported that qemu may not enable the XSTATE CPU extension, which is a requirement after commit 3f17fed21491 ("um: switch to regset API and depend on XSTATE"). Add a fallback to use FXSAVE (FP registers on x86_64 and XFP on i386) which is just a shorter version of the same data. The only difference is that the XSTATE magic should not be set in the signal frame. Note that this still drops support for the older i386 FP register layout as supporting this would require more backward compatibility to build a correct signal frame. Fixes: 3f17fed21491 ("um: switch to regset API and depend on XSTATE") Reported-by: SeongJae Park <sj@kernel.org> Closes: https://lore.kernel.org/r/20241203070218.240797-1-sj@kernel.org Tested-by: SeongJae Park <sj@kernel.org> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241204074827.1582917-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2025-02-12KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loopSean Christopherson7-11/+19
Move the conditional loading of hardware DR6 with the guest's DR6 value out of the core .vcpu_run() loop to fix a bug where KVM can load hardware with a stale vcpu->arch.dr6. When the guest accesses a DR and host userspace isn't debugging the guest, KVM disables DR interception and loads the guest's values into hardware on VM-Enter and saves them on VM-Exit. This allows the guest to access DRs at will, e.g. so that a sequence of DR accesses to configure a breakpoint only generates one VM-Exit. For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest) and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop. But for DR6, the guest's value doesn't need to be loaded into hardware for KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas VMX requires software to manually load the guest value, and so loading the guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done _inside_ the core run loop. Unfortunately, saving the guest values on VM-Exit is initiated by common x86, again outside of the core run loop. If the guest modifies DR6 (in hardware, when DR interception is disabled), and then the next VM-Exit is a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and clobber the guest's actual value. The bug shows up primarily with nested VMX because KVM handles the VMX preemption timer in the fastpath, and the window between hardware DR6 being modified (in guest context) and DR6 being read by guest software is orders of magnitude larger in a nested setup. E.g. in non-nested, the VMX preemption timer would need to fire precisely between #DB injection and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the window where hardware DR6 is "dirty" extends all the way from L1 writing DR6 to VMRESUME (in L1). L1's view: ========== <L1 disables DR interception> CPU 0/KVM-7289 [023] d.... 2925.640961: kvm_entry: vcpu 0 A: L1 Writes DR6 CPU 0/KVM-7289 [023] d.... 2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1 B: CPU 0/KVM-7289 [023] d.... 2925.640967: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT intr_info 0x800000ec D: L1 reads DR6, arch.dr6 = 0 CPU 0/KVM-7289 [023] d.... 2925.640969: <hack>: Sync DRs, DR6 = 0xffff0ff0 CPU 0/KVM-7289 [023] d.... 2925.640976: kvm_entry: vcpu 0 L2 reads DR6, L1 disables DR interception CPU 0/KVM-7289 [023] d.... 2925.640980: kvm_exit: vcpu 0 reason DR_ACCESS info1 0x0000000000000216 CPU 0/KVM-7289 [023] d.... 2925.640983: kvm_entry: vcpu 0 CPU 0/KVM-7289 [023] d.... 2925.640983: <hack>: Set DRs, DR6 = 0xffff0ff0 L2 detects failure CPU 0/KVM-7289 [023] d.... 2925.640987: kvm_exit: vcpu 0 reason HLT L1 reads DR6 (confirms failure) CPU 0/KVM-7289 [023] d.... 2925.640990: <hack>: Sync DRs, DR6 = 0xffff0ff0 L0's view: ========== L2 reads DR6, arch.dr6 = 0 CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216 CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216 L2 => L1 nested VM-Exit CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit_inject: reason: DR_ACCESS ext_inf1: 0x0000000000000216 CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_entry: vcpu 23 CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_exit: vcpu 23 reason VMREAD CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_entry: vcpu 23 CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason VMREAD CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_entry: vcpu 23 L1 writes DR7, L0 disables DR interception CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000007 CPU 23/KVM-5046 [001] d.... 3410.005613: kvm_entry: vcpu 23 L0 writes DR6 = 0 (arch.dr6) CPU 23/KVM-5046 [001] d.... 3410.005613: <hack>: Set DRs, DR6 = 0xffff0ff0 A: <L1 writes DR6 = 1, no interception, arch.dr6 is still '0'> B: CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_exit: vcpu 23 reason PREEMPTION_TIMER CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_entry: vcpu 23 C: L0 writes DR6 = 0 (arch.dr6) CPU 23/KVM-5046 [001] d.... 3410.005614: <hack>: Set DRs, DR6 = 0xffff0ff0 L1 => L2 nested VM-Enter CPU 23/KVM-5046 [001] d.... 3410.005616: kvm_exit: vcpu 23 reason VMRESUME L0 reads DR6, arch.dr6 = 0 Reported-by: John Stultz <jstultz@google.com> Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com Fixes: 375e28ffc0cf ("KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT") Fixes: d67668e9dd76 ("KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6") Cc: stable@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Tested-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20250125011833.3644371-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: nSVM: Enter guest mode before initializing nested NPT MMUSean Christopherson2-6/+6
When preparing vmcb02 for nested VMRUN (or state restore), "enter" guest mode prior to initializing the MMU for nested NPT so that guest_mode is set in the MMU's role. KVM's model is that all L2 MMUs are tagged with guest_mode, as the behavior of hypervisor MMUs tends to be significantly different than kernel MMUs. Practically speaking, the bug is relatively benign, as KVM only directly queries role.guest_mode in kvm_mmu_free_guest_mode_roots() and kvm_mmu_page_ad_need_write_protect(), which SVM doesn't use, and in paths that are optimizations (mmu_page_zap_pte() and shadow_mmu_try_split_huge_pages()). And while the role is incorprated into shadow page usage, because nested NPT requires KVM to be using NPT for L1, reusing shadow pages across L1 and L2 is impossible as L1 MMUs will always have direct=1, while L2 MMUs will have direct=0. Hoist the TLB processing and setting of HF_GUEST_MASK to the beginning of the flow instead of forcing guest_mode in the MMU, as nothing in nested_vmcb02_prepare_control() between the old and new locations touches TLB flush requests or HF_GUEST_MASK, i.e. there's no reason to present inconsistent vCPU state to the MMU. Fixes: 69cb877487de ("KVM: nSVM: move MMU setup to nested_prepare_vmcb_control") Cc: stable@vger.kernel.org Reported-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://lore.kernel.org/r/20250130010825.220346-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Reject Hyper-V's SEND_IPI hypercalls if local APIC isn't in-kernelSean Christopherson1-1/+5
Advertise support for Hyper-V's SEND_IPI and SEND_IPI_EX hypercalls if and only if the local API is emulated/virtualized by KVM, and explicitly reject said hypercalls if the local APIC is emulated in userspace, i.e. don't rely on userspace to opt-in to KVM_CAP_HYPERV_ENFORCE_CPUID. Rejecting SEND_IPI and SEND_IPI_EX fixes a NULL-pointer dereference if Hyper-V enlightenments are exposed to the guest without an in-kernel local APIC: dump_stack+0xbe/0xfd __kasan_report.cold+0x34/0x84 kasan_report+0x3a/0x50 __apic_accept_irq+0x3a/0x5c0 kvm_hv_send_ipi.isra.0+0x34e/0x820 kvm_hv_hypercall+0x8d9/0x9d0 kvm_emulate_hypercall+0x506/0x7e0 __vmx_handle_exit+0x283/0xb60 vmx_handle_exit+0x1d/0xd0 vcpu_enter_guest+0x16b0/0x24c0 vcpu_run+0xc0/0x550 kvm_arch_vcpu_ioctl_run+0x170/0x6d0 kvm_vcpu_ioctl+0x413/0xb20 __se_sys_ioctl+0x111/0x160 do_syscal1_64+0x30/0x40 entry_SYSCALL_64_after_hwframe+0x67/0xd1 Note, checking the sending vCPU is sufficient, as the per-VM irqchip_mode can't be modified after vCPUs are created, i.e. if one vCPU has an in-kernel local APIC, then all vCPUs have an in-kernel local APIC. Reported-by: Dongjie Zou <zoudongjie@huawei.com> Fixes: 214ff83d4473 ("KVM: x86: hyperv: implement PV IPI send hypercalls") Fixes: 2bc39970e932 ("x86/kvm/hyper-v: Introduce KVM_GET_SUPPORTED_HV_CPUID") Cc: stable@vger.kernel.org Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20250118003454.2619573-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12powerpc/code-patching: Fix KASAN hit by not flagging text patching area as ↵Christophe Leroy1-1/+1
VM_ALLOC Erhard reported the following KASAN hit while booting his PowerMac G4 with a KASAN-enabled kernel 6.13-rc6: BUG: KASAN: vmalloc-out-of-bounds in copy_to_kernel_nofault+0xd8/0x1c8 Write of size 8 at addr f1000000 by task chronyd/1293 CPU: 0 UID: 123 PID: 1293 Comm: chronyd Tainted: G W 6.13.0-rc6-PMacG4 #2 Tainted: [W]=WARN Hardware name: PowerMac3,6 7455 0x80010303 PowerMac Call Trace: [c2437590] [c1631a84] dump_stack_lvl+0x70/0x8c (unreliable) [c24375b0] [c0504998] print_report+0xdc/0x504 [c2437610] [c050475c] kasan_report+0xf8/0x108 [c2437690] [c0505a3c] kasan_check_range+0x24/0x18c [c24376a0] [c03fb5e4] copy_to_kernel_nofault+0xd8/0x1c8 [c24376c0] [c004c014] patch_instructions+0x15c/0x16c [c2437710] [c00731a8] bpf_arch_text_copy+0x60/0x7c [c2437730] [c0281168] bpf_jit_binary_pack_finalize+0x50/0xac [c2437750] [c0073cf4] bpf_int_jit_compile+0xb30/0xdec [c2437880] [c0280394] bpf_prog_select_runtime+0x15c/0x478 [c24378d0] [c1263428] bpf_prepare_filter+0xbf8/0xc14 [c2437990] [c12677ec] bpf_prog_create_from_user+0x258/0x2b4 [c24379d0] [c027111c] do_seccomp+0x3dc/0x1890 [c2437ac0] [c001d8e0] system_call_exception+0x2dc/0x420 [c2437f30] [c00281ac] ret_from_syscall+0x0/0x2c --- interrupt: c00 at 0x5a1274 NIP: 005a1274 LR: 006a3b3c CTR: 005296c8 REGS: c2437f40 TRAP: 0c00 Tainted: G W (6.13.0-rc6-PMacG4) MSR: 0200f932 <VEC,EE,PR,FP,ME,IR,DR,RI> CR: 24004422 XER: 00000000 GPR00: 00000166 af8f3fa0 a7ee3540 00000001 00000000 013b6500 005a5858 0200f932 GPR08: 00000000 00001fe9 013d5fc8 005296c8 2822244c 00b2fcd8 00000000 af8f4b57 GPR16: 00000000 00000001 00000000 00000000 00000000 00000001 00000000 00000002 GPR24: 00afdbb0 00000000 00000000 00000000 006e0004 013ce060 006e7c1c 00000001 NIP [005a1274] 0x5a1274 LR [006a3b3c] 0x6a3b3c --- interrupt: c00 The buggy address belongs to the virtual mapping at [f1000000, f1002000) created by: text_area_cpu_up+0x20/0x190 The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x76e30 flags: 0x80000000(zone=2) raw: 80000000 00000000 00000122 00000000 00000000 00000000 ffffffff 00000001 raw: 00000000 page dumped because: kasan: bad access detected Memory state around the buggy address: f0ffff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0ffff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >f1000000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ^ f1000080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f1000100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ================================================================== f8 corresponds to KASAN_VMALLOC_INVALID which means the area is not initialised hence not supposed to be used yet. Powerpc text patching infrastructure allocates a virtual memory area using get_vm_area() and flags it as VM_ALLOC. But that flag is meant to be used for vmalloc() and vmalloc() allocated memory is not supposed to be used before a call to __vmalloc_node_range() which is never called for that area. That went undetected until commit e4137f08816b ("mm, kasan, kmsan: instrument copy_from/to_kernel_nofault") The area allocated by text_area_cpu_up() is not vmalloc memory, it is mapped directly on demand when needed by map_kernel_page(). There is no VM flag corresponding to such usage, so just pass no flag. That way the area will be unpoisonned and usable immediately. Reported-by: Erhard Furtner <erhard_f@mailbox.org> Closes: https://lore.kernel.org/all/20250112135832.57c92322@yea/ Fixes: 37bc3e5fd764 ("powerpc/lib/code-patching: Use alternate map for patch_instruction()") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/06621423da339b374f48c0886e3a5db18e896be8.1739342693.git.christophe.leroy@csgroup.eu
2025-02-12x86/hyperv/vtl: Stop kernel from probing VTL0 low memoryNaman Jain1-0/+1
For Linux, running in Hyper-V VTL (Virtual Trust Level), kernel in VTL2 tries to access VTL0 low memory in probe_roms. This memory is not described in the e820 map. Initialize probe_roms call to no-ops during boot for VTL2 kernel to avoid this. The issue got identified in OpenVMM which detects invalid accesses initiated from kernel running in VTL2. Co-developed-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Naman Jain <namjain@linux.microsoft.com> Tested-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Roman Kisel <romank@linux.microsoft.com> Link: https://lore.kernel.org/r/20250116061224.1701-1-namjain@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250116061224.1701-1-namjain@linux.microsoft.com>
2025-02-11arm64: dts: rockchip: adjust SMMU interrupt type on rk3588Patrick Wildt1-8/+8
The SMMU architecture requires wired interrupts to be edge triggered, which does not align with the DT description for the RK3588. This leads to interrupt storms, as the SMMU continues to hold the pin high and only pulls it down for a short amount when issuing an IRQ. Update the DT description to be in line with the spec and perceived reality. Signed-off-by: Patrick Wildt <patrick@blueri.se> Fixes: cd81d3a0695c ("arm64: dts: rockchip: add rk3588 pcie and php IOMMUs") Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/Z6pxme2Chmf3d3uK@windev.fritz.box Signed-off-by: Heiko Stuebner <heiko@sntech.de>
2025-02-11arm64: dts: rockchip: disable IOMMU when running rk3588 in PCIe endpoint modeNiklas Cassel2-1/+4
Commit da92d3dfc871 ("arm64: dts: rockchip: enable the mmu600_pcie IOMMU on the rk3588 SoC") enabled the mmu600_pcie IOMMU, both in the normal case (when all PCIe controllers are running in Root Complex mode) and in the case when running the pcie3x4 PCIe controller in Endpoint mode. There have been no issues detected when running the PCIe controllers in Root Complex mode. During PCI probe time, we will add a SID to the IOMMU for each PCI device enumerated on the bus, including the root port itself. However, when running the pcie3x4 PCIe controller in Endpoint mode, we will only add a single SID to the IOMMU (the SID specified in the iommus DT property). The enablement of IOMMU in endpoint mode was verified on setup with two Rock 5b:s, where the BDF of the Root Complex has BDF (00:00.0). A Root Complex sending a TLP to the Endpoint will have Requester ID set to the BDF of the initiator. On the EP side, the Requester ID will then be used as the SID. This works fine if the Root Complex has a BDF that matches the iommus DT property, however, if the Root Complex has any other BDF, we will see something like: arm-smmu-v3 fc900000.iommu: event: C_BAD_STREAMID client: (unassigned sid) sid: 0x1600 ssid: 0x0 on the endpoint side. For PCIe controllers running in endpoint mode that always uses the incoming Requester ID as the SID, the iommus DT property simply isn't a viable solution. (Neither is iommu-map a viable solution, as there is no enumeration done on the endpoint side.) Thus, partly revert commit da92d3dfc871 ("arm64: dts: rockchip: enable the mmu600_pcie IOMMU on the rk3588 SoC") by disabling the PCI IOMMU when running the pcie3x4 PCIe controller in Endpoint mode. Since the PCI IOMMU is working as expected in the normal case, keep it enabled when running all PCIe controllers in Root Complex mode. Fixes: da92d3dfc871 ("arm64: dts: rockchip: enable the mmu600_pcie IOMMU on the rk3588 SoC") Signed-off-by: Niklas Cassel <cassel@kernel.org> Acked-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/20250207143900.2047949-2-cassel@kernel.org Signed-off-by: Heiko Stuebner <heiko@sntech.de>
2025-02-11s390/pci: Fix handling of isolated VFsNiklas Schnelle3-1/+28
In contrast to the commit message of the fixed commit VFs whose parent PF is not configured are not always isolated, that is put on their own PCI domain. This is because for VFs to be added to an existing PCI domain it is enough for that PCI domain to share the same topology ID or PCHID. Such a matching PCI domain without a parent PF may exist when a PF from the same PCI card created the domain with the VF being a child of a different, non accessible, PF. While not causing technical issues it makes the rules which VFs are isolated inconsistent. Fix this by explicitly checking that the parent PF exists on the PCI domain determined by the topology ID or PCHID before registering the VF. This works because a parent PF which is under control of this Linux instance must be enabled and configured at the point where its child VFs appear because otherwise SR-IOV could not have been enabled on the parent. Fixes: 25f39d3dcb48 ("s390/pci: Ignore RID for isolated VFs") Cc: stable@vger.kernel.org Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-02-11s390/pci: Pull search for parent PF out of zpci_iov_setup_virtfn()Niklas Schnelle1-14/+42
This creates a new zpci_iov_find_parent_pf() function which a future commit can use to find if a VF has a configured parent PF. Use zdev->rid instead of zdev->devfn such that the new function can be used before it has been decided if the RID will be exposed and zdev->devfn is set. Also handle the hypotheical case that the RID is not available but there is an otherwise matching zbus. Fixes: 25f39d3dcb48 ("s390/pci: Ignore RID for isolated VFs") Cc: stable@vger.kernel.org Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-02-11s390/bitops: Disable arch_test_bit() optimization for PROFILE_ALL_BRANCHESHeiko Carstens1-1/+5
With PROFILE_ALL_BRANCHES enabled gcc sometimes fails to handle __builtin_constant_p() correctly: In function 'arch_test_bit', inlined from 'node_state' at include/linux/nodemask.h:423:9, inlined from 'warn_if_node_offline' at include/linux/gfp.h:252:2, inlined from '__alloc_pages_node_noprof' at include/linux/gfp.h:267:2, inlined from 'alloc_pages_node_noprof' at include/linux/gfp.h:296:9, inlined from 'vm_area_alloc_pages.constprop' at mm/vmalloc.c:3591:11: >> arch/s390/include/asm/bitops.h:60:17: warning: 'asm' operand 2 probably does not match constraints 60 | asm volatile( | ^~~ >> arch/s390/include/asm/bitops.h:60:17: error: impossible constraint in 'asm' Therefore disable the optimization for this case. This is similar to commit 63678eecec57 ("s390/preempt: disable __preempt_count_add() optimization for PROFILE_ALL_BRANCHES") Fixes: b2bc1b1a77c0 ("s390/bitops: Provide optimized arch_test_bit()") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202502091912.xL2xTCGw-lkp@intel.com/ Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-02-11s390/configs: Remove CONFIG_LSMIlya Leoshkevich3-3/+0
s390 defconfig does not have BPF LSM, resulting in systemd[1]: bpf-restrict-fs: BPF LSM hook not enabled in the kernel, BPF LSM not supported. with the respective kernels. The other architectures do not explicitly set it, and the default values have BPF in them, so just drop it. Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-02-11x86/cpu/kvm: SRSO: Fix possible missing IBPB on VM-ExitPatrick Bellasi2-8/+16
In [1] the meaning of the synthetic IBPB flags has been redefined for a better separation of concerns: - ENTRY_IBPB -- issue IBPB on entry only - IBPB_ON_VMEXIT -- issue IBPB on VM-Exit only and the Retbleed mitigations have been updated to match this new semantics. Commit [2] was merged shortly before [1], and their interaction was not handled properly. This resulted in IBPB not being triggered on VM-Exit in all SRSO mitigation configs requesting an IBPB there. Specifically, an IBPB on VM-Exit is triggered only when X86_FEATURE_IBPB_ON_VMEXIT is set. However: - X86_FEATURE_IBPB_ON_VMEXIT is not set for "spec_rstack_overflow=ibpb", because before [1] having X86_FEATURE_ENTRY_IBPB was enough. Hence, an IBPB is triggered on entry but the expected IBPB on VM-exit is not. - X86_FEATURE_IBPB_ON_VMEXIT is not set also when "spec_rstack_overflow=ibpb-vmexit" if X86_FEATURE_ENTRY_IBPB is already set. That's because before [1] this was effectively redundant. Hence, e.g. a "retbleed=ibpb spec_rstack_overflow=bpb-vmexit" config mistakenly reports the machine still vulnerable to SRSO, despite an IBPB being triggered both on entry and VM-Exit, because of the Retbleed selected mitigation config. - UNTRAIN_RET_VM won't still actually do anything unless CONFIG_MITIGATION_IBPB_ENTRY is set. For "spec_rstack_overflow=ibpb", enable IBPB on both entry and VM-Exit and clear X86_FEATURE_RSB_VMEXIT which is made superfluous by X86_FEATURE_IBPB_ON_VMEXIT. This effectively makes this mitigation option similar to the one for 'retbleed=ibpb', thus re-order the code for the RETBLEED_MITIGATION_IBPB option to be less confusing by having all features enabling before the disabling of the not needed ones. For "spec_rstack_overflow=ibpb-vmexit", guard this mitigation setting with CONFIG_MITIGATION_IBPB_ENTRY to ensure UNTRAIN_RET_VM sequence is effectively compiled in. Drop instead the CONFIG_MITIGATION_SRSO guard, since none of the SRSO compile cruft is required in this configuration. Also, check only that the required microcode is present to effectively enabled the IBPB on VM-Exit. Finally, update the KConfig description for CONFIG_MITIGATION_IBPB_ENTRY to list also all SRSO config settings enabled by this guard. Fixes: 864bcaa38ee4 ("x86/cpu/kvm: Provide UNTRAIN_RET_VM") [1] Fixes: d893832d0e1e ("x86/srso: Add IBPB on VMEXIT") [2] Reported-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Patrick Bellasi <derkling@google.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-02-10KVM: arm64: Fix __pkvm_host_mkyoung_guest() return valueMarc Zyngier1-2/+1
Don't use an uninitialised stack variable, and just return 0 on the non-error path. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202502100911.8c9DbtKD-lkp@intel.com/ Reviewed-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-10powerpc/64s: Rewrite __real_pte() and __rpte_to_hidx() as static inlineChristophe Leroy1-2/+10
Rewrite __real_pte() and __rpte_to_hidx() as static inline in order to avoid following warnings/errors when building with 4k page size: CC arch/powerpc/mm/book3s64/hash_tlb.o arch/powerpc/mm/book3s64/hash_tlb.c: In function 'hpte_need_flush': arch/powerpc/mm/book3s64/hash_tlb.c:49:16: error: variable 'offset' set but not used [-Werror=unused-but-set-variable] 49 | int i, offset; | ^~~~~~ CC arch/powerpc/mm/book3s64/hash_native.o arch/powerpc/mm/book3s64/hash_native.c: In function 'native_flush_hash_range': arch/powerpc/mm/book3s64/hash_native.c:782:29: error: variable 'index' set but not used [-Werror=unused-but-set-variable] 782 | unsigned long hash, index, hidx, shift, slot; | ^~~~~ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501081741.AYFwybsq-lkp@intel.com/ Fixes: ff31e105464d ("powerpc/mm/hash64: Store the slot information at the right offset for hugetlb") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/e0d340a5b7bd478ecbf245d826e6ab2778b74e06.1736706263.git.christophe.leroy@csgroup.eu
2025-02-10powerpc/code-patching: Disable KASAN report during patching via temporary mmChristophe Leroy1-0/+2
Erhard reports the following KASAN hit on Talos II (power9) with kernel 6.13: [ 12.028126] ================================================================== [ 12.028198] BUG: KASAN: user-memory-access in copy_to_kernel_nofault+0x8c/0x1a0 [ 12.028260] Write of size 8 at addr 0000187e458f2000 by task systemd/1 [ 12.028346] CPU: 87 UID: 0 PID: 1 Comm: systemd Tainted: G T 6.13.0-P9-dirty #3 [ 12.028408] Tainted: [T]=RANDSTRUCT [ 12.028446] Hardware name: T2P9D01 REV 1.01 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV [ 12.028500] Call Trace: [ 12.028536] [c000000008dbf3b0] [c000000001656a48] dump_stack_lvl+0xbc/0x110 (unreliable) [ 12.028609] [c000000008dbf3f0] [c0000000006e2fc8] print_report+0x6b0/0x708 [ 12.028666] [c000000008dbf4e0] [c0000000006e2454] kasan_report+0x164/0x300 [ 12.028725] [c000000008dbf600] [c0000000006e54d4] kasan_check_range+0x314/0x370 [ 12.028784] [c000000008dbf640] [c0000000006e6310] __kasan_check_write+0x20/0x40 [ 12.028842] [c000000008dbf660] [c000000000578e8c] copy_to_kernel_nofault+0x8c/0x1a0 [ 12.028902] [c000000008dbf6a0] [c0000000000acfe4] __patch_instructions+0x194/0x210 [ 12.028965] [c000000008dbf6e0] [c0000000000ade80] patch_instructions+0x150/0x590 [ 12.029026] [c000000008dbf7c0] [c0000000001159bc] bpf_arch_text_copy+0x6c/0xe0 [ 12.029085] [c000000008dbf800] [c000000000424250] bpf_jit_binary_pack_finalize+0x40/0xc0 [ 12.029147] [c000000008dbf830] [c000000000115dec] bpf_int_jit_compile+0x3bc/0x930 [ 12.029206] [c000000008dbf990] [c000000000423720] bpf_prog_select_runtime+0x1f0/0x280 [ 12.029266] [c000000008dbfa00] [c000000000434b18] bpf_prog_load+0xbb8/0x1370 [ 12.029324] [c000000008dbfb70] [c000000000436ebc] __sys_bpf+0x5ac/0x2e00 [ 12.029379] [c000000008dbfd00] [c00000000043a228] sys_bpf+0x28/0x40 [ 12.029435] [c000000008dbfd20] [c000000000038eb4] system_call_exception+0x334/0x610 [ 12.029497] [c000000008dbfe50] [c00000000000c270] system_call_vectored_common+0xf0/0x280 [ 12.029561] --- interrupt: 3000 at 0x3fff82f5cfa8 [ 12.029608] NIP: 00003fff82f5cfa8 LR: 00003fff82f5cfa8 CTR: 0000000000000000 [ 12.029660] REGS: c000000008dbfe80 TRAP: 3000 Tainted: G T (6.13.0-P9-dirty) [ 12.029735] MSR: 900000000280f032 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI> CR: 42004848 XER: 00000000 [ 12.029855] IRQMASK: 0 GPR00: 0000000000000169 00003fffdcf789a0 00003fff83067100 0000000000000005 GPR04: 00003fffdcf78a98 0000000000000090 0000000000000000 0000000000000008 GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR12: 0000000000000000 00003fff836ff7e0 c000000000010678 0000000000000000 GPR16: 0000000000000000 0000000000000000 00003fffdcf78f28 00003fffdcf78f90 GPR20: 0000000000000000 0000000000000000 0000000000000000 00003fffdcf78f80 GPR24: 00003fffdcf78f70 00003fffdcf78d10 00003fff835c7239 00003fffdcf78bd8 GPR28: 00003fffdcf78a98 0000000000000000 0000000000000000 000000011f547580 [ 12.030316] NIP [00003fff82f5cfa8] 0x3fff82f5cfa8 [ 12.030361] LR [00003fff82f5cfa8] 0x3fff82f5cfa8 [ 12.030405] --- interrupt: 3000 [ 12.030444] ================================================================== Commit c28c15b6d28a ("powerpc/code-patching: Use temporary mm for Radix MMU") is inspired from x86 but unlike x86 is doesn't disable KASAN reports during patching. This wasn't a problem at the begining because __patch_mem() is not instrumented. Commit 465cabc97b42 ("powerpc/code-patching: introduce patch_instructions()") use copy_to_kernel_nofault() to copy several instructions at once. But when using temporary mm the destination is not regular kernel memory but a kind of kernel-like memory located in user address space. Because it is not in kernel address space it is not covered by KASAN shadow memory. Since commit e4137f08816b ("mm, kasan, kmsan: instrument copy_from/to_kernel_nofault") KASAN reports bad accesses from copy_to_kernel_nofault(). Here a bad access to user memory is reported because KASAN detects the lack of shadow memory and the address is below TASK_SIZE. Do like x86 in commit b3fd8e83ada0 ("x86/alternatives: Use temporary mm for text poking") and disable KASAN reports during patching when using temporary mm. Reported-by: Erhard Furtner <erhard_f@mailbox.org> Close: https://lore.kernel.org/all/20250201151435.48400261@yea/ Fixes: 465cabc97b42 ("powerpc/code-patching: introduce patch_instructions()") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/1c05b2a1b02ad75b981cfc45927e0b4a90441046.1738577687.git.christophe.leroy@csgroup.eu
2025-02-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds26-967/+1069
Pull kvm fixes from Paolo Bonzini: "ARM: - Correctly clean the BSS to the PoC before allowing EL2 to access it on nVHE/hVHE/protected configurations - Propagate ownership of debug registers in protected mode after the rework that landed in 6.14-rc1 - Stop pretending that we can run the protected mode without a GICv3 being present on the host - Fix a use-after-free situation that can occur if a vcpu fails to initialise the NV shadow S2 MMU contexts - Always evaluate the need to arm a background timer for fully emulated guest timers - Fix the emulation of EL1 timers in the absence of FEAT_ECV - Correctly handle the EL2 virtual timer, specially when HCR_EL2.E2H==0 s390: - move some of the guest page table (gmap) logic into KVM itself, inching towards the final goal of completely removing gmap from the non-kvm memory management code. As an initial set of cleanups, move some code from mm/gmap into kvm and start using __kvm_faultin_pfn() to fault-in pages as needed; but especially stop abusing page->index and page->lru to aid in the pgdesc conversion. x86: - Add missing check in the fix to defer starting the huge page recovery vhost_task - SRSO_USER_KERNEL_NO does not need SYNTHESIZED_F" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (31 commits) KVM: x86/mmu: Ensure NX huge page recovery thread is alive before waking KVM: remove kvm_arch_post_init_vm KVM: selftests: Fix spelling mistake "initally" -> "initially" kvm: x86: SRSO_USER_KERNEL_NO is not synthesized KVM: arm64: timer: Don't adjust the EL2 virtual timer offset KVM: arm64: timer: Correctly handle EL1 timer emulation when !FEAT_ECV KVM: arm64: timer: Always evaluate the need for a soft timer KVM: arm64: Fix nested S2 MMU structures reallocation KVM: arm64: Fail protected mode init if no vgic hardware is present KVM: arm64: Flush/sync debug state in protected mode KVM: s390: selftests: Streamline uc_skey test to issue iske after sske KVM: s390: remove the last user of page->index KVM: s390: move PGSTE softbits KVM: s390: remove useless page->index usage KVM: s390: move gmap_shadow_pgt_lookup() into kvm KVM: s390: stop using lists to keep track of used dat tables KVM: s390: stop using page->index for non-shadow gmaps KVM: s390: move some gmap shadowing functions away from mm/gmap.c KVM: s390: get rid of gmap_translate() KVM: s390: get rid of gmap_fault() ...
2025-02-09KVM: arm64: Simplify np-guest hypercallsQuentin Perret1-31/+38
When the handling of a guest stage-2 permission fault races with an MMU notifier, the faulting page might be gone from the guest's stage-2 by the point we attempt to call (p)kvm_pgtable_stage2_relax_perms(). In the normal KVM case, this leads to returning -EAGAIN which user_mem_abort() handles correctly by simply re-entering the guest. However, the pKVM hypercall implementation has additional logic to check the page state using __check_host_shared_guest() which gets confused with absence of a page mapped at the requested IPA and returns -ENOENT, hence breaking user_mem_abort() and hilarity ensues. Luckily, several of the hypercalls for managing the stage-2 page-table of NP guests have no effect on the pKVM ownership tracking (wrprotect, test_clear_young, mkyoung, and crucially relax_perms), so the extra state checking logic is in fact not strictly necessary. So, to fix the discrepancy between standard KVM and pKVM, let's just drop the superfluous __check_host_shared_guest() logic from those hypercalls and make the extra state checking a debug assertion dependent on CONFIG_NVHE_EL2_DEBUG as we already do for other transitions. Signed-off-by: Quentin Perret <qperret@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250207145438.1333475-3-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-09KVM: arm64: Improve error handling from check_host_shared_guest()Quentin Perret1-2/+2
The check_host_shared_guest() path expects to find a last-level valid PTE in the guest's stage-2 page-table. However, it checks the PTE's level before its validity, which makes it hard for callers to figure out what went wrong. To make error handling simpler, check the PTE's validity first. Signed-off-by: Quentin Perret <qperret@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250207145438.1333475-2-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-09Merge tag 'execve-v6.14-rc2' of ↵Linus Torvalds4-21/+6
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve fix from Kees Cook: "This is an alpha-specific fix, but since it touched ELF I was asked to carry it. - alpha/elf: Fix misc/setarch test of util-linux by removing 32bit support (Eric W. Biederman)" * tag 'execve-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: alpha/elf: Fix misc/setarch test of util-linux by removing 32bit support
2025-02-08Merge tag 'x86-urgent-2025-02-08' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Ingo Molnar: "Fix a build regression on GCC 15 builds, caused by GCC changing the default C version that is overriden in the main Makefile but not in the x86 boot code Makefile" * tag 'x86-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Use '-std=gnu11' to fix build with GCC 15
2025-02-08perf/x86/intel: Ensure LBRs are disabled when a CPU is startingSean Christopherson2-2/+6
Explicitly clear DEBUGCTL.LBR when a CPU is starting, prior to purging the LBR MSRs themselves, as at least one system has been found to transfer control to the kernel with LBRs enabled (it's unclear whether it's a BIOS flaw or a CPU goof). Because the kernel preserves the original DEBUGCTL, even when toggling LBRs, leaving DEBUGCTL.LBR as is results in running with LBRs enabled at all times. Closes: https://lore.kernel.org/all/c9d8269bff69f6359731d758e3b1135dedd7cc61.camel@redhat.com Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20250131010721.470503-1-seanjc@google.com
2025-02-08perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAFKan Liang2-11/+35
The EAX of the CPUID Leaf 023H enumerates the mask of valid sub-leaves. To tell the availability of the sub-leaf 1 (enumerate the counter mask), perf should check the bit 1 (0x2) of EAS, rather than bit 0 (0x1). The error is not user-visible on bare metal. Because the sub-leaf 0 and the sub-leaf 1 are always available. However, it may bring issues in a virtualization environment when a VMM only enumerates the sub-leaf 0. Introduce the cpuid35_e?x to replace the macros, which makes the implementation style consistent. Fixes: eb467aaac21e ("perf/x86/intel: Support Architectural PerfMon Extension leaf") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20250129154820.3755948-3-kan.liang@linux.intel.com
2025-02-08perf/x86/intel: Clean up PEBS-via-PT on hybridKan Liang2-11/+9
The PEBS-via-PT feature is exposed for the e-core of some hybrid platforms, e.g., ADL and MTL. But it never works. $ dmesg | grep PEBS [ 1.793888] core: cpu_atom PMU driver: PEBS-via-PT $ perf record -c 1000 -e '{intel_pt/branch=0/, cpu_atom/cpu-cycles,aux-output/pp}' -C8 Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cpu_atom/cpu-cycles,aux-output/pp). "dmesg | grep -i perf" may provide additional information. The "PEBS-via-PT" is printed if the corresponding bit of per-PMU capabilities is set. Since the feature is supported by the e-core HW, perf sets the bit for e-core. However, for Intel PT, if a feature is not supported on all CPUs, it is not supported at all. The PEBS-via-PT event cannot be created successfully. The PEBS-via-PT is no longer enumerated on the latest hybrid platform. It will be deprecated on future platforms with Arch PEBS. Let's remove it from the existing hybrid platforms. Fixes: d9977c43bff8 ("perf/x86: Register hybrid PMUs") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250129154820.3755948-2-kan.liang@linux.intel.com
2025-02-08perf/x86/rapl: Fix the error checking orderDhananjay Ugwekar1-8/+4
After the commit b4943b8bfc41 ("perf/x86/rapl: Add core energy counter support for AMD CPUs"), the default "perf record"/"perf top" command is broken in systems where there isn't a PMU registered for type PERF_TYPE_RAW. This is due to the change in order of error checks in rapl_pmu_event_init() Due to which we return -EINVAL instead of -ENOENT, when we reach here from the fallback loop in perf_init_event(). Move the "PMU and event type match" back to the beginning of the function so that we return -ENOENT early on. Closes: https://lore.kernel.org/all/uv7mz6vew2bzgre5jdpmwldxljp5djzmuiksqdcdwipfm4zm7w@ribobcretidk/ Fixes: b4943b8bfc41 ("perf/x86/rapl: Add core energy counter support for AMD CPUs") Reported-by: Koichiro Den <koichiro.den@canonical.com> Signed-off-by: Dhananjay Ugwekar <dhananjay.ugwekar@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250129080513.30353-1-dhananjay.ugwekar@amd.com
2025-02-07arm64: cacheinfo: Avoid out-of-bounds write to cacheinfo arrayRadu Rendec1-5/+7
The loop that detects/populates cache information already has a bounds check on the array size but does not account for cache levels with separate data/instructions cache. Fix this by incrementing the index for any populated leaf (instead of any populated level). Fixes: 5d425c186537 ("arm64: kernel: add support for cpu cache information") Signed-off-by: Radu Rendec <rrendec@redhat.com> Link: https://lore.kernel.org/r/20250206174420.2178724-1-rrendec@redhat.com Signed-off-by: Will Deacon <will@kernel.org>
2025-02-07arm64: Handle .ARM.attributes section in linker scriptsNathan Chancellor2-0/+2
A recent LLVM commit [1] started generating an .ARM.attributes section similar to the one that exists for 32-bit, which results in orphan section warnings (or errors if CONFIG_WERROR is enabled) from the linker because it is not handled in the arm64 linker scripts. ld.lld: error: arch/arm64/kernel/vdso/vgettimeofday.o:(.ARM.attributes) is being placed in '.ARM.attributes' ld.lld: error: arch/arm64/kernel/vdso/vgetrandom.o:(.ARM.attributes) is being placed in '.ARM.attributes' ld.lld: error: vmlinux.a(lib/vsprintf.o):(.ARM.attributes) is being placed in '.ARM.attributes' ld.lld: error: vmlinux.a(lib/win_minmax.o):(.ARM.attributes) is being placed in '.ARM.attributes' ld.lld: error: vmlinux.a(lib/xarray.o):(.ARM.attributes) is being placed in '.ARM.attributes' Discard the new sections in the necessary linker scripts to resolve the warnings, as the kernel and vDSO do not need to retain it, similar to the .note.gnu.property section. Cc: stable@vger.kernel.org Fixes: b3e5d80d0c48 ("arm64/build: Warn on orphan section placement") Link: https://github.com/llvm/llvm-project/commit/ee99c4d4845db66c4daa2373352133f4b237c942 [1] Signed-off-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20250206-arm64-handle-arm-attributes-in-linker-script-v3-1-d53d169913eb@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2025-02-07genirq: Remove leading space from irq_chip::irq_print_chip() callbacksGeert Uytterhoeven1-1/+1
The space separator was factored out from the multiple chip name prints, but several irq_chip::irq_print_chip() callbacks still print a leading space. Remove the superfluous double spaces. Fixes: 9d9f204bdf7243bf ("genirq/proc: Add missing space separator back") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/893f7e9646d8933cd6786d5a1ef3eb076d263768.1738764803.git.geert+renesas@glider.be
2025-02-06Merge tag 'for-linus-6.14-rc2-tag' of ↵Linus Torvalds1-8/+3
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Three fixes for xen_hypercall_hvm() that was introduced in the 6.13 cycle" * tag 'for-linus-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: remove unneeded dummy push from xen_hypercall_hvm() x86/xen: add FRAME_END to xen_hypercall_hvm() x86/xen: fix xen_hypercall_hvm() to not clobber %rbx
2025-02-06alpha/elf: Fix misc/setarch test of util-linux by removing 32bit supportEric W. Biederman4-21/+6
Richard Henderson <richard.henderson@linaro.org> writes[1]: > There was a Spec benchmark (I forget which) which was memory bound and ran > twice as fast with 32-bit pointers. > > I copied the idea from DEC to the ELF abi, but never did all the other work > to allow the toolchain to take advantage. > > Amusingly, a later Spec changed the benchmark data sets to not fit into a > 32-bit address space, specifically because of this. > > I expect one could delete the ELF bit and personality and no one would > notice. Not even the 10 remaining Alpha users. In [2] it was pointed out that parts of setarch weren't working properly on alpha because it has it's own SET_PERSONALITY implementation. In the discussion that followed Richard Henderson pointed out that the 32bit pointer support for alpha was never completed. Fix this by removing alpha's 32bit pointer support. As a bit of paranoia refuse to execute any alpha binaries that have the EF_ALPHA_32BIT flag set. Just in case someone somewhere has binaries that try to use alpha's 32bit pointer support. Link: https://lkml.kernel.org/r/CAFXwXrkgu=4Qn-v1PjnOR4SG0oUb9LSa0g6QXpBq4ttm52pJOQ@mail.gmail.com [1] Link: https://lkml.kernel.org/r/20250103140148.370368-1-glaubitz@physik.fu-berlin.de [2] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Link: https://lore.kernel.org/r/87y0zfs26i.fsf_-_@email.froward.int.ebiederm.org Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-06arm64: defconfig: Enable TISCI Interrupt Router and AggregatorVaishnav Achath1-0/+2
Enable TISCI Interrupt Router and Interrupt Aggregator drivers. These IPs are found in all TI K3 SoCs like J721E, AM62X and is required for core functionality like DMA, GPIO Interrupts which is necessary during boot, thus make them built-in. bloat-o-meter summary on vmlinux: add/remove: 460/1 grow/shrink: 4/0 up/down: 162483/-8 (162475) ... Total: Before=31615984, After=31778459, chg +0.51% These configs were previously selected for ARCH_K3 in respective Kconfigs till commit b8b26ae398c4 ("irqchip/ti-sci-inta : Add module build support") and commit 2d95ffaecbc2 ("irqchip/ti-sci-intr: Add module build support") dropped them and few driver configs (TI_K3_UDMA, TI_K3_RINGACC) dependent on these also got disabled due to this. While re-enabling the TI_SCI_INT_*_IRQCHIP configs, these configs with missing dependencies (which are already part of arm64 defconfig) also get re-enabled which explains the slightly larger size increase from the bloat-o-meter summary. Fixes: 2d95ffaecbc2 ("irqchip/ti-sci-intr: Add module build support") Fixes: b8b26ae398c4 ("irqchip/ti-sci-inta : Add module build support") Signed-off-by: Vaishnav Achath <vaishnav.a@ti.com> Tested-by: Dhruva Gole <d-gole@ti.com> Reviewed-by: Dhruva Gole <d-gole@ti.com> Link: https://lore.kernel.org/r/20250205062229.3869081-1-vaishnav.a@ti.com Signed-off-by: Nishanth Menon <nm@ti.com>
2025-02-05x86/xen: remove unneeded dummy push from xen_hypercall_hvm()Juergen Gross1-6/+0
Stack alignment of the kernel in 64-bit mode is 8, not 16, so the dummy push in xen_hypercall_hvm() for aligning the stack to 16 bytes can be removed. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-02-05x86/xen: add FRAME_END to xen_hypercall_hvm()Juergen Gross1-0/+1
xen_hypercall_hvm() is missing a FRAME_END at the end, add it. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202502030848.HTNTTuo9-lkp@intel.com/ Fixes: b4845bb63838 ("x86/xen: add central hypercall functions") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-02-05x86/xen: fix xen_hypercall_hvm() to not clobber %rbxJuergen Gross1-2/+2
xen_hypercall_hvm(), which is used when running as a Xen PVH guest at most only once during early boot, is clobbering %rbx. Depending on whether the caller relies on %rbx to be preserved across the call or not, this clobbering might result in an early crash of the system. This can be avoided by using an already saved register instead of %rbx. Fixes: b4845bb63838 ("x86/xen: add central hypercall functions") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-02-04KVM: x86/mmu: Ensure NX huge page recovery thread is alive before wakingSean Christopherson1-7/+26
When waking a VM's NX huge page recovery thread, ensure the thread is actually alive before trying to wake it. Now that the thread is spawned on-demand during KVM_RUN, a VM without a recovery thread is reachable via the related module params. BUG: kernel NULL pointer dereference, address: 0000000000000040 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:vhost_task_wake+0x5/0x10 Call Trace: <TASK> set_nx_huge_pages+0xcc/0x1e0 [kvm] param_attr_store+0x8a/0xd0 module_attr_store+0x1a/0x30 kernfs_fop_write_iter+0x12f/0x1e0 vfs_write+0x233/0x3e0 ksys_write+0x60/0xd0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f3b52710104 </TASK> Modules linked in: kvm_intel kvm CR2: 0000000000000040 Fixes: 931656b9e2ff ("kvm: defer huge page recovery vhost task to later") Cc: stable@vger.kernel.org Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250124234623.3609069-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-04KVM: remove kvm_arch_post_init_vmPaolo Bonzini1-6/+1
The only statement in a kvm_arch_post_init_vm implementation can be moved into the x86 kvm_arch_init_vm. Do so and remove all traces from architecture-independent code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-04Merge tag 'kvmarm-fixes-6.14-1' of ↵Paolo Bonzini5-45/+73
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.14, take #1 - Correctly clean the BSS to the PoC before allowing EL2 to access it on nVHE/hVHE/protected configurations - Propagate ownership of debug registers in protected mode after the rework that landed in 6.14-rc1 - Stop pretending that we can run the protected mode without a GICv3 being present on the host - Fix a use-after-free situation that can occur if a vcpu fails to initialise the NV shadow S2 MMU contexts - Always evaluate the need to arm a background timer for fully emulated guest timers - Fix the emulation of EL1 timers in the absence of FEAT_ECV - Correctly handle the EL2 virtual timer, specially when HCR_EL2.E2H==0
2025-02-04Merge tag 'kvm-s390-next-6.14-2' of ↵Paolo Bonzini18-908/+968
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD - some selftest fixes - move some kvm-related functions from mm into kvm - remove all usage of page->index and page->lru from kvm - fixes and cleanups for vsie
2025-02-04kvm: x86: SRSO_USER_KERNEL_NO is not synthesizedPaolo Bonzini1-1/+1
SYNTHESIZED_F() generally is used together with setup_force_cpu_cap(), i.e. when it makes sense to present the feature even if cpuid does not have it *and* the VM is not able to see the difference. For example, it can be used when mitigations on the host automatically protect the guest as well. The "SYNTHESIZED_F(SRSO_USER_KERNEL_NO)" line came in as a conflict resolution between the CPUID overhaul from the KVM tree and support for the feature in the x86 tree. Using it right now does not hurt, or make a difference for that matter, because there is no setup_force_cpu_cap(X86_FEATURE_SRSO_USER_KERNEL_NO). However, it is a little less future proof in case such a setup_force_cpu_cap() appears later, for a case where the kernel somehow is not vulnerable but the guest would have to apply the mitigation. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-04KVM: arm64: timer: Don't adjust the EL2 virtual timer offsetMarc Zyngier2-18/+13
The way we deal with the EL2 virtual timer is a bit odd. We try to cope with E2H being flipped, and adjust which offset applies to that timer depending on the current E2H value. But that's a complexity we shouldn't have to worry about. What we have to deal with is either E2H being RES1, in which case there is no offset, or E2H being RES0, and the virtual timer simply does not exist. Drop the adjusting of the timer offset, which makes things a bit simpler. At the same time, make sure that accessing the HV timer when E2H is RES0 results in an UNDEF in the guest. Suggested-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250204110050.150560-4-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-04KVM: arm64: timer: Correctly handle EL1 timer emulation when !FEAT_ECVMarc Zyngier1-20/+10
Both Wei-Lin Chang and Volodymyr Babchuk report that the way we handle the emulation of EL1 timers with NV is completely wrong, specially in the case of HCR_EL2.E2H==0. There are three problems in about as many lines of code: - With E2H==0, the EL1 timers are overwritten with the EL1 state, while they should actually contain the EL2 state (as per the timer map) - With E2H==1, we run the full EL1 timer emulation even when ECV is present, hiding a bug in timer_emulate() (see previous patch) - The comments are actively misleading, and say all the wrong things. This is only attributable to the code having been initially written for FEAT_NV, hacked up to handle FEAT_NV2 *in parallel*, and vaguely hacked again to be FEAT_NV2 only. Oh, and yours truly being a gold plated idiot. The fix is obvious: just delete most of the E2H==0 code, have a unified handling of the timers (because they really are E2H agnostic), and make sure we don't execute any of that when FEAT_ECV is present. Fixes: 4bad3068cfa9f ("KVM: arm64: nv: Sync nested timer state with FEAT_NV2") Reported-by: Wei-Lin Chang <r09922117@csie.ntu.edu.tw> Reported-by: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com> Link: https://lore.kernel.org/r/fqiqfjzwpgbzdtouu2pwqlu7llhnf5lmy4hzv5vo6ph4v3vyls@jdcfy3fjjc5k Link: https://lore.kernel.org/r/87frl51tse.fsf@epam.com Tested-by: Dmytro Terletskyi <dmytro_terletskyi@epam.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250204110050.150560-3-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-04KVM: arm64: timer: Always evaluate the need for a soft timerMarc Zyngier1-3/+1
When updating the interrupt state for an emulated timer, we return early and skip the setup of a soft timer that runs in parallel with the guest. While this is OK if we have set the interrupt pending, it is pretty wrong if the guest moved CVAL into the future. In that case, no timer is armed and the guest can wait for a very long time (it will take a full put/load cycle for the situation to resolve). This is specially visible with EDK2 running at EL2, but still using the EL1 virtual timer, which in that case is fully emulated. Any key-press takes ages to be captured, as there is no UART interrupt and EDK2 relies on polling from a timer... The fix is simply to drop the early return. If the timer interrupt is pending, we will still return early, and otherwise arm the soft timer. Fixes: 4d74ecfa6458b ("KVM: arm64: Don't arm a hrtimer for an already pending timer") Cc: stable@vger.kernel.org Tested-by: Dmytro Terletskyi <dmytro_terletskyi@epam.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250204110050.150560-2-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-04KVM: arm64: Fix nested S2 MMU structures reallocationMarc Zyngier1-4/+5
For each vcpu that userspace creates, we allocate a number of s2_mmu structures that will eventually contain our shadow S2 page tables. Since this is a dynamically allocated array, we reallocate the array and initialise the newly allocated elements. Once everything is correctly initialised, we adjust pointer and size in the kvm structure, and move on. But should that initialisation fail *and* the reallocation triggered a copy to another location, we end-up returning early, with the kvm structure still containing the (now stale) old pointer. Weeee! Cure it by assigning the pointer early, and use this to perform the initialisation. If everything succeeds, we adjust the size. Otherwise, we just leave the size as it was, no harm done, and the new memory is as good as the ol' one (we hope...). Fixes: 4f128f8e1aaac ("KVM: arm64: nv: Support multiple nested Stage-2 mmu structures") Reported-by: Alexander Potapenko <glider@google.com> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/r/20250204145554.774427-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-04arm64/hwcap: Remove stray references to SF8MMxMark Brown1-2/+0
Due to SME currently being disabled when removing the SF8MMx support it wasn't noticed that there were some stray references in the hwcap table, delete them. Fixes: 819935464cb2 ("arm64/hwcap: Describe 2024 dpISA extensions to userspace") Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250203-arm64-remove-sf8mmx-v1-1-6f1da3dbff82@kernel.org Signed-off-by: Will Deacon <will@kernel.org>