| Age | Commit message (Collapse) | Author | Files | Lines |
|
The arm64 xchg/cmpxchg() wrappers cast the arguments to (unsigned long)
prior to invoking the static inline functions implementing the
operation. Some restrictive type annotations (e.g. __bitwise) lead to
sparse warnings like below:
sparse warnings: (new ones prefixed by >>)
fs/crypto/bio.c:67:17: sparse: sparse: cast from restricted blk_status_t
>> fs/crypto/bio.c:67:17: sparse: sparse: cast to restricted blk_status_t
Force the casting in the arm64 xchg/cmpxchg() wrappers to silence
sparse.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602230947.uNRsPyBn-lkp@intel.com/
Link: https://lore.kernel.org/r/202602230947.uNRsPyBn-lkp@intel.com/
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
CONFIG_EFI_SBAT_FILE can be a relative path. When compiling using a different
output directory (O=) the build currently fails because it can't find the
filename set in CONFIG_EFI_SBAT_FILE:
arch/x86/boot/compressed/sbat.S: Assembler messages:
arch/x86/boot/compressed/sbat.S:6: Error: file not found: kernel.sbat
Add $(srctree) as include dir for sbat.o.
[ bp: Massage commit message. ]
Fixes: 61b57d35396a ("x86/efi: Implement support for embedding SBAT data for x86")
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: <stable@kernel.org>
Link: https://patch.msgid.link/f4eda155b0cef91d4d316b4e92f5771cb0aa7187.1772047658.git.jstancek@redhat.com
|
|
With crash hotplug support enabled, additional memory is allocated to
the elfcorehdr kexec segment to accommodate resources added during
memory hotplug events. However, the kdump FDT is not updated with the
same size, which can result in elfcorehdr corruption in the kdump
kernel.
Update elf_headers_sz (the kimage member representing the size of the
elfcorehdr kexec segment) to reflect the total memory allocated for the
elfcorehdr segment instead of the elfcorehdr buffer size at the time of
kdump load. This allows of_kexec_alloc_and_setup_fdt() to reserve the
full elfcorehdr memory in the kdump FDT and prevents elfcorehdr
corruption.
Fixes: 849599b702ef8 ("powerpc/crash: add crash memory hotplug support")
Reviewed-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260227171801.2238847-1-sourabhjain@linux.ibm.com
|
|
Use explicit word-sized big-endian types for kexec and crash related
variables. This makes the endianness unambiguous and avoids type
mismatches that trigger sparse warnings.
The change addresses sparse warnings like below (seen on both 32-bit
and 64-bit builds):
CHECK ../arch/powerpc/kexec/core.c
sparse: expected unsigned int static [addressable] [toplevel] [usertype] crashk_base
sparse: got restricted __be32 [usertype]
sparse: warning: incorrect type in assignment (different base types)
sparse: expected unsigned int static [addressable] [toplevel] [usertype] crashk_size
sparse: got restricted __be32 [usertype]
sparse: warning: incorrect type in assignment (different base types)
sparse: expected unsigned long long static [addressable] [toplevel] mem_limit
sparse: got restricted __be32 [usertype]
sparse: warning: incorrect type in assignment (different base types)
sparse: expected unsigned int static [addressable] [toplevel] [usertype] kernel_end
sparse: got restricted __be32 [usertype]
No functional change intended.
Fixes: ea961a828fe7 ("powerpc: Fix endian issues in kexec and crash dump code")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512221405.VHPKPjnp-lkp@intel.com/
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20251224151257.28672-1-sourabhjain@linux.ibm.com
|
|
Similar to other PowerMac mac-io devices, the media-bay node is missing the
"#size-cells" property.
Depends-on: commit 045b14ca5c36 ("of: WARN on deprecated #address-cells/#size-cells handling")
Reported-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20251029174047.1620073-1-robh@kernel.org
|
|
These files are not included by anything and therefore don't get built or
tested.
There's also no upstream driver for the interlaken-lac stuff.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260128140222.1627203-1-robh@kernel.org
|
|
Test robot reports the following error with clang-16.0.6:
In file included from kernel/rseq.c:75:
include/linux/rseq_entry.h:141:3: error: invalid operand for instruction
unsafe_get_user(offset, &ucs->post_commit_offset, efault);
^
include/linux/uaccess.h:608:2: note: expanded from macro 'unsafe_get_user'
arch_unsafe_get_user(x, ptr, local_label); \
^
arch/powerpc/include/asm/uaccess.h:518:2: note: expanded from macro 'arch_unsafe_get_user'
__get_user_size_goto(__gu_val, __gu_addr, sizeof(*(p)), e); \
^
arch/powerpc/include/asm/uaccess.h:284:2: note: expanded from macro '__get_user_size_goto'
__get_user_size_allowed(x, ptr, size, __gus_retval); \
^
arch/powerpc/include/asm/uaccess.h:275:10: note: expanded from macro '__get_user_size_allowed'
case 8: __get_user_asm2(x, (u64 __user *)ptr, retval); break; \
^
arch/powerpc/include/asm/uaccess.h:258:4: note: expanded from macro '__get_user_asm2'
" li %1+1,0\n" \
^
<inline asm>:7:5: note: instantiated into assembly here
li 31+1,0
^
1 error generated.
On PPC32, for 64 bits vars a pair of registers is used. Usually the
lower register in the pair is the high part and the higher register is
the low part. GCC uses r3/r4 ... r11/r12 ... r14/r15 ... r30/r31
In older kernel code inline assembly was using %1 and %1+1 to represent
64 bits values. However here it looks like clang uses r31 as high part,
allthough r32 doesn't exist hence the error.
Allthoug %1+1 should work, most places now use %L1 instead of %1+1, so
let's do the same here.
With that change, the build doesn't fail anymore and a disassembly shows
clang uses r17/r18 and r31/r14 pair when GCC would have used r16/r17 and
r30/r31:
Disassembly of section .fixup:
00000000 <.fixup>:
0: 38 a0 ff f2 li r5,-14
4: 3a 20 00 00 li r17,0
8: 3a 40 00 00 li r18,0
c: 48 00 00 00 b c <.fixup+0xc>
c: R_PPC_REL24 .text+0xbc
10: 38 a0 ff f2 li r5,-14
14: 3b e0 00 00 li r31,0
18: 39 c0 00 00 li r14,0
1c: 48 00 00 00 b 1c <.fixup+0x1c>
1c: R_PPC_REL24 .text+0x144
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602021825.otcItxGi-lkp@intel.com/
Fixes: c20beffeec3c ("powerpc/uaccess: Use flexible addressing with __put_user()/__get_user()")
Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Acked-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/8ca3a657a650e497a96bfe7acde2f637dadab344.1770103646.git.chleroy@kernel.org
|
|
Today there are two PTE formats for e500:
- The 64 bits format, used
- On 64 bits kernel
- On 32 bits kernel with 64 bits physical addresses
- On 32 bits kernel with support of huge pages
- The 32 bits format, used in other cases
Maintaining two PTE formats means unnecessary maintenance burden
because every change needs to be implemented and tested for both
formats.
Remove the 32 bits PTE format. The memory usage increase due to
larger PTEs is minimal (approx. 0,1% of memory).
This also means that from now on huge pages are supported also
with 32 bits physical addresses.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/04a658209ea78dcc0f3dbde6b2c29cf1939adfe9.1767721208.git.chleroy@kernel.org
|
|
Rename KVM_SUPPORTED_TD_ATTRS to KVM_SUPPORTED_TDX_TD_ATTRS to include
"TDX" in the name, making it clear that it pertains to TDX.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260303030335.766779-5-xiaoyao.li@intel.com
|
|
The macros TDX_ATTR_* and DEF_TDX_ATTR_* are related to TD attributes,
which are TD-scope attributes. Naming them as TDX_ATTR_* can be somewhat
confusing and might mislead people into thinking they are TDX global
things.
Rename TDX_ATTR_* to TDX_TD_ATTR_* to explicitly clarify they are
TD-scope things.
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260303030335.766779-4-xiaoyao.li@intel.com
|
|
There are definitions of TD attributes bits inside asm/shared/tdx.h as
TDX_ATTR_*.
Remove KVM's definitions and use the ones in asm/shared/tdx.h
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260303030335.766779-3-xiaoyao.li@intel.com
|
|
The TD scoped TDCS attributes are defined by bit positions. In the guest
side of the TDX code, the 'tdx_attributes' string array holds pretty
print names for these attributes, which are generated via macros and
defines. Today these pretty print names are only used to print the
attribute names to dmesg.
Unfortunately there is a typo in the define for the migratable bit.
Change the defines TDX_ATTR_MIGRTABLE* to TDX_ATTR_MIGRATABLE*. Update
the sole user, the tdx_attributes array, to use the fixed name.
Since these defines control the string printed to dmesg, the change is
user visible. But the risk of breakage is almost zero since it is not
exposed in any interface expected to be consumed programmatically.
Fixes: 564ea84c8c14 ("x86/tdx: Dump attributes and TD_CTLS on boot")
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260303030335.766779-2-xiaoyao.li@intel.com
|
|
In kvm_check_and_inject_events(), kvm_deliver_exception_payload() is
called for pending #DB exceptions. However, shortly after, the
per-vendor inject_exception callbacks are made. Both
vmx_inject_exception() and svm_inject_exception() unconditionally call
kvm_deliver_exception_payload(), so the call in
kvm_check_and_inject_events() is redundant.
Note that the extra call for pending #DB exceptions is harmless, as
kvm_deliver_exception_payload() clears exception.has_payload after the
first call.
The call in kvm_check_and_inject_events() was added in commit
f10c729ff965 ("kvm: vmx: Defer setting of DR6 until #DB delivery"). At
that point, the call was likely needed because svm_queue_exception()
checked whether an exception for L2 is intercepted by L1 before calling
kvm_deliver_exception_payload(), as SVM did not have a
check_nested_events callback. Since DR6 is updated before the #DB
intercept in SVM (unlike VMX), it was necessary to deliver the DR6
payload before calling svm_queue_exception().
After that, commit 7c86663b68ba ("KVM: nSVM: inject exceptions via
svm_check_nested_events") added a check_nested_events callback for SVM,
which checked for L1 intercepts for L2's exceptions, and delivered the
the payload appropriately before the intercept. At that point,
svm_queue_exception() started calling kvm_deliver_exception_payload()
unconditionally, and the call to kvm_deliver_exception_payload() from
its caller became redundant.
No functional change intended.
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260302154249.784529-1-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Skip the OSVW RDMSRs if the current CPU doesn't enumerate support for the
MSRs. In practice, checking only the boot CPU's capabilities is
sufficient, as the RDMSRs should fault when unsupported, but there's no
downside to being more precise, and checking only the boot CPU _looks_
wrong given the rather odd semantics of the MSRs. E.g. if a CPU doesn't
support OVSW, then KVM must assume all errata are present.
Link: https://patch.msgid.link/20251113231420.1695919-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Elide the OSVW variable updates if the current CPU's set of errata are a
subset of the errata tracked in the global values, i.e. if no update is
needed. There's no danger of under-reporting errata due to bailing early
as KVM is purely reducing the set of "known fixed" errata. I.e. a racing
update on a different CPU with _more_ errata doesn't change anything if
the current CPU has the same or fewer errata relative to the status quo.
If another CPU is writing osvw_len, then "len" is guaranteed to be larger
than the new osvw_len and so the osvw_len update would be skipped anyways.
If another CPU is setting new bits in osvw_status, then "status" is
guaranteed to be a subset of the new osvw_status and the bitwise-OR would
be an effective nop anyways.
Link: https://patch.msgid.link/20251113231420.1695919-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the initialization of the global OSVW variables to a helper function
so that svm_enable_virtualization_cpu() isn't polluted with a pile of what
is effectively legacy code.
No functional change intended.
Link: https://patch.msgid.link/20251113231420.1695919-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't bother reading the OSVW MSRs if osvw_len is already zero, i.e. if
KVM is already treating all errata as present, in which case the positive
path of the if-statement is one giant nop.
Opportunistically update the comment to more thoroughly explain how the
MSRs work and why the code does what it does.
Link: https://patch.msgid.link/20251113231420.1695919-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Guard writes to the global osvw_status and osvw_len variables with a
spinlock to ensure enabling virtualization on multiple CPUs in parallel
doesn't effectively drop any writes due to writing back stale data. Don't
bother taking the lock when the boot CPU doesn't support the feature, as
that check is constant for all CPUs, i.e. racing writes will always write
the same value (zero).
Note, the bug was inadvertently "fixed" by commit 9a798b1337af ("KVM:
Register cpuhp and syscore callbacks when enabling hardware"), which
effectively serialized calls to enable virtualization due to how the cpuhp
framework "brings up" CPU. But KVM shouldn't rely on the mechanics of
cphup to provide serialization.
Link: https://patch.msgid.link/20251113231420.1695919-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When splitting hugepages in the TDP MMU, don't zero the new page table on
allocation since tdp_mmu_split_huge_page() is guaranteed to write every
entry and thus every byte.
Unless someone peeks at the memory between allocating the page table and
writing the child SPTEs, no functional change intended.
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260218210820.2828896-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Function register_restart_handler() is deprecated. Using this new API
removes our need to keep and manage a struct notifier_block.
Signed-off-by: Andrew Davis <afd@ti.com>
Message-ID: <20260302193543.275197-3-afd@ti.com>
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
|
|
Function register_restart_handler() is deprecated. Using this new API
removes our need to keep and manage a struct notifier_block.
Signed-off-by: Andrew Davis <afd@ti.com>
Message-ID: <20260302193543.275197-2-afd@ti.com>
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
|
|
Function register_restart_handler() is deprecated. Using this new API
removes our need to keep and manage a struct notifier_block.
Signed-off-by: Andrew Davis <afd@ti.com>
Message-ID: <20260302193543.275197-1-afd@ti.com>
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
|
|
The initial LASS enabling has been deferred to much later during boot,
and EFI runtime services now run with LASS temporarily disabled. This
removes LASS from the path of all EFI services.
Remove the LASS restriction on EFI config, as the two can now coexist.
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Link: https://patch.msgid.link/20260120234730.2215498-4-sohil.mehta@intel.com
|
|
Ideally, EFI runtime services should switch to kernel virtual addresses
after SetVirtualAddressMap(). However, firmware implementations are
known to be buggy in this regard and continue to access physical
addresses. The kernel maintains a 1:1 mapping of all runtime services
code and data regions to avoid breaking such firmware.
LASS enforcement relies on bit 63 of the virtual address, which would
block such accesses to the lower half. Unfortunately, not doing anything
could lead to #GP faults when users update to a kernel with LASS
enabled.
One option is to use a STAC/CLAC pair to temporarily disable LASS data
enforcement. However, there is no guarantee that the stray accesses
would only touch data and not perform instruction fetches. Also, relying
on the AC bit would depend on the runtime calls preserving RFLAGS, which
is highly unlikely in practice.
Instead, use the big hammer and switch off the entire LASS mechanism
temporarily by clearing CR4.LASS. Runtime services are called in the
context of efi_mm, which has explicitly unmapped any memory EFI isn't
allowed to touch (including userspace). So, do this right after
switching to efi_mm to avoid any security impact.
Some runtime services can be invoked during boot when LASS isn't active.
Use a global variable (similar to efi_mm) to save and restore the
correct CR4.LASS state. The runtime calls are serialized with the
efi_runtime_lock, so no concurrency issues are expected.
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Link: https://patch.msgid.link/20260120234730.2215498-3-sohil.mehta@intel.com
|
|
LASS blocks any kernel access to the lower half of the virtual address
space. Unfortunately, some EFI accesses happen during boot with bit 63
cleared, which causes a #GP fault when LASS is enabled.
Notably, the SetVirtualAddressMap() call can only happen in EFI physical
mode. Also, EFI_BOOT_SERVICES_CODE/_DATA could be accessed even after
ExitBootServices(). The boot services memory is truly freed during
efi_free_boot_services() after SVAM has completed.
To prevent EFI from tripping LASS, at a minimum, LASS enabling must be
deferred until EFI has completely finished entering virtual mode
(including freeing boot services memory). Moving setup_lass() to
arch_cpu_finalize_init() would do the trick, but that would make the
implementation very fragile. Something else might come in the future
that would need the LASS enabling to be moved again.
In general, security features such as LASS provide limited value before
userspace is up. They aren't necessary during early boot while only
trusted ring0 code is executing. Introduce a generic late initcall to
defer activating some CPU features until userspace is enabled.
For now, only move the LASS CR4 programming to this initcall. As APs are
already up by the time late initcalls run, some extra steps are needed
to enable LASS on all CPUs. Use a CPU hotplug callback instead of
on_each_cpu() or smp_call_function(). This ensures that LASS is enabled
on every CPU that is currently online as well as any future CPUs that
come online later. Note, even though hotplug callbacks run with
preemption enabled, cr4_set_bits() would disable interrupts while
updating CR4.
Keep the existing logic in place to clear the LASS feature bits early.
setup_clear_cpu_cap() must be called before boot_cpu_data is finalized
and alternatives are patched. Eventually, the entire setup_lass() logic
can go away once the restrictions based on vsyscall emulation and EFI
are removed.
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Link: https://patch.msgid.link/20260120234730.2215498-2-sohil.mehta@intel.com
|
|
The A64_MOV macro unconditionally uses ADD Rd, Rn, #0 to implement
register moves. While functionally correct, this is not the canonical
encoding when both operands are general-purpose registers.
On AArch64, MOV has two aliases depending on the operand registers:
- MOV <Xd|SP>, <Xn|SP> → ADD <Xd|SP>, <Xn|SP>, #0
- MOV <Xd>, <Xn> → ORR <Xd>, XZR, <Xn>
The ADD form is required when the stack pointer is involved (as ORR
does not accept SP), while the ORR form is the preferred encoding for
general-purpose registers.
The ORR encoding is also measurably faster on modern microarchitectures.
A microbenchmark [1] comparing dependent chains of MOV (ORR) vs ADD #0
on an ARM Neoverse-V2 (72-core, 3.4 GHz) shows:
=== mov (ORR Xd, XZR, Xn) ===
run1 cycles/op=0.749859456
run2 cycles/op=0.749991250
run3 cycles/op=0.749601847
avg cycles/op=0.749817518
=== add0 (ADD Xd, Xn, #0) ===
run1 cycles/op=1.004777689
run2 cycles/op=1.004558266
run3 cycles/op=1.004806559
avg cycles/op=1.004714171
The ORR form completes in ~0.75 cycles/op vs ~1.00 cycles/op for ADD #0,
a ~25% improvement. This is likely because the CPU's register renaming
hardware can eliminate ORR-based moves, while ADD #0 must go through the
ALU pipeline.
Update A64_MOV to select the appropriate encoding at JIT time:
use ADD when either register is A64_SP, and ORR (via
aarch64_insn_gen_move_reg()) otherwise.
Update verifier_private_stack selftests to expect "mov x7, x0" instead
of "add x7, x0, #0x0" in the JITed instruction checks, matching the
new ORR-based encoding.
[1] https://github.com/puranjaymohan/scripts/blob/main/arm64/bench/run_mov_vs_add0.sh
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260225134339.2723288-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Implement BPF_TRACE_FSESSION support for s390. The logic here is similar
to what we did in x86_64.
In order to simply the logic, we factor out the function invoke_bpf() for
fentry and fexit.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20260224092208.1395085-3-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Introduce a helper to store 64-bit immediate on the trampoline stack with
a help of a register.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20260224092208.1395085-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Implementing BPF version of preempt_count() requires accessing lowcore
from BPF. Since lowcore can be relocated, open-coding
(struct lowcore *)0 does not work, so add a kfunc.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20260217160813.100855-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Recent changes replaced the use of no_64bit_msi with msi_addr_mask, which
is now expected to be initialized to DMA_BIT_MASK(64) during PCI device
setup. On SPARC systems, this initialization was inadvertently missed for
devices instantiated from device tree nodes, leaving msi_addr_mask unset
for OF-created pci_dev instances. As a result, MSI address validation fails
during probe, causing affected devices to fail initialization.
Initialize pdev->msi_addr_mask to DMA_BIT_MASK(64) in of_create_pci_dev()
so that MSI address validation succeeds and PCI device probing works as
expected.
Fixes: 386ced19e9a3 ("PCI/MSI: Convert the boolean no_64bit_msi flag to a DMA address mask")
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Han Gao <gaohan@iscas.ac.cn> # SPARC Enterprise T5220
Tested-by: Nathaniel Roach <nroach44@nroach44.id.au> # SPARC T5-2
Reviewed-by: Vivian Wang <wangruikang@iscas.ac.cn>
Link: https://patch.msgid.link/20260220070239.1693303-3-nilay@linux.ibm.com
|
|
Recent changes replaced the use of no_64bit_msi with msi_addr_mask. As a
result, msi_addr_mask is now expected to be initialized to DMA_BIT_MASK(64)
when a pci_dev is set up. However, this initialization was missed on
powerpc due to differences in the device initialization path compared to
other (x86) architecture. Due to this, now PCI device probe method fails on
powerpc system.
On powerpc systems, struct pci_dev instances are created from device tree
nodes via of_create_pci_dev(). Because msi_addr_mask was not initialized
there, it remained zero. Later, during MSI setup, msi_verify_entries()
validates the programmed MSI address against pdev->msi_addr_mask. Since the
mask was not set correctly, the validation fails, causing PCI driver probe
failures for devices on powerpc systems.
Initialize pdev->msi_addr_mask to DMA_BIT_MASK(64) in of_create_pci_dev()
so that MSI address validation succeeds and device probe works as expected.
Fixes: 386ced19e9a3 ("PCI/MSI: Convert the boolean no_64bit_msi flag to a DMA address mask")
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Tested-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Vivian Wang <wangruikang@iscas.ac.cn>
Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260220070239.1693303-2-nilay@linux.ibm.com
|
|
The __stackleak_poison() inline assembly comes with a "count" operand where
the "d" constraint is used. "count" is used with the exrl instruction and
"d" means that the compiler may allocate any register from 0 to 15.
If the compiler would allocate register 0 then the exrl instruction would
not or the value of "count" into the executed instruction - resulting in a
stackframe which is only partially poisoned.
Use the correct "a" constraint, which excludes register 0 from register
allocation.
Fixes: 2a405f6bb3a5 ("s390/stackleak: provide fast __stackleak_poison() implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Link: https://lore.kernel.org/r/20260302133500.1560531-4-hca@linux.ibm.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
The inline assembly constraint for the "bytes" operand is "d" for all xor()
inline assemblies. "d" means that any register from 0 to 15 can be used. If
the compiler would use register 0 then the exrl instruction would not or
the value of "bytes" into the executed instruction - resulting in an
incorrect result.
However all the xor() inline assemblies make hard-coded use of register 0,
and it is correctly listed in the clobber list, so that this cannot happen.
Given that this is quite subtle use the better "a" constraint, which
excludes register 0 from register allocation in any case.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Link: https://lore.kernel.org/r/20260302133500.1560531-3-hca@linux.ibm.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
The inline assembly constraints for xor_xc_2() are incorrect. "bytes",
"p1", and "p2" are input operands, while all three of them are modified
within the inline assembly. Given that the function consists only of this
inline assembly it seems unlikely that this may cause any problems, however
fix this in any case.
Fixes: 2cfc5f9ce7f5 ("s390/xor: optimized xor routing using the XC instruction")
Cc: stable@vger.kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Link: https://lore.kernel.org/r/20260302133500.1560531-2-hca@linux.ibm.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
xor_xc_5() contains a larl 1,2f that is not used by the asm and is not
declared as a clobber. This can corrupt a compiler-allocated value in %r1
and lead to miscompilation. Remove the instruction.
Fixes: 745600ed6965 ("s390/lib: Use exrl instead of ex in xor functions")
Cc: stable@vger.kernel.org
Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
After commit e6e094e053af75 ("x86/acpi, x86/boot: Take RSDP address from
boot params if available"), the RSDP address can be passed in boot
params. Therefore, store the RSDP address in start_info page into boot
params in the PVH entry instead of registering a different callback.
This removes an absolute reference during the PVH entry and is more
standardized.
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <76675c4d49d3a8f72252076812ef8f22276230c2.1772282441.git.houwenlong.hwl@antgroup.com>
|
|
The function xen_flush_tlb_others() was renamed xen_flush_tlb_multi()
by commit 4ce94eabac16 ("x86/mm/tlb: Flush remote and local TLBs
concurrently"). Update the comment accordingly.
Signed-off-by: kexinsun <kexinsun@smail.nju.edu.cn>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260224022424.1718-1-kexinsun@smail.nju.edu.cn>
|
|
After commit 47ffe0578aee ("x86/pvh: Add 64bit relocation page tables"),
the PVH entry uses a new set of page tables instead of the
preconstructed page tables in head64.S. Since those preconstructed page
tables are only used in XENPV now and XENPV does not actually need the
preconstructed identity page tables directly, they can be filled in
xen_setup_kernel_pagetable(). Therefore, build the identity mapping page
table dynamically to remove the preconstructed page tables and make the
code cleaner.
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: "Borislav Petkov (AMD)" <bp@alien8.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <453981eae7e8158307f971d1632d5023adbe03c3.1769074722.git.houwenlong.hwl@antgroup.com>
|
|
The nx_huge_pages parameter is stored as an int (initialized to -1 to
indicate auto mode), but get_nx_huge_pages() calls param_get_bool()
which expects a bool pointer.
This causes UBSAN to report "load of value 255 is not a valid value for
type '_Bool'" when the parameter is read via sysfs during a narrow time
window.
The issue occurs during module load: the module parameter is registered
and its sysfs file becomes readable before the kvm_mmu_x86_module_init()
function runs:
1. Module load begins, static variable initialized to -1
2. mod_sysfs_setup() creates /sys/module/kvm/parameters/nx_huge_pages
3. (Parameter readable, value = -1)
4. do_init_module() runs kvm_x86_init()
5. kvm_mmu_x86_module_init() resolves -1 to bool
If userspace (e.g., sos report) reads the parameter during step 3,
param_get_bool() dereferences the int as a bool, triggering the UBSAN
warning.
Fix that by properly reading and converting the -1 value into an 'auto'
string.
Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20260225145050.2350278-3-gal@nvidia.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The avic parameter is stored as an int to support the special value -1
(AVIC_AUTO_MODE), but the cited commit changed it from bool to int while
keeping param_get_bool() as the getter function.
This causes UBSAN to report "load of value 255 is not a valid value for
type '_Bool'" when the parameter is read via sysfs.
The issue happens in two scenarios:
1. During module load: There's a time window between when module
parameters are registered, and when avic_hardware_setup() runs to
resolve the value, where the value is -1.
2. On non-AMD systems: On non-AMD hardware, the kvm_is_svm_supported()
check returns early. The avic_hardware_setup() function never runs,
so avic remains -1.
Fix that by implementing a getter function that properly reads and
converts the -1 value into a string.
Triggered by sos report:
UBSAN: invalid-load in kernel/params.c:323:33
load of value 255 is not a valid value for type '_Bool'
CPU: 0 UID: 0 PID: 4667 Comm: sos Not tainted 6.19.0-rc5net_mlx5_1e86836 #1 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x69/0xa0
ubsan_epilogue+0x5/0x2b
__ubsan_handle_load_invalid_value.cold+0x47/0x4c
? lock_acquire+0x219/0x2c0
param_get_bool.cold+0xf/0x14
param_attr_show+0x51/0x80
module_attr_show+0x19/0x30
sysfs_kf_seq_show+0xac/0xf0
seq_read_iter+0x100/0x410
copy_splice_read+0x1b4/0x360
splice_direct_to_actor+0xbd/0x270
? wait_for_space+0xb0/0xb0
do_splice_direct+0x72/0xb0
? propagate_umount+0x870/0x870
do_sendfile+0x3a3/0x470
__x64_sys_sendfile64+0x5e/0xe0
do_syscall_64+0x70/0x8c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Fixes: ca2967de5a5b ("KVM: SVM: Enable AVIC by default for Zen4+ if x2AVIC is support")
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://patch.msgid.link/20260225145050.2350278-2-gal@nvidia.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add helpers to fill kvm_run for userspace MMIO exits to deduplicate a
variety of code, and to allow for a cleaner return path in
emulator_read_write().
Opportunistically add a KVM_BUG_ON() to ensure the caller is limiting the
length of a single MMIO access to 8 bytes (the largest size userspace is
prepared to handled, as the ABI was baked before things like MOVDQ came
along).
No functional change intended.
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Michael Roth <michael.roth@amd.com>
Tested-by: Tom Lendacky <thomas.lendacky@gmail.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260225012049.920665-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
sideways
Kill the VM instead of the host kernel if KVM botches I/O and/or MMIO
handling. There is zero danger to the host or guest, i.e. panicking the
host isn't remotely justified.
Tested-by: Tom Lendacky <thomas.lendacky@gmail.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260225012049.920665-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename the ops and helpers to read/write guest memory to clarify that they
do exactly that, i.e. aren't generic emulation flows and don't do anything
related to emulated MMIO.
Opportunistically add comments to explain the flow, e.g. it's not exactly
obvious that KVM deliberately treats "failed" accesses to guest memory as
emulated MMIO.
No functional change intended.
Tested-by: Tom Lendacky <thomas.lendacky@gmail.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260225012049.920665-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fold emulator_write_phys() into write_emulate() to drop a superfluous
wrapper, and to provide more symmetry between the read and write paths.
No functional change intended.
Tested-by: Tom Lendacky <thomas.lendacky@gmail.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260225012049.920665-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that SEV-ES invokes vcpu_mmio_{read,write}() directly, bury the
read vs. write emulator ops in the dedicated emulator callbacks so that
they are colocated with their usage, and to make it harder for
non-emulator code to use hooks that are intended for the emulator.
Opportunistically rename the structures to get rid of the long-standing
"emultor" typo.
No functional change intended.
Tested-by: Tom Lendacky <thomas.lendacky@gmail.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260225012049.920665-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Dedup kvm_sev_es_mmio_{read,write}() into a single API, as the "cost" of
plumbing in a boolean is largely negligible since KVM pulls out a boolean
for ops->write anyways, and consolidating the APIs will allow for
additional cleanups.
No functional change intended.
Tested-by: Tom Lendacky <thomas.lendacky@gmail.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260225012049.920665-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Dedup the SEV-ES emulated MMIO code by using the read vs. write emulator
ops to handle the few differences between reads and writes.
Opportunistically tweak the comment about fragments to call out that KVM
should verify that userspace can actually handle MMIO requests that cross
page boundaries. Unlike emulated MMIO, the request is made in the GPA
space, not the GVA space, i.e. emulation across page boundaries can work
generically, at least in theory.
No functional change intended.
Tested-by: Tom Lendacky <thomas.lendacky@gmail.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260225012049.920665-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a sanity check to ensure KVM doesn't use an on-stack variable when
handling an MMIO request for an SEV-ES guest. The source/destination
for SEV-ES MMIO should _always_ be the #VMGEXIT scratch area.
Opportunistically update the comment in the completion side of things
to clarify that frag->data doesn't need to be copied anywhere, and the
VMEGEXIT is trap-like (the current comment doesn't clarify *how* RIP is
advanced).
Tested-by: Tom Lendacky <thomas.lendacky@gmail.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260225012049.920665-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the invocation of MMIO write tracepoint into vcpu_mmio_write() and
drop its largely-useless wrapper to cull pointless code and to make the
code symmetrical with respect to vcpu_mmio_read().
No functional change intended.
Tested-by: Tom Lendacky <thomas.lendacky@gmail.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260225012049.920665-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Open code the differences in read vs. write userspace MMIO exits instead
of burying three lines of code behind indirect callbacks, as splitting the
logic makes it extremely hard to track that KVM's handling of reads vs.
write is _significantly_ different. Add a comment to explain why the
semantics are different, and how on earth an MMIO write ends up triggering
an exit to userspace.
No functional change intended.
Tested-by: Tom Lendacky <thomas.lendacky@gmail.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260225012049.920665-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|