| Age | Commit message (Collapse) | Author | Files | Lines |
|
KVM allows VMMs to specify the maximum possible APIC ID for a virtual
machine through KVM_CAP_MAX_VCPU_ID capability so as to limit data
structures related to APIC/x2APIC. Utilize the same to set the AVIC
physical max index in the VMCB, similar to VMX. This helps hardware
limit the number of entries to be scanned in the physical APIC ID table
speeding up IPI broadcasts for virtual machines with smaller number of
vCPUs.
Unlike VMX, SVM AVIC requires a single page to be allocated for the
Physical APIC ID table and the Logical APIC ID table, so retain the
existing approach of allocating those during VM init.
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/adb07ccdb3394cd79cb372ba6bcc69a4e4d4ef54.1757009416.git.naveen@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
checks
Add an off-by-default param, "warn_on_missed_cc", to have KVM WARN on a
missed VMX Consistency Check on nested VM-Enter, specifically so that KVM
developers and maintainers can more easily detect missing checks. KVM's
goal/intent is that KVM detect *all* VM-Fail conditions in software, as
relying on hardware leads to false passes when KVM's nested support is a
subset of hardware support, e.g. see commit 095686e6fcb4 ("KVM: nVMX:
Check vmcs12->guest_ia32_debugctl on nested VM-Enter").
With one notable exception, KVM now detects all VM-Fail scenarios for
which there is known test coverage, i.e. KVM developers can enable the
param and expect a clean run, and thus can use the param to detect missed
checks, e.g. when enabling new features, when writing new tests, etc.
The one exception is an unfortunate consistency check on vTPR. Because
the vTPR for L2 comes from the virtual APIC page provided by L1, L2's vTPR
is fully writable at all times, i.e. is inherently subject to TOCTOU
issues with respect to checks in software versus consumption in hardware.
Further complicating matters is KVM's deferred handling of vmcs12 pages
when loading nested state; KVM flat out cannot check vTPR during
KVM_SET_NESTED_STATE without breaking setups that do on-demand paging,
e.g. for live migration and/or live update.
To fudge around the vTPR issue, add a "late" controls check for vTPR and
also treat an invalid virtual APIC as VM-Fail, but gate the check on
warn_on_missed_cc being enabled to avoid unwanted false positives, i.e. to
avoid breaking KVM in production.
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250919005955.1366256-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Remove nested_early_check and all associated code, as it's quite obviously
not being used or tested (it's been broken for 4+ years without a single
bug report). More importantly, KVM's software-based consistency checks
have matured since the option to do hardware-based checks was added; KVM
appears to be missing only _one_ consistency check, on vTPR. And even
*more* importantly, that consistency check can't be prevented by an early
hardware check due to L1 being able to modify the virtual APIC at any
time, i.e. there's an inherent TOCTOU flaw that could cause KVM to "miss"
a consistency check VM-Fail, regardless of whether the check is performed
by software or by hardware.
In other words, KVM _must_ be able to unwind from a late VM-Fail (which
was a big motivation for doing early checks). I.e. now that KVM provides
(almost) all necessary consistency checks, what's really needed is a way
to detect missing checks in KVM, not a way to avoid having to unwind from
a late VM-Fail. And that can be done much more simply, e.g. by an simple
module param to guard a WARN (which, sadly, must be off-by-default to
avoid splats due to the aforementioned TOCTOU issue).
For all intents and purposes, this reverts commit 52017608da33 ("KVM:
nVMX: add option to perform early consistency checks via H/W").
Link: https://lore.kernel.org/r/20250919005955.1366256-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
If KVM is doing "early" nested VM-Enter consistency checks and TSC scaling
is supported, stuff vmcs02's TSC Multiplier early on to avoid getting a
false positive VM-Fail due to trying to do VM-Enter with TSC_MULTIPLIER=0.
To minimize complexity around L1 vs. L2 TSC, KVM sets the actual TSC
Multiplier rather late during VM-Entry, i.e. may have '0' at the time of
early consistency checks.
If vmcs12 has TSC Scaling enabled, use the multiplier from vmcs12 so that
nested early checks actually check vmcs12 state, otherwise throw in an
arbitrary value of '1' (anything non-zero is legal).
Fixes: d041b5ea9335 ("KVM: nVMX: Enable nested TSC scaling")
Link: https://lore.kernel.org/r/20250919005955.1366256-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a missing consistency check on the TSC Multiplier being '0'. Per the
SDM:
If the "use TSC scaling" VM-execution control is 1, the TSC-multiplier
must not be zero.
Fixes: d041b5ea9335 ("KVM: nVMX: Enable nested TSC scaling")
Link: https://lore.kernel.org/r/20250919005955.1366256-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a missing consistency check on the TPR Threshold. Per the SDM
If the "use TPR shadow" VM-execution control is 1 and the "virtual-
interrupt delivery" VM-execution control is 0, bits 31:4 of the TPR
threshold VM-execution control field must be 0.
Note, nested_vmx_check_tpr_shadow_controls() bails early if "use TPR
shadow" is 0.
Link: https://lore.kernel.org/r/20250919005955.1366256-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use the role for the to-be-loaded/invalidated EPT root to compute the
root's level and A/D enablement instead of pulling the information from
the vCPU (e.g. by passing in the root level and querying vmcs12). Not
making unnecessary assumptions about the root will allow invalidating
arbitrary EPT roots (which sadly requires a full EPTP) at any given time.
No functional change intended (the end result should be the same).
Link: https://lore.kernel.org/r/20250919005955.1366256-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the helpers to get/query a dummy root from mmu_internal.h to spte.h
so that VMX can detect and handle dummy roots when constructing EPTPs.
This will allow using the root's role to build the EPTP instead of pulling
equivalent information out of the vCPU structure.
No functional change intended.
Link: https://lore.kernel.org/r/20250919005955.1366256-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Hardcode the dummy EPTP used for "early" consistency checks as there's no
need to use 5-level EPT based on the guest.MAXPHYADDR (the EPTP just needs
to be valid, it's never truly consumed).
This will allow breaking construct_eptp()'s dependency on having access to
the vCPU, which in turn will (much further in the future) allow for eliding
per-root TLB flushes when a vCPU is migrated between pCPUs (a flush is
need if and only if that particular pCPU hasn't already flushed the vCPU's
roots).
Link: https://lore.kernel.org/r/20250919005955.1366256-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move construct_eptp() further up in vmx.c so that it's above
vmx_flush_tlb_current(), its "first" user in vmx.c. This will allow a
future patch to opportunistically make construct_eptp() local to vmx.c.
No functional change intended.
Link: https://lore.kernel.org/r/20250919005955.1366256-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Correct the reserved memory size for TF-A to 512K, as it was mistakenly
marked as 500K. Update the reserved memory node accordingly.
Fixes: 8517204c982b ("arm64: dts: qcom: ipq5424: Add reserved memory for TF-A")
Signed-off-by: Kathiravan Thirumoorthy <kathiravan.thirumoorthy@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251014-tfa-reserved-mem-v1-1-48c82033c8a7@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
|
|
th1520 support Zfh ISA extension.
It supports the same RISC-V extensions as SG2042.
commit cb074bed1186 ("riscv: dts: sophgo: add zfh for sg2042")
Signed-off-by: Han Gao <rabenda.cn@gmail.com>
Reviewed-by: Drew Fustini <fustini@kernel.org>
Signed-off-by: Drew Fustini <fustini@kernel.org>
|
|
Existing rv64 hardware conforms to the rva20 profile.
Ziccrse is an additional extension required by the rva20 profile, so
th1520 has this extension.
Signed-off-by: Han Gao <rabenda.cn@gmail.com>
Reviewed-by: Drew Fustini <fustini@kernel.org>
Signed-off-by: Drew Fustini <fustini@kernel.org>
|
|
The th1520 support xtheadvector [1] so it can be included in the
devicetree. Also include vlenb for the cpu. And set vlenb=16 [2].
This can be tested by passing the "mitigations=off" kernel parameter.
Link: https://lore.kernel.org/linux-riscv/20241113-xtheadvector-v11-4-236c22791ef9@rivosinc.com/ [1]
Link: https://lore.kernel.org/linux-riscv/aCO44SAoS2kIP61r@ghost/ [2]
Signed-off-by: Han Gao <rabenda.cn@gmail.com>
Reviewed-by: Drew Fustini <fustini@kernel.org>
Signed-off-by: Drew Fustini <fustini@kernel.org>
|
|
We intend that EL0 exception handlers unmask all DAIF exceptions
before calling exit_to_user_mode().
When completing single-step of a suspended breakpoint, we do not call
local_daif_restore(DAIF_PROCCTX) before calling exit_to_user_mode(),
leaving all DAIF exceptions masked.
When pseudo-NMIs are not in use this is benign.
When pseudo-NMIs are in use, this is unsound. At this point interrupts
are masked by both DAIF.IF and PMR_EL1, and subsequent irq flag
manipulation may not work correctly. For example, a subsequent
local_irq_enable() within exit_to_user_mode_loop() will only unmask
interrupts via PMR_EL1 (leaving those masked via DAIF.IF), and
anything depending on interrupts being unmasked (e.g. delivery of
signals) will not work correctly.
This was detected by CONFIG_ARM64_DEBUG_PRIORITY_MASKING.
Move the call to `try_step_suspended_breakpoints()` outside of the check
so that interrupts can be unmasked even if we don't call the step handler.
Fixes: 0ac7584c08ce ("arm64: debug: split single stepping exception entry")
Cc: <stable@vger.kernel.org> # 6.17
Signed-off-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
[catalin.marinas@arm.com: added Mark's rewritten commit log and some whitespace]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The GIC CDEOI system instruction requires the Rt field to be set to 0b11111
otherwise the instruction behaviour becomes CONSTRAINED UNPREDICTABLE.
Currenly, its usage is encoded as a system register write, with a constant
0 value:
write_sysreg_s(0, GICV5_OP_GIC_CDEOI)
While compiling with GCC, the 0 constant value, through these asm
constraints and modifiers ('x' modifier and 'Z' constraint combo):
asm volatile(__msr_s(r, "%x0") : : "rZ" (__val));
forces the compiler to issue the XZR register for the MSR operation (ie
that corresponds to Rt == 0b11111) issuing the right instruction encoding.
Unfortunately LLVM does not yet understand that modifier/constraint
combo so it ends up issuing a different register from XZR for the MSR
source, which in turns means that it encodes the GIC CDEOI instruction
wrongly and the instruction behaviour becomes CONSTRAINED UNPREDICTABLE
that we must prevent.
Add a conditional to write_sysreg_s() macro that detects whether it
is passed a constant 0 value and issues an MSR write with XZR as source
register - explicitly doing what the asm modifier/constraint is meant to
achieve through constraints/modifiers, fixing the LLVM compilation issue.
Fixes: 7ec80fb3f025 ("irqchip/gic-v5: Add GICv5 PPI support")
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Cc: Sascha Bischoff <sascha.bischoff@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
When executing kvm_riscv_vcpu_aia_has_interrupts, the vCPU may have
migrated and the IMSIC VS-file have not been updated yet, currently
the HGEIP CSR should be read from the imsic->vsfile_cpu ( the pCPU
before migration ) via on_each_cpu_mask, but this will trigger an
IPI call and repeated IPI within a period of time is expensive in
a many-core systems.
Just let the vCPU execute and update the correct IMSIC VS-file via
kvm_riscv_vcpu_aia_imsic_update may be a simple solution.
Fixes: 4cec89db80ba ("RISC-V: KVM: Move HGEI[E|P] CSR access to IMSIC virtualization")
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Reviewed-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Anup Patel <anup@brainfault.org>
Tested-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20251016012659.82998-1-fangyu.yu@linux.alibaba.com
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
Add eeprom device node for PRot module FRU.
Signed-off-by: Fred Chen <fredchen.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
|
|
Enable AMD APML related features
- add amd sbrmi node for SoC power reading
- add amd sbtsi node for SoC temperature reading
- rename the P0_I3C_APML_ALERT_L GPIO to align with the naming
convention expected by the AMD APML tool
Signed-off-by: Fred Chen <fredchen.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
|
|
Add GPIO line name for userspace control or monitoring
- Add leak-related line names to report chassis leak event
- Add debug-card-mux to control debug card access
- Add FM_MAIN_PWREN_RMC_EN_ISO_R to receive RMC power control signal
Signed-off-by: Fred Chen <fredchen.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
|
|
Add a 'bmc_ready_noled' LED on GPIOB3 with GPIO_TRANSITORY to ensure its
state resets on BMC reboot.
Signed-off-by: Fred Chen <fredchen.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
|
|
Add the mctp-controller property and MCTP node to enable frontend NIC
management via PLDM over MCTP.
Signed-off-by: Fred Chen <fredchen.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
|
|
add power monitor and temperature sensors for extension boards in bus 6,
8, 10 and 13.
Signed-off-by: Fred Chen <fredchen.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
|
|
Add missing blank lines between DT nodes to follow the devicetree coding
style and improve readability.
No functional changes.
Signed-off-by: Fred Chen <fredchen.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
|
|
"mac3" controller was removed from the initial version of fuji-data64
dts because the rgmii setting is incorrect, but dropping mac3 leads to
regression in the existing fuji platform, because fuji.dts simply
includes fuji-data64.dts.
This patch adds mac3 back to fuji-data64.dts to fix the fuji regression[1],
and rgmii settings need to be fixed later.
Fixes: b0f294fdfc3e ("ARM: dts: aspeed: facebook-fuji: Include facebook-fuji-data64.dts")
Link: https://lore.kernel.org/all/79ddc7b9-ef26-4959-9a16-aa4e006eb145@roeck-us.net/ [1]
Signed-off-by: Tao Ren <rentao.bupt@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
|
|
Add device tree for the Meta (Facebook) Yosemite5 compute node,
based on the AST2600 BMC.
The Yosemite5 platform provides monitoring of voltages, power,
temperatures, and other critical parameters across the motherboard,
CXL board, E1.S expansion board, and NIC components. The BMC also
logs relevant events and performs appropriate system actions in
response to abnormal conditions.
Signed-off-by: Kevin Tung <kevin.tung.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
|
|
== Background ==
ENCLS[EUPDATESVN] is a new SGX instruction [1] which allows enclave
attestation to include information about updated microcode SVN without a
reboot. Before an EUPDATESVN operation can be successful, all SGX memory
(aka. EPC) must be marked as “unused” in the SGX hardware metadata
(aka.EPCM). This requirement ensures that no compromised enclave can
survive the EUPDATESVN procedure and provides an opportunity to generate
new cryptographic assets.
== Solution ==
Attempt to execute ENCLS[EUPDATESVN] every time the first file descriptor
is obtained via sgx_(vepc_)open(). In the most common case the microcode
SVN is already up-to-date, and the operation succeeds without updating SVN.
Note: while in such cases the underlying crypto assets are regenerated, it
does not affect enclaves' visible keys obtained via EGETKEY instruction.
If it fails with any other error code than SGX_INSUFFICIENT_ENTROPY, this
is considered unexpected and the *open() returns an error. This should not
happen in practice.
On contrary, SGX_INSUFFICIENT_ENTROPY might happen due to a pressure on the
system's DRNG (RDSEED) and therefore the *open() can be safely retried to
allow normal enclave operation.
[1] Runtime Microcode Updates with Intel Software Guard Extensions,
https://cdrdv2.intel.com/v1/dl/getContent/648682
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Nataliia Bondarevska <bondarn@google.com>
|
|
All running enclaves and cryptographic assets (such as internal SGX
encryption keys) are assumed to be compromised whenever an SGX-related
microcode update occurs. To mitigate this assumed compromise the new
supervisor SGX instruction ENCLS[EUPDATESVN] can generate fresh
cryptographic assets.
Before executing EUPDATESVN, all SGX memory must be marked as unused. This
requirement ensures that no potentially compromised enclave survives the
update and allows the system to safely regenerate cryptographic assets.
Add the method to perform ENCLS[EUPDATESVN]. However, until the follow up
patch that wires calling sgx_update_svn() from sgx_inc_usage_count(), this
code is not reachable.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Nataliia Bondarevska <bondarn@google.com>
|
|
Add error codes for ENCLS[EUPDATESVN], then SGX CPUSVN update process can
know the execution state of EUPDATESVN and notify userspace.
EUPDATESVN will be called when no active SGX users is guaranteed. Only add
the error codes that can legally happen. E.g., it could also fail due to
"SGX not ready" when there's SGX users but it wouldn't happen in this
implementation.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Nataliia Bondarevska <bondarn@google.com>
|
|
Add a flag indicating whenever ENCLS[EUPDATESVN] SGX instruction is
supported. This will be used by SGX driver to perform CPU SVN updates.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Nataliia Bondarevska <bondarn@google.com>
|
|
Currently, when SGX is compromised and the microcode update fix is applied,
the machine needs to be rebooted to invalidate old SGX crypto-assets and
make SGX be in an updated safe state. It's not friendly for the cloud.
To avoid having to reboot, a new ENCLS[EUPDATESVN] is introduced to update
SGX environment at runtime. This process needs to be done when there's no
SGX users to make sure no compromised enclaves can survive from the update
and allow the system to regenerate crypto-assets.
For now there's no counter to track the active SGX users of host enclave
and virtual EPC. Introduce such counter mechanism so that the EUPDATESVN
can be done only when there's no SGX users.
Define placeholder functions sgx_inc/dec_usage_count() that are used to
increment and decrement such a counter. Also, wire the call sites for
these functions. Encapsulate the current sgx_(vepc_)open() to
__sgx_(vepc_)open() to make the new sgx_(vepc_)open() easy to read.
The definition of the counter itself and the actual implementation of
sgx_inc/dec_usage_count() functions come next.
Note: The EUPDATESVN, which may fail, will be done in
sgx_inc_usage_count(). Make it return 'int' to make subsequent patches
which implement EUPDATESVN easier to review. For now it always returns
success.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Nataliia Bondarevska <bondarn@google.com>
|
|
Fix kernel-doc warnings by adding the missing '*' to each line.
Warning: include/asm/idtentry.h:395 bad line: when raised from kernel mode
Warning: include/asm/idtentry.h:405 bad line: when raised from user mode
Since this is in a kernel-doc block, these lines need a leading
" *" on each line to prevent the warnings.
Fixes: a13644f3a53d ("x86/entry/64: Add entry code for #VC handler")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
To set all 64 bits in the mask on a 32-bit system, the constant must
have type `unsigned long long`.
Fixes: 6b1e8ba4bac4 ("RISC-V: KVM: Use bitmap for irqs_pending and irqs_pending_mask")
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20251016001714.3889380-1-samuel.holland@sifive.com
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
A compile warning slipped through:
arch/x86/kernel/smpboot.c:548:5: warning: no previous prototype for function 'arch_sched_node_distance' [-Wmissing-prototypes]
Fixes: 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Mark the RTL8211F PHY as a wakeup source for the Jetson Xavier NX.
This allows the reworked RTL8211F driver to know that the PHY is
wired to wakeup capable hardware, and thus to expose WoL capabilities.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
|
|
It is possible for Granite Rapids (GNR) and Clearwater Forest
(CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
is enabled, each die will become a separate NUMA node in the package
with different distances between dies within the same package.
For example, on GNR, we see the following numa distances for a 2 socket
system with 3 dies per socket:
package 1 package2
----------------
| |
--------- ---------
| 0 | | 3 |
--------- ---------
| |
--------- ---------
| 1 | | 4 |
--------- ---------
| |
--------- ---------
| 2 | | 5 |
--------- ---------
| |
----------------
node distances:
node 0 1 2 3 4 5
0: 10 15 17 21 28 26
1: 15 10 15 23 26 23
2: 17 15 10 26 23 21
3: 21 28 26 10 15 17
4: 23 26 23 15 10 15
5: 26 23 21 17 15 10
The node distances above led to 2 problems:
1. Asymmetric routes taken between nodes in different packages led to
asymmetric scheduler domain perspective depending on which node you
are on. Current scheduler code failed to build domains properly with
asymmetric distances.
2. Multiple remote distances to respective tiles on remote package create
too many levels of domain hierarchies grouping different nodes between
remote packages.
For example, the above GNR topology lead to NUMA domains below:
Sched domains from the perspective of a CPU in node 0, where the number
in bracket represent node number.
NUMA-level 1 [0,1] [2]
NUMA-level 2 [0,1,2] [3]
NUMA-level 3 [0,1,2,3] [5]
NUMA-level 4 [0,1,2,3,5] [4]
Sched domains from the perspective of a CPU in node 4
NUMA-level 1 [4] [3,5]
NUMA-level 2 [3,4,5] [0,2]
NUMA-level 3 [0,2,3,4,5] [1]
Scheduler group peers for load balancing from the perspective of CPU 0
and 4 are different. Improper task could be chosen for load balancing
between groups such as [0,2,3,4,5] [1]. Ideally you should choose nodes
in 0 or 2 that are in same package as node 1 first. But instead tasks
in the remote package node 3, 4, 5 could be chosen with an equal chance
and could lead to excessive remote package migrations and imbalance of
load between packages. We should not group partial remote nodes and
local nodes together.
Simplify the remote distances for CWF and GNR for the purpose of
sched domains building, which maintains symmetry and leads to a more
reasonable load balance hierarchy.
The sched domains from the perspective of a CPU in node 0 NUMA-level 1
is now
NUMA-level 1 [0,1] [2]
NUMA-level 2 [0,1,2] [3,4,5]
The sched domains from the perspective of a CPU in node 4 NUMA-level 1
is now
NUMA-level 1 [4] [3,5]
NUMA-level 2 [3,4,5] [0,1,2]
We have the same balancing perspective from node 0 or node 4. Loads are
now balanced equally between packages.
Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
|
|
Use the new-found freedom of allowing variable declarions inside
for() to simplify the for_each_insn_prefix() iterator to no longer
need an external temporary.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Both uprobes and alternatives have insn_is_nop() variants, unify them
and make sure insn_is_nop() works for both x86_64 and i386.
Specifically, uprobe must not compare userspace instructions to kernel
nops as that does not work right in the compat case.
For the uprobe case we therefore must recognise common 32bit and 64bit
nops. Because uprobe will consume the instruction as a nop, it must
not mistakenly claim a non-nop instruction to be a nop. Eg. 'REX.b3
NOP' is 'xchg %r8,%rax' - not a nop.
For the kernel case similar constraints apply, is it used to optimize
NOPs by replacing strings of short(er) nops with longer nops. Must not
claim an instruction is a nop if it really isn't. Not recognising a
nop is non-fatal.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
On AMD machines cpuc->events[idx] can become NULL in a subtle race
condition with NMI->throttle->x86_pmu_stop().
Check event for NULL in amd_pmu_enable_all() before enable to avoid a GPF.
This appears to be an AMD only issue.
Syzkaller reported a GPF in amd_pmu_enable_all.
INFO: NMI handler (perf_event_nmi_handler) took too long to run: 13.143
msecs
Oops: general protection fault, probably for non-canonical address
0xdffffc0000000034: 0000 PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7]
CPU: 0 UID: 0 PID: 328415 Comm: repro_36674776 Not tainted 6.12.0-rc1-syzk
RIP: 0010:x86_pmu_enable_event (arch/x86/events/perf_event.h:1195
arch/x86/events/core.c:1430)
RSP: 0018:ffff888118009d60 EFLAGS: 00010012
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000034 RSI: 0000000000000000 RDI: 00000000000001a0
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
R13: ffff88811802a440 R14: ffff88811802a240 R15: ffff8881132d8601
FS: 00007f097dfaa700(0000) GS:ffff888118000000(0000) GS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000200001c0 CR3: 0000000103d56000 CR4: 00000000000006f0
Call Trace:
<IRQ>
amd_pmu_enable_all (arch/x86/events/amd/core.c:760 (discriminator 2))
x86_pmu_enable (arch/x86/events/core.c:1360)
event_sched_out (kernel/events/core.c:1191 kernel/events/core.c:1186
kernel/events/core.c:2346)
__perf_remove_from_context (kernel/events/core.c:2435)
event_function (kernel/events/core.c:259)
remote_function (kernel/events/core.c:92 (discriminator 1)
kernel/events/core.c:72 (discriminator 1))
__flush_smp_call_function_queue (./arch/x86/include/asm/jump_label.h:27
./include/linux/jump_label.h:207 ./include/trace/events/csd.h:64
kernel/smp.c:135 kernel/smp.c:540)
__sysvec_call_function_single (./arch/x86/include/asm/jump_label.h:27
./include/linux/jump_label.h:207
./arch/x86/include/asm/trace/irq_vectors.h:99 arch/x86/kernel/smp.c:272)
sysvec_call_function_single (arch/x86/kernel/smp.c:266 (discriminator 47)
arch/x86/kernel/smp.c:266 (discriminator 47))
</IRQ>
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
When the kernel leaves to userspace via syscall_restore_rfi(), the
W bit is not set in the new PSW. This doesn't cause any problems
because there's no 64 bit userspace for parisc. Simple static binaries
are usually loaded at addresses way below the 32 bit limit so the W bit
doesn't matter.
Fix this by setting the W bit when TIF_32BIT is not set.
Signed-off-by: Sven Schnelle <svens@stackframe.org>
Cc: stable@vger.kernel.org
Signed-off-by: Helge Deller <deller@gmx.de>
|
|
Enable support for the CEC interface of the Synopsys DesignWare HDMI QP
IP block.
This is used by all boards based on RK3588 & RK3576 SoCs.
Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
|
|
The S5_RESET_STATUS register is parsed on boot and printed to kmsg.
However, this could sometimes be misleading and lead to users wasting a
lot of time on meaningless debugging for two reasons:
* Some bits are never cleared by hardware. It's the software's
responsibility to clear them as per the Processor Programming Reference
(see [1]).
* Some rare hardware-initiated platform resets do not update the
register at all.
In both cases, a previous reboot could leave its trace in the register,
resulting in users seeing unrelated reboot reasons while debugging random
reboots afterward.
Write the read value back to the register in order to clear all reason bits
since they are write-1-to-clear while the others must be preserved.
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=206537#attach_303991
[ bp: Massage commit message. ]
Fixes: ab8131028710 ("x86/CPU/AMD: Print the reason for the last reset")
Signed-off-by: Rong Zhang <i@rong.moe>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/all/20250913144245.23237-1-i@rong.moe/
|
|
Modern AMD CPUs do not support segment limit checks in 64-bit mode
(i.e. EFER.LMSLE must be zero). Do not allow a guest to set EFER.LMSLE
on a CPU that requires the bit to be zero.
For backwards compatibility, allow EFER.LMSLE to be set on CPUs that
support segment limit checks in 64-bit mode, even though KVM's
implementation of the feature is incomplete (e.g. KVM's emulator does
not enforce segment limits in 64-bit mode).
Fixes: eec4b140c924 ("KVM: SVM: Allow EFER.LMSLE to be set with nested svm")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/r/20251001001529.1119031-3-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Inspired by commit c065b6046b34 ("Use CONFIG_EXT4_FS instead of
CONFIG_EXT3_FS in all of the defconfigs") I looked around for any other
left-over EXT3 config options, and found some old defconfig files still
mentioned CONFIG_EXT3_DEFAULTS_TO_ORDERED.
That config option was removed a decade ago in commit c290ea01abb7 ("fs:
Remove ext3 filesystem driver"). It had a good run, but let's remove it
for good.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bug fixes from Ted Ts'o:
- Fix regression caused by removing CONFIG_EXT3_FS when testing some
very old defconfigs
- Avoid a BUG_ON when opening a file on a maliciously corrupted file
system
- Avoid mm warnings when freeing a very large orphan file metadata
- Avoid a theoretical races between metadata writeback and checkpoints
(it's very hard to hit in practice, since the race requires that the
writeback take a very long time)
* tag 'ext4_for_linus-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
Use CONFIG_EXT4_FS instead of CONFIG_EXT3_FS in all of the defconfigs
ext4: free orphan info with kvfree
ext4: detect invalid INLINE_DATA + EXTENTS flag combination
ext4, doc: fix and improve directory hash tree description
ext4: wait for ongoing I/O to complete before freeing blocks
jbd2: ensure that all ongoing I/O complete before freeing blocks
|
|
With staging support implemented, enable it when the CPU reports the
feature.
[ bp: Sort in the MSR properly. ]
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/20250320234104.8288-1-chang.seok.bae@intel.com
|
|
The functions for sending microcode data and retrieving the next offset
were previously placeholders, as they need to handle a specific mailbox
format.
While the kernel supports similar mailboxes, none of them are compatible
with this one. Attempts to share code led to unnecessary complexity, so
add a dedicated implementation instead.
[ bp: Sort the include properly. ]
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/20250320234104.8288-1-chang.seok.bae@intel.com
|
|
Previously, per-package staging invocations and their associated state
data were established. The next step is to implement the actual staging
handler according to the specified protocol. Below are key aspects to
note:
(a) Each staging process must begin by resetting the staging hardware.
(b) The staging hardware processes up to a page-sized chunk of the
microcode image per iteration, requiring software to submit data
incrementally.
(c) Once a data chunk is processed, the hardware responds with an
offset in the image for the next chunk.
(d) The offset may indicate completion or request retransmission of an
already transferred chunk. As long as the total transferred data
remains within the predefined limit (twice the image size),
retransmissions should be acceptable.
Incorporate them in the handler, while data transmission and mailbox
format handling are implemented separately.
[ bp: Sort the headers in a reversed name-length order. ]
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/20250320234104.8288-1-chang.seok.bae@intel.com
|
|
Define a staging_state struct to simplify function prototypes by consolidating
relevant data, instead of passing multiple local variables.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/20250320234104.8288-1-chang.seok.bae@intel.com
|
|
When microcode staging is initiated, operations are carried out through
an MMIO interface. Each package has a unique interface specified by the
IA32_MCU_STAGING_MBOX_ADDR MSR, which maps to a set of 32-bit registers.
Prepare staging with the following steps:
1. Ensure the microcode image is 32-bit aligned to match the MMIO
register size.
2. Identify each MMIO interface based on its per-package scope.
3. Invoke the staging function for each identified interface, which
will be implemented separately.
[ bp: Improve error logging. ]
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/all/871pznq229.ffs@tglx
|