Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FRED support from Thomas Gleixner:
"Support for x86 Fast Return and Event Delivery (FRED).
FRED is a replacement for IDT event delivery on x86 and addresses most
of the technical nightmares which IDT exposes:
1) Exception cause registers like CR2 need to be manually preserved
in nested exception scenarios.
2) Hardware interrupt stack switching is suboptimal for nested
exceptions as the interrupt stack mechanism rewinds the stack on
each entry which requires a massive effort in the low level entry
of #NMI code to handle this.
3) No hardware distinction between entry from kernel or from user
which makes establishing kernel context more complex than it needs
to be especially for unconditionally nestable exceptions like NMI.
4) NMI nesting caused by IRET unconditionally reenabling NMIs, which
is a problem when the perf NMI takes a fault when collecting a
stack trace.
5) Partial restore of ESP when returning to a 16-bit segment
6) Limitation of the vector space which can cause vector exhaustion
on large systems.
7) Inability to differentiate NMI sources
FRED addresses these shortcomings by:
1) An extended exception stack frame which the CPU uses to save
exception cause registers. This ensures that the meta information
for each exception is preserved on stack and avoids the extra
complexity of preserving it in software.
2) Hardware interrupt stack switching is non-rewinding if a nested
exception uses the currently interrupt stack.
3) The entry points for kernel and user context are separate and GS
BASE handling which is required to establish kernel context for
per CPU variable access is done in hardware.
4) NMIs are now nesting protected. They are only reenabled on the
return from NMI.
5) FRED guarantees full restore of ESP
6) FRED does not put a limitation on the vector space by design
because it uses a central entry points for kernel and user space
and the CPUstores the entry type (exception, trap, interrupt,
syscall) on the entry stack along with the vector number. The
entry code has to demultiplex this information, but this removes
the vector space restriction.
The first hardware implementations will still have the current
restricted vector space because lifting this limitation requires
further changes to the local APIC.
7) FRED stores the vector number and meta information on stack which
allows having more than one NMI vector in future hardware when the
required local APIC changes are in place.
The series implements the initial FRED support by:
- Reworking the existing entry and IDT handling infrastructure to
accomodate for the alternative entry mechanism.
- Expanding the stack frame to accomodate for the extra 16 bytes FRED
requires to store context and meta information
- Providing FRED specific C entry points for events which have
information pushed to the extended stack frame, e.g. #PF and #DB.
- Providing FRED specific C entry points for #NMI and #MCE
- Implementing the FRED specific ASM entry points and the C code to
demultiplex the events
- Providing detection and initialization mechanisms and the necessary
tweaks in context switching, GS BASE handling etc.
The FRED integration aims for maximum code reuse vs the existing IDT
implementation to the extent possible and the deviation in hot paths
like context switching are handled with alternatives to minimalize the
impact. The low level entry and exit paths are seperate due to the
extended stack frame and the hardware based GS BASE swichting and
therefore have no impact on IDT based systems.
It has been extensively tested on existing systems and on the FRED
simulation and as of now there are no outstanding problems"
* tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/fred: Fix init_task thread stack pointer initialization
MAINTAINERS: Add a maintainer entry for FRED
x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly
x86/fred: Invoke FRED initialization code to enable FRED
x86/fred: Add FRED initialization functions
x86/syscall: Split IDT syscall setup code into idt_syscall_init()
KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code
x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled
x86/traps: Add sysvec_install() to install a system interrupt handler
x86/fred: FRED entry/exit and dispatch code
x86/fred: Add a machine check entry stub for FRED
x86/fred: Add a NMI entry stub for FRED
x86/fred: Add a debug fault entry stub for FRED
x86/idtentry: Incorporate definitions/declarations of the FRED entries
x86/fred: Make exc_page_fault() work for FRED
x86/fred: Allow single-step trap and NMI when starting a new task
x86/fred: No ESPFIX needed when FRED is enabled
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
"Rework of APIC enumeration and topology evaluation.
The current implementation has a couple of shortcomings:
- It fails to handle hybrid systems correctly.
- The APIC registration code which handles CPU number assignents is
in the middle of the APIC code and detached from the topology
evaluation.
- The various mechanisms which enumerate APICs, ACPI, MPPARSE and
guest specific ones, tweak global variables as they see fit or in
case of XENPV just hack around the generic mechanisms completely.
- The CPUID topology evaluation code is sprinkled all over the vendor
code and reevaluates global variables on every hotplug operation.
- There is no way to analyze topology on the boot CPU before bringing
up the APs. This causes problems for infrastructure like PERF which
needs to size certain aspects upfront or could be simplified if
that would be possible.
- The APIC admission and CPU number association logic is
incomprehensible and overly complex and needs to be kept around
after boot instead of completing this right after the APIC
enumeration.
This update addresses these shortcomings with the following changes:
- Rework the CPUID evaluation code so it is common for all vendors
and provides information about the APIC ID segments in a uniform
way independent of the number of segments (Thread, Core, Module,
..., Die, Package) so that this information can be computed instead
of rewriting global variables of dubious value over and over.
- A few cleanups and simplifcations of the APIC, IO/APIC and related
interfaces to prepare for the topology evaluation changes.
- Seperation of the parser stages so the early evaluation which tries
to find the APIC address can be seperately overridden from the late
evaluation which enumerates and registers the local APIC as further
preparation for sanitizing the topology evaluation.
- A new registration and admission logic which
- encapsulates the inner workings so that parsers and guest logic
cannot longer fiddle in it
- uses the APIC ID segments to build topology bitmaps at
registration time
- provides a sane admission logic
- allows to detect the crash kernel case, where CPU0 does not run
on the real BSP, automatically. This is required to prevent
sending INIT/SIPI sequences to the real BSP which would reset
the whole machine. This was so far handled by a tedious command
line parameter, which does not even work in nested crash
scenarios.
- Associates CPU number after the enumeration completed and
prevents the late registration of APICs, which was somehow
tolerated before.
- Converting all parsers and guest enumeration mechanisms over to the
new interfaces.
This allows to get rid of all global variable tweaking from the
parsers and enumeration mechanisms and sanitizes the XEN[PV]
handling so it can use CPUID evaluation for the first time.
- Mopping up existing sins by taking the information from the APIC ID
segment bitmaps.
This evaluates hybrid systems correctly on the boot CPU and allows
for cleanups and fixes in the related drivers, e.g. PERF.
The series has been extensively tested and the minimal late fallout
due to a broken ACPI/MADT table has been addressed by tightening the
admission logic further"
* tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits)
x86/topology: Ignore non-present APIC IDs in a present package
x86/apic: Build the x86 topology enumeration functions on UP APIC builds too
smp: Provide 'setup_max_cpus' definition on UP too
smp: Avoid 'setup_max_cpus' namespace collision/shadowing
x86/bugs: Use fixed addressing for VERW operand
x86/cpu/topology: Get rid of cpuinfo::x86_max_cores
x86/cpu/topology: Provide __num_[cores|threads]_per_package
x86/cpu/topology: Rename topology_max_die_per_package()
x86/cpu/topology: Rename smp_num_siblings
x86/cpu/topology: Retrieve cores per package from topology bitmaps
x86/cpu/topology: Use topology logical mapping mechanism
x86/cpu/topology: Provide logical pkg/die mapping
x86/cpu/topology: Simplify cpu_mark_primary_thread()
x86/cpu/topology: Mop up primary thread mask handling
x86/cpu/topology: Use topology bitmaps for sizing
x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT
x86/xen/smp_pv: Count number of vCPUs early
x86/cpu/topology: Assign hotpluggable CPUIDs during init
x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug
x86/topology: Add a mechanism to track topology via APIC IDs
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull clocksource updates from Thomas Gleixner:
"Updates for timekeeping and PTP core.
The cross-timestamp mechanism which allows to correlate hardware
clocks uses clocksource pointers for describing the correlation.
That's suboptimal as drivers need to obtain the pointer, which
requires needless exports and exposing internals. This can all be
completely avoided by assigning clocksource IDs and using them for
describing the correlated clock source.
So this adds clocksource IDs to all clocksources in the tree which can
be exposed to this mechanism and removes the pointer and now needless
exports.
A related improvement for the core and the correlation handling has
not made it this time, but is expected to get ready for the next
round"
* tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kvmclock: Unexport kvmclock clocksource
treewide: Remove system_counterval_t.cs, which is never read
timekeeping: Evaluate system_counterval_t.cs_id instead of .cs
ptp/kvm, arm_arch_timer: Set system_counterval_t.cs_id to constant
x86/kvm, ptp/kvm: Add clocksource ID, set system_counterval_t.cs_id
x86/tsc: Add clocksource ID, set system_counterval_t.cs_id
timekeeping: Add clocksource ID to struct system_counterval_t
x86/tsc: Correct kernel-doc notation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI updates from Thomas Gleixner:
"Updates for the MSI interrupt subsystem and initial RISC-V MSI
support.
The core changes have been adopted from previous work which converted
ARM[64] to the new per device MSI domain model, which was merged to
support multiple MSI domain per device. The ARM[64] changes are being
worked on too, but have not been ready yet. The core and platform-MSI
changes have been split out to not hold up RISC-V and to avoid that
RISC-V builds on the scheduled for removal interfaces.
The core support provides new interfaces to handle wire to MSI bridges
in a straight forward way and introduces new platform-MSI interfaces
which are built on top of the per device MSI domain model.
Once ARM[64] is converted over the old platform-MSI interfaces and the
related ugliness in the MSI core code will be removed.
The actual MSI parts for RISC-V were finalized late and have been
post-poned for the next merge window.
Drivers:
- Add a new driver for the Andes hart-level interrupt controller
- Rework the SiFive PLIC driver to prepare for MSI suport
- Expand the RISC-V INTC driver to support the new RISC-V AIA
controller which provides the basis for MSI on RISC-V
- A few fixup for the fallout of the core changes"
* tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
irqchip/riscv-intc: Fix low-level interrupt handler setup for AIA
x86/apic/msi: Use DOMAIN_BUS_GENERIC_MSI for HPET/IO-APIC domain search
genirq/matrix: Dynamic bitmap allocation
irqchip/riscv-intc: Add support for RISC-V AIA
irqchip/sifive-plic: Improve locking safety by using irqsave/irqrestore
irqchip/sifive-plic: Parse number of interrupts and contexts early in plic_probe()
irqchip/sifive-plic: Cleanup PLIC contexts upon irqdomain creation failure
irqchip/sifive-plic: Use riscv_get_intc_hwnode() to get parent fwnode
irqchip/sifive-plic: Use devm_xyz() for managed allocation
irqchip/sifive-plic: Use dev_xyz() in-place of pr_xyz()
irqchip/sifive-plic: Convert PLIC driver into a platform driver
irqchip/riscv-intc: Introduce Andes hart-level interrupt controller
irqchip/riscv-intc: Allow large non-standard interrupt number
genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens
irqchip/imx-intmux: Handle pure domain searches correctly
genirq/msi: Provide MSI_FLAG_PARENT_PM_DEV
genirq/irqdomain: Reroute device MSI create_mapping
genirq/msi: Provide allocation/free functions for "wired" MSI interrupts
genirq/msi: Optionally use dev->fwnode for device domain
genirq/msi: Provide DOMAIN_BUS_WIRED_TO_MSI
...
|
|
RFDS is a CPU vulnerability that may allow userspace to infer kernel
stale data previously used in floating point registers, vector registers
and integer registers. RFDS only affects certain Intel Atom processors.
Intel released a microcode update that uses VERW instruction to clear
the affected CPU buffers. Unlike MDS, none of the affected cores support
SMT.
Add RFDS bug infrastructure and enable the VERW based mitigation by
default, that clears the affected buffers just before exiting to
userspace. Also add sysfs reporting and cmdline parameter
"reg_file_data_sampling" to control the mitigation.
For details see:
Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
|
|
Currently MMIO Stale Data mitigation for CPUs not affected by MDS/TAA is
to only deploy VERW at VMentry by enabling mmio_stale_data_clear static
branch. No mitigation is needed for kernel->user transitions. If such
CPUs are also affected by RFDS, its mitigation may set
X86_FEATURE_CLEAR_CPU_BUF to deploy VERW at kernel->user and VMentry.
This could result in duplicate VERW at VMentry.
Fix this by disabling mmio_stale_data_clear static branch when
X86_FEATURE_CLEAR_CPU_BUF is enabled.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
|
|
KVM VMX changes for 6.9:
- Fix a bug where KVM would report stale/bogus exit qualification information
when exiting to userspace due to an unexpected VM-Exit while the CPU was
vectoring an exception.
- Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.
- Clean up the logic for massaging the passthrough MSR bitmaps when userspace
changes its MSR filter.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.9
* Set reserved bits as zero in CPUCFG.
* Start SW timer only when vcpu is blocking.
* Do not restart SW timer when it is expired.
* Remove unnecessary CSR register saving during enter guest.
|
|
https://github.com/kvm-x86/linux into HEAD
KVM GUEST_MEMFD fixes for 6.8:
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
avoid creating ABI that KVM can't sanely support.
- Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
clear that such VMs are purely a development and testing vehicle, and
come with zero guarantees.
- Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
is to support confidential VMs with deterministic private memory (SNP
and TDX) only in the TDP MMU.
- Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
|
|
Call this function unconditionally so that we can populate an empty DTB
on platforms that don't boot with a firmware provided or builtin DTB.
There's no harm in calling unflatten_device_tree() unconditionally here.
If there isn't a non-NULL 'initial_boot_params' pointer then
unflatten_device_tree() returns early.
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: H. Peter Anvin <hpa@zytor.com>
Tested-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Link: https://lore.kernel.org/r/20240217010557.2381548-5-sboyd@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
Instrumenting sev.c and mem_encrypt_identity.c with KMSAN will result in
a triple-faulting kernel. Some of the code is invoked too early during
boot, before KMSAN is ready.
Disable KMSAN instrumentation for the two translation units.
[ bp: Massage commit message. ]
Signed-off-by: Changbin Du <changbin.du@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240308044401.1120395-1-changbin.du@huawei.com
|
|
As TOP_OF_KERNEL_STACK_PADDING was defined as 0 on x86_64, it went
unnoticed that the initialization of the .sp field in INIT_THREAD and some
calculations in the low level startup code do not take the padding into
account.
FRED enabled kernels require a 16 byte padding, which means that the init
task initialization and the low level startup code use the wrong stack
offset.
Subtract TOP_OF_KERNEL_STACK_PADDING in all affected places to adjust for
this.
Fixes: 65c9cc9e2c14 ("x86/fred: Reserve space for the FRED stack frame")
Fixes: 3adee777ad0d ("x86/smpboot: Remove initial_stack on 64-bit")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Closes: https://lore.kernel.org/oe-lkp/202402262159.183c2a37-lkp@intel.com
Link: https://lore.kernel.org/r/20240304083333.449322-1-xin@zytor.com
|
|
With the instruction decoder, we are now able to decode and recognize
instructions with opcode extensions. There are more instructions in
these groups that can be boosted:
Group 2: ROL, ROR, RCL, RCR, SHL/SAL, SHR, SAR
Group 3: TEST, NOT, NEG, MUL, IMUL, DIV, IDIV
Group 4: INC, DEC (byte operation)
Group 5: INC, DEC (word/doubleword/quadword operation)
These instructions are not boosted previously because there are reserved
opcodes within the groups, e.g., group 2 with ModR/M.nnn == 110 is
unmapped. As a result, kprobes attached to them requires two int3 traps
as being non-boostable also prevents jump-optimization.
Some simple tests on QEMU show that after boosting and jump-optimization
a single kprobe on these instructions with an empty pre-handler runs 10x
faster (~1000 cycles vs. ~100 cycles).
Since these instructions are mostly ALU operations and do not touch
special registers like RIP, let's boost them so that we get the
performance benefit.
Link: https://lore.kernel.org/all/20240204031300.830475-4-jinghao7@illinois.edu/
Signed-off-by: Jinghao Jia <jinghao7@illinois.edu>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Both INT (INT n, INT1, INT3, INTO) and UD (UD0, UD1, UD2) serve special
purposes in the kernel, e.g., INT3 is used by KGDB and UD2 is involved
in LLVM-KCFI instrumentation. At the same time, attaching kprobes on
these instructions (particularly UD) will pollute the stack trace dumped
in the kernel ring buffer, since the exception is triggered in the copy
buffer rather than the original location.
Check for INT and UD in can_probe and reject any kprobes trying to
attach to these instructions.
Link: https://lore.kernel.org/all/20240204031300.830475-3-jinghao7@illinois.edu/
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Jinghao Jia <jinghao7@illinois.edu>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Both can_probe and can_boost have int return type but are using int as
boolean in their context.
Refactor both functions to make them actually return boolean.
Link: https://lore.kernel.org/all/20240204031300.830475-2-jinghao7@illinois.edu/
Signed-off-by: Jinghao Jia <jinghao7@illinois.edu>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Borislav reported that one of his systems has a broken MADT table which
advertises eight present APICs and 24 non-present APICs in the same
package.
The non-present ones are considered hot-pluggable by the topology
evaluation code, which is obviously bogus as there is no way to hot-plug
within the same package.
As the topology evaluation code accounts for hot-pluggable CPUs in a
package, the maximum number of cores per package is computed wrong, which
in turn causes the uncore performance counter driver to access non-existing
MSRs. It will probably confuse other entities which rely on the maximum
number of cores and threads per package too.
Cure this by ignoring hot-pluggable APIC IDs within a present package.
In theory it would be reasonable to just do this unconditionally, but then
there is this thing called reality^Wvirtualization which ruins
everything. Virtualization is the only existing user of "physical" hotplug
and the virtualization tools allow the above scenario. Whether that is
actually in use or not is unknown.
As it can be argued that the virtualization case is not affected by the
issues which exposed the reported problem, allow the bogosity if the kernel
determined that it is running in a VM for now.
Fixes: 89b0f15f408f ("x86/cpu/topology: Get rid of cpuinfo::x86_max_cores")
Reported-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/87a5nbvccx.ffs@tglx
|
|
According to x86 spec ([1] and [2]), MWAIT hint_address[7:4] plus 1 is
the corresponding C-state, and 0xF means C0.
ACPI C-state table usually only contains C1+, but nothing prevents ACPI
firmware from presenting a C-state (maybe C1+) but using MWAIT address C0
(i.e., 0xF in ACPI FFH MWAIT hint address). And if this is the case, Linux
erroneously treat this cstate as C16, while actually this should be valid
C0 instead of C16, as per the specifications.
Since ACPI firmware is out of Linux kernel scope, fix the kernel handling
of 0xF ->(to) C0 in this situation. This is found when a tweaked ACPI
C-state table is presented by Qemu to VM.
Also modify the intel_idle case for code consistency.
[1]. Intel SDM Vol 2, Table 4-11. MWAIT Hints
Register (EAX): "Value of 0 means C1; 1 means C2 and so on
Value of 01111B means C0".
[2]. AMD manual Vol 3, MWAIT: "The processor C-state is EAX[7:4]+1, so to
request C0 is to place the value F in EAX[7:4] and to request C1 is to
place the value 0 in EAX[7:4].".
Signed-off-by: He Rongguang <herongguang@linux.alibaba.com>
[ rjw: Subject and changelog edits, whitespace fixups ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
As there are some AMD processors which only support CPPC V2 firmware and
BIOS implementation, the amd_pstate driver will be failed to load when
system booting with below kernel warning message:
[ 0.477523] amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
To make the amd_pstate driver can be loaded on those TR40 processors, it
needs to match x86_model from 0x30 to 0x7F for family 17H.
With the change, the system can load amd_pstate driver as expected.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Reported-by: Gino Badouri <badouri.g@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218171
Fixes: fbd74d1689 ("ACPI: CPPC: Fix enabling CPPC on AMD systems with shared memory")
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
In preparation for implementing rigorous build time checks to enforce
that only code that can support it will be called from the early 1:1
mapping of memory, move SEV init code that is called in this manner to
the .head.text section.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-19-ardb+git@google.com
|
|
The secondary startup code is used on the primary boot path as well, but
in this case, the initial part runs from a 1:1 mapping, until an
explicit cross-jump is made to the kernel virtual mapping of the same
code.
On the secondary boot path, this jump is pointless as the code already
executes from the mapping targeted by the jump. So combine this
cross-jump with the jump from startup_64() into the common boot path.
This simplifies the execution flow, and clearly separates code that runs
from a 1:1 mapping from code that runs from the kernel virtual mapping.
Note that this requires a page table switch, so hoist the CR3 assignment
into startup_64() as well. And since absolute symbol references will no
longer be permitted in .head.text once we enable the associated build
time checks, a RIP-relative memory operand is used in the JMP
instruction, referring to an absolute constant in the .init.rodata
section.
Given that the secondary startup code does not require a special
placement inside the executable, move it to the .text section.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-15-ardb+git@google.com
|
|
Determining the address of the initial page table to program into CR3
involves:
- taking the physical address
- adding the SME encryption mask
On the primary entry path, the code is mapped using a 1:1 virtual to
physical translation, so the physical address can be taken directly
using a RIP-relative LEA instruction.
On the secondary entry path, the address can be obtained by taking the
offset from the virtual kernel base (__START_kernel_map) and adding the
physical kernel base.
This is implemented in a slightly confusing way, so clean this up.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-14-ardb+git@google.com
|
|
Assigning the 5-level paging related global variables from the earliest
C code using explicit references that use the 1:1 translation of memory
is unnecessary, as the startup code itself does not rely on them to
create the initial page tables, and this is all it should be doing. So
defer these assignments to the primary C entry code that executes via
the ordinary kernel virtual mapping.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-13-ardb+git@google.com
|
|
When paging is enabled, the CR4.PAE and CR4.LA57 control bits cannot be
changed, and so they can simply be preserved rather than reason about
whether or not they need to be set. CR4.MCE should be preserved unless
the kernel was built without CONFIG_X86_MCE, in which case it must be
cleared.
CR4.PSE should be set explicitly, regardless of whether or not it was
set before.
CR4.PGE is set explicitly, and then cleared and set again after
programming CR3 in order to flush TLB entries based on global
translations. This makes the first assignment redundant, and can
therefore be omitted. So clear PGE by omitting it from the preserve
mask, and set it again explicitly after switching to the new page
tables.
[ bp: Document the exact operation of CR4.PGE ]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-12-ardb+git@google.com
|
|
The idle routine selection is done on every CPU bringup operation and
has a guard in place which is effective after the first invocation,
which is a pointless exercise.
Invoke it once on the boot CPU and mark the related functions __init.
The guard check has to stay as xen_set_default_idle() runs early.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/87edcu6vaq.ffs@tglx
|
|
The return value is truly boolean. Make it so.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240229142248.518723854@linutronix.de
|
|
Updating the static call for x86_idle() from idle_setup() is
counter-intuitive.
Let select_idle_routine() handle it like the other idle choices, which
allows to simplify the idle selection later on.
While at it rewrite comments and return a proper error code and not -1.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240229142248.455616019@linutronix.de
|
|
Clean up the code to make it readable. No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240229142248.392017685@linutronix.de
|
|
amd_e400_idle(), the idle routine for AMD CPUs which are affected by
erratum 400 violates the RCU constraints by invoking tick_broadcast_enter()
and tick_broadcast_exit() after the core code has marked RCU non-idle. The
functions can end up in lockdep or tracing, which rightfully triggers a
RCU warning.
The core code provides now a static branch conditional invocation of the
broadcast functions.
Remove amd_e400_idle(), enforce default_idle() and enable the static branch
on affected CPUs to cure this.
[ bp: Fold in a fix for a IS_ENABLED() check fail missing a "CONFIG_"
prefix which tglx spotted. ]
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/877cim6sis.ffs@tglx
|
|
Sparse complains rightfully about the usage of EXPORT_SYMBOL_GPL() for per
CPU variables:
callthunks.c:346:20: sparse: warning: incorrect type in initializer (different address spaces)
callthunks.c:346:20: sparse: expected void const [noderef] __percpu *__vpp_verify
callthunks.c:346:20: sparse: got unsigned long long *
Use EXPORT_PER_CPU_SYMBOL_GPL() instead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240304005104.841915535@linutronix.de
|
|
Sparse rightfully complains:
bugs.c:71:9: sparse: warning: incorrect type in initializer (different address spaces)
bugs.c:71:9: sparse: expected void const [noderef] __percpu *__vpp_verify
bugs.c:71:9: sparse: got unsigned long long *
The reason is that x86_spec_ctrl_current which is a per CPU variable is
exported with EXPORT_SYMBOL_GPL().
Use EXPORT_PER_CPU_SYMBOL_GPL() instead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240304005104.732288812@linutronix.de
|
|
On UP builds Sparse complains rightfully about accesses to cpu_info with
per CPU accessors:
cacheinfo.c:282:30: sparse: warning: incorrect type in initializer (different address spaces)
cacheinfo.c:282:30: sparse: expected void const [noderef] __percpu *__vpp_verify
cacheinfo.c:282:30: sparse: got unsigned int *
The reason is that on UP builds cpu_info which is a per CPU variable on SMP
is mapped to boot_cpu_info which is a regular variable. There is a hideous
accessor cpu_data() which tries to hide this, but it's not sufficient as
some places require raw accessors and generates worse code than the regular
per CPU accessors.
Waste sizeof(struct x86_cpuinfo) memory on UP and provide the per CPU
cpu_info unconditionally. This requires to update the CPU info on the boot
CPU as SMP does. (Ab)use the weakly defined smp_prepare_boot_cpu() function
and implement exactly that.
This allows to use regular per CPU accessors uncoditionally and paves the
way to remove the cpu_data() hackery.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240304005104.622511517@linutronix.de
|
|
There is no point in having seven architectures implementing the same empty
stub.
Provide a weak function in the init code and remove the stubs.
This also allows to utilize the function on UP which is required to
sanitize the per CPU handling on X86 UP.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240304005104.567671691@linutronix.de
|
|
To clean up the per CPU insanity of UP which causes sparse to be rightfully
unhappy and prevents the usage of the generic per CPU accessors on cpu_info
it is necessary to include <linux/percpu.h> into <asm/msr.h>.
Including <linux/percpu.h> into <asm/msr.h> is impossible because it ends
up in header dependency hell. The problem is that <asm/processor.h>
includes <asm/msr.h>. The inclusion of <linux/percpu.h> results in a
compile fail where the compiler cannot longer handle an include in
<asm/cpufeature.h> which references boot_cpu_data which is
defined in <asm/processor.h>.
The only reason why <asm/msr.h> is included in <asm/processor.h> are the
set/get_debugctlmsr() inlines. They are defined there because <asm/processor.h>
is such a nice dump ground for everything. In fact they belong obviously
into <asm/debugreg.h>.
Move them to <asm/debugreg.h> and fix up the resulting damage which is just
exposing the reliance on random include chains.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240304005104.454678686@linutronix.de
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The HV_REGISTER_ are used as arguments to hv_set/get_register(), which
delegate to arch-specific mechanisms for getting/setting synthetic
Hyper-V MSRs.
On arm64, HV_REGISTER_ defines are synthetic VP registers accessed via
the get/set vp registers hypercalls. The naming matches the TLFS
document, although these register names are not specific to arm64.
However, on x86 the prefix HV_REGISTER_ indicates Hyper-V MSRs accessed
via rdmsrl()/wrmsrl(). This is not consistent with the TLFS doc, where
HV_REGISTER_ is *only* used for used for VP register names used by
the get/set register hypercalls.
To fix this inconsistency and prevent future confusion, change the
arch-generic aliases used by callers of hv_set/get_register() to have
the prefix HV_MSR_ instead of HV_REGISTER_.
Use the prefix HV_X64_MSR_ for the x86-only Hyper-V MSRs. On x86, the
generic HV_MSR_'s point to the corresponding HV_X64_MSR_.
Move the arm64 HV_REGISTER_* defines to the asm-generic hyperv-tlfs.h,
since these are not specific to arm64. On arm64, the generic HV_MSR_'s
point to the corresponding HV_REGISTER_.
While at it, rename hv_get/set_registers() and related functions to
hv_get/set_msr(), hv_get/set_nested_msr(), etc. These are only used for
Hyper-V MSRs and this naming makes that clear.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1708440933-27125-1-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1708440933-27125-1-git-send-email-nunodasneves@linux.microsoft.com>
|
|
SETUP_RNG_SEED in setup_data is supplied by kexec and should
not be reserved in the e820 map.
Doing so reserves 16 bytes of RAM when booting with kexec.
(16 bytes because data->len is zeroed by parse_setup_data so only
sizeof(setup_data) is reserved.)
When kexec is used repeatedly, each boot adds two entries in the
kexec-provided e820 map as the 16-byte range splits a larger
range of usable memory. Eventually all of the 128 available entries
get used up. The next split will result in losing usable memory
as the new entries cannot be added to the e820 map.
Fixes: 68b8e9713c8e ("x86/setup: Use rng seeds from setup_data")
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/ZbmOjKnARGiaYBd5@dwarf.suse.cz
|
|
x86_64 zero extends 32-bit operations, so for 64-bit operands,
XORL r32,r32 is functionally equal to XORQ r64,r64, but avoids
a REX prefix byte when legacy registers are used.
Slightly smaller code generated, no change in functionality.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240124103859.611372-1-ubizjak@gmail.com
|
|
It is, and will be even more useful in the future, to dump the SEV
features enabled according to SEV_STATUS. Do so:
[ 0.542753] Memory Encryption Features active: AMD SEV SEV-ES SEV-SNP
[ 0.544425] SEV: Status: SEV SEV-ES SEV-SNP DebugSwap
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20240219094216.GAZdMieDHKiI8aaP3n@fat_crate.local
|
|
startup_gdt[]
Instead of loading a duplicate GDT just for early boot, load the kernel
GDT from its physical address.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20240226220544.70769-1-brgerst@gmail.com
|
|
IS_ENABLED(CONFIG_SMP) is unnecessary here: smp_processor_id() should
always return zero on UP, and arch_cpu_is_offline() reduces to
!(cpu == 0), so this is a statically false condition on UP.
Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240201094604.3918141-1-xin@zytor.com
|
|
Conflicts:
arch/x86/kernel/cpu/common.c
arch/x86/kernel/cpu/intel.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
These functions are mostly pointless on UP, but nevertheless the
64-bit UP APIC build already depends on the existence of
topology_apply_cmdline_limits_early(), which caused a build bug,
resolve it by making them available under CONFIG_X86_LOCAL_APIC,
as their prototypes already are.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
In commit c1d171a00294 ("x86: randomize brk"), arch_randomize_brk() was
defined to use a 32MB range (13 bits of entropy), but was never increased
when moving to 64-bit. The default arch_randomize_brk() uses 32MB for
32-bit tasks, and 1GB (18 bits of entropy) for 64-bit tasks.
Update x86_64 to match the entropy used by arm64 and other 64-bit
architectures.
Reported-by: y0un9n132@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jiri Kosina <jkosina@suse.com>
Closes: https://lore.kernel.org/linux-hardening/CA+2EKTVLvc8hDZc+2Yhwmus=dzOUG5E4gV7ayCbu0MPJTZzWkw@mail.gmail.com/
Link: https://lore.kernel.org/r/20240217062545.1631668-1-keescook@chromium.org
|
|
The vDSO (and its initial randomization) was introduced in commit 2aae950b21e4
("x86_64: Add vDSO for x86-64 with gettimeofday/clock_gettime/getcpu"), but
had very low entropy. The entropy was improved in commit 394f56fe4801
("x86_64, vdso: Fix the vdso address randomization algorithm"), but there
is still improvement to be made.
In principle there should not be executable code at a low entropy offset
from the stack, since the stack and executable code having separate
randomization is part of what makes ASLR stronger.
Remove the only executable code near the stack region and give the vDSO
the same randomized base as other mmap mappings including the linker
and other shared objects. This results in higher entropy being provided
and there's little to no advantage in separating this from the existing
executable code there. This is already how other architectures like
arm64 handle the vDSO.
As an side, while it's sensible for userspace to reserve the initial mmap
base as a region for executable code with a random gap for other mmap
allocations, along with providing randomization within that region, there
isn't much the kernel can do to help due to how dynamic linkers load the
shared objects.
This was extracted from the PaX RANDMMAP feature.
[kees: updated commit log with historical details and other tweaks]
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Closes: https://github.com/KSPP/linux/issues/280
Link: https://lore.kernel.org/r/20240210091827.work.233-kees@kernel.org
|
|
Commit 344da544f177 ("x86/nmi: Print reasons why backtrace NMIs are
ignored") creates a super nice framework to diagnose NMIs.
Every time nmi_exc() is called, it increments a per_cpu counter
(nsp->idt_nmi_seq). At its exit, it also increments the same counter. By
reading this counter it can be seen how many times that function was called
(dividing by 2), and, if the function is still being executed, by checking
the idt_nmi_seq's least significant bit.
On the check side (nmi_backtrace_stall_check()), that variable is queried
to check if the NMI is still being executed, but, there is a mistake in the
bitwise operation. That code wants to check if the least significant bit of
the idt_nmi_seq is set or not, but does the opposite, and checks for all
the other bits, which will always be true after the first exc_nmi()
executed successfully.
This appends the misleading string to the dump "(CPU currently in NMI
handler function)"
Fix it by checking the least significant bit, and if it is set, append the
string.
Fixes: 344da544f177 ("x86/nmi: Print reasons why backtrace NMIs are ignored")
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240207165237.1048837-1-leitao@debian.org
|
|
MKTME repurposes the high bit of physical address to key id for encryption
key and, even though MAXPHYADDR in CPUID[0x80000008] remains the same,
the valid bits in the MTRR mask register are based on the reduced number
of physical address bits.
detect_tme() in arch/x86/kernel/cpu/intel.c detects TME and subtracts
it from the total usable physical bits, but it is called too late.
Move the call to early_init_intel() so that it is called in setup_arch(),
before MTRRs are setup.
This fixes boot on TDX-enabled systems, which until now only worked with
"disable_mtrr_cleanup". Without the patch, the values written to the
MTRRs mask registers were 52-bit wide (e.g. 0x000fffff_80000800) and
the writes failed; with the patch, the values are 46-bit wide, which
matches the reduced MAXPHYADDR that is shown in /proc/cpuinfo.
Reported-by: Zixi Chen <zixchen@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240131230902.1867092-3-pbonzini%40redhat.com
|
|
In commit fbf6449f84bf ("x86/sev-es: Set x86_virt_bits to the correct
value straight away, instead of a two-phase approach"), the initialization
of c->x86_phys_bits was moved after this_cpu->c_early_init(c). This is
incorrect because early_init_amd() expected to be able to reduce the
value according to the contents of CPUID leaf 0x8000001f.
Fortunately, the bug was negated by init_amd()'s call to early_init_amd(),
which does reduce x86_phys_bits in the end. However, this is very
late in the boot process and, most notably, the wrong value is used for
x86_phys_bits when setting up MTRRs.
To fix this, call get_cpu_address_sizes() as soon as X86_FEATURE_CPUID is
set/cleared, and c->extended_cpuid_level is retrieved.
Fixes: fbf6449f84bf ("x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240131230902.1867092-2-pbonzini%40redhat.com
|
|
early_top_pgt[] is assigned from code that executes from a 1:1 mapping
so it cannot use a plain access from C. Replace the use of
fixup_pointer() with RIP_REL_REF(), which is better and simpler.
For legibility and to align with the code that populates the lower page
table levels, statically initialize the root level page table with an
entry pointing to level3_kernel_pgt[], and overwrite it when needed to
enable 5-level paging.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240221113506.2565718-24-ardb+git@google.com
|
|
The early statically allocated page tables are populated from code that
executes from a 1:1 mapping so it cannot use plain accesses from C.
Replace the use of fixup_pointer() with RIP_REL_REF(), which is better
and simpler.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240221113506.2565718-23-ardb+git@google.com
|
|
'__supported_pte_mask' is accessed from code that executes from a 1:1
mapping so it cannot use a plain access from C. Replace the use of
fixup_pointer() with RIP_REL_REF(), which is better and simpler.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240221113506.2565718-22-ardb+git@google.com
|