summaryrefslogtreecommitdiff
path: root/arch/x86/include
AgeCommit message (Collapse)AuthorFilesLines
2023-03-22virt/coco/sev-guest: Add throttling awarenessDionna Glaze1-1/+2
commit 72f7754dcf31c87c92c0c353dcf747814cc5ce10 upstream. A potentially malicious SEV guest can constantly hammer the hypervisor using this driver to send down requests and thus prevent or at least considerably hinder other guests from issuing requests to the secure processor which is a shared platform resource. Therefore, the host is permitted and encouraged to throttle such guest requests. Add the capability to handle the case when the hypervisor throttles excessive numbers of requests issued by the guest. Otherwise, the VM platform communication key will be disabled, preventing the guest from attesting itself. Realistically speaking, a well-behaved guest should not even care about throttling. During its lifetime, it would end up issuing a handful of requests which the hardware can easily handle. This is more to address the case of a malicious guest. Such guest should get throttled and if its VMPCK gets disabled, then that's its own wrongdoing and perhaps that guest even deserves it. To the implementation: the hypervisor signals with SNP_GUEST_REQ_ERR_BUSY that the guest requests should be throttled. That error code is returned in the upper 32-bit half of exitinfo2 and this is part of the GHCB spec v2. So the guest is given a throttling period of 1 minute in which it retries the request every 2 seconds. This is a good default but if it turns out to not pan out in practice, it can be tweaked later. For safety, since the encryption algorithm in GHCBv2 is AES_GCM, control must remain in the kernel to complete the request with the current sequence number. Returning without finishing the request allows the guest to make another request but with different message contents. This is IV reuse, and breaks cryptographic protections. [ bp: - Rewrite commit message and do a simplified version. - The stable tags are supposed to denote that a cleanup should go upfront before backporting this so that any future fixes to this can preserve the sanity of the backporter(s). ] Fixes: d5af44dde546 ("x86/sev: Provide support for SNP guest request NAEs") Signed-off-by: Dionna Glaze <dionnaglaze@google.com> Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@kernel.org> # d6fd48eff750 ("virt/coco/sev-guest: Check SEV_SNP attribute at probe time") Cc: <stable@kernel.org> # 970ab823743f (" virt/coco/sev-guest: Simplify extended guest request handling") Cc: <stable@kernel.org> # c5a338274bdb ("virt/coco/sev-guest: Remove the disable_vmpck label in handle_guest_request()") Cc: <stable@kernel.org> # 0fdb6cc7c89c ("virt/coco/sev-guest: Carve out the request issuing logic into a helper") Cc: <stable@kernel.org> # d25bae7dc7b0 ("virt/coco/sev-guest: Do some code style cleanups") Cc: <stable@kernel.org> # fa4ae42cc60a ("virt/coco/sev-guest: Convert the sw_exit_info_2 checking to a switch-case") Link: https://lore.kernel.org/r/20230214164638.1189804-2-dionnaglaze@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-22KVM: SVM: Fix a benign off-by-one bug in AVIC physical table maskSean Christopherson1-5/+7
commit 3ec7a1b2743c07c45f4a0c508114f6cb410ddef3 upstream. Define the "physical table max index mask" as bits 8:0, not 9:0. x2AVIC currently supports a max of 512 entries, i.e. the max index is 511, and the inputs to GENMASK_ULL() are inclusive. The bug is benign as bit 9 is reserved and never set by KVM, i.e. KVM is just clearing bits that are guaranteed to be zero. Note, as of this writing, APM "Rev. 3.39-October 2022" incorrectly states that bits 11:8 are reserved in Table B-1. VMCB Layout, Control Area. I.e. that table wasn't updated when x2AVIC support was added. Opportunistically fix the comment for the max AVIC ID to align with the code, and clean up comment formatting too. Fixes: 4d1d7942e36a ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode") Cc: stable@vger.kernel.org Cc: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <20230207002156.521736-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17KVM: x86: Move guts of kvm_arch_init() to standalone helperSean Christopherson1-0/+3
[ Upstream commit 4f8396b96a9fc672964842fe7adbe8ddca8a3adf ] Move the guts of kvm_arch_init() to a new helper, kvm_x86_vendor_init(), so that VMX can do _all_ arch and vendor initialization before calling kvm_init(). Calling kvm_init() must be the _very_ last step during init, as kvm_init() exposes /dev/kvm to userspace, i.e. allows creating VMs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Stable-dep-of: e32b120071ea ("KVM: VMX: Do _all_ initialization before exposing /dev/kvm to userspace") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11x86/resctl: fix scheduler confusion with 'current'Linus Torvalds1-6/+6
commit 7fef099702527c3b2c5234a2ea6a24411485a13a upstream. The implementation of 'current' on x86 is very intentionally special: it is a very common thing to look up, and it uses 'this_cpu_read_stable()' to get the current thread pointer efficiently from per-cpu storage. And the keyword in there is 'stable': the current thread pointer never changes as far as a single thread is concerned. Even if when a thread is preempted, or moved to another CPU, or even across an explicit call 'schedule()' that thread will still have the same value for 'current'. It is, after all, the kernel base pointer to thread-local storage. That's why it's stable to begin with, but it's also why it's important enough that we have that special 'this_cpu_read_stable()' access for it. So this is all done very intentionally to allow the compiler to treat 'current' as a value that never visibly changes, so that the compiler can do CSE and combine multiple different 'current' accesses into one. However, there is obviously one very special situation when the currently running thread does actually change: inside the scheduler itself. So the scheduler code paths are special, and do not have a 'current' thread at all. Instead there are _two_ threads: the previous and the next thread - typically called 'prev' and 'next' (or prev_p/next_p) internally. So this is all actually quite straightforward and simple, and not all that complicated. Except for when you then have special code that is run in scheduler context, that code then has to be aware that 'current' isn't really a valid thing. Did you mean 'prev'? Did you mean 'next'? In fact, even if then look at the code, and you use 'current' after the new value has been assigned to the percpu variable, we have explicitly told the compiler that 'current' is magical and always stable. So the compiler is quite free to use an older (or newer) value of 'current', and the actual assignment to the percpu storage is not relevant even if it might look that way. Which is exactly what happened in the resctl code, that blithely used 'current' in '__resctrl_sched_in()' when it really wanted the new process state (as implied by the name: we're scheduling 'into' that new resctl state). And clang would end up just using the old thread pointer value at least in some configurations. This could have happened with gcc too, and purely depends on random compiler details. Clang just seems to have been more aggressive about moving the read of the per-cpu current_task pointer around. The fix is trivial: just make the resctl code adhere to the scheduler rules of using the prev/next thread pointer explicitly, instead of using 'current' in a situation where it just wasn't valid. That same code is then also used outside of the scheduler context (when a thread resctl state is explicitly changed), and then we will just pass in 'current' as that pointer, of course. There is no ambiguity in that case. The fix may be trivial, but noticing and figuring out what went wrong was not. The credit for that goes to Stephane Eranian. Reported-by: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/lkml/20230303231133.1486085-1-eranian@google.com/ Link: https://lore.kernel.org/lkml/alpine.LFD.2.01.0908011214330.3304@localhost.localdomain/ Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: Stephane Eranian <eranian@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10x86/microcode/AMD: Add a @cpu parameter to the reloading functionsBorislav Petkov (AMD)2-4/+4
commit a5ad92134bd153a9ccdcddf09a95b088f36c3cce upstream. Will be used in a subsequent change. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230130161709.11615-3-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10x86/crash: Disable virt in core NMI crash handler to avoid double shootdownSean Christopherson1-0/+2
commit 26044aff37a5455b19a91785086914fd33053ef4 upstream. Disable virtualization in crash_nmi_callback() and rework the emergency_vmx_disable_all() path to do an NMI shootdown if and only if a shootdown has not already occurred. NMI crash shootdown fundamentally can't support multiple invocations as responding CPUs are deliberately put into halt state without unblocking NMIs. But, the emergency reboot path doesn't have any work of its own, it simply cares about disabling virtualization, i.e. so long as a shootdown occurred, emergency reboot doesn't care who initiated the shootdown, or when. If "crash_kexec_post_notifiers" is specified on the kernel command line, panic() will invoke crash_smp_send_stop() and result in a second call to nmi_shootdown_cpus() during native_machine_emergency_restart(). Invoke the callback _before_ disabling virtualization, as the current VMCS needs to be cleared before doing VMXOFF. Note, this results in a subtle change in ordering between disabling virtualization and stopping Intel PT on the responding CPUs. While VMX and Intel PT do interact, VMXOFF and writes to MSR_IA32_RTIT_CTL do not induce faults between one another, which is all that matters when panicking. Harden nmi_shootdown_cpus() against multiple invocations to try and capture any such kernel bugs via a WARN instead of hanging the system during a crash/dump, e.g. prior to the recent hardening of register_nmi_handler(), re-registering the NMI handler would trigger a double list_add() and hang the system if CONFIG_BUG_ON_DATA_CORRUPTION=y. list_add double add: new=ffffffff82220800, prev=ffffffff8221cfe8, next=ffffffff82220800. WARNING: CPU: 2 PID: 1319 at lib/list_debug.c:29 __list_add_valid+0x67/0x70 Call Trace: __register_nmi_handler+0xcf/0x130 nmi_shootdown_cpus+0x39/0x90 native_machine_emergency_restart+0x1c9/0x1d0 panic+0x237/0x29b Extract the disabling logic to a common helper to deduplicate code, and to prepare for doing the shootdown in the emergency reboot path if SVM is supported. Note, prior to commit ed72736183c4 ("x86/reboot: Force all cpus to exit VMX root if VMX is supported"), nmi_shootdown_cpus() was subtly protected against a second invocation by a cpu_vmx_enabled() check as the kdump handler would disable VMX if it ran first. Fixes: ed72736183c4 ("x86/reboot: Force all cpus to exit VMX root if VMX is supported") Cc: stable@vger.kernel.org Reported-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/all/20220427224924.592546-2-gpiccoli@igalia.com Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221130233650.1404148-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10x86/virt: Force GIF=1 prior to disabling SVM (for reboot flows)Sean Christopherson1-1/+15
commit 6a3236580b0b1accc3976345e723104f74f6f8e6 upstream. Set GIF=1 prior to disabling SVM to ensure that INIT is recognized if the kernel is disabling SVM in an emergency, e.g. if the kernel is about to jump into a crash kernel or may reboot without doing a full CPU RESET. If GIF is left cleared, the new kernel (or firmware) will be unabled to awaken APs. Eat faults on STGI (due to EFER.SVME=0) as it's possible that SVM could be disabled via NMI shootdown between reading EFER.SVME and executing STGI. Link: https://lore.kernel.org/all/cbcb6f35-e5d7-c1c9-4db9-fe5cc4de579a@amd.com Cc: stable@vger.kernel.org Cc: Andrew Cooper <Andrew.Cooper3@citrix.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221130233650.1404148-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10x86/bugs: Reset speculation control settings on initBreno Leitao1-0/+4
[ Upstream commit 0125acda7d76b943ca55811df40ed6ec0ecf670f ] Currently, x86_spec_ctrl_base is read at boot time and speculative bits are set if Kconfig items are enabled. For example, IBRS is enabled if CONFIG_CPU_IBRS_ENTRY is configured, etc. These MSR bits are not cleared if the mitigations are disabled. This is a problem when kexec-ing a kernel that has the mitigation disabled from a kernel that has the mitigation enabled. In this case, the MSR bits are not cleared during the new kernel boot. As a result, this might have some performance degradation that is hard to pinpoint. This problem does not happen if the machine is (hard) rebooted because the bit will be cleared by default. [ bp: Massage. ] Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20221128153148.1129350-1-leitao@debian.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10x86/fpu: Don't set TIF_NEED_FPU_LOAD for PF_IO_WORKER threadsJens Axboe1-1/+1
[ Upstream commit cb3ea4b7671b7cfbac3ee609976b790aebd0bbda ] We don't set it on PF_KTHREAD threads as they never return to userspace, and PF_IO_WORKER threads are identical in that regard. As they keep running in the kernel until they die, skip setting the FPU flag on them. More of a cosmetic thing that was found while debugging and issue and pondering why the FPU flag is set on these threads. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/560c844c-f128-555b-40c6-31baff27537f@kernel.dk Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10cpuidle, intel_idle: Fix CPUIDLE_FLAG_INIT_XSTATEPeter Zijlstra2-3/+3
[ Upstream commit 821ad23d0eaff73ef599ece39ecc77482df20a8c ] Fix instrumentation bugs objtool found: vmlinux.o: warning: objtool: intel_idle_s2idle+0xd5: call to fpu_idle_fpregs() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle_xstate+0x11: call to fpu_idle_fpregs() leaves .noinstr.text section vmlinux.o: warning: objtool: fpu_idle_fpregs+0x9: call to xfeatures_in_use() leaves .noinstr.text section Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230112195540.494977795@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10x86/microcode: Check CPU capabilities after late microcode update correctlyAshok Raj1-0/+1
[ Upstream commit c0dd9245aa9e25a697181f6085692272c9ec61bc ] The kernel caches each CPU's feature bits at boot in an x86_capability[] structure. However, the capabilities in the BSP's copy can be turned off as a result of certain command line parameters or configuration restrictions, for example the SGX bit. This can cause a mismatch when comparing the values before and after the microcode update. Another example is X86_FEATURE_SRBDS_CTRL which gets added only after microcode update: # --- cpuid.before 2023-01-21 14:54:15.652000747 +0100 # +++ cpuid.after 2023-01-21 14:54:26.632001024 +0100 # @@ -10,7 +10,7 @@ CPU: # 0x00000004 0x04: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 # 0x00000005 0x00: eax=0x00000040 ebx=0x00000040 ecx=0x00000003 edx=0x11142120 # 0x00000006 0x00: eax=0x000027f7 ebx=0x00000002 ecx=0x00000001 edx=0x00000000 # - 0x00000007 0x00: eax=0x00000000 ebx=0x029c6fbf ecx=0x40000000 edx=0xbc002400 # + 0x00000007 0x00: eax=0x00000000 ebx=0x029c6fbf ecx=0x40000000 edx=0xbc002e00 ^^^ and which proves for a gazillionth time that late loading is a bad bad idea. microcode_check() is called after an update to report any previously cached CPUID bits which might have changed due to the update. Therefore, store the cached CPU caps before the update and compare them with the CPU caps after the microcode update has succeeded. Thus, the comparison is done between the CPUID *hardware* bits before and after the upgrade instead of using the cached, possibly runtime modified values in BSP's boot_cpu_data copy. As a result, false warnings about CPUID bits changes are avoided. [ bp: - Massage. - Add SRBDS_CTRL example. - Add kernel-doc. - Incorporate forgotten review feedback from dhansen. ] Fixes: 1008c52c09dc ("x86/CPU: Add a microcode loader callback") Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230109153555.4986-3-ashok.raj@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10x86/microcode: Add a parameter to microcode_check() to store CPU capabilitiesAshok Raj1-1/+1
[ Upstream commit ab31c74455c64e69342ddab21fd9426fcbfefde7 ] Add a parameter to store CPU capabilities before performing a microcode update so that CPU capabilities can be compared before and after update. [ bp: Massage. ] Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230109153555.4986-2-ashok.raj@intel.com Stable-dep-of: c0dd9245aa9e ("x86/microcode: Check CPU capabilities after late microcode update correctly") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-25x86/alternatives: Introduce int3_emulate_jcc()Peter Zijlstra1-0/+31
commit db7adcfd1cec4e95155e37bc066fddab302c6340 upstream. Move the kprobe Jcc emulation into int3_emulate_jcc() so it can be used by more code -- specifically static_call() will need this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20230123210607.057678245@infradead.org Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-14Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-0/+1
Pull kvm fixes from Paolo Bonzini: "Certain AMD processors are vulnerable to a cross-thread return address predictions bug. When running in SMT mode and one of the sibling threads transitions out of C0 state, the other thread gets access to twice as many entries in the RSB, but unfortunately the predictions of the now-halted logical processor are not purged. Therefore, the executing processor could speculatively execute from locations that the now-halted processor had trained the RSB on. The Spectre v2 mitigations cover the Linux kernel, as it fills the RSB when context switching to the idle thread. However, KVM allows a VMM to prevent exiting guest mode when transitioning out of C0 using the KVM_CAP_X86_DISABLE_EXITS capability can be used by a VMM to change this behavior. To mitigate the cross-thread return address predictions bug, a VMM must not be allowed to override the default behavior to intercept C0 transitions. These patches introduce a KVM module parameter that, if set, will prevent the user from disabling the HLT, MWAIT and CSTATE exits" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: Documentation/hw-vuln: Add documentation for Cross-Thread Return Predictions KVM: x86: Mitigate the cross-thread return address predictions bug x86/speculation: Identify processors vulnerable to SMT RSB predictions
2023-02-10x86/speculation: Identify processors vulnerable to SMT RSB predictionsTom Lendacky1-0/+1
Certain AMD processors are vulnerable to a cross-thread return address predictions bug. When running in SMT mode and one of the sibling threads transitions out of C0 state, the other sibling thread could use return target predictions from the sibling thread that transitioned out of C0. The Spectre v2 mitigations cover the Linux kernel, as it fills the RSB when context switching to the idle thread. However, KVM allows a VMM to prevent exiting guest mode when transitioning out of C0. A guest could act maliciously in this situation, so create a new x86 BUG that can be used to detect if the processor is vulnerable. Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <91cec885656ca1fcd4f0185ce403a53dd9edecb7.1675956146.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-02-08x86/cpu: Add Lunar Lake MKan Liang1-0/+2
Intel confirmed the existence of this CPU in Q4'2022 earnings presentation. Add the CPU model number. [ dhansen: Merging these as soon as possible makes it easier on all the folks developing model-specific features. ] Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230208172340.158548-1-tony.luck%40intel.com
2023-01-31x86/debug: Fix stack recursion caused by wrongly ordered DR7 accessesJoerg Roedel1-2/+24
In kernels compiled with CONFIG_PARAVIRT=n, the compiler re-orders the DR7 read in exc_nmi() to happen before the call to sev_es_ist_enter(). This is problematic when running as an SEV-ES guest because in this environment the DR7 read might cause a #VC exception, and taking #VC exceptions is not safe in exc_nmi() before sev_es_ist_enter() has run. The result is stack recursion if the NMI was caused on the #VC IST stack, because a subsequent #VC exception in the NMI handler will overwrite the stack frame of the interrupted #VC handler. As there are no compiler barriers affecting the ordering of DR7 reads/writes, make the accesses to this register volatile, forbidding the compiler to re-order them. [ bp: Massage text, make them volatile too, to make sure some aggressive compiler optimization pass doesn't discard them. ] Fixes: 315562c9af3d ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler") Reported-by: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230127035616.508966-1-aik@amd.com
2023-01-20acpi: Fix suspend with Xen PVJuergen Gross1-0/+8
Commit f1e525009493 ("x86/boot: Skip realmode init code when running as Xen PV guest") missed one code path accessing real_mode_header, leading to dereferencing NULL when suspending the system under Xen: [ 348.284004] PM: suspend entry (deep) [ 348.289532] Filesystems sync: 0.005 seconds [ 348.291545] Freezing user space processes ... (elapsed 0.000 seconds) done. [ 348.292457] OOM killer disabled. [ 348.292462] Freezing remaining freezable tasks ... (elapsed 0.104 seconds) done. [ 348.396612] printk: Suspending console(s) (use no_console_suspend to debug) [ 348.749228] PM: suspend devices took 0.352 seconds [ 348.769713] ACPI: EC: interrupt blocked [ 348.816077] BUG: kernel NULL pointer dereference, address: 000000000000001c [ 348.816080] #PF: supervisor read access in kernel mode [ 348.816081] #PF: error_code(0x0000) - not-present page [ 348.816083] PGD 0 P4D 0 [ 348.816086] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 348.816089] CPU: 0 PID: 6764 Comm: systemd-sleep Not tainted 6.1.3-1.fc32.qubes.x86_64 #1 [ 348.816092] Hardware name: Star Labs StarBook/StarBook, BIOS 8.01 07/03/2022 [ 348.816093] RIP: e030:acpi_get_wakeup_address+0xc/0x20 Fix that by adding an optional acpi callback allowing to skip setting the wakeup address, as in the Xen PV case this will be handled by the hypervisor anyway. Fixes: f1e525009493 ("x86/boot: Skip realmode init code when running as Xen PV guest") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/all/20230117155724.22940-1-jgross%40suse.com
2023-01-19x86/sev: Add SEV-SNP guest feature negotiation supportNikunj A Dadhania2-0/+26
The hypervisor can enable various new features (SEV_FEATURES[1:63]) and start a SNP guest. Some of these features need guest side implementation. If any of these features are enabled without it, the behavior of the SNP guest will be undefined. It may fail booting in a non-obvious way making it difficult to debug. Instead of allowing the guest to continue and have it fail randomly later, detect this early and fail gracefully. The SEV_STATUS MSR indicates features which the hypervisor has enabled. While booting, SNP guests should ascertain that all the enabled features have guest side implementation. In case a feature is not implemented in the guest, the guest terminates booting with GHCB protocol Non-Automatic Exit(NAE) termination request event, see "SEV-ES Guest-Hypervisor Communication Block Standardization" document (currently at https://developer.amd.com/wp-content/resources/56421.pdf), section "Termination Request". Populate SW_EXITINFO2 with mask of unsupported features that the hypervisor can easily report to the user. More details in the AMD64 APM Vol 2, Section "SEV_STATUS MSR". [ bp: - Massage. - Move snp_check_features() call to C code. Note: the CC:stable@ aspect here is to be able to protect older, stable kernels when running on newer hypervisors. Or not "running" but fail reliably and in a well-defined manner instead of randomly. ] Fixes: cbd3d4f7c4e5 ("x86/sev: Check SEV-SNP features support") Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230118061943.534309-1-nikunj@amd.com
2023-01-12KVM: x86/xen: Avoid deadlock by adding kvm->arch.xen.xen_lock leaf node lockDavid Woodhouse1-0/+1
In commit 14243b387137a ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery") the clever version of me left some helpful notes for those who would come after him: /* * For the irqfd workqueue, using the main kvm->lock mutex is * fine since this function is invoked from kvm_set_irq() with * no other lock held, no srcu. In future if it will be called * directly from a vCPU thread (e.g. on hypercall for an IPI) * then it may need to switch to using a leaf-node mutex for * serializing the shared_info mapping. */ mutex_lock(&kvm->lock); In commit 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") the other version of me ran straight past that comment without reading it, and introduced a potential deadlock by taking vcpu->mutex and kvm->lock in the wrong order. Solve this as originally suggested, by adding a leaf-node lock in the Xen state rather than using kvm->lock for it. Fixes: 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230111180651.14394-4-dwmw2@infradead.org> [Rebase, add docs. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-03x86/insn: Avoid namespace clash by separating instruction decoder MMIO type ↵Jason A. Donenfeld1-9/+9
from MMIO trace type Both <linux/mmiotrace.h> and <asm/insn-eval.h> define various MMIO_ enum constants, whose namespace overlaps. Rename the <asm/insn-eval.h> ones to have a INSN_ prefix, so that the headers can be used from the same source file. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230101162910.710293-2-Jason@zx2c4.com
2022-12-17Merge tag 'x86_mm_for_6.2_v2' of ↵Linus Torvalds12-188/+60
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Dave Hansen: "New Feature: - Randomize the per-cpu entry areas Cleanups: - Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it - Move to "native" set_memory_rox() helper - Clean up pmd_get_atomic() and i386-PAE - Remove some unused page table size macros" * tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) x86/mm: Ensure forced page table splitting x86/kasan: Populate shadow for shared chunk of the CPU entry area x86/kasan: Add helpers to align shadow addresses up and down x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area x86/mm: Recompute physical address for every page of per-CPU CEA mapping x86/mm: Rename __change_page_attr_set_clr(.checkalias) x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias() x86/mm: Untangle __change_page_attr_set_clr(.checkalias) x86/mm: Add a few comments x86/mm: Fix CR3_ADDR_MASK x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros mm: Convert __HAVE_ARCH_P..P_GET to the new style mm: Remove pointless barrier() after pmdp_get_lockless() x86/mm/pae: Get rid of set_64bit() x86_64: Remove pointless set_64bit() usage x86/mm/pae: Be consistent with pXXp_get_and_clear() x86/mm/pae: Use WRITE_ONCE() x86/mm/pae: Don't (ab)use atomic64 mm/gup: Fix the lockless PMD access ...
2022-12-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds7-84/+224
Pull kvm updates from Paolo Bonzini: "ARM64: - Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. - Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a97d: "Fix a number of issues with MTE, such as races on the tags being initialised vs the PG_mte_tagged flag as well as the lack of support for VM_SHARED when KVM is involved. Patches from Catalin Marinas and Peter Collingbourne"). - Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. - Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. - Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. - Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. s390: - Second batch of the lazy destroy patches - First batch of KVM changes for kernel virtual != physical address support - Removal of a unused function x86: - Allow compiling out SMM support - Cleanup and documentation of SMM state save area format - Preserve interrupt shadow in SMM state save area - Respond to generic signals during slow page faults - Fixes and optimizations for the non-executable huge page errata fix. - Reprogram all performance counters on PMU filter change - Cleanups to Hyper-V emulation and tests - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest running on top of a L1 Hyper-V hypervisor) - Advertise several new Intel features - x86 Xen-for-KVM: - Allow the Xen runstate information to cross a page boundary - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured - Add support for 32-bit guests in SCHEDOP_poll - Notable x86 fixes and cleanups: - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. - Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. - Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. - Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency. - Advertise (on AMD) that the SMM_CTL MSR is not supported - Remove unnecessary exports Generic: - Support for responding to signals during page faults; introduces new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks Selftests: - Fix an inverted check in the access tracking perf test, and restore support for asserting that there aren't too many idle pages when running on bare metal. - Fix build errors that occur in certain setups (unsure exactly what is unique about the problematic setup) due to glibc overriding static_assert() to a variant that requires a custom message. - Introduce actual atomics for clear/set_bit() in selftests - Add support for pinning vCPUs in dirty_log_perf_test. - Rename the so called "perf_util" framework to "memstress". - Add a lightweight psuedo RNG for guest use, and use it to randomize the access pattern and write vs. read percentage in the memstress tests. - Add a common ucall implementation; code dedup and pre-work for running SEV (and beyond) guests in selftests. - Provide a common constructor and arch hook, which will eventually be used by x86 to automatically select the right hypercall (AMD vs. Intel). - A bunch of added/enabled/fixed selftests for ARM64, covering memslots, breakpoints, stage-2 faults and access tracking. - x86-specific selftest changes: - Clean up x86's page table management. - Clean up and enhance the "smaller maxphyaddr" test, and add a related test to cover generic emulation failure. - Clean up the nEPT support checks. - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values. - Fix an ordering issue in the AMX test introduced by recent conversions to use kvm_cpu_has(), and harden the code to guard against similar bugs in the future. Anything that tiggers caching of KVM's supported CPUID, kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if the caching occurs before the test opts in via prctl(). Documentation: - Remove deleted ioctls from documentation - Clean up the docs for the x86 MSR filter. - Various fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits) KVM: x86: Add proper ReST tables for userspace MSR exits/flags KVM: selftests: Allocate ucall pool from MEM_REGION_DATA KVM: arm64: selftests: Align VA space allocator with TTBR0 KVM: arm64: Fix benign bug with incorrect use of VA_BITS KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow KVM: x86: Advertise that the SMM_CTL MSR is not supported KVM: x86: remove unnecessary exports KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic" tools: KVM: selftests: Convert clear/set_bit() to actual atomics tools: Drop "atomic_" prefix from atomic test_and_set_bit() tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests perf tools: Use dedicated non-atomic clear/set bit helpers tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers KVM: arm64: selftests: Enable single-step without a "full" ucall() KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself KVM: Remove stale comment about KVM_REQ_UNHALT KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR KVM: Reference to kvm_userspace_memory_region in doc and comments KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl ...
2022-12-15x86/mm: Fix CR3_ADDR_MASKKirill A. Shutemov1-1/+1
The mask must not include bits above physical address mask. These bits are reserved and can be used for other things. Bits 61 and 62 are used for Linear Address Masking. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20221109165140.9137-2-kirill.shutemov%40linux.intel.com
2022-12-15x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macrosPasha Tatashin1-9/+3
Other architectures and the common mm/ use P*D_MASK, and P*D_SIZE. Remove the duplicated P*D_PAGE_MASK and P*D_PAGE_SIZE which are only used in x86/*. Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/20220516185202.604654-1-tatashin@google.com
2022-12-15x86/mm/pae: Get rid of set_64bit()Peter Zijlstra2-39/+12
Recognise that set_64bit() is a special case of our previously introduced pxx_xchg64(), so use that and get rid of set_64bit(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114425.233481884%40infradead.org
2022-12-15x86_64: Remove pointless set_64bit() usagePeter Zijlstra1-5/+0
The use of set_64bit() in X86_64 only code is pretty pointless, seeing how it's a direct assignment. Remove all this nonsense. [nathanchance: unbreak irte] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114425.168036718%40infradead.org
2022-12-15x86/mm/pae: Be consistent with pXXp_get_and_clear()Peter Zijlstra1-50/+17
Given that ptep_get_and_clear() uses cmpxchg8b, and that should be by far the most common case, there's no point in having an optimized variant for pmd/pud. Introduce the pxx_xchg64() helper to implement the common logic once. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114425.103392961%40infradead.org
2022-12-15x86/mm/pae: Use WRITE_ONCE()Peter Zijlstra1-6/+6
Disallow write-tearing, that would be really unfortunate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114425.038102604%40infradead.org
2022-12-15x86/mm/pae: Don't (ab)use atomic64Peter Zijlstra1-5/+4
PAE implies CX8, write readable code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114424.971450128%40infradead.org
2022-12-15mm: Fix pmd_read_atomic()Peter Zijlstra1-56/+0
AFAICT there's no reason to do anything different than what we do for PTEs. Make it so (also affects SH). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114424.711181252%40infradead.org
2022-12-15x86/mm/pae: Make pmd_t similar to pte_tPeter Zijlstra4-31/+23
Instead of mucking about with at least 2 different ways of fudging it, do the same thing we do for pte_t. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114424.580310787%40infradead.org
2022-12-15x86/mm: Implement native set_memory_rox()Peter Zijlstra1-0/+3
Provide a native implementation of set_memory_rox(), avoiding the double set_memory_ro();set_memory_x(); calls. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2022-12-15x86/mm: Randomize per-cpu entry areaPeter Zijlstra2-5/+7
Seth found that the CPU-entry-area; the piece of per-cpu data that is mapped into the userspace page-tables for kPTI is not subject to any randomization -- irrespective of kASLR settings. On x86_64 a whole P4D (512 GB) of virtual address space is reserved for this structure, which is plenty large enough to randomize things a little. As such, use a straight forward randomization scheme that avoids duplicates to spread the existing CPUs over the available space. [ bp: Fix le build. ] Reported-by: Seth Jenkins <sethjenkins@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-12-15x86/kasan: Map shadow for percpu pages on demandAndrey Ryabinin1-0/+3
KASAN maps shadow for the entire CPU-entry-area: [CPU_ENTRY_AREA_BASE, CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE] This will explode once the per-cpu entry areas are randomized since it will increase CPU_ENTRY_AREA_MAP_SIZE to 512 GB and KASAN fails to allocate shadow for such big area. Fix this by allocating KASAN shadow only for really used cpu entry area addresses mapped by cea_map_percpu_pages() Thanks to the 0day folks for finding and reporting this to be an issue. [ dhansen: tweak changelog since this will get committed before peterz's actual cpu-entry-area randomization ] Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Yujie Liu <yujie.liu@intel.com> Cc: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/r/202210241508.2e203c3d-yujie.liu@intel.com
2022-12-15Merge tag 'x86_core_for_v6.2' of ↵Linus Torvalds16-104/+413
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Borislav Petkov: - Add the call depth tracking mitigation for Retbleed which has been long in the making. It is a lighterweight software-only fix for Skylake-based cores where enabling IBRS is a big hammer and causes a significant performance impact. What it basically does is, it aligns all kernel functions to 16 bytes boundary and adds a 16-byte padding before the function, objtool collects all functions' locations and when the mitigation gets applied, it patches a call accounting thunk which is used to track the call depth of the stack at any time. When that call depth reaches a magical, microarchitecture-specific value for the Return Stack Buffer, the code stuffs that RSB and avoids its underflow which could otherwise lead to the Intel variant of Retbleed. This software-only solution brings a lot of the lost performance back, as benchmarks suggest: https://lore.kernel.org/all/20220915111039.092790446@infradead.org/ That page above also contains a lot more detailed explanation of the whole mechanism - Implement a new control flow integrity scheme called FineIBT which is based on the software kCFI implementation and uses hardware IBT support where present to annotate and track indirect branches using a hash to validate them - Other misc fixes and cleanups * tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits) x86/paravirt: Use common macro for creating simple asm paravirt functions x86/paravirt: Remove clobber bitmask from .parainstructions x86/debug: Include percpu.h in debugreg.h to get DECLARE_PER_CPU() et al x86/cpufeatures: Move X86_FEATURE_CALL_DEPTH from bit 18 to bit 19 of word 11, to leave space for WIP X86_FEATURE_SGX_EDECCSSA bit x86/Kconfig: Enable kernel IBT by default x86,pm: Force out-of-line memcpy() objtool: Fix weak hole vs prefix symbol objtool: Optimize elf_dirty_reloc_sym() x86/cfi: Add boot time hash randomization x86/cfi: Boot time selection of CFI scheme x86/ibt: Implement FineIBT objtool: Add --cfi to generate the .cfi_sites section x86: Add prefix symbols for function padding objtool: Add option to generate prefix symbols objtool: Avoid O(bloody terrible) behaviour -- an ode to libelf objtool: Slice up elf_create_section_symbol() kallsyms: Revert "Take callthunks into account" x86: Unconfuse CONFIG_ and X86_FEATURE_ namespaces x86/retpoline: Fix crash printing warning x86/paravirt: Fix a !PARAVIRT build warning ...
2022-12-14Merge tag 'mm-stable-2022-12-13' of ↵Linus Torvalds3-11/+17
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ...
2022-12-14Merge tag 'x86_paravirt_for_v6.2' of ↵Linus Torvalds1-49/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 paravirt update from Borislav Petkov: - Simplify paravirt patching machinery by removing the now unused clobber mask * tag 'x86_paravirt_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/paravirt: Remove clobber bitmask from .parainstructions
2022-12-14Merge tag 'x86_microcode_for_v6.2' of ↵Linus Torvalds3-4/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 microcode and IFS updates from Borislav Petkov: "The IFS (In-Field Scan) stuff goes through tip because the IFS driver uses the same structures and similar functionality as the microcode loader and it made sense to route it all through this branch so that there are no conflicts. - Add support for multiple testing sequences to the Intel In-Field Scan driver in order to be able to run multiple different test patterns. Rework things and remove the BROKEN dependency so that the driver can be enabled (Jithu Joseph) - Remove the subsys interface usage in the microcode loader because it is not really needed - A couple of smaller fixes and cleanups" * tag 'x86_microcode_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) x86/microcode/intel: Do not retry microcode reloading on the APs x86/microcode/intel: Do not print microcode revision and processor flags platform/x86/intel/ifs: Add missing kernel-doc entry Revert "platform/x86/intel/ifs: Mark as BROKEN" Documentation/ABI: Update IFS ABI doc platform/x86/intel/ifs: Add current_batch sysfs entry platform/x86/intel/ifs: Remove reload sysfs entry platform/x86/intel/ifs: Add metadata validation platform/x86/intel/ifs: Use generic microcode headers and functions platform/x86/intel/ifs: Add metadata support x86/microcode/intel: Use a reserved field for metasize x86/microcode/intel: Add hdr_type to intel_microcode_sanity_check() x86/microcode/intel: Reuse microcode_sanity_check() x86/microcode/intel: Use appropriate type in microcode_sanity_check() x86/microcode/intel: Reuse find_matching_signature() platform/x86/intel/ifs: Remove memory allocation from load path platform/x86/intel/ifs: Remove image loading during init platform/x86/intel/ifs: Return a more appropriate error code platform/x86/intel/ifs: Remove unused selection x86/microcode: Drop struct ucode_cpu_info.valid ...
2022-12-14Merge tag 'x86_cpu_for_v6.2' of ↵Linus Torvalds9-153/+175
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu updates from Borislav Petkov: - Split MTRR and PAT init code to accomodate at least Xen PV and TDX guests which do not get MTRRs exposed but only PAT. (TDX guests do not support the cache disabling dance when setting up MTRRs so they fall under the same category) This is a cleanup work to remove all the ugly workarounds for such guests and init things separately (Juergen Gross) - Add two new Intel CPUs to the list of CPUs with "normal" Energy Performance Bias, leading to power savings - Do not do bus master arbitration in C3 (ARB_DISABLE) on modern Centaur CPUs * tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits) x86/mtrr: Make message for disabled MTRRs more descriptive x86/pat: Handle TDX guest PAT initialization x86/cpuid: Carve out all CPUID functionality x86/cpu: Switch to cpu_feature_enabled() for X86_FEATURE_XENPV x86/cpu: Remove X86_FEATURE_XENPV usage in setup_cpu_entry_area() x86/cpu: Drop 32-bit Xen PV guest code in update_task_stack() x86/cpu: Remove unneeded 64-bit dependency in arch_enter_from_user_mode() x86/cpufeatures: Add X86_FEATURE_XENPV to disabled-features.h x86/acpi/cstate: Optimize ARB_DISABLE on Centaur CPUs x86/mtrr: Simplify mtrr_ops initialization x86/cacheinfo: Switch cache_ap_init() to hotplug callback x86: Decouple PAT and MTRR handling x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init() x86/mtrr: Let cache_aps_delayed_init replace mtrr_aps_delayed_init x86/mtrr: Get rid of __mtrr_enabled bool x86/mtrr: Simplify mtrr_bp_init() x86/mtrr: Remove set_all callback from struct mtrr_ops x86/mtrr: Disentangle MTRR init from PAT init x86/mtrr: Move cache control code to cacheinfo.c x86/mtrr: Split MTRR-specific handling from cache dis/enabling ...
2022-12-14Merge tag 'x86_boot_for_v6.2' of ↵Linus Torvalds2-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Borislav Petkov: "A of early boot cleanups and fixes. - Do some spring cleaning to the compressed boot code by moving the EFI mixed-mode code to a separate compilation unit, the AMD memory encryption early code where it belongs and fixing up build dependencies. Make the deprecated EFI handover protocol optional with the goal of removing it at some point (Ard Biesheuvel) - Skip realmode init code on Xen PV guests as it is not needed there - Remove an old 32-bit PIC code compiler workaround" * tag 'x86_boot_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Remove x86_32 PIC using %ebx workaround x86/boot: Skip realmode init code when running as Xen PV guest x86/efi: Make the deprecated EFI handover protocol optional x86/boot/compressed: Only build mem_encrypt.S if AMD_MEM_ENCRYPT=y x86/boot/compressed: Adhere to calling convention in get_sev_encryption_bit() x86/boot/compressed: Move startup32_check_sev_cbit() out of head_64.S x86/boot/compressed: Move startup32_check_sev_cbit() into .text x86/boot/compressed: Move startup32_load_idt() out of head_64.S x86/boot/compressed: Move startup32_load_idt() into .text section x86/boot/compressed: Pull global variable reference into startup32_load_idt() x86/boot/compressed: Avoid touching ECX in startup32_set_idt_entry() x86/boot/compressed: Simplify IDT/GDT preserve/restore in the EFI thunk x86/boot/compressed, efi: Merge multiple definitions of image_offset into one x86/boot/compressed: Move efi32_pe_entry() out of head_64.S x86/boot/compressed: Move efi32_entry out of head_64.S x86/boot/compressed: Move efi32_pe_entry into .text section x86/boot/compressed: Move bootargs parsing out of 32-bit startup code x86/boot/compressed: Move 32-bit entrypoint code into .text section x86/boot/compressed: Rename efi_thunk_64.S to efi-mixed.S
2022-12-14Merge tag 'efi-next-for-v6.2' of ↵Linus Torvalds1-32/+77
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI updates from Ard Biesheuvel: "Another fairly sizable pull request, by EFI subsystem standards. Most of the work was done by me, some of it in collaboration with the distro and bootloader folks (GRUB, systemd-boot), where the main focus has been on removing pointless per-arch differences in the way EFI boots a Linux kernel. - Refactor the zboot code so that it incorporates all the EFI stub logic, rather than calling the decompressed kernel as a EFI app. - Add support for initrd= command line option to x86 mixed mode. - Allow initrd= to be used with arbitrary EFI accessible file systems instead of just the one the kernel itself was loaded from. - Move some x86-only handling and manipulation of the EFI memory map into arch/x86, as it is not used anywhere else. - More flexible handling of any random seeds provided by the boot environment (i.e., systemd-boot) so that it becomes available much earlier during the boot. - Allow improved arch-agnostic EFI support in loaders, by setting a uniform baseline of supported features, and adding a generic magic number to the DOS/PE header. This should allow loaders such as GRUB or systemd-boot to reduce the amount of arch-specific handling substantially. - (arm64) Run EFI runtime services from a dedicated stack, and use it to recover from synchronous exceptions that might occur in the firmware code. - (arm64) Ensure that we don't allocate memory outside of the 48-bit addressable physical range. - Make EFI pstore record size configurable - Add support for decoding CXL specific CPER records" * tag 'efi-next-for-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (43 commits) arm64: efi: Recover from synchronous exceptions occurring in firmware arm64: efi: Execute runtime services from a dedicated stack arm64: efi: Limit allocations to 48-bit addressable physical region efi: Put Linux specific magic number in the DOS header efi: libstub: Always enable initrd command line loader and bump version efi: stub: use random seed from EFI variable efi: vars: prohibit reading random seed variables efi: random: combine bootloader provided RNG seed with RNG protocol output efi/cper, cxl: Decode CXL Error Log efi/cper, cxl: Decode CXL Protocol Error Section efi: libstub: fix efi_load_initrd_dev_path() kernel-doc comment efi: x86: Move EFI runtime map sysfs code to arch/x86 efi: runtime-maps: Clarify purpose and enable by default for kexec efi: pstore: Add module parameter for setting the record size efi: xen: Set EFI_PARAVIRT for Xen dom0 boot on all architectures efi: memmap: Move manipulation routines into x86 arch tree efi: memmap: Move EFI fake memmap support into x86 arch tree efi: libstub: Undeprecate the command line initrd loader efi: libstub: Add mixed mode support to command line initrd loader efi: libstub: Permit mixed mode return types other than efi_status_t ...
2022-12-13Merge tag 'pull-elfcore' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull elf coredumping updates from Al Viro: "Unification of regset and non-regset sides of ELF coredump handling. Collecting per-thread register values is the only thing that needs to be ifdefed there..." * tag 'pull-elfcore' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [elf] get rid of get_note_info_size() [elf] unify regset and non-regset cases [elf][non-regset] use elf_core_copy_task_regs() for dumper as well [elf][non-regset] uninline elf_core_copy_task_fpregs() (and lose pt_regs argument) elf_core_copy_task_regs(): task_pt_regs is defined everywhere [elf][regset] simplify thread list handling in fill_note_info() [elf][regset] clean fill_note_info() a bit kill extern of vsyscall32_sysctl kill coredump_params->regs kill signal_pt_regs()
2022-12-13Merge tag 'random-6.2-rc1-for-linus' of ↵Linus Torvalds1-13/+1
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: - Replace prandom_u32_max() and various open-coded variants of it, there is now a new family of functions that uses fast rejection sampling to choose properly uniformly random numbers within an interval: get_random_u32_below(ceil) - [0, ceil) get_random_u32_above(floor) - (floor, U32_MAX] get_random_u32_inclusive(floor, ceil) - [floor, ceil] Coccinelle was used to convert all current users of prandom_u32_max(), as well as many open-coded patterns, resulting in improvements throughout the tree. I'll have a "late" 6.1-rc1 pull for you that removes the now unused prandom_u32_max() function, just in case any other trees add a new use case of it that needs to converted. According to linux-next, there may be two trivial cases of prandom_u32_max() reintroductions that are fixable with a 's/.../.../'. So I'll have for you a final conversion patch doing that alongside the removal patch during the second week. This is a treewide change that touches many files throughout. - More consistent use of get_random_canary(). - Updates to comments, documentation, tests, headers, and simplification in configuration. - The arch_get_random*_early() abstraction was only used by arm64 and wasn't entirely useful, so this has been replaced by code that works in all relevant contexts. - The kernel will use and manage random seeds in non-volatile EFI variables, refreshing a variable with a fresh seed when the RNG is initialized. The RNG GUID namespace is then hidden from efivarfs to prevent accidental leakage. These changes are split into random.c infrastructure code used in the EFI subsystem, in this pull request, and related support inside of EFISTUB, in Ard's EFI tree. These are co-dependent for full functionality, but the order of merging doesn't matter. - Part of the infrastructure added for the EFI support is also used for an improvement to the way vsprintf initializes its siphash key, replacing an sleep loop wart. - The hardware RNG framework now always calls its correct random.c input function, add_hwgenerator_randomness(), rather than sometimes going through helpers better suited for other cases. - The add_latent_entropy() function has long been called from the fork handler, but is a no-op when the latent entropy gcc plugin isn't used, which is fine for the purposes of latent entropy. But it was missing out on the cycle counter that was also being mixed in beside the latent entropy variable. So now, if the latent entropy gcc plugin isn't enabled, add_latent_entropy() will expand to a call to add_device_randomness(NULL, 0), which adds a cycle counter, without the absent latent entropy variable. - The RNG is now reseeded from a delayed worker, rather than on demand when used. Always running from a worker allows it to make use of the CPU RNG on platforms like S390x, whose instructions are too slow to do so from interrupts. It also has the effect of adding in new inputs more frequently with more regularity, amounting to a long term transcript of random values. Plus, it helps a bit with the upcoming vDSO implementation (which isn't yet ready for 6.2). - The jitter entropy algorithm now tries to execute on many different CPUs, round-robining, in hopes of hitting even more memory latencies and other unpredictable effects. It also will mix in a cycle counter when the entropy timer fires, in addition to being mixed in from the main loop, to account more explicitly for fluctuations in that timer firing. And the state it touches is now kept within the same cache line, so that it's assured that the different execution contexts will cause latencies. * tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits) random: include <linux/once.h> in the right header random: align entropy_timer_state to cache line random: mix in cycle counter when jitter timer fires random: spread out jitter callback to different CPUs random: remove extraneous period and add a missing one in comments efi: random: refresh non-volatile random seed when RNG is initialized vsprintf: initialize siphash key using notifier random: add back async readiness notifier random: reseed in delayed work rather than on-demand random: always mix cycle counter in add_latent_entropy() hw_random: use add_hwgenerator_randomness() for early entropy random: modernize documentation comment on get_random_bytes() random: adjust comment to account for removed function random: remove early archrandom abstraction random: use random.trust_{bootloader,cpu} command line option only stackprotector: actually use get_random_canary() stackprotector: move get_random_canary() into stackprotector.h treewide: use get_random_u32_inclusive() when possible treewide: use get_random_u32_{above,below}() instead of manual loop treewide: use get_random_u32_below() instead of deprecated function ...
2022-12-13Merge tag 'x86_cache_for_6.2' of ↵Linus Torvalds2-11/+18
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cache resource control updates from Dave Hansen: "These declare the resource control (rectrl) MSRs a bit more normally and clean up an unnecessary structure member: - Remove unnecessary arch_has_empty_bitmaps structure memory - Move rescrtl MSR defines into msr-index.h, like normal MSRs" * tag 'x86_cache_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/resctrl: Move MSR defines into msr-index.h x86/resctrl: Remove arch_has_empty_bitmaps
2022-12-13Merge tag 'x86_tdx_for_6.2' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 tdx updates from Dave Hansen: "This includes a single chunk of new functionality for TDX guests which allows them to talk to the trusted TDX module software and obtain an attestation report. This report can then be used to prove the trustworthiness of the guest to a third party and get access to things like storage encryption keys" * tag 'x86_tdx_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftests/tdx: Test TDX attestation GetReport support virt: Add TDX guest driver x86/tdx: Add a wrapper to get TDREPORT0 from the TDX Module
2022-12-13Merge tag 'x86_sgx_for_6.2' of ↵Linus Torvalds2-7/+27
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 sgx updates from Dave Hansen: "The biggest deal in this series is support for a new hardware feature that allows enclaves to detect and mitigate single-stepping attacks. There's also a minor performance tweak and a little piece of the kmap_atomic() -> kmap_local() transition. Summary: - Introduce a new SGX feature (Asynchrounous Exit Notification) for bare-metal enclaves and KVM guests to mitigate single-step attacks - Increase batching to speed up enclave release - Replace kmap/kunmap_atomic() calls" * tag 'x86_sgx_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sgx: Replace kmap/kunmap_atomic() calls KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest x86/sgx: Allow enclaves to use Asynchrounous Exit Notification x86/sgx: Reduce delay and interference of enclave release
2022-12-13Merge tag 'x86-misc-2022-12-10' of ↵Linus Torvalds5-17/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Thomas Gleixner: "Updates for miscellaneous x86 areas: - Reserve a new boot loader type for barebox which is usally used on ARM and MIPS, but can also be utilized as EFI payload on x86 to provide watchdog-supervised boot up. - Consolidate the native and compat 32bit signal handling code and split the 64bit version out into a separate source file - Switch the ESPFIX random usage to get_random_long()" * tag 'x86-misc-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/espfix: Use get_random_long() rather than archrandom x86/signal/64: Move 64-bit signal code to its own file x86/signal/32: Merge native and compat 32-bit signal code x86/signal: Add ABI prefixes to frame setup functions x86/signal: Merge get_sigframe() x86: Remove __USER32_DS signal/compat: Remove compat_sigset_t override x86/signal: Remove sigset_t parameter from frame setup functions x86/signal: Remove sig parameter from frame setup functions Documentation/x86/boot: Reserve type_of_loader=13 for barebox
2022-12-12Merge remote-tracking branch 'kvm/queue' into HEADPaolo Bonzini2-8/+2
x86 Xen-for-KVM: * Allow the Xen runstate information to cross a page boundary * Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured * add support for 32-bit guests in SCHEDOP_poll x86 fixes: * One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). * Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. * Clean up the MSR filter docs. * Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. * Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. * Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency. * Advertise (on AMD) that the SMM_CTL MSR is not supported * Remove unnecessary exports Selftests: * Fix an inverted check in the access tracking perf test, and restore support for asserting that there aren't too many idle pages when running on bare metal. * Fix an ordering issue in the AMX test introduced by recent conversions to use kvm_cpu_has(), and harden the code to guard against similar bugs in the future. Anything that tiggers caching of KVM's supported CPUID, kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if the caching occurs before the test opts in via prctl(). * Fix build errors that occur in certain setups (unsure exactly what is unique about the problematic setup) due to glibc overriding static_assert() to a variant that requires a custom message. * Introduce actual atomics for clear/set_bit() in selftests Documentation: * Remove deleted ioctls from documentation * Various fixes
2022-12-12Merge tag 'x86-apic-2022-12-10' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 apic update from Thomas Gleixner: "A set of changes for the x86 APIC code: - Handle the case where x2APIC is enabled and locked by the BIOS on a kernel with CONFIG_X86_X2APIC=n gracefully. Instead of a panic which does not make it to the graphical console during very early boot, simply disable the local APIC completely and boot with the PIC and very limited functionality, which allows to diagnose the issue - Convert x86 APIC device tree bindings to YAML - Extend x86 APIC device tree bindings to configure interrupt delivery mode and handle this in during init. This allows to boot with device tree on platforms which lack a legacy PIC" * tag 'x86-apic-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/of: Add support for boot time interrupt delivery mode configuration x86/of: Replace printk(KERN_LVL) with pr_lvl() dt-bindings: x86: apic: Introduce new optional bool property for lapic dt-bindings: x86: apic: Convert Intel's APIC bindings to YAML schema x86/of: Remove unused early_init_dt_add_memory_arch() x86/apic: Handle no CONFIG_X86_X2APIC on systems with x2APIC enabled by BIOS