Age | Commit message (Collapse) | Author | Files | Lines |
|
This commit in sched/urgent moved the cfs_rq_is_decayed() function:
a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
and this fresh commit in sched/core modified it in the old location:
9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are")
Merge the two variants.
Conflicts:
kernel/sched/fair.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Previously, to detect nested virtualization enlightenment support,
we were using HV_X64_ENLIGHTENED_VMCS_RECOMMENDED feature bit of
HYPERV_CPUID_ENLIGHTMENT_INFO.EAX CPUID as docuemented in TLFS:
"Bit 14: Recommend a nested hypervisor using the enlightened VMCS
interface. Also indicates that additional nested enlightenments
may be available (see leaf 0x4000000A)".
Enlightened VMCS, however, is an Intel only feature so the above
detection method doesn't work for AMD. So, use the
HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS.EAX CPUID information ("The
maximum input value for hypervisor CPUID information.") and this
works for both AMD and Intel.
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Message-Id: <43b25ff21cd2d9a51582033c9bdd895afefac056.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
xa_destroy() needs to be called to destroy a virtual EPC's page array
before calling kfree() to free the virtual EPC. Currently it is not
called so add the missing xa_destroy().
Fixes: 540745ddbc70 ("x86/sgx: Introduce virtual EPC for use by KVM guests")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Yang Zhong <yang.zhong@intel.com>
Link: https://lkml.kernel.org/r/20210615101639.291929-1-kai.huang@intel.com
|
|
As a result of TSX deprecation, some processors always abort TSX
transactions by default after a microcode update.
When TSX feature cannot be used it is better to hide it. Clear CPUID.RTM
and CPUID.HLE bits when TSX transactions always abort.
[ bp: Massage commit message and comments. ]
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Link: https://lkml.kernel.org/r/5209b3d72ffe5bd3cafdcc803f5b883f785329c3.1623704845.git-series.pawan.kumar.gupta@linux.intel.com
|
|
When an instruction is fetched from user-space, segmentation needs to
be taken into account. This means that getting the linear address of an
instruction can fail. Hardware would raise a #GP exception in that case,
but the #VC exception handler would emulate it as a page-fault.
The insn_fetch_from_user*() functions now provide the relevant
information in case of a failure. Use that and propagate a #GP when the
linear address of an instruction to fetch could not be calculated.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210614135327.9921-7-joro@8bytes.org
|
|
The error reporting from the insn_fetch_from_user*() functions is not
very verbose. Extend it to include information on whether the linear
RIP could not be calculated or whether the memory access faulted.
This will be used in the SEV-ES code to propagate the correct
exception depending on what went wrong during instruction fetch.
[ bp: Massage comments. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210614135327.9921-6-joro@8bytes.org
|
|
The runtime #VC handler is not "early" anymore. Fix the copy&paste error
and remove that word from the error message.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210614135327.9921-2-joro@8bytes.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc fixes:
- Fix the NMI watchdog on ancient Intel CPUs
- Remove a misguided, NMI-unsafe KASAN callback from the NMI-safe
irq_work path used by perf.
- Fix uncore events on Ice Lake servers.
- Someone booted maxcpus=1 on an SNB-EP, and the uncore driver
emitted warnings and was probably buggy. Fix it.
- KCSAN found a genuine data race in the core perf code. Somewhat
ironically the bug was introduced through a recent race fix. :-/
In our defense, the new race window was much more narrow. Fix it"
* tag 'perf-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs
irq_work: Make irq_work_queue() NMI-safe again
perf/x86/intel/uncore: Fix M2M event umask for Ice Lake server
perf/x86/intel/uncore: Fix a kernel WARNING triggered by maxcpus=1
perf: Fix data race between pin_count increment/decrement
|
|
Fix the following kernel-doc warning:
arch/x86/kernel/cpu/sgx/encl.c:392: warning: Function parameter \
or member 'ref' not described in 'sgx_encl_release'
[ bp: Massage commit message. ]
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210609035510.2083694-1-chenxiaosong2@huawei.com
|
|
The following commit:
3a4ac121c2ca ("x86/perf: Add hardware performance events support for Zhaoxin CPU.")
Got the old-style NMI watchdog logic wrong and broke it for basically every
Intel CPU where it was active. Which is only truly old CPUs, so few people noticed.
On CPUs with perf events support we turn off the old-style NMI watchdog, so it
was pretty pointless to add the logic for X86_VENDOR_ZHAOXIN to begin with ... :-/
Anyway, the fix is to restore the old logic and add a 'break'.
[ mingo: Wrote a new changelog. ]
Fixes: 3a4ac121c2ca ("x86/perf: Add hardware performance events support for Zhaoxin CPU.")
Signed-off-by: CodyYao-oc <CodyYao-oc@zhaoxin.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210607025335.9643-1-CodyYao-oc@zhaoxin.com
|
|
If access_ok() or fpregs_soft_set() fails in __fpu__restore_sig() then the
function just returns but does not clear the FPU state as it does for all
other fatal failures.
Clear the FPU state for these failures as well.
Fixes: 72a671ced66d ("x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/87mtryyhhz.ffs@nanos.tec.linutronix.de
|
|
git://anongit.freedesktop.org/drm/drm-intel into drm-next
Cross-subsystem Changes:
- x86/gpu: add JasperLake to gen11 early quirks
(Although the patch lacks the Ack info, it has been Acked by Borislav)
Driver Changes:
- General DMC improves (Anusha)
- More ADL-P enabling (Vandita, Matt, Jose, Mika, Anusha, Imre, Lucas, Jani, Manasi, Ville, Stanislav)
- Introduce MBUS relative dbuf offset (Ville)
- PSR fixes and improvements (Gwan, Jose, Ville)
- Re-enable LTTPR non-transparent LT mode for DPCD_REV < 1.4 (Ville)
- Remove duplicated declarations (Shaokun, Wan)
- Check HDMI sink deep color capabilities during .mode_valid (Ville)
- Fix display flicker screan related to console and FBC (Chris)
- Remaining conversions of GRAPHICS_VER (Lucas)
- Drop invalid FIXME (Jose)
- Fix bigjoiner check in dsc_disable (Vandita)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/YMEy2Ew82BeL/hDK@intel.com
|
|
copy_user_to_xstate() uses __copy_from_user(), which provides a negligible
speedup. Fortunately, both call sites are at least almost correct.
__fpu__restore_sig() checks access_ok() with xstate_sigframe_size()
length and ptrace regset access uses fpu_user_xstate_size. These should
be valid upper bounds on the length, so, at worst, this would cause
spurious failures and not accesses to kernel memory.
Nonetheless, this is far more fragile than necessary and none of these
callers are in a hotpath.
Use copy_from_user() instead.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Link: https://lkml.kernel.org/r/20210608144346.140254130@linutronix.de
|
|
Both Intel and AMD consider it to be architecturally valid for XRSTOR to
fail with #PF but nonetheless change the register state. The actual
conditions under which this might occur are unclear [1], but it seems
plausible that this might be triggered if one sibling thread unmaps a page
and invalidates the shared TLB while another sibling thread is executing
XRSTOR on the page in question.
__fpu__restore_sig() can execute XRSTOR while the hardware registers
are preserved on behalf of a different victim task (using the
fpu_fpregs_owner_ctx mechanism), and, in theory, XRSTOR could fail but
modify the registers.
If this happens, then there is a window in which __fpu__restore_sig()
could schedule out and the victim task could schedule back in without
reloading its own FPU registers. This would result in part of the FPU
state that __fpu__restore_sig() was attempting to load leaking into the
victim task's user-visible state.
Invalidate preserved FPU registers on XRSTOR failure to prevent this
situation from corrupting any state.
[1] Frequent readers of the errata lists might imagine "complex
microarchitectural conditions".
Fixes: 1d731e731c4c ("x86/fpu: Add a fastpath to __fpu__restore_sig()")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210608144345.758116583@linutronix.de
|
|
The non-compacted slowpath uses __copy_from_user() and copies the entire
user buffer into the kernel buffer, verbatim. This means that the kernel
buffer may now contain entirely invalid state on which XRSTOR will #GP.
validate_user_xstate_header() can detect some of that corruption, but that
leaves the onus on callers to clear the buffer.
Prior to XSAVES support, it was possible just to reinitialize the buffer,
completely, but with supervisor states that is not longer possible as the
buffer clearing code split got it backwards. Fixing that is possible but
not corrupting the state in the first place is more robust.
Avoid corruption of the kernel XSAVE buffer by using copy_user_to_xstate()
which validates the XSAVE header contents before copying the actual states
to the kernel. copy_user_to_xstate() was previously only called for
compacted-format kernel buffers, but it works for both compacted and
non-compacted forms.
Using it for the non-compacted form is slower because of multiple
__copy_from_user() operations, but that cost is less important than robust
code in an already slow path.
[ Changelog polished by Dave Hansen ]
Fixes: b860eb8dce59 ("x86/fpu/xstate: Define new functions for clearing fpregs and xstates")
Reported-by: syzbot+2067e764dbcd10721e2e@syzkaller.appspotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210608144345.611833074@linutronix.de
|
|
It does so unconditionally too, on Intel and AMD machines, to work
around BIOS bugs, as confirmed by Microsoft folks (see Link for full
details).
Reflow the paragraph, while at it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/MWHPR21MB159330952629D36EEDE706B3D7379@MWHPR21MB1593.namprd21.prod.outlook.com
|
|
Let's reserve JSL stolen memory for graphics.
JasperLake is a gen11 platform which is compatible with
ICL/EHL changes.
This was missed in commit 24ea098b7c0d ("drm/i915/jsl: Split
EHL/JSL platform info and PCI ids")
V2:
- Added maintainer list in cc
- Added patch ref in commit message
V1:
- Added Cc: x86@kernel.org
Fixes: 24ea098b7c0d ("drm/i915/jsl: Split EHL/JSL platform info and PCI ids")
Cc: <stable@vger.kernel.org> # v5.11+
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210608053411.394166-1-tejaskumarx.surendrakumar.upadhyay@intel.com
|
|
The entire memory range under 1M is unconditionally reserved in
setup_arch(), so there is no need for crash_reserve_low_1M() anymore.
Remove this function.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210601075354.5149-4-rppt@kernel.org
|
|
The CONFIG_X86_RESERVE_LOW build time and reservelow= command line option
allowed to control the amount of memory under 1M that would be reserved at
boot to avoid using memory that can be potentially clobbered by BIOS.
Since the entire range under 1M is always reserved there is no need for
these options anymore and they can be removed.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210601075354.5149-3-rppt@kernel.org
|
|
Pick up dependent changes in order to base further cleanups ontop.
Signed-off-by: Borislav Petkov <bp@suse.de>
|
|
There are BIOSes that are known to corrupt the memory under 1M, or more
precisely under 640K because the memory above 640K is anyway reserved
for the EGA/VGA frame buffer and BIOS.
To prevent usage of the memory that will be potentially clobbered by the
kernel, the beginning of the memory is always reserved. The exact size
of the reserved area is determined by CONFIG_X86_RESERVE_LOW build time
and the "reservelow=" command line option. The reserved range may be
from 4K to 640K with the default of 64K. There are also configurations
that reserve the entire 1M range, like machines with SandyBridge graphic
devices or systems that enable crash kernel.
In addition to the potentially clobbered memory, EBDA of unknown size may
be as low as 128K and the memory above that EBDA start is also reserved
early.
It would have been possible to reserve the entire range under 1M unless for
the real mode trampoline that must reside in that area.
To accommodate placement of the real mode trampoline and keep the memory
safe from being clobbered by BIOS, reserve the first 64K of RAM before
memory allocations are possible and then, after the real mode trampoline
is allocated, reserve the entire range from 0 to 1M.
Update trim_snb_memory() and reserve_real_mode() to avoid redundant
reservations of the same memory range.
Also make sure the memory under 1M is not getting freed by
efi_free_boot_services().
[ bp: Massage commit message and comments. ]
Fixes: a799c2bd29d1 ("x86/setup: Consolidate early memory reservations")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Hugh Dickins <hughd@google.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213177
Link: https://lkml.kernel.org/r/20210601075354.5149-2-rppt@kernel.org
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Up until now the assumption was that an alternative patching site would
have some instructions at the beginning and trailing single-byte NOPs
(0x90) padding. Therefore, the patching machinery would go and optimize
those single-byte NOPs into longer ones.
However, this assumption is broken on 32-bit when code like
hv_do_hypercall() in hyperv_init() would use the ratpoline speculation
killer CALL_NOSPEC. The 32-bit version of that macro would align certain
insns to 16 bytes, leading to the compiler issuing a one or more
single-byte NOPs, depending on the holes it needs to fill for alignment.
That would lead to the warning in optimize_nops() to fire:
------------[ cut here ]------------
Not a NOP at 0xc27fb598
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:211 optimize_nops.isra.13
due to that function verifying whether all of the following bytes really
are single-byte NOPs.
Therefore, carve out the NOP padding into a separate function and call
it for each NOP range beginning with a single-byte NOP.
Fixes: 23c1ad538f4f ("x86/alternatives: Optimize optimize_nops()")
Reported-by: Richard Narron <richard@aaazen.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213301
Link: https://lkml.kernel.org/r/20210601212125.17145-1-bp@alien8.de
|
|
While digesting the XSAVE-related horrors which got introduced with
the supervisor/user split, the recent addition of ENQCMD-related
functionality got on the radar and turned out to be similarly broken.
update_pasid(), which is only required when X86_FEATURE_ENQCMD is
available, is invoked from two places:
1) From switch_to() for the incoming task
2) Via a SMP function call from the IOMMU/SMV code
#1 is half-ways correct as it hacks around the brokenness of get_xsave_addr()
by enforcing the state to be 'present', but all the conditionals in that
code are completely pointless for that.
Also the invocation is just useless overhead because at that point
it's guaranteed that TIF_NEED_FPU_LOAD is set on the incoming task
and all of this can be handled at return to user space.
#2 is broken beyond repair. The comment in the code claims that it is safe
to invoke this in an IPI, but that's just wishful thinking.
FPU state of a running task is protected by fregs_lock() which is
nothing else than a local_bh_disable(). As BH-disabled regions run
usually with interrupts enabled the IPI can hit a code section which
modifies FPU state and there is absolutely no guarantee that any of the
assumptions which are made for the IPI case is true.
Also the IPI is sent to all CPUs in mm_cpumask(mm), but the IPI is
invoked with a NULL pointer argument, so it can hit a completely
unrelated task and unconditionally force an update for nothing.
Worse, it can hit a kernel thread which operates on a user space
address space and set a random PASID for it.
The offending commit does not cleanly revert, but it's sufficient to
force disable X86_FEATURE_ENQCMD and to remove the broken update_pasid()
code to make this dysfunctional all over the place. Anything more
complex would require more surgery and none of the related functions
outside of the x86 core code are blatantly wrong, so removing those
would be overkill.
As nothing enables the PASID bit in the IA32_XSS MSR yet, which is
required to make this actually work, this cannot result in a regression
except for related out of tree train-wrecks, but they are broken already
today.
Fixes: 20f0afd1fb3d ("x86/mmu: Allocate/free a PASID")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/87mtsd6gr9.ffs@nanos.tec.linutronix.de
|
|
Kprobes has a counter 'nmissed', that is used to count the number of
times a probe handler was not called. This generally happens when we hit
a kprobe while handling another kprobe.
However, if one of the probe handlers causes a fault, we are currently
incrementing 'nmissed'. The comment in fault handler indicates that this
can be used to account faults taken by the probe handlers. But, this has
never been the intention as is evident from the comment above 'nmissed'
in 'struct kprobe':
/*count the number of times this probe was temporarily disarmed */
unsigned long nmissed;
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20210601120150.672652-1-naveen.n.rao@linux.vnet.ibm.com
|
|
For easier inspection which bytes have changed.
For example:
feat: 7*32+12, old: (__x86_indirect_thunk_r10+0x0/0x20 (ffffffff81c02480) len: 17), repl: (ffffffff897813aa, len: 17)
ffffffff81c02480: old_insn: 41 ff e2 90 90 90 90 90 90 90 90 90 90 90 90 90 90
ffffffff897813aa: rpl_insn: e8 07 00 00 00 f3 90 0f ae e8 eb f9 4c 89 14 24 c3
ffffffff81c02480: final_insn: e8 07 00 00 00 f3 90 0f ae e8 eb f9 4c 89 14 24 c3
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210601193713.16190-1-bp@alien8.de
|
|
Microsoft Hypervisor expects the logical processor index to be the same
as CPU's index during logical processor creation. Using cpu_physical_id
confuses hypervisor's scheduler. That causes the root partition not boot
when core scheduler is used.
This patch removes the call to cpu_physical_id and uses the CPU index
directly for bringing up logical processor. This scheme works for both
classic scheduler and core scheduler.
Fixes: 333abaf5abb3 (x86/hyperv: implement and use hv_smp_prepare_cpus)
Signed-off-by: Praveen Kumar <kumarpraveen@linux.microsoft.com>
Link: https://lore.kernel.org/r/20210531074046.113452-1-kumarpraveen@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
AMD and Hygon CPUs have a CPUID bit for RAPL. Drop the fam17h suffix as
it is stale already.
Make use of this instead of a model check to work more nicely in virtual
environments where RAPL typically isn't available.
[ bp: drop the ../cpu/powerflags.c hunk which is superfluous as the
"rapl" bit name appears already in flags. ]
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210514135920.16093-1-andrew.cooper3@citrix.com
|
|
The reason for kprobe::fault_handler(), as given by their comment:
* We come here because instructions in the pre/post
* handler caused the page_fault, this could happen
* if handler tries to access user space by
* copy_from_user(), get_user() etc. Let the
* user-specified handler try to fix it first.
Is just plain bad. Those other handlers are ran from non-preemptible
context and had better use _nofault() functions. Also, there is no
upstream usage of this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210525073213.561116662@infradead.org
|
|
There are machines out there with added value crap^WBIOS which provide an
SMI handler for the local APIC thermal sensor interrupt. Out of reset,
the BSP on those machines has something like 0x200 in that APIC register
(timestamps left in because this whole issue is timing sensitive):
[ 0.033858] read lvtthmr: 0x330, val: 0x200
which means:
- bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
- bits [10:8] have 010b which means SMI delivery mode.
Now, later during boot, when the kernel programs the local APIC, it
soft-disables it temporarily through the spurious vector register:
setup_local_APIC:
...
/*
* If this comes from kexec/kcrash the APIC might be enabled in
* SPIV. Soft disable it before doing further initialization.
*/
value = apic_read(APIC_SPIV);
value &= ~APIC_SPIV_APIC_ENABLED;
apic_write(APIC_SPIV, value);
which means (from the SDM):
"10.4.7.2 Local APIC State After It Has Been Software Disabled
...
* The mask bits for all the LVT entries are set. Attempts to reset these
bits will be ignored."
And this happens too:
[ 0.124111] APIC: Switch to symmetric I/O mode setup
[ 0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
[ 0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0
This results in CPU 0 soft lockups depending on the placement in time
when the APIC soft-disable happens. Those soft lockups are not 100%
reproducible and the reason for that can only be speculated as no one
tells you what SMM does. Likely, it confuses the SMM code that the APIC
is disabled and the thermal interrupt doesn't doesn't fire at all,
leading to CPU 0 stuck in SMM forever...
Now, before
4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
due to how the APIC_LVTTHMR was read before APIC initialization in
mcheck_intel_therm_init(), it would read the value with the mask bit 16
clear and then intel_init_thermal() would replicate it onto the APs and
all would be peachy - the thermal interrupt would remain enabled.
But that commit moved that reading to a later moment in
intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
late and with its interrupt mask bit set.
Thus, revert back to the old behavior of reading the thermal LVT
register before the APIC gets initialized.
Fixes: 4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
Reported-by: James Feeney <james@nurealm.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lkml.kernel.org/r/YKIqDdFNaXYd39wz@zn.tnic
|
|
Hygon systems support the MONITOR/MWAIT instructions and these can be
used for ACPI C1 in the same way as on AMD and Intel systems.
The BIOS declares a C1 state in _CST to use FFH and CPUID_Fn00000005_EDX
is non-zero on Hygon systems.
Allow ffh_cstate_init() to succeed on Hygon systems to default using FFH
MWAIT instead of HALT for ACPI C1.
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210528081417.31474-1-puwen@hygon.cn
|
|
PIC interrupts do not support affinity setting and they can end up on
any online CPU. Therefore, it's required to mark the associated vectors
as system-wide reserved. Otherwise, the corresponding irq descriptors
are copied to the secondary CPUs but the vectors are not marked as
assigned or reserved. This works correctly for the IO/APIC case.
When the IO/APIC is disabled via config, kernel command line or lack of
enumeration then all legacy interrupts are routed through the PIC, but
nothing marks them as system-wide reserved vectors.
As a consequence, a subsequent allocation on a secondary CPU can result in
allocating one of these vectors, which triggers the BUG() in
apic_update_vector() because the interrupt descriptor slot is not empty.
Imran tried to work around that by marking those interrupts as allocated
when a CPU comes online. But that's wrong in case that the IO/APIC is
available and one of the legacy interrupts, e.g. IRQ0, has been switched to
PIC mode because then marking them as allocated will fail as they are
already marked as system vectors.
Stay consistent and update the legacy vectors after attempting IO/APIC
initialization and mark them as system vectors in case that no IO/APIC is
available.
Fixes: 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment")
Reported-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210519233928.2157496-1-imran.f.khan@oracle.com
|
|
When BIOS reports memory errors to Linux using the ACPI/APEI
error reporting method Linux creates a "struct mce" to pass
to the normal reporting code path.
The constructed record doesn't include a value for the "misc"
field of the structure, and so mce_usable_address() says this
record doesn't include a valid address.
Net result is that functions like uc_decode_notifier() will
just ignore this record instead of taking action to offline
a page.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210527222846.931851-1-tony.luck@intel.com
|
|
Add the (HWID, MCATYPE) tuples and names for new SMCA bank types.
Also, add their respective error descriptions to the MCE decoding module
edac_mce_amd. Also while at it, optimize the string names for some SMCA
banks.
[ bp: Drop repeated comments, explain why UMC_V2 is a separate entry. ]
Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lkml.kernel.org/r/20210526164601.66228-1-nchatrad@amd.com
|
|
drm/i915 is extremely on fire without the below revert from -rc3:
commit 293837b9ac8d3021657f44c9d7a14948ec01c5d0
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed May 19 05:55:57 2021 -1000
Revert "i915: fix remap_io_sg to verify the pgprot"
Backmerge so we don't have a too wide bisect window for anything
that's a more involved workload than booting the driver.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Fix how SEV handles MMIO accesses by forwarding potential page faults
instead of killing the machine and by using the accessors with the
exact functionality needed when accessing memory.
- Fix a confusion with Clang LTO compiler switches passed to the it
- Handle the case gracefully when VMGEXIT has been executed in
userspace
* tag 'x86_urgent_for_v5.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev-es: Use __put_user()/__get_user() for data accesses
x86/sev-es: Forward page-faults which happen during emulation
x86/sev-es: Don't return NULL from sev_es_get_ghcb()
x86/build: Fix location of '-plugin-opt=' flags
x86/sev-es: Invalidate the GHCB after completing VMGEXIT
x86/sev-es: Move sev_es_put_ghcb() in prep for follow on patch
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo fix from Eric Biederman:
"During the merge window an issue with si_perf and the siginfo ABI came
up. The alpha and sparc siginfo structure layout had changed with the
addition of SIGTRAP TRAP_PERF and the new field si_perf.
The reason only alpha and sparc were affected is that they are the
only architectures that use si_trapno.
Looking deeper it was discovered that si_trapno is used for only a few
select signals on alpha and sparc, and that none of the other
_sigfault fields past si_addr are used at all. Which means technically
no regression on alpha and sparc.
While the alignment concerns might be dismissed the abuse of si_errno
by SIGTRAP TRAP_PERF does have the potential to cause regressions in
existing userspace.
While we still have time before userspace starts using and depending
on the new definition siginfo for SIGTRAP TRAP_PERF this set of
changes cleans up siginfo_t.
- The si_trapno field is demoted from magic alpha and sparc status
and made an ordinary union member of the _sigfault member of
siginfo_t. Without moving it of course.
- si_perf is replaced with si_perf_data and si_perf_type ending the
abuse of si_errno.
- Unnecessary additions to signalfd_siginfo are removed"
* 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo
signal: Deliver all of the siginfo perf data in _perf
signal: Factor force_sig_perf out of perf_sigtrap
signal: Implement SIL_FAULT_TRAPNO
siginfo: Move si_trapno inside the union inside _si_fault
|
|
These files contain private set_gdt() functions which are only used to
invalid the gdt; machine_kexec_64.c also contains a set_idt()
function to invalidate the idt.
phys_to_virt(0) *really* doesn't make any sense for creating an
invalid GDT. A NULL pointer (virtual 0) makes a lot more sense;
although neither will allow any actual memory reference, a NULL
pointer stands out more.
Replace these calls with native_[gi]dt_invalidate().
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210519212154.511983-7-hpa@zytor.com
|
|
There is no reason to specify any specific address to idt_invalidate(). It
looks mostly like an artifact of unifying code done differently by
accident. The most "sensible" address to set here is a NULL pointer -
virtual address zero, just as a visual marker.
This also makes it possible to mark the struct desc_ptr in idt_invalidate()
as static const.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210519212154.511983-5-hpa@zytor.com
|
|
This is required to support Zen3 APUs in k10temp.
Signed-off-by: David Bartley <andareed@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Wei Huang <wei.huang2@amd.com>
Link: https://lkml.kernel.org/r/20210520174130.94954-1-andareed@gmail.com
|
|
git://anongit.freedesktop.org/drm/drm-intel into drm-next
Core Changes:
- drm: Rename DP_PSR_SELECTIVE_UPDATE to better mach eDP spec (Jose).
Driver Changes:
- Display plane clock rates fixes and improvements (Ville).
- Uninint DMC FW loader state during shutdown (Imre).
- Convert snprintf to sysfs_emit (Xuezhi).
- Fix invalid access to ACPI _DSM objects (Takashi).
- A big refactor around how i915 addresses the graphics
and display IP versions. (Matt, Lucas).
- Backlight fix (Lyude).
- Display watermark and DBUF fixes (Ville).
- HDCP fix (Anshuman).
- Improve cases where display is not available (Jose).
- Defeature PSR2 for RKL and ALD-S (Jose).
- VLV DSI panel power fixes and improvements (Hans).
- display-12 workaround (Jose).
- Fix modesetting (Imre).
- Drop redundant address-of op before lttpr_common_caps array (Imre).
- Fix compiler checks (Jose, Jason).
- GLK display fixes (Ville).
- Fix error code returns (Dan).
- eDP novel: back again to slow and wide link training everywhere (Kai-Heng).
- Abstract DMC FW path (Rodrigo).
- Preparation and changes for upcoming
XeLPD display IP (Jose, Matt, Ville, Juha-Pekka, Animesh).
- Fix comment typo in DSI code (zuoqilin).
- Simplify CCS and UV plane alignment handling (Imre).
- PSR Fixes on TGL (Gwan-gyeong, Jose).
- Add intel_dp_hdcp.h and rename init (Jani).
- Move crtc and dpll declarations around (Jani).
- Fix pre-skl DP AUX precharge length (Ville).
- Remove stray newlines from random files (Ville).
- crtc->index and intel_crtc+drm_crtc pointer clean-up (Ville).
- Add frontbuffer tracking tracepoints (Ville).
- ADL-S PCI ID updates (Anand).
- Use unique backlight device names (Jani).
- A few clean-ups on i915/audio (Jani).
- Use intel_framebuffer instead of drm one on intel_fb functions (Imre).
- Add the missing MC CCS/XYUV8888 format support on display >= 12 (Imre).
- Nuke display error state (Ville).
- ADL-P initial enablement patches
starting to land (Clint, Imre, Jose, Umesh, Vandita, Mika).
- Display clean-up around VBT and the strap bits (Lucas).
- Try YCbCr420 color when RGB fails (Werner).
- More PSR fixes and improvements (Jose).
- Other generic display code clean-up (Jose, Ville).
- Use correct downstream caps for check Src-Ctl mode for PCON (Ankit).
- Disable HiZ Raw Stall Optimization on broken gen7 (Simon).
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/YKVioeu0JkUAlR7y@intel.com
|
|
The put_user() and get_user() functions do checks on the address which is
passed to them. They check whether the address is actually a user-space
address and whether its fine to access it. They also call might_fault()
to indicate that they could fault and possibly sleep.
All of these checks are neither wanted nor needed in the #VC exception
handler, which can be invoked from almost any context and also for MMIO
instructions from kernel space on kernel memory. All the #VC handler
wants to know is whether a fault happened when the access was tried.
This is provided by __put_user()/__get_user(), which just do the access
no matter what. Also add comments explaining why __get_user() and
__put_user() are the best choice here and why it is safe to use them
in this context. Also explain why copy_to/from_user can't be used.
In addition, also revert commit
7024f60d6552 ("x86/sev-es: Handle string port IO to kernel memory properly")
because using __get_user()/__put_user() fixes the same problem while
the above commit introduced several problems:
1) It uses access_ok() which is only allowed in task context.
2) It uses memcpy() which has no fault handling at all and is
thus unsafe to use here.
[ bp: Fix up commit ID of the reverted commit above. ]
Fixes: f980f9c31a92 ("x86/sev-es: Compile early handler code into kernel image")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org # v5.10+
Link: https://lkml.kernel.org/r/20210519135251.30093-4-joro@8bytes.org
|
|
When emulating guest instructions for MMIO or IOIO accesses, the #VC
handler might get a page-fault and will not be able to complete. Forward
the page-fault in this case to the correct handler instead of killing
the machine.
Fixes: 0786138c78e7 ("x86/sev-es: Add a Runtime #VC Exception Handler")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org # v5.10+
Link: https://lkml.kernel.org/r/20210519135251.30093-3-joro@8bytes.org
|
|
sev_es_get_ghcb() is called from several places but only one of them
checks the return value. The reaction to returning NULL is always the
same: calling panic() and kill the machine.
Instead of adding checks to all call sites, move the panic() into the
function itself so that it will no longer return NULL.
Fixes: 0786138c78e7 ("x86/sev-es: Add a Runtime #VC Exception Handler")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org # v5.10+
Link: https://lkml.kernel.org/r/20210519135251.30093-2-joro@8bytes.org
|
|
The kernel pushes context on to the userspace stack to prepare for the
user's signal handler. When the user has supplied an alternate signal
stack, via sigaltstack(2), it is easy for the kernel to verify that the
stack size is sufficient for the current hardware context.
Check if writing the hardware context to the alternate stack will exceed
it's size. If yes, then instead of corrupting user-data and proceeding with
the original signal handler, an immediate SIGSEGV signal is delivered.
Refactor the stack pointer check code from on_sig_stack() and use the new
helper.
While the kernel allows new source code to discover and use a sufficient
alternate signal stack size, this check is still necessary to protect
binaries with insufficient alternate signal stack size from data
corruption.
Fixes: c2bc11f10a39 ("x86, AVX-512: Enable AVX-512 States Context Switch")
Reported-by: Florian Weimer <fweimer@redhat.com>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20210518200320.17239-6-chang.seok.bae@intel.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=153531
|
|
Historically, signal.h defines MINSIGSTKSZ (2KB) and SIGSTKSZ (8KB), for
use by all architectures with sigaltstack(2). Over time, the hardware state
size grew, but these constants did not evolve. Today, literal use of these
constants on several architectures may result in signal stack overflow, and
thus user data corruption.
A few years ago, the ARM team addressed this issue by establishing
getauxval(AT_MINSIGSTKSZ). This enables the kernel to supply a value
at runtime that is an appropriate replacement on current and future
hardware.
Add getauxval(AT_MINSIGSTKSZ) support to x86, analogous to the support
added for ARM in
94b07c1f8c39 ("arm64: signal: Report signal frame size to userspace via auxv").
Also, include a documentation to describe x86-specific auxiliary vectors.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20210518200320.17239-4-chang.seok.bae@intel.com
|
|
Signal frames do not have a fixed format and can vary in size when a number
of things change: supported XSAVE features, 32 vs. 64-bit apps, etc.
Add support for a runtime method for userspace to dynamically discover
how large a signal stack needs to be.
Introduce a new variable, max_frame_size, and helper functions for the
calculation to be used in a new user interface. Set max_frame_size to a
system-wide worst-case value, instead of storing multiple app-specific
values.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H.J. Lu <hjl.tools@gmail.com>
Link: https://lkml.kernel.org/r/20210518200320.17239-3-chang.seok.bae@intel.com
|
|
Don't abuse si_errno and deliver all of the perf data in _perf member
of siginfo_t.
Note: The data field in the perf data structures in a u64 to allow a
pointer to be encoded without needed to implement a 32bit and 64bit
version of the same structure. There already exists a 32bit and 64bit
versions siginfo_t, and the 32bit version can not include a 64bit
member as it only has 32bit alignment. So unsigned long is used in
siginfo_t instead of a u64 as unsigned long can encode a pointer on
all architectures linux supports.
v1: https://lkml.kernel.org/r/m11rarqqx2.fsf_-_@fess.ebiederm.org
v2: https://lkml.kernel.org/r/20210503203814.25487-10-ebiederm@xmission.com
v3: https://lkml.kernel.org/r/20210505141101.11519-11-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-4-ebiederm@xmission.com
Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
It turns out that linux uses si_trapno very sparingly, and as such it
can be considered extra information for a very narrow selection of
signals, rather than information that is present with every fault
reported in siginfo.
As such move si_trapno inside the union inside of _si_fault. This
results in no change in placement, and makes it eaiser
to extend _si_fault in the future as this reduces the number of
special cases. In particular with si_trapno included in the union it
is no longer a concern that the union must be pointer aligned on most
architectures because the union follows immediately after si_addr
which is a pointer.
This change results in a difference in siginfo field placement on
sparc and alpha for the fields si_addr_lsb, si_lower, si_upper,
si_pkey, and si_perf. These architectures do not implement the
signals that would use si_addr_lsb, si_lower, si_upper, si_pkey, and
si_perf. Further these architecture have not yet implemented the
userspace that would use si_perf.
The point of this change is in fact to correct these placement issues
before sparc or alpha grow userspace that cares. This change was
discussed[1] and the agreement is that this change is currently safe.
[1]: https://lkml.kernel.org/r/CAK8P3a0+uKYwL1NhY6Hvtieghba2hKYGD6hcKx5n8=4Gtt+pHA@mail.gmail.com
Acked-by: Marco Elver <elver@google.com>
v1: https://lkml.kernel.org/r/m1tunns7yf.fsf_-_@fess.ebiederm.org
v2: https://lkml.kernel.org/r/20210505141101.11519-5-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-1-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
A bus lock can be thousands of cycles slower than atomic operation within
one cache line. It also disrupts performance on other cores. Malicious
users can generate multiple bus locks to degrade the whole system
performance.
The current mitigation is to kill the offending process, but for certain
scenarios it's desired to identify and throttle the offending application.
Add a system wide rate limit for bus locks. When the system detects bus
locks at a rate higher than N/sec (where N can be set by the kernel boot
argument in the range [1..1000]) any task triggering a bus lock will be
forced to sleep for at least 20ms until the overall system rate of bus
locks drops below the threshold.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210419214958.4035512-3-fenghua.yu@intel.com
|