summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2019-12-05x86/fpu: Don't cache access to fpu_fpregs_owner_ctxSebastian Andrzej Siewior1-1/+1
commit 59c4bd853abcea95eccc167a7d7fd5f1a5f47b98 upstream. The state/owner of the FPU is saved to fpu_fpregs_owner_ctx by pointing to the context that is currently loaded. It never changed during the lifetime of a task - it remained stable/constant. After deferred FPU registers loading until return to userland was implemented, the content of fpu_fpregs_owner_ctx may change during preemption and must not be cached. This went unnoticed for some time and was now noticed, in particular since gcc 9 is caching that load in copy_fpstate_to_sigframe() and reusing it in the retry loop: copy_fpstate_to_sigframe() load fpu_fpregs_owner_ctx and save on stack fpregs_lock() copy_fpregs_to_sigframe() /* failed */ fpregs_unlock() *** PREEMPTION, another uses FPU, changes fpu_fpregs_owner_ctx *** fault_in_pages_writeable() /* succeed, retry */ fpregs_lock() __fpregs_load_activate() fpregs_state_valid() /* uses fpu_fpregs_owner_ctx from stack */ copy_fpregs_to_sigframe() /* succeeds, random FPU content */ This is a comparison of the assembly produced by gcc 9, without vs with this patch: | # arch/x86/kernel/fpu/signal.c:173: if (!access_ok(buf, size)) | cmpq %rdx, %rax # tmp183, _4 | jb .L190 #, |-# arch/x86/include/asm/fpu/internal.h:512: return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu; |-#APP |-# 512 "arch/x86/include/asm/fpu/internal.h" 1 |- movq %gs:fpu_fpregs_owner_ctx,%rax #, pfo_ret__ |-# 0 "" 2 |-#NO_APP |- movq %rax, -88(%rbp) # pfo_ret__, %sfp … |-# arch/x86/include/asm/fpu/internal.h:512: return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu; |- movq -88(%rbp), %rcx # %sfp, pfo_ret__ |- cmpq %rcx, -64(%rbp) # pfo_ret__, %sfp |+# arch/x86/include/asm/fpu/internal.h:512: return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu; |+#APP |+# 512 "arch/x86/include/asm/fpu/internal.h" 1 |+ movq %gs:fpu_fpregs_owner_ctx(%rip),%rax # fpu_fpregs_owner_ctx, pfo_ret__ |+# 0 "" 2 |+# arch/x86/include/asm/fpu/internal.h:512: return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu; |+#NO_APP |+ cmpq %rax, -64(%rbp) # pfo_ret__, %sfp Use this_cpu_read() instead this_cpu_read_stable() to avoid caching of fpu_fpregs_owner_ctx during preemption points. The Fixes: tag points to the commit where deferred FPU loading was added. Since this commit, the compiler is no longer allowed to move the load of fpu_fpregs_owner_ctx somewhere else / outside of the locked section. A task preemption will change its value and stale content will be observed. [ bp: Massage. ] Debugged-by: Austin Clements <austin@google.com> Debugged-by: David Chase <drchase@golang.org> Debugged-by: Ian Lance Taylor <ian@airs.com> Fixes: 5f409e20b7945 ("x86/fpu: Defer FPU state load until return to userspace") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Borislav Petkov <bp@suse.de> Cc: Aubrey Li <aubrey.li@intel.com> Cc: Austin Clements <austin@google.com> Cc: Barret Rhoden <brho@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Chase <drchase@golang.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: ian@airs.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Bleecher Snyder <josharian@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191128085306.hxfa2o3knqtu4wfn@linutronix.de Link: https://bugzilla.kernel.org/show_bug.cgi?id=205663 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-05ARM: dts: stm32: Fix CAN RAM mapping on stm32mp157cChristophe Roullier1-2/+2
[ Upstream commit 9df50c2e16de7fd739d11d37303afec9e573b46f ] Split the 10Kbytes CAN message RAM to be able to use simultaneously FDCAN1 and FDCAN2 instances. First 5Kbytes are allocated to FDCAN1 and last 5Kbytes are used for FDCAN2. To do so, set the offset to 0x1400 in mram-cfg for FDCAN2. Fixes: d44d6e021301 ("ARM: dts: stm32: change CAN RAM mapping on stm32mp157c") Signed-off-by: Christophe Roullier <christophe.roullier@st.com> Signed-off-by: Alexandre Torgue <alexandre.torgue@st.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05x86/tsc: Respect tsc command line paraemeter for clocksource_tsc_earlyMichael Zhivich1-0/+3
[ Upstream commit 63ec58b44fcc05efd1542045abd7faf056ac27d9 ] The introduction of clocksource_tsc_early broke the functionality of "tsc=reliable" and "tsc=nowatchdog" command line parameters, since clocksource_tsc_early is unconditionally registered with CLOCK_SOURCE_MUST_VERIFY and thus put on the watchdog list. This can cause the TSC to be declared unstable during boot: clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc-early' as unstable because the skew is too large: clocksource: 'refined-jiffies' wd_now: fffb7018 wd_last: fffb6e9d mask: ffffffff clocksource: 'tsc-early' cs_now: 68a6a7070f6a0 cs_last: 68a69ab6f74d6 mask: ffffffffffffffff tsc: Marking TSC unstable due to clocksource watchdog The corresponding elapsed times are cs_nsec=1224152026 and wd_nsec=378942392, so the watchdog differs from TSC by 0.84 seconds. This happens when HPET is not available and jiffies are used as the TSC watchdog instead and the jiffies update is not happening due to lost timer interrupts in periodic mode, which can happen e.g. with expensive debug mechanisms enabled or under massive overload conditions in virtualized environments. Before the introduction of the early TSC clocksource the command line parameters "tsc=reliable" and "tsc=nowatchdog" could be used to work around this issue. Restore the behaviour by disabling the watchdog if requested on the kernel command line. [ tglx: Clarify changelog ] Fixes: aa83c45762a24 ("x86/tsc: Introduce early tsc clocksource") Signed-off-by: Michael Zhivich <mzhivich@akamai.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191024175945.14338-1-mzhivich@akamai.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05arm64: dts: zii-ultra: fix ARM regulator GPIO handleLucas Stach1-1/+1
[ Upstream commit f852497c9a07ec9913bb3f3db5f096a8e2ab7e03 ] The GPIO handle is referencing the wrong GPIO, so the voltage did not actually change as intended. The pinmux is already correct, so just correct the GPIO number. Fixes: 4a13b3bec3b4 (arm64: dts: imx: add Zii Ultra board support) Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05x86/resctrl: Prevent NULL pointer dereference when reading mondataXiaochen Shen1-0/+4
[ Upstream commit 26467b0f8407cbd628fa5b7bcfd156e772004155 ] When a mon group is being deleted, rdtgrp->flags is set to RDT_DELETED in rdtgroup_rmdir_mon() firstly. The structure of rdtgrp will be freed until rdtgrp->waitcount is dropped to 0 in rdtgroup_kn_unlock() later. During the window of deleting a mon group, if an application calls rdtgroup_mondata_show() to read mondata under this mon group, 'rdtgrp' returned from rdtgroup_kn_lock_live() is a NULL pointer when rdtgrp->flags is RDT_DELETED. And then 'rdtgrp' is passed in this path: rdtgroup_mondata_show() --> mon_event_read() --> mon_event_count(). Thus it results in NULL pointer dereference in mon_event_count(). Check 'rdtgrp' in rdtgroup_mondata_show(), and return -ENOENT immediately when reading mondata during the window of deleting a mon group. Fixes: d89b7379015f ("x86/intel_rdt/cqm: Add mon_data") Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Fenghua Yu <fenghua.yu@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: pei.p.jia@intel.com Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/1572326702-27577-1-git-send-email-xiaochen.shen@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05powerpc/bpf: Fix tail call implementationEric Dumazet1-0/+13
[ Upstream commit 7de086909365cd60a5619a45af3f4152516fd75c ] We have seen many crashes on powerpc hosts while loading bpf programs. The problem here is that bpf_int_jit_compile() does a first pass to compute the program length. Then it allocates memory to store the generated program and calls bpf_jit_build_body() a second time (and a third time later) What I have observed is that the second bpf_jit_build_body() could end up using few more words than expected. If bpf_jit_binary_alloc() put the space for the program at the end of the allocated page, we then write on a non mapped memory. It appears that bpf_jit_emit_tail_call() calls bpf_jit_emit_common_epilogue() while ctx->seen might not be stable. Only after the second pass we can be sure ctx->seen wont be changed. Trying to avoid a second pass seems quite complex and probably not worth it. Fixes: ce0761419faef ("powerpc/bpf: Implement support for tail calls") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20191101033444.143741-1-edumazet@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05ARM: dts: sun8i-a83t-tbs-a711: Fix WiFi resume from suspendOndrej Jirman1-0/+1
[ Upstream commit e614f341253f8541baf0230a8dc6a016b544b1e2 ] Without enabling keep-power-in-suspend, we can't wake the device up using WOL packet, and the log is flooded with these messages on resume: sunxi-mmc 1c10000.mmc: send stop command failed sunxi-mmc 1c10000.mmc: data error, sending stop command sunxi-mmc 1c10000.mmc: send stop command failed sunxi-mmc 1c10000.mmc: data error, sending stop command So to make the WiFi really a wakeup-source, we need to keep it powered during suspend. Fixes: 0e23372080def7 ("arm: dts: sun8i: Add the TBS A711 tablet devicetree") Signed-off-by: Ondrej Jirman <megous@megous.com> Signed-off-by: Maxime Ripard <mripard@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05arm64: dts: imx8mm: fix compatible string for sdmaShengjiu Wang1-3/+3
[ Upstream commit e346ff93f02b1ba81e976d4e67ec56582dbdf7f1 ] SDMA in i.MX8MM should use same configuration as i.MX8MQ So need to change compatible string to be "fsl,imx8mq-sdma". Fixes: a05ea40eb384 ("arm64: dts: imx: Add i.mx8mm dtsi support") Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05ARM: dts: imx6qdl-sabreauto: Fix storm of accelerometer interruptsFabio Estevam1-0/+8
[ Upstream commit 7e5d0bf6afcc7bd72f78e7f33570e2e0945624f0 ] Since commit a211b8c55f3c ("ARM: dts: imx6qdl-sabreauto: Add sensors") a storm of accelerometer interrupts is seen: [ 114.211283] irq 260: nobody cared (try booting with the "irqpoll" option) [ 114.218108] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.4 #1 [ 114.223960] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) [ 114.230531] [<c0112858>] (unwind_backtrace) from [<c010cdc8>] (show_stack+0x10/0x14) [ 114.238301] [<c010cdc8>] (show_stack) from [<c0c1aa1c>] (dump_stack+0xd8/0x110) [ 114.245644] [<c0c1aa1c>] (dump_stack) from [<c0193594>] (__report_bad_irq+0x30/0xc0) [ 114.253417] [<c0193594>] (__report_bad_irq) from [<c01933ac>] (note_interrupt+0x108/0x298) [ 114.261707] [<c01933ac>] (note_interrupt) from [<c018ffe4>] (handle_irq_event_percpu+0x70/0x80) [ 114.270433] [<c018ffe4>] (handle_irq_event_percpu) from [<c019002c>] (handle_irq_event+0x38/0x5c) [ 114.279326] [<c019002c>] (handle_irq_event) from [<c019438c>] (handle_level_irq+0xc8/0x154) [ 114.287701] [<c019438c>] (handle_level_irq) from [<c018eda0>] (generic_handle_irq+0x20/0x34) [ 114.296166] [<c018eda0>] (generic_handle_irq) from [<c0534214>] (mxc_gpio_irq_handler+0x30/0xf0) [ 114.304975] [<c0534214>] (mxc_gpio_irq_handler) from [<c0534334>] (mx3_gpio_irq_handler+0x60/0xb0) [ 114.313955] [<c0534334>] (mx3_gpio_irq_handler) from [<c018eda0>] (generic_handle_irq+0x20/0x34) [ 114.322762] [<c018eda0>] (generic_handle_irq) from [<c018f3ac>] (__handle_domain_irq+0x64/0xe0) [ 114.331485] [<c018f3ac>] (__handle_domain_irq) from [<c05215a8>] (gic_handle_irq+0x4c/0xa8) [ 114.339862] [<c05215a8>] (gic_handle_irq) from [<c0101a70>] (__irq_svc+0x70/0x98) [ 114.347361] Exception stack(0xc1301ec0 to 0xc1301f08) [ 114.352435] 1ec0: 00000001 00000006 00000000 c130c340 00000001 c130f688 9785636d c13ea2e8 [ 114.360635] 1ee0: 9784907d 0000001a eaf99d78 0000001a 00000000 c1301f10 c0182b00 c0878de4 [ 114.368830] 1f00: 20000013 ffffffff [ 114.372349] [<c0101a70>] (__irq_svc) from [<c0878de4>] (cpuidle_enter_state+0x168/0x5f4) [ 114.380464] [<c0878de4>] (cpuidle_enter_state) from [<c08792ac>] (cpuidle_enter+0x28/0x38) [ 114.388751] [<c08792ac>] (cpuidle_enter) from [<c015ef9c>] (do_idle+0x224/0x2a8) [ 114.396168] [<c015ef9c>] (do_idle) from [<c015f3b8>] (cpu_startup_entry+0x18/0x20) [ 114.403765] [<c015f3b8>] (cpu_startup_entry) from [<c1200e54>] (start_kernel+0x43c/0x500) [ 114.411958] handlers: [ 114.414302] [<a01028b8>] irq_default_primary_handler threaded [<fd7a3b08>] mma8452_interrupt [ 114.422974] Disabling IRQ #260 CPU0 CPU1 .... 260: 100001 0 gpio-mxc 31 Level mma8451 The MMA8451 interrupt triggers as low level, so the GPIO6_IO31 pin needs to activate its pull up, otherwise it will stay always at low level generating multiple interrupts. The current device tree does not configure the IOMUX for this pin, so it uses whathever comes configured from the bootloader. The IOMUXC_SW_PAD_CTL_PAD_EIM_BCLK register value comes as 0x8000 from the bootloader, which has PKE bit cleared, hence disabling the pull-up. Instead of relying on a previous configuration from the bootloader, configure the GPIO6_IO31 pin with pull-up enabled in order to fix this problem. Fixes: a211b8c55f3c ("ARM: dts: imx6qdl-sabreauto: Add sensors") Signed-off-by: Fabio Estevam <festevam@gmail.com> Reviewed-By: Leonard Crestez <leonard.crestez@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05arm64: dts: ls1028a: fix a compatible issueYuantian Tang1-1/+1
[ Upstream commit 7eb3894b2fac978f811684e3ccb3cb0ad7820bef ] The I2C multiplexer used on ls1028aqds is PCA9547, not PCA9847. If the wrong compatible was used, this chip will not be able to be probed correctly and hence fail to work. Signed-off-by: Yuantian Tang <andy.tang@nxp.com> Acked-by: Li Yang <leoyang.li@nxp.com> Fixes: 8897f3255c9c ("arm64: dts: Add support for NXP LS1028A SoC") Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-29KVM: PPC: Book3S HV: Flush link stack on guest exit to host kernelMichael Ellerman3-0/+41
commit af2e8c68b9c5403f77096969c516f742f5bb29e0 upstream. On some systems that are vulnerable to Spectre v2, it is up to software to flush the link stack (return address stack), in order to protect against Spectre-RSB. When exiting from a guest we do some house keeping and then potentially exit to C code which is several stack frames deep in the host kernel. We will then execute a series of returns without preceeding calls, opening up the possiblity that the guest could have poisoned the link stack, and direct speculative execution of the host to a gadget of some sort. To prevent this we add a flush of the link stack on exit from a guest. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29powerpc/book3s64: Fix link stack flush on context switchMichael Ellerman4-4/+54
commit 39e72bf96f5847ba87cc5bd7a3ce0fed813dc9ad upstream. In commit ee13cb249fab ("powerpc/64s: Add support for software count cache flush"), I added support for software to flush the count cache (indirect branch cache) on context switch if firmware told us that was the required mitigation for Spectre v2. As part of that code we also added a software flush of the link stack (return address stack), which protects against Spectre-RSB between user processes. That is all correct for CPUs that activate that mitigation, which is currently Power9 Nimbus DD2.3. What I got wrong is that on older CPUs, where firmware has disabled the count cache, we also need to flush the link stack on context switch. To fix it we create a new feature bit which is not set by firmware, which tells us we need to flush the link stack. We set that when firmware tells us that either of the existing Spectre v2 mitigations are enabled. Then we adjust the patching code so that if we see that feature bit we enable the link stack flush. If we're also told to flush the count cache in software then we fall through and do that also. On the older CPUs we don't need to do do the software count cache flush, firmware has disabled it, so in that case we patch in an early return after the link stack flush. The naming of some of the functions is awkward after this patch, because they're called "count cache" but they also do link stack. But we'll fix that up in a later commit to ease backporting. This is the fix for CVE-2019-18660. Reported-by: Anthony Steinhauser <asteinhauser@google.com> Fixes: ee13cb249fab ("powerpc/64s: Add support for software count cache flush") Cc: stable@vger.kernel.org # v4.4+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29powerpc/64s: support nospectre_v2 cmdline optionChristopher M. Riedl1-3/+16
commit d8f0e0b073e1ec52a05f0c2a56318b47387d2f10 upstream. Add support for disabling the kernel implemented spectre v2 mitigation (count cache flush on context switch) via the nospectre_v2 and mitigations=off cmdline options. Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190524024647.381-1-cmr@informatik.wtf Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/entry/32: Fix FIXUP_ESPFIX_STACK with user CR3Andy Lutomirski1-3/+18
commit 4a13b0e3e10996b9aa0b45a764ecfe49f6fcd360 upstream. UNWIND_ESPFIX_STACK needs to read the GDT, and the GDT mapping that can be accessed via %fs is not mapped in the user pagetables. Use SGDT to find the cpu_entry_area mapping and read the espfix offset from that instead. Reported-and-tested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make ↵Ingo Molnar3-10/+14
the CPU_ENTRY_AREA_PAGES assert precise commit 05b042a1944322844eaae7ea596d5f154166d68a upstream. When two recent commits that increased the size of the 'struct cpu_entry_area' were merged in -tip, the 32-bit defconfig build started failing on the following build time assert: ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_189’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE arch/x86/mm/cpu_entry_area.c:189:2: note: in expansion of macro ‘BUILD_BUG_ON’ In function ‘setup_cpu_entry_area_ptes’, Which corresponds to the following build time assert: BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE); The purpose of this assert is to sanity check the fixed-value definition of CPU_ENTRY_AREA_PAGES arch/x86/include/asm/pgtable_32_types.h: #define CPU_ENTRY_AREA_PAGES (NR_CPUS * 41) The '41' is supposed to match sizeof(struct cpu_entry_area)/PAGE_SIZE, which value we didn't want to define in such a low level header, because it would cause dependency hell. Every time the size of cpu_entry_area is changed, we have to adjust CPU_ENTRY_AREA_PAGES accordingly - and this assert is checking that constraint. But the assert is both imprecise and buggy, primarily because it doesn't include the single readonly IDT page that is mapped at CPU_ENTRY_AREA_BASE (which begins at a PMD boundary). This bug was hidden by the fact that by accident CPU_ENTRY_AREA_PAGES is defined too large upstream (v5.4-rc8): #define CPU_ENTRY_AREA_PAGES (NR_CPUS * 40) While 'struct cpu_entry_area' is 155648 bytes, or 38 pages. So we had two extra pages, which hid the bug. The following commit (not yet upstream) increased the size to 40 pages: x86/iopl: ("Restrict iopl() permission scope") ... but increased CPU_ENTRY_AREA_PAGES only 41 - i.e. shortening the gap to just 1 extra page. Then another not-yet-upstream commit changed the size again: 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") Which increased the cpu_entry_area size from 38 to 39 pages, but didn't change CPU_ENTRY_AREA_PAGES (kept it at 40). This worked fine, because we still had a page left from the accidental 'reserve'. But when these two commits were merged into the same tree, the combined size of cpu_entry_area grew from 38 to 40 pages, while CPU_ENTRY_AREA_PAGES finally caught up to 40 as well. Which is fine in terms of functionality, but the assert broke: BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE); because CPU_ENTRY_AREA_MAP_SIZE is the total size of the area, which is 1 page larger due to the IDT page. To fix all this, change the assert to two precise asserts: BUILD_BUG_ON((CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE); BUILD_BUG_ON(CPU_ENTRY_AREA_TOTAL_SIZE != CPU_ENTRY_AREA_MAP_SIZE); This takes the IDT page into account, and also connects the size-based define of CPU_ENTRY_AREA_TOTAL_SIZE with the address-subtraction based define of CPU_ENTRY_AREA_MAP_SIZE. Also clean up some of the names which made it rather confusing: - 'CPU_ENTRY_AREA_TOT_SIZE' wasn't actually the 'total' size of the cpu-entry-area, but the per-cpu array size, so rename this to CPU_ENTRY_AREA_ARRAY_SIZE. - Introduce CPU_ENTRY_AREA_TOTAL_SIZE that _is_ the total mapping size, with the IDT included. - Add comments where '+1' denotes the IDT mapping - it wasn't obvious and took me about 3 hours to decode... Finally, because this particular commit is actually applied after this patch: 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") Fix the CPU_ENTRY_AREA_PAGES value from 40 pages to the correct 39 pages. All future commits that change cpu_entry_area will have to adjust this value precisely. As a side note, we should probably attempt to remove CPU_ENTRY_AREA_PAGES and derive its value directly from the structure, without causing header hell - but that is an adventure for another day! :-) Fixes: 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/entry/32: Fix NMI vs ESPFIXPeter Zijlstra1-12/+41
commit 895429076512e9d1cf5428181076299c90713159 upstream. When the NMI lands on an ESPFIX_SS, we are on the entry stack and must swizzle, otherwise we'll run do_nmi() on the entry stack, which is BAD. Also, similar to the normal exception path, we need to correct the ESPFIX magic before leaving the entry stack, otherwise pt_regs will present a non-flat stack pointer. Tested by running sigreturn_32 concurrent with perf-record. Fixes: e5862d0515ad ("x86/entry/32: Leave the kernel via trampoline stack") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/entry/32: Unwind the ESPFIX stack earlier on exception entryAndy Lutomirski1-14/+16
commit a1a338e5b6fe9e0a39c57c232dc96c198bb53e47 upstream. Right now, we do some fancy parts of the exception entry path while SS might have a nonzero base: we fill in regs->ss and regs->sp, and we consider switching to the kernel stack. This results in regs->ss and regs->sp referring to a non-flat stack and it may result in overflowing the entry stack. The former issue means that we can try to call iret_exc on a non-flat stack, which doesn't work. Tested with selftests/x86/sigreturn_32. Fixes: 45d7b255747c ("x86/entry/32: Enter the kernel via trampoline stack") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/entry/32: Move FIXUP_FRAME after pushing %fs in SAVE_ALLAndy Lutomirski1-31/+35
commit 82cb8a0b1d8d07817b5d59f7fa1438e1fceafab2 upstream. This will allow us to get percpu access working before FIXUP_FRAME, which will allow us to unwind ESPFIX earlier. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/entry/32: Use %ss segment where requiredAndy Lutomirski1-5/+14
commit 4c4fd55d3d59a41ddfa6ecba7e76928921759f43 upstream. When re-building the IRET frame we use %eax as an destination %esp, make sure to then also match the segment for when there is a nonzero SS base (ESPFIX). [peterz: Changelog and minor edits] Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/entry/32: Fix IRET exceptionPeter Zijlstra1-1/+1
commit 40ad2199580e248dce2a2ebb722854180c334b9e upstream. As reported by Lai, the commit 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") wrecked the IRET EXTABLE entry by making .Lirq_return not point at IRET. Fix this by placing IRET_FRAME in RESTORE_REGS, to mirror how FIXUP_FRAME is part of SAVE_ALL. Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") Reported-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/cpu_entry_area: Add guard page for entry stack on 32bitThomas Gleixner1-1/+5
commit 880a98c339961eaa074393e3a2117cbe9125b8bb upstream. The entry stack in the cpu entry area is protected against overflow by the readonly GDT on 64-bit, but on 32-bit the GDT needs to be writeable and therefore does not trigger a fault on stack overflow. Add a guard page. Fixes: c482feefe1ae ("x86/entry/64: Make cpu_entry_area.tss read-only") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/pti/32: Size initial_page_table correctlyThomas Gleixner1-0/+10
commit f490e07c53d66045d9d739e134145ec9b38653d3 upstream. Commit 945fd17ab6ba ("x86/cpu_entry_area: Sync cpu_entry_area to initial_page_table") introduced the sync for the initial page table for 32bit. sync_initial_page_table() uses clone_pgd_range() which does the update for the kernel page table. If PTI is enabled it also updates the user space page table counterpart, which is assumed to be in the next page after the target PGD. At this point in time 32-bit did not have PTI support, so the user space page table update was not taking place. The support for PTI on 32-bit which was introduced later on, did not take that into account and missed to add the user space counter part for the initial page table. As a consequence sync_initial_page_table() overwrites any data which is located in the page behing initial_page_table causing random failures, e.g. by corrupting doublefault_tss and wreckaging the doublefault handler on 32bit. Fix it by adding a "user" page table right after initial_page_table. Fixes: 7757d607c6b3 ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joerg Roedel <jroedel@suse.de> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/doublefault/32: Fix stack canaries in the double fault handlerAndy Lutomirski1-0/+3
commit 3580d0b29cab08483f84a16ce6a1151a1013695f upstream. The double fault TSS was missing GS setup, which is needed for stack canaries to work. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/xen/32: Simplify ring check in xen_iret_crit_fixup()Jan Beulich1-11/+4
commit 922eea2ce5c799228d9ff1be9890e6873ce8fff6 upstream. This can be had with two instead of six insns, by just checking the high CS.RPL bit. Also adjust the comment - there would be no #GP in the mentioned cases, as there's no segment limit violation or alike. Instead there'd be #PF, but that one reports the target EIP of said branch, not the address of the branch insn itself. Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lkml.kernel.org/r/a5986837-01eb-7bf8-bf42-4d3084d6a1f5@suse.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/xen/32: Make xen_iret_crit_fixup() independent of frame layoutJan Beulich2-55/+33
commit 29b810f5a5ec127d3143770098e05981baa3eb77 upstream. Now that SS:ESP always get saved by SAVE_ALL, this also needs to be accounted for in xen_iret_crit_fixup(). Otherwise the old_ax value gets interpreted as EFLAGS, and hence VM86 mode appears to be active all the time, leading to random "vm86_32: no user_vm86: BAD" log messages alongside processes randomly crashing. Since following the previous model (sitting after SAVE_ALL) would further complicate the code _and_ retain the dependency of xen_iret_crit_fixup() on frame manipulations done by entry_32.S, switch things around and do the adjustment ahead of SAVE_ALL. Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Stable Team <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/32d8713d-25a7-84ab-b74b-aa3e88abce6b@suse.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/stackframe/32: Repair 32-bit Xen PVJan Beulich2-2/+14
commit 81ff2c37f9e5d77593928df0536d86443195fd64 upstream. Once again RPL checks have been introduced which don't account for a 32-bit kernel living in ring 1 when running in a PV Xen domain. The case in FIXUP_FRAME has been preventing boot. Adjust BUG_IF_WRONG_CR3 as well to guard against future uses of the macro on a code path reachable when running in PV mode under Xen; I have to admit that I stopped at a certain point trying to figure out whether there are present ones. Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Stable Team <stable@vger.kernel.org> Link: https://lore.kernel.org/r/0fad341f-b7f5-f859-d55d-f0084ee7087e@suse.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/speculation: Fix redundant MDS mitigation messageWaiman Long1-0/+13
commit cd5a2aa89e847bdda7b62029d94e95488d73f6b2 upstream. Since MDS and TAA mitigations are inter-related for processors that are affected by both vulnerabilities, the followiing confusing messages can be printed in the kernel log: MDS: Vulnerable MDS: Mitigation: Clear CPU buffers To avoid the first incorrect message, defer the printing of MDS mitigation after the TAA mitigation selection has been done. However, that has the side effect of printing TAA mitigation first before MDS mitigation. [ bp: Check box is affected/mitigations are disabled first before printing and massage. ] Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Mark Gross <mgross@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191115161445.30809-3-longman@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/speculation: Fix incorrect MDS/TAA mitigation statusWaiman Long1-2/+15
commit 64870ed1b12e235cfca3f6c6da75b542c973ff78 upstream. For MDS vulnerable processors with TSX support, enabling either MDS or TAA mitigations will enable the use of VERW to flush internal processor buffers at the right code path. IOW, they are either both mitigated or both not. However, if the command line options are inconsistent, the vulnerabilites sysfs files may not report the mitigation status correctly. For example, with only the "mds=off" option: vulnerabilities/mds:Vulnerable; SMT vulnerable vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable The mds vulnerabilities file has wrong status in this case. Similarly, the taa vulnerability file will be wrong with mds mitigation on, but taa off. Change taa_select_mitigation() to sync up the two mitigation status and have them turned off if both "mds=off" and "tsx_async_abort=off" are present. Update documentation to emphasize the fact that both "mds=off" and "tsx_async_abort=off" have to be specified together for processors that are affected by both TAA and MDS to be effective. [ bp: Massage and add kernel-parameters.txt change too. ] Fixes: 1b42f017415b ("x86/speculation/taa: Add mitigation for TSX Async Abort") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: linux-doc@vger.kernel.org Cc: Mark Gross <mgross@linux.intel.com> Cc: <stable@vger.kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191115161445.30809-2-longman@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29x86/insn: Fix awk regexp warningsAlexander Kapshuk1-2/+2
commit 700c1018b86d0d4b3f1f2d459708c0cdf42b521d upstream. gawk 5.0.1 generates the following regexp warnings: GEN /home/sasha/torvalds/tools/objtool/arch/x86/lib/inat-tables.c awk: ../arch/x86/tools/gen-insn-attr-x86.awk:260: warning: regexp escape sequence `\:' is not a known regexp operator awk: ../arch/x86/tools/gen-insn-attr-x86.awk:350: (FILENAME=../arch/x86/lib/x86-opcode-map.txt FNR=41) warning: regexp escape sequence `\&' is not a known regexp operator Ealier versions of gawk are not known to generate these warnings. The gawk manual referenced below does not list characters ':' and '&' as needing escaping, so 'unescape' them. See https://www.gnu.org/software/gawk/manual/html_node/Escape-Sequences.html for more info. Running diff on the output generated by the script before and after applying the patch reported no differences. [ bp: Massage commit message. ] [ Caught the respective tools header discrepancy. ] Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Alexander Kapshuk <alexander.kapshuk@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190924044659.3785-1-alexander.kapshuk@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29ARM: 8904/1: skip nomap memblocks while finding the lowmem/highmem boundaryChester Lin1-0/+3
commit 1d31999cf04c21709f72ceb17e65b54a401330da upstream. adjust_lowmem_bounds() checks every memblocks in order to find the boundary between lowmem and highmem. However some memblocks could be marked as NOMAP so they are not used by kernel, which should be skipped while calculating the boundary. Signed-off-by: Chester Lin <clin@suse.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-24arm64: uaccess: Ensure PAN is re-enabled after unhandled uaccess faultPavel Tatashin4-0/+4
commit 94bb804e1e6f0a9a77acf20d7c70ea141c6c821e upstream. A number of our uaccess routines ('__arch_clear_user()' and '__arch_copy_{in,from,to}_user()') fail to re-enable PAN if they encounter an unhandled fault whilst accessing userspace. For CPUs implementing both hardware PAN and UAO, this bug has no effect when both extensions are in use by the kernel. For CPUs implementing hardware PAN but not UAO, this means that a kernel using hardware PAN may execute portions of code with PAN inadvertently disabled, opening us up to potential security vulnerabilities that rely on userspace access from within the kernel which would usually be prevented by this mechanism. In other words, parts of the kernel run the same way as they would on a CPU without PAN implemented/emulated at all. For CPUs not implementing hardware PAN and instead relying on software emulation via 'CONFIG_ARM64_SW_TTBR0_PAN=y', the impact is unfortunately much worse. Calling 'schedule()' with software PAN disabled means that the next task will execute in the kernel using the page-table and ASID of the previous process even after 'switch_mm()', since the actual hardware switch is deferred until return to userspace. At this point, or if there is a intermediate call to 'uaccess_enable()', the page-table and ASID of the new process are installed. Sadly, due to the changes introduced by KPTI, this is not an atomic operation and there is a very small window (two instructions) where the CPU is configured with the page-table of the old task and the ASID of the new task; a speculative access in this state is disastrous because it would corrupt the TLB entries for the new task with mappings from the previous address space. As Pavel explains: | I was able to reproduce memory corruption problem on Broadcom's SoC | ARMv8-A like this: | | Enable software perf-events with PERF_SAMPLE_CALLCHAIN so userland's | stack is accessed and copied. | | The test program performed the following on every CPU and forking | many processes: | | unsigned long *map = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE, | MAP_SHARED | MAP_ANONYMOUS, -1, 0); | map[0] = getpid(); | sched_yield(); | if (map[0] != getpid()) { | fprintf(stderr, "Corruption detected!"); | } | munmap(map, PAGE_SIZE); | | From time to time I was getting map[0] to contain pid for a | different process. Ensure that PAN is re-enabled when returning after an unhandled user fault from our uaccess routines. Cc: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Cc: <stable@vger.kernel.org> Fixes: 338d4f49d6f7 ("arm64: kernel: Add support for Privileged Access Never") Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> [will: rewrote commit message] Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-20x86/quirks: Disable HPET on Intel Coffe Lake platformsKai-Heng Feng1-0/+2
commit fc5db58539b49351e76f19817ed1102bf7c712d0 upstream. Some Coffee Lake platforms have a skewed HPET timer once the SoCs entered PC10, which in consequence marks TSC as unstable because HPET is used as watchdog clocksource for TSC. Harry Pan tried to work around it in the clocksource watchdog code [1] thereby creating a circular dependency between HPET and TSC. This also ignores the fact, that HPET is not only unsuitable as watchdog clocksource on these systems, it becomes unusable in general. Disable HPET on affected platforms. Suggested-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203183 Link: https://lore.kernel.org/lkml/20190516090651.1396-1-harry.pan@intel.com/ [1] Link: https://lkml.kernel.org/r/20191016103816.30650-1-kai.heng.feng@canonical.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-20KVM: MMU: Do not treat ZONE_DEVICE pages as being reservedSean Christopherson1-4/+4
commit a78986aae9b2988f8493f9f65a587ee433e83bc3 upstream. Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and instead manually handle ZONE_DEVICE on a case-by-case basis. For things like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal pages, e.g. put pages grabbed via gup(). But for flows such as setting A/D bits or shifting refcounts for transparent huge pages, KVM needs to to avoid processing ZONE_DEVICE pages as the flows in question lack the underlying machinery for proper handling of ZONE_DEVICE pages. This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup() when running a KVM guest backed with /dev/dax memory, as KVM straight up doesn't put any references to ZONE_DEVICE pages acquired by gup(). Note, Dan Williams proposed an alternative solution of doing put_page() on ZONE_DEVICE pages immediately after gup() in order to simplify the auditing needed to ensure is_zone_device_page() is called if and only if the backing device is pinned (via gup()). But that approach would break kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til unmap() when accessing guest memory, unlike KVM's secondary MMU, which coordinates with mmu_notifier invalidations to avoid creating stale page references, i.e. doesn't rely on pages being pinned. [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl Reported-by: Adam Borowski <kilobyte@angband.pl> Analyzed-by: David Hildenbrand <david@redhat.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12kvm: x86: mmu: Recovery of shattered NX large pagesJunaid Shahid4-0/+148
commit 1aa9b9572b10529c2e64e2b8f44025d86e124308 upstream. The page table pages corresponding to broken down large pages are zapped in FIFO order, so that the large page can potentially be recovered, if it is not longer being used for execution. This removes the performance penalty for walking deeper EPT page tables. By default, one large page will last about one hour once the guest reaches a steady state. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12kvm: mmu: ITLB_MULTIHIT mitigationPaolo Bonzini5-13/+181
commit b8e8c8303ff28c61046a4d0f6ea99aea609a7dc0 upstream. With some Intel processors, putting the same virtual address in the TLB as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit and cause the processor to issue a machine check resulting in a CPU lockup. Unfortunately when EPT page tables use huge pages, it is possible for a malicious guest to cause this situation. Add a knob to mark huge pages as non-executable. When the nx_huge_pages parameter is enabled (and we are using EPT), all huge pages are marked as NX. If the guest attempts to execute in one of those pages, the page is broken down into 4K pages, which are then marked executable. This is not an issue for shadow paging (except nested EPT), because then the host is in control of TLB flushes and the problematic situation cannot happen. With nested EPT, again the nested guest can cause problems shadow and direct EPT is treated in the same way. [ tglx: Fixup default to auto and massage wording a bit ] Originally-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12kvm: x86, powerpc: do not allow clearing largepages debugfs entryPaolo Bonzini2-7/+7
commit 833b45de69a6016c4b0cebe6765d526a31a81580 upstream. The largepages debugfs entry is incremented/decremented as shadow pages are created or destroyed. Clearing it will result in an underflow, which is harmless to KVM but ugly (and could be misinterpreted by tools that use debugfs information), so make this particular statistic read-only. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: kvm-ppc@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12x86/cpu: Add Tremont to the cpu vulnerability whitelistPawan Gupta1-0/+2
commit cad14885a8d32c1c0d8eaa7bf5c0152a22b6080e upstream. Add the new cpu family ATOM_TREMONT_D to the cpu vunerability whitelist. ATOM_TREMONT_D is not affected by X86_BUG_ITLB_MULTIHIT. ATOM_TREMONT_D might have mitigations against other issues as well, but only the ITLB multihit mitigation is confirmed at this point. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12x86/bugs: Add ITLB_MULTIHIT bug infrastructureVineela Tummalapalli4-29/+55
commit db4d30fbb71b47e4ecb11c4efa5d8aad4b03dfae upstream. Some processors may incur a machine check error possibly resulting in an unrecoverable CPU lockup when an instruction fetch encounters a TLB multi-hit in the instruction TLB. This can occur when the page size is changed along with either the physical address or cache type. The relevant erratum can be found here: https://bugzilla.kernel.org/show_bug.cgi?id=205195 There are other processors affected for which the erratum does not fully disclose the impact. This issue affects both bare-metal x86 page tables and EPT. It can be mitigated by either eliminating the use of large pages or by using careful TLB invalidations when changing the page size in the page tables. Just like Spectre, Meltdown, L1TF and MDS, a new bit has been allocated in MSR_IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) and will be set on CPUs which are mitigated against this issue. Signed-off-by: Vineela Tummalapalli <vineela.tummalapalli@intel.com> Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUsJosh Poimboeuf1-4/+0
commit 012206a822a8b6ac09125bfaa210a95b9eb8f1c1 upstream. For new IBRS_ALL CPUs, the Enhanced IBRS check at the beginning of cpu_bugs_smt_update() causes the function to return early, unintentionally skipping the MDS and TAA logic. This is not a problem for MDS, because there appears to be no overlap between IBRS_ALL and MDS-affected CPUs. So the MDS mitigation would be disabled and nothing would need to be done in this function anyway. But for TAA, the TAA_MSG_SMT string will never get printed on Cascade Lake and newer. The check is superfluous anyway: when 'spectre_v2_enabled' is SPECTRE_V2_IBRS_ENHANCED, 'spectre_v2_user' is always SPECTRE_V2_USER_NONE, and so the 'spectre_v2_user' switch statement handles it appropriately by doing nothing. So just remove the check. Fixes: 1b42f017415b ("x86/speculation/taa: Add mitigation for TSX Async Abort") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12x86/tsx: Add config options to set tsx=on|off|autoMichal Hocko2-6/+61
commit db616173d787395787ecc93eef075fa975227b10 upstream. There is a general consensus that TSX usage is not largely spread while the history shows there is a non trivial space for side channel attacks possible. Therefore the tsx is disabled by default even on platforms that might have a safe implementation of TSX according to the current knowledge. This is a fair trade off to make. There are, however, workloads that really do benefit from using TSX and updating to a newer kernel with TSX disabled might introduce a noticeable regressions. This would be especially a problem for Linux distributions which will provide TAA mitigations. Introduce config options X86_INTEL_TSX_MODE_OFF, X86_INTEL_TSX_MODE_ON and X86_INTEL_TSX_MODE_AUTO to control the TSX feature. The config setting can be overridden by the tsx cmdline options. [ bp: Text cleanups from Josh. ] Suggested-by: Borislav Petkov <bpetkov@suse.de> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12x86/tsx: Add "auto" option to the tsx= cmdline parameterPawan Gupta1-1/+6
commit 7531a3596e3272d1f6841e0d601a614555dc6b65 upstream. Platforms which are not affected by X86_BUG_TAA may want the TSX feature enabled. Add "auto" option to the TSX cmdline parameter. When tsx=auto disable TSX when X86_BUG_TAA is present, otherwise enable TSX. More details on X86_BUG_TAA can be found here: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html [ bp: Extend the arg buffer to accommodate "auto\0". ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12kvm/x86: Export MDS_NO=0 to guests when TSX is enabledPawan Gupta1-0/+19
commit e1d38b63acd843cfdd4222bf19a26700fd5c699e upstream. Export the IA32_ARCH_CAPABILITIES MSR bit MDS_NO=0 to guests on TSX Async Abort(TAA) affected hosts that have TSX enabled and updated microcode. This is required so that the guests don't complain, "Vulnerable: Clear CPU buffers attempted, no microcode" when the host has the updated microcode to clear CPU buffers. Microcode update also adds support for MSR_IA32_TSX_CTRL which is enumerated by the ARCH_CAP_TSX_CTRL bit in IA32_ARCH_CAPABILITIES MSR. Guests can't do this check themselves when the ARCH_CAP_TSX_CTRL bit is not exported to the guests. In this case export MDS_NO=0 to the guests. When guests have CPUID.MD_CLEAR=1, they deploy MDS mitigation which also mitigates TAA. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12x86/speculation/taa: Add sysfs reporting for TSX Async AbortPawan Gupta1-0/+23
commit 6608b45ac5ecb56f9e171252229c39580cc85f0f upstream. Add the sysfs reporting file for TSX Async Abort. It exposes the vulnerability and the mitigation state similar to the existing files for the other hardware vulnerabilities. Sysfs file path is: /sys/devices/system/cpu/vulnerabilities/tsx_async_abort Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Reviewed-by: Mark Gross <mgross@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12x86/speculation/taa: Add mitigation for TSX Async AbortPawan Gupta6-2/+137
commit 1b42f017415b46c317e71d41c34ec088417a1883 upstream. TSX Async Abort (TAA) is a side channel vulnerability to the internal buffers in some Intel processors similar to Microachitectural Data Sampling (MDS). In this case, certain loads may speculatively pass invalid data to dependent operations when an asynchronous abort condition is pending in a TSX transaction. This includes loads with no fault or assist condition. Such loads may speculatively expose stale data from the uarch data structures as in MDS. Scope of exposure is within the same-thread and cross-thread. This issue affects all current processors that support TSX, but do not have ARCH_CAP_TAA_NO (bit 8) set in MSR_IA32_ARCH_CAPABILITIES. On CPUs which have their IA32_ARCH_CAPABILITIES MSR bit MDS_NO=0, CPUID.MD_CLEAR=1 and the MDS mitigation is clearing the CPU buffers using VERW or L1D_FLUSH, there is no additional mitigation needed for TAA. On affected CPUs with MDS_NO=1 this issue can be mitigated by disabling the Transactional Synchronization Extensions (TSX) feature. A new MSR IA32_TSX_CTRL in future and current processors after a microcode update can be used to control the TSX feature. There are two bits in that MSR: * TSX_CTRL_RTM_DISABLE disables the TSX sub-feature Restricted Transactional Memory (RTM). * TSX_CTRL_CPUID_CLEAR clears the RTM enumeration in CPUID. The other TSX sub-feature, Hardware Lock Elision (HLE), is unconditionally disabled with updated microcode but still enumerated as present by CPUID(EAX=7).EBX{bit4}. The second mitigation approach is similar to MDS which is clearing the affected CPU buffers on return to user space and when entering a guest. Relevant microcode update is required for the mitigation to work. More details on this approach can be found here: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html The TSX feature can be controlled by the "tsx" command line parameter. If it is force-enabled then "Clear CPU buffers" (MDS mitigation) is deployed. The effective mitigation state can be read from sysfs. [ bp: - massage + comments cleanup - s/TAA_MITIGATION_TSX_DISABLE/TAA_MITIGATION_TSX_DISABLED/g - Josh. - remove partial TAA mitigation in update_mds_branch_idle() - Josh. - s/tsx_async_abort_cmdline/tsx_async_abort_parse_cmdline/g ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12x86/cpu: Add a "tsx=" cmdline option with TSX disabled by defaultPawan Gupta5-1/+149
commit 95c5824f75f3ba4c9e8e5a4b1a623c95390ac266 upstream. Add a kernel cmdline parameter "tsx" to control the Transactional Synchronization Extensions (TSX) feature. On CPUs that support TSX control, use "tsx=on|off" to enable or disable TSX. Not specifying this option is equivalent to "tsx=off". This is because on certain processors TSX may be used as a part of a speculative side channel attack. Carve out the TSX controlling functionality into a separate compilation unit because TSX is a CPU feature while the TSX async abort control machinery will go to cpu/bugs.c. [ bp: - Massage, shorten and clear the arg buffer. - Clarifications of the tsx= possible options - Josh. - Expand on TSX_CTRL availability - Pawan. ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12x86/cpu: Add a helper function x86_read_arch_cap_msr()Pawan Gupta2-4/+13
commit 286836a70433fb64131d2590f4bf512097c255e1 upstream. Add a helper function to read the IA32_ARCH_CAPABILITIES MSR. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Reviewed-by: Mark Gross <mgross@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12x86/msr: Add the IA32_TSX_CTRL MSRPawan Gupta1-0/+5
commit c2955f270a84762343000f103e0640d29c7a96f3 upstream. Transactional Synchronization Extensions (TSX) may be used on certain processors as part of a speculative side channel attack. A microcode update for existing processors that are vulnerable to this attack will add a new MSR - IA32_TSX_CTRL to allow the system administrator the option to disable TSX as one of the possible mitigations. The CPUs which get this new MSR after a microcode upgrade are the ones which do not set MSR_IA32_ARCH_CAPABILITIES.MDS_NO (bit 5) because those CPUs have CPUID.MD_CLEAR, i.e., the VERW implementation which clears all CPU buffers takes care of the TAA case as well. [ Note that future processors that are not vulnerable will also support the IA32_TSX_CTRL MSR. ] Add defines for the new IA32_TSX_CTRL MSR and its bits. TSX has two sub-features: 1. Restricted Transactional Memory (RTM) is an explicitly-used feature where new instructions begin and end TSX transactions. 2. Hardware Lock Elision (HLE) is implicitly used when certain kinds of "old" style locks are used by software. Bit 7 of the IA32_ARCH_CAPABILITIES indicates the presence of the IA32_TSX_CTRL MSR. There are two control bits in IA32_TSX_CTRL MSR: Bit 0: When set, it disables the Restricted Transactional Memory (RTM) sub-feature of TSX (will force all transactions to abort on the XBEGIN instruction). Bit 1: When set, it disables the enumeration of the RTM and HLE feature (i.e. it will make CPUID(EAX=7).EBX{bit4} and CPUID(EAX=7).EBX{bit11} read as 0). The other TSX sub-feature, Hardware Lock Elision (HLE), is unconditionally disabled by the new microcode but still enumerated as present by CPUID(EAX=7).EBX{bit4}, unless disabled by IA32_TSX_CTRL_MSR[1] - TSX_CTRL_CPUID_CLEAR. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Reviewed-by: Mark Gross <mgross@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12arm64: errata: Update stale commentThierry Reding1-2/+2
[ Upstream commit 7a292b6c7c9c35afee01ce3b2248f705869d0ff1 ] Commit 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or lack thereof") renamed the caller of the install_bp_hardening_cb() function but forgot to update a comment, which can be confusing when trying to follow the code flow. Fixes: 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or lack thereof") Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-12ARM: dts: stm32: change joystick pinctrl definition on stm32mp157c-ev1Amelie Delaunay1-1/+0
[ Upstream commit f4d6e0f79bcde7810890563bac8e0f3479fe6d03 ] Pins used for joystick are all configured as input. "push-pull" is not a valid setting for an input pin. Fixes: a502b343ebd0 ("pinctrl: stmfx: update pinconf settings") Signed-off-by: Alexandre Torgue <alexandre.torgue@st.com> Signed-off-by: Amelie Delaunay <amelie.delaunay@st.com> Signed-off-by: Alexandre Torgue <alexandre.torgue@st.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-12timekeeping/vsyscall: Update VDSO data unconditionallyHuacai Chen1-7/+0
[ Upstream commit 52338415cf4d4064ae6b8dd972dadbda841da4fa ] The update of the VDSO data is depending on __arch_use_vsyscall() returning True. This is a leftover from the attempt to map the features of various architectures 1:1 into generic code. The usage of __arch_use_vsyscall() in the actual vsyscall implementations got dropped and replaced by the requirement for the architecture code to return U64_MAX if the global clocksource is not usable in the VDSO. But the __arch_use_vsyscall() check in the update code stayed which causes the VDSO data to be stale or invalid when an architecture actually implements that function and returns False when the current clocksource is not usable in the VDSO. As a consequence the VDSO implementations of clock_getres(), time(), clock_gettime(CLOCK_.*_COARSE) operate on invalid data and return bogus information. Remove the __arch_use_vsyscall() check from the VDSO update function and update the VDSO data unconditionally. [ tglx: Massaged changelog and removed the now useless implementations in asm-generic/ARM64/MIPS ] Fixes: 44f57d788e7deecb50 ("timekeeping: Provide a generic update_vsyscall() implementation") Signed-off-by: Huacai Chen <chenhc@lemote.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Burton <paul.burton@mips.com> Cc: linux-mips@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1571887709-11447-1-git-send-email-chenhc@lemote.com Signed-off-by: Sasha Levin <sashal@kernel.org>