summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2019-10-17x86/asm: Fix MWAITX C-state hint valueJanakarajan Natarajan2-3/+3
commit 454de1e7d970d6bc567686052329e4814842867c upstream. As per "AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions", MWAITX EAX[7:4]+1 specifies the optional hint of the optimized C-state. For C0 state, EAX[7:4] should be set to 0xf. Currently, a value of 0xf is set for EAX[3:0] instead of EAX[7:4]. Fix this by changing MWAITX_DISABLE_CSTATES from 0xf to 0xf0. This hasn't had any implications so far because setting reserved bits in EAX is simply ignored by the CPU. [ bp: Fixup comment in delay_mwaitx() and massage. ] Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "x86@kernel.org" <x86@kernel.org> Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191007190011.4859-1-Janakarajan.Natarajan@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17arm64/sve: Fix wrong free for task->thread.sve_stateMasayoshi Mizuma1-17/+15
commit 4585fc59c0e813188d6a4c5de1f6976fce461fc2 upstream. The system which has SVE feature crashed because of the memory pointed by task->thread.sve_state was destroyed by someone. That is because sve_state is freed while the forking the child process. The child process has the pointer of sve_state which is same as the parent's because the child's task_struct is copied from the parent's one. If the copy_process() fails as an error on somewhere, for example, copy_creds(), then the sve_state is freed even if the parent is alive. The flow is as follows. copy_process p = dup_task_struct => arch_dup_task_struct *dst = *src; // copy the entire region. : retval = copy_creds if (retval < 0) goto bad_fork_free; : bad_fork_free: ... delayed_free_task(p); => free_task => arch_release_task_struct => fpsimd_release_task => __sve_free => kfree(task->thread.sve_state); // free the parent's sve_state Move child's sve_state = NULL and clearing TIF_SVE flag to arch_dup_task_struct() so that the child doesn't free the parent's one. There is no need to wait until copy_process() to clear TIF_SVE for dst, because the thread flags for dst are initialized already by copying the src task_struct. This change simplifies the code, so get rid of comments that are no longer needed. As a note, arm64 used to have thread_info on the stack. So it would not be possible to clear TIF_SVE until the stack is initialized. From commit c02433dd6de3 ("arm64: split thread_info from task stack"), the thread_info is part of the task, so it should be valid to modify the flag from arch_dup_task_struct(). Cc: stable@vger.kernel.org # 4.15.x- Fixes: bc0ee4760364 ("arm64/sve: Core task context handling") Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reported-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Suggested-by: Dave Martin <Dave.Martin@arm.com> Reviewed-by: Dave Martin <Dave.Martin@arm.com> Tested-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17arm64: topology: Use PPTT to determine if PE is a threadJeremy Linton1-4/+15
Commit 98dc19902a0b2e5348e43d6a2c39a0a7d0fc639e upstream. ACPI 6.3 adds a thread flag to represent if a CPU/PE is actually a thread. Given that the MPIDR_MT bit may not represent this information consistently on homogeneous machines we should prefer the PPTT flag if its available. Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Reviewed-by: Robert Richter <rrichter@marvell.com> [will: made acpi_cpu_is_threaded() return 'bool'] Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-17MIPS: elf_hwcap: Export userspace ASEsJiaxun Yang2-0/+44
commit 38dffe1e4dde1d3174fdce09d67370412843ebb5 upstream. A Golang developer reported MIPS hwcap isn't reflecting instructions that the processor actually supported so programs can't apply optimized code at runtime. Thus we export the ASEs that can be used in userspace programs. Reported-by: Meng Zhuo <mengzhuo1203@gmail.com> Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: linux-mips@vger.kernel.org Cc: Paul Burton <paul.burton@mips.com> Cc: <stable@vger.kernel.org> # 4.14+ Signed-off-by: Paul Burton <paul.burton@mips.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17MIPS: Disable Loongson MMI instructions for kernel buildPaul Burton2-0/+5
commit 2f2b4fd674cadd8c6b40eb629e140a14db4068fd upstream. GCC 9.x automatically enables support for Loongson MMI instructions when using some -march= flags, and then errors out when -msoft-float is specified with: cc1: error: ‘-mloongson-mmi’ must be used with ‘-mhard-float’ The kernel shouldn't be using these MMI instructions anyway, just as it doesn't use floating point instructions. Explicitly disable them in order to fix the build with GCC 9.x. Signed-off-by: Paul Burton <paul.burton@mips.com> Fixes: 3702bba5eb4f ("MIPS: Loongson: Add GCC 4.4 support for Loongson2E") Fixes: 6f7a251a259e ("MIPS: Loongson: Add basic Loongson 2F support") Fixes: 5188129b8c9f ("MIPS: Loongson-3: Improve -march option and move it to Platform") Cc: Huacai Chen <chenhc@lemote.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: stable@vger.kernel.org # v2.6.32+ Cc: linux-mips@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17USB: rio500: Remove Rio 500 kernel driverBastien Nocera7-7/+0
commit 015664d15270a112c2371d812f03f7c579b35a73 upstream. The Rio500 kernel driver has not been used by Rio500 owners since 2001 not long after the rio500 project added support for a user-space USB stack through the very first versions of usbdevfs and then libusb. Support for the kernel driver was removed from the upstream utilities in 2008: https://gitlab.freedesktop.org/hadess/rio500/commit/943f624ab721eb8281c287650fcc9e2026f6f5db Cc: Cesar Miquel <miquel@df.uba.ar> Signed-off-by: Bastien Nocera <hadess@hadess.net> Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/6251c17584d220472ce882a3d9c199c401a51a71.camel@hadess.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11riscv: Avoid interrupts being erroneously enabled in handle_exception()Vincent Chen1-1/+5
[ Upstream commit c82dd6d078a2bb29d41eda032bb96d05699a524d ] When the handle_exception function addresses an exception, the interrupts will be unconditionally enabled after finishing the context save. However, It may erroneously enable the interrupts if the interrupts are disabled before entering the handle_exception. For example, one of the WARN_ON() condition is satisfied in the scheduling where the interrupt is disabled and rq.lock is locked. The WARN_ON will trigger a break exception and the handle_exception function will enable the interrupts before entering do_trap_break function. During the procedure, if a timer interrupt is pending, it will be taken when interrupts are enabled. In this case, it may cause a deadlock problem if the rq.lock is locked again in the timer ISR. Hence, the handle_exception() can only enable interrupts when the state of sstatus.SPIE is 1. This patch is tested on HiFive Unleashed board. Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Reviewed-by: Palmer Dabbelt <palmer@sifive.com> [paul.walmsley@sifive.com: updated to apply] Fixes: bcae803a21317 ("RISC-V: Enable IRQ during exception handling") Cc: David Abdurachmanov <david.abdurachmanov@sifive.com> Cc: stable@vger.kernel.org Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-11KVM: nVMX: Fix consistency check on injected exception error codeSean Christopherson1-1/+1
[ Upstream commit 567926cca99ba1750be8aae9c4178796bf9bb90b ] Current versions of Intel's SDM incorrectly state that "bits 31:15 of the VM-Entry exception error-code field" must be zero. In reality, bits 31:16 must be zero, i.e. error codes are 16-bit values. The bogus error code check manifests as an unexpected VM-Entry failure due to an invalid code field (error number 7) in L1, e.g. when injecting a #GP with error_code=0x9f00. Nadav previously reported the bug[*], both to KVM and Intel, and fixed the associated kvm-unit-test. [*] https://patchwork.kernel.org/patch/11124749/ Reported-by: Nadav Amit <namit@vmware.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-11x86/purgatory: Disable the stackleak GCC plugin for the purgatoryArvind Sankar1-0/+1
[ Upstream commit ca14c996afe7228ff9b480cf225211cc17212688 ] Since commit: b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS") kexec breaks if GCC_PLUGIN_STACKLEAK=y is enabled, as the purgatory contains undefined references to stackleak_track_stack. Attempting to load a kexec kernel results in this failure: kexec: Undefined symbol: stackleak_track_stack kexec-bzImage64: Loading purgatory failed Fix this by disabling the stackleak plugin for the purgatory. Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS") Link: https://lkml.kernel.org/r/20190923171753.GA2252517@rani.riverdale.lan Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-11DTS: ARM: gta04: introduce legacy spi-cs-high to make display work againH. Nikolaus Schaller1-0/+1
commit f1f028ff89cb0d37db299d48e7b2ce19be040d52 upstream. commit 6953c57ab172 "gpio: of: Handle SPI chipselect legacy bindings" did introduce logic to centrally handle the legacy spi-cs-high property in combination with cs-gpios. This assumes that the polarity of the CS has to be inverted if spi-cs-high is missing, even and especially if non-legacy GPIO_ACTIVE_HIGH is specified. The DTS for the GTA04 was orginally introduced under the assumption that there is no need for spi-cs-high if the gpio is defined with proper polarity GPIO_ACTIVE_HIGH. This was not a problem until gpiolib changed the interpretation of GPIO_ACTIVE_HIGH and missing spi-cs-high. The effect is that the missing spi-cs-high is now interpreted as CS being low (despite GPIO_ACTIVE_HIGH) which turns off the SPI interface when the panel is to be programmed by the panel driver. Therefore, we have to add the redundant and legacy spi-cs-high property to properly activate CS. Cc: stable@vger.kernel.org Signed-off-by: H. Nikolaus Schaller <hns@goldelico.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11libnvdimm/altmap: Track namespace boundaries in altmapAneesh Kumar K.V1-1/+16
commit cf387d9644d8c78721cf9b77af9f67bb5b04da16 upstream. With PFN_MODE_PMEM namespace, the memmap area is allocated from the device area. Some architectures map the memmap area with large page size. On architectures like ppc64, 16MB page for memap mapping can map 262144 pfns. This maps a namespace size of 16G. When populating memmap region with 16MB page from the device area, make sure the allocated space is not used to map resources outside this namespace. Such usage of device area will prevent a namespace destroy. Add resource end pnf in altmap and use that to check if the memmap area allocation can map pfn outside the namespace. On ppc64 in such case we fallback to allocation from memory. This fix kernel crash reported below: [ 132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0 [ 133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000 [ 133.464760] Faulting instruction address: 0xc00000000007580c [ 133.464766] Oops: Kernel access of bad area, sig: 11 [#1] [ 133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries ..... [ 133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0 [ 133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0 [ 133.464910] Call Trace: [ 133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable) [ 133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240 [ 133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590 [ 133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160 [ 133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0 [ 133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50 [ 133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400 [ 133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250 [ 133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110 [ 133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70 [ 133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0 [ 133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250 [ 133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70 [ 133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250 djbw: Aneesh notes that this crash can likely be triggered in any kernel that supports 'papr_scm', so flagging that commit for -stable consideration. Fixes: b5beae5e224f ("powerpc/pseries: Add driver for PAPR SCM regions") Cc: <stable@vger.kernel.org> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Pankaj Gupta <pagupta@redhat.com> Tested-by: Santosh Sivaraj <santosh@fossix.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Link: https://lore.kernel.org/r/20190910062826.10041-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11MIPS: Treat Loongson Extensions as ASEsJiaxun Yang4-0/+30
commit d2f965549006acb865c4638f1f030ebcefdc71f6 upstream. Recently, binutils had split Loongson-3 Extensions into four ASEs: MMI, CAM, EXT, EXT2. This patch do the samething in kernel and expose them in cpuinfo so applications can probe supported ASEs at runtime. Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Huacai Chen <chenhc@lemote.com> Cc: Yunqiang Su <ysu@wavecomp.com> Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Paul Burton <paul.burton@mips.com> Cc: linux-mips@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9Aneesh Kumar K.V5-22/+134
commit 047e6575aec71d75b765c22111820c4776cd1c43 upstream. On POWER9, under some circumstances, a broadcast TLB invalidation will fail to invalidate the ERAT cache on some threads when there are parallel mtpidr/mtlpidr happening on other threads of the same core. This can cause stores to continue to go to a page after it's unmapped. The workaround is to force an ERAT flush using PID=0 or LPID=0 tlbie flush. This additional TLB flush will cause the ERAT cache invalidation. Since we are using PID=0 or LPID=0, we don't get filtered out by the TLB snoop filtering logic. We need to still follow this up with another tlbie to take care of store vs tlbie ordering issue explained in commit: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9"). The presence of ERAT cache implies we can still get new stores and they may miss store queue marking flush. Cc: stable@vger.kernel.org Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-3-aneesh.kumar@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/mm: Fix an Oops in kasan_mmu_init()Christophe Leroy1-1/+14
commit cbd18991e24fea2c31da3bb117c83e4a3538cd11 upstream. Uncompressing Kernel Image ... OK Loading Device Tree to 01ff7000, end 01fff74f ... OK [ 0.000000] printk: bootconsole [udbg0] enabled [ 0.000000] BUG: Unable to handle kernel data access at 0xf818c000 [ 0.000000] Faulting instruction address: 0xc0013c7c [ 0.000000] Thread overran stack, or stack corrupted [ 0.000000] Oops: Kernel access of bad area, sig: 11 [#1] [ 0.000000] BE PAGE_SIZE=16K PREEMPT [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.3.0-rc4-s3k-dev-00743-g5abe4a3e8fd3-dirty #2080 [ 0.000000] NIP: c0013c7c LR: c0013310 CTR: 00000000 [ 0.000000] REGS: c0c5ff38 TRAP: 0300 Not tainted (5.3.0-rc4-s3k-dev-00743-g5abe4a3e8fd3-dirty) [ 0.000000] MSR: 00001032 <ME,IR,DR,RI> CR: 99033955 XER: 80002100 [ 0.000000] DAR: f818c000 DSISR: 82000000 [ 0.000000] GPR00: c0013310 c0c5fff0 c0ad6ac0 c0c600c0 f818c031 82000000 00000000 ffffffff [ 0.000000] GPR08: 00000000 f1f1f1f1 c0013c2c c0013304 99033955 00400008 00000000 07ff9598 [ 0.000000] GPR16: 00000000 07ffb94c 00000000 00000000 00000000 00000000 00000000 f818cfb2 [ 0.000000] GPR24: 00000000 00000000 00001000 ffffffff 00000000 c07dbf80 00000000 f818c000 [ 0.000000] NIP [c0013c7c] do_page_fault+0x50/0x904 [ 0.000000] LR [c0013310] handle_page_fault+0xc/0x38 [ 0.000000] Call Trace: [ 0.000000] Instruction dump: [ 0.000000] be010080 91410014 553fe8fe 3d40c001 3d20f1f1 7d800026 394a3c2c 3fffe000 [ 0.000000] 6129f1f1 900100c4 9181007c 91410018 <913f0000> 3d2001f4 6129f4f4 913f0004 Don't map the early shadow page read-only yet when creating the new page tables for the real shadow memory, otherwise the memblock allocations that immediately follows to create the real shadow pages that are about to replace the early shadow page trigger a page fault if they fall into the region being worked on at the moment. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Fixes: 2edb16efc899 ("powerpc/32: Add KASAN support") Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/fe86886fb8db44360417cee0dc515ad47ca6ef72.1566382750.git.christophe.leroy@c-s.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/mm: Add a helper to select PAGE_KERNEL_RO or PAGE_READONLYChristophe Leroy1-8/+13
commit 4c0f5d1eb4072871c34530358df45f05ab80edd6 upstream. In a couple of places there is a need to select whether read-only protection of shadow pages is performed with PAGE_KERNEL_RO or with PAGE_READONLY. Add a helper to avoid duplicating the choice. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/9f33f44b9cd741c4a02b3dce7b8ef9438fe2cd2a.1566382750.git.christophe.leroy@c-s.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flagAneesh Kumar K.V5-9/+9
commit 09ce98cacd51fcd0fa0af2f79d1e1d3192f4cbb0 upstream. Rename the #define to indicate this is related to store vs tlbie ordering issue. In the next patch, we will be adding another feature flag that is used to handles ERAT flush vs tlbie ordering issue. Fixes: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisionsAneesh Kumar K.V1-2/+28
commit 677733e296b5c7a37c47da391fc70a43dc40bd67 upstream. The store ordering vs tlbie issue mentioned in commit a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9") is fixed for Nimbus 2.3 and Cumulus 1.3 revisions. We don't need to apply the fixup if we are running on them We can only do this on PowerNV. On pseries guest with KVM we still don't support redoing the feature fixup after migration. So we should be enabling all the workarounds needed, because whe can possibly migrate between DD 2.3 and DD 2.2 Fixes: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-1-aneesh.kumar@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/kasan: Fix shadow area set up for modules.Christophe Leroy1-1/+1
commit 663c0c9496a69f80011205ba3194049bcafd681d upstream. When loading modules, from time to time an Oops is encountered during the init of shadow area for globals. This is due to the last page not always being mapped depending on the exact distance between the start and the end of the shadow area and the alignment with the page addresses. Fix this by aligning the starting address with the page address. Fixes: 2edb16efc899 ("powerpc/32: Add KASAN support") Cc: stable@vger.kernel.org # v5.2+ Reported-by: Erhard F. <erhard_f@mailbox.org> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/4f887e9b77d0d725cbb52035c7ece485c1c5fc14.1565361881.git.christophe.leroy@c-s.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/kasan: Fix parallel loading of modules.Christophe Leroy1-2/+19
commit 45ff3c55958542c3b76075d59741297b8cb31cbb upstream. Parallel loading of modules may lead to bad setup of shadow page table entries. First, lets align modules so that two modules never share the same shadow page. Second, ensure that two modules cannot allocate two page tables for the same PMD entry at the same time. This is done by using init_mm.page_table_lock in the same way as __pte_alloc_kernel() Fixes: 2edb16efc899 ("powerpc/32: Add KASAN support") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/c97284f912128cbc3f2fe09d68e90e65fb3e6026.1565361876.git.christophe.leroy@c-s.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/powernv/ioda: Fix race in TCE level allocationAlexey Kardashevskiy1-5/+13
commit 56090a3902c80c296e822d11acdb6a101b322c52 upstream. pnv_tce() returns a pointer to a TCE entry and originally a TCE table would be pre-allocated. For the default case of 2GB window the table needs only a single level and that is fine. However if more levels are requested, it is possible to get a race when 2 threads want a pointer to a TCE entry from the same page of TCEs. This adds cmpxchg to handle the race. Note that once TCE is non-zero, it cannot become zero again. Fixes: a68bd1267b72 ("powerpc/powernv/ioda: Allocate indirect TCE levels on demand") CC: stable@vger.kernel.org # v4.19+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190718051139.74787-2-aik@ozlabs.ru Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt()Gautham R. Shenoy2-3/+14
commit c784be435d5dae28d3b03db31753dd7a18733f0c upstream. The calls to arch_add_memory()/arch_remove_memory() are always made with the read-side cpu_hotplug_lock acquired via memory_hotplug_begin(). On pSeries, arch_add_memory()/arch_remove_memory() eventually call resize_hpt() which in turn calls stop_machine() which acquires the read-side cpu_hotplug_lock again, thereby resulting in the recursive acquisition of this lock. In the absence of CONFIG_PROVE_LOCKING, we hadn't observed a system lockup during a memory hotplug operation because cpus_read_lock() is a per-cpu rwsem read, which, in the fast-path (in the absence of the writer, which in our case is a CPU-hotplug operation) simply increments the read_count on the semaphore. Thus a recursive read in the fast-path doesn't cause any problems. However, we can hit this problem in practice if there is a concurrent CPU-Hotplug operation in progress which is waiting to acquire the write-side of the lock. This will cause the second recursive read to block until the writer finishes. While the writer is blocked since the first read holds the lock. Thus both the reader as well as the writers fail to make any progress thereby blocking both CPU-Hotplug as well as Memory Hotplug operations. Memory-Hotplug CPU-Hotplug CPU 0 CPU 1 ------ ------ 1. down_read(cpu_hotplug_lock.rw_sem) [memory_hotplug_begin] 2. down_write(cpu_hotplug_lock.rw_sem) [cpu_up/cpu_down] 3. down_read(cpu_hotplug_lock.rw_sem) [stop_machine()] Lockdep complains as follows in these code-paths. swapper/0/1 is trying to acquire lock: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x2c/0x60 but task is already holding lock: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(cpu_hotplug_lock.rw_sem); lock(cpu_hotplug_lock.rw_sem); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by swapper/0/1: #0: (____ptrval____) (&dev->mutex){....}, at: __driver_attach+0x12c/0x1b0 #1: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50 #2: (____ptrval____) (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x54/0x1a0 stack backtrace: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty #166 Call Trace: dump_stack+0xe8/0x164 (unreliable) __lock_acquire+0x1110/0x1c70 lock_acquire+0x240/0x290 cpus_read_lock+0x64/0xf0 stop_machine+0x2c/0x60 pseries_lpar_resize_hpt+0x19c/0x2c0 resize_hpt_for_hotplug+0x70/0xd0 arch_add_memory+0x58/0xfc devm_memremap_pages+0x5e8/0x8f0 pmem_attach_disk+0x764/0x830 nvdimm_bus_probe+0x118/0x240 really_probe+0x230/0x4b0 driver_probe_device+0x16c/0x1e0 __driver_attach+0x148/0x1b0 bus_for_each_dev+0x90/0x130 driver_attach+0x34/0x50 bus_add_driver+0x1a8/0x360 driver_register+0x108/0x170 __nd_driver_register+0xd0/0xf0 nd_pmem_driver_init+0x34/0x48 do_one_initcall+0x1e0/0x45c kernel_init_freeable+0x540/0x64c kernel_init+0x2c/0x160 ret_from_kernel_thread+0x5c/0x68 Fix this issue by 1) Requiring all the calls to pseries_lpar_resize_hpt() be made with cpu_hotplug_lock held. 2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked() as a consequence of 1) 3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt() with cpu_hotplug_lock held. Fixes: dbcf929c0062 ("powerpc/pseries: Add support for hash table resizing") Cc: stable@vger.kernel.org # v4.11+ Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1557906352-29048-1-git-send-email-ego@linux.vnet.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/powernv: Restrict OPAL symbol map to only be readable by rootAndrew Donnellan1-4/+7
commit e7de4f7b64c23e503a8c42af98d56f2a7462bd6d upstream. Currently the OPAL symbol map is globally readable, which seems bad as it contains physical addresses. Restrict it to root. Fixes: c8742f85125d ("powerpc/powernv: Expose OPAL firmware symbol map") Cc: stable@vger.kernel.org # v3.19+ Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190503075253.22798-1-ajd@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/ptdump: Fix addresses display on PPC32Christophe Leroy1-1/+1
commit 7c7a532ba3fc51bf9527d191fb410786c1fdc73c upstream. Commit 453d87f6a8ae ("powerpc/mm: Warn if W+X pages found on boot") wrongly changed KERN_VIRT_START from 0 to PAGE_OFFSET, leading to a shift in the displayed addresses. Lets revert that change to resync walk_pagetables()'s addr val and pgd_t pointer for PPC32. Fixes: 453d87f6a8ae ("powerpc/mm: Warn if W+X pages found on boot") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/eb4d626514e22f85814830012642329018ef6af9.1565786091.git.christophe.leroy@c-s.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/32s: Fix boot failure with DEBUG_PAGEALLOC without KASAN.Christophe Leroy2-0/+11
commit 9d6d712fbf7766f21c838940eebcd7b4d476c5e6 upstream. When KASAN is selected, the definitive hash table has to be set up later, but there is already an early temporary one. When KASAN is not selected, there is no early hash table, so the setup of the definitive hash table cannot be delayed. Fixes: 72f208c6a8f7 ("powerpc/32s: move hash code patching out of MMU_init_hw()") Cc: stable@vger.kernel.org # v5.2+ Reported-by: Jonathan Neuschafer <j.neuschaefer@gmx.net> Tested-by: Jonathan Neuschafer <j.neuschaefer@gmx.net> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/b7860c5e1e784d6b96ba67edf47dd6cbc2e78ab6.1565776892.git.christophe.leroy@c-s.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/603: Fix handling of the DIRTY flagChristophe Leroy1-2/+2
commit 415480dce2ef03bb8335deebd2f402f475443ce0 upstream. If a page is already mapped RW without the DIRTY flag, the DIRTY flag is never set and a TLB store miss exception is taken forever. This is easily reproduced with the following app: void main(void) { volatile char *ptr = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); *ptr = *ptr; } When DIRTY flag is not set, bail out of TLB miss handler and take a minor page fault which will set the DIRTY flag. Fixes: f8b58c64eaef ("powerpc/603: let's handle PAGE_DIRTY directly") Cc: stable@vger.kernel.org # v5.1+ Reported-by: Doug Crawford <doug.crawford@intelight-its.com> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/80432f71194d7ee75b2f5043ecf1501cf1cca1f3.1566196646.git.christophe.leroy@c-s.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/mce: Schedule work from irq_workSantosh Sivaraj1-1/+10
commit b5bda6263cad9a927e1a4edb7493d542da0c1410 upstream. schedule_work() cannot be called from MCE exception context as MCE can interrupt even in interrupt disabled context. Fixes: 733e4a4c4467 ("powerpc/mce: hookup memory_failure for UE errors") Cc: stable@vger.kernel.org # v4.15+ Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Santosh Sivaraj <santosh@fossix.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190820081352.8641-2-santosh@fossix.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/mce: Fix MCE handling for huge pagesBalbir Singh1-6/+13
commit 99ead78afd1128bfcebe7f88f3b102fb2da09aee upstream. The current code would fail on huge pages addresses, since the shift would be incorrect. Use the correct page shift value returned by __find_linux_pte() to get the correct physical address. The code is more generic and can handle both regular and compound pages. Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors") Signed-off-by: Balbir Singh <bsingharora@gmail.com> [arbab@linux.ibm.com: Fixup pseries_do_memory_failure()] Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Santosh Sivaraj <santosh@fossix.org> Cc: stable@vger.kernel.org # v4.15+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190820081352.8641-3-santosh@fossix.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown racePaul Mackerras5-23/+108
commit da15c03b047dca891d37b9f4ef9ca14d84a6484f upstream. Testing has revealed the existence of a race condition where a XIVE interrupt being shut down can be in one of the XIVE interrupt queues (of which there are up to 8 per CPU, one for each priority) at the point where free_irq() is called. If this happens, can return an interrupt number which has been shut down. This can lead to various symptoms: - irq_to_desc(irq) can be NULL. In this case, no end-of-interrupt function gets called, resulting in the CPU's elevated interrupt priority (numerically lowered CPPR) never gets reset. That then means that the CPU stops processing interrupts, causing device timeouts and other errors in various device drivers. - The irq descriptor or related data structures can be in the process of being freed as the interrupt code is using them. This typically leads to crashes due to bad pointer dereferences. This race is basically what commit 62e0468650c3 ("genirq: Add optional hardware synchronization for shutdown", 2019-06-28) is intended to fix, given a get_irqchip_state() method for the interrupt controller being used. It works by polling the interrupt controller when an interrupt is being freed until the controller says it is not pending. With XIVE, the PQ bits of the interrupt source indicate the state of the interrupt source, and in particular the P bit goes from 0 to 1 at the point where the hardware writes an entry into the interrupt queue that this interrupt is directed towards. Normally, the code will then process the interrupt and do an end-of-interrupt (EOI) operation which will reset PQ to 00 (assuming another interrupt hasn't been generated in the meantime). However, there are situations where the code resets P even though a queue entry exists (for example, by setting PQ to 01, which disables the interrupt source), and also situations where the code leaves P at 1 after removing the queue entry (for example, this is done for escalation interrupts so they cannot fire again until they are explicitly re-enabled). The code already has a 'saved_p' flag for the interrupt source which indicates that a queue entry exists, although it isn't maintained consistently. This patch adds a 'stale_p' flag to indicate that P has been left at 1 after processing a queue entry, and adds code to set and clear saved_p and stale_p as necessary to maintain a consistent indication of whether a queue entry may or may not exist. With this, we can implement xive_get_irqchip_state() by looking at stale_p, saved_p and the ESB PQ bits for the interrupt. There is some additional code to handle escalation interrupts properly; because they are enabled and disabled in KVM assembly code, which does not have access to the xive_irq_data struct for the escalation interrupt. Hence, stale_p may be incorrect when the escalation interrupt is freed in kvmppc_xive_{,native_}cleanup_vcpu(). Fortunately, we can fix it up by looking at vcpu->arch.xive_esc_on, with some careful attention to barriers in order to ensure the correct result if xive_esc_irq() races with kvmppc_xive_cleanup_vcpu(). Finally, this adds code to make noise on the console (pr_crit and WARN_ON(1)) if we find an interrupt queue entry for an interrupt which does not have a descriptor. While this won't catch the race reliably, if it does get triggered it will be an indication that the race is occurring and needs to be debugged. Fixes: 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE interrupt controller") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190813100648.GE9567@blackberry Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11KVM: X86: Fix userspace set invalid CR4Wanpeng Li1-17/+21
commit 3ca94192278ca8de169d78c085396c424be123b3 upstream. Reported by syzkaller: WARNING: CPU: 0 PID: 6544 at /home/kernel/data/kvm/arch/x86/kvm//vmx/vmx.c:4689 handle_desc+0x37/0x40 [kvm_intel] CPU: 0 PID: 6544 Comm: a.out Tainted: G OE 5.3.0-rc4+ #4 RIP: 0010:handle_desc+0x37/0x40 [kvm_intel] Call Trace: vmx_handle_exit+0xbe/0x6b0 [kvm_intel] vcpu_enter_guest+0x4dc/0x18d0 [kvm] kvm_arch_vcpu_ioctl_run+0x407/0x660 [kvm] kvm_vcpu_ioctl+0x3ad/0x690 [kvm] do_vfs_ioctl+0xa2/0x690 ksys_ioctl+0x6d/0x80 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x74/0x720 entry_SYSCALL_64_after_hwframe+0x49/0xbe When CR4.UMIP is set, guest should have UMIP cpuid flag. Current kvm set_sregs function doesn't have such check when userspace inputs sregs values. SECONDARY_EXEC_DESC is enabled on writes to CR4.UMIP in vmx_set_cr4 though guest doesn't have UMIP cpuid flag. The testcast triggers handle_desc warning when executing ltr instruction since guest architectural CR4 doesn't set UMIP. This patch fixes it by adding valid CR4 and CPUID combination checking in __set_sregs. syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=138efb99600000 Reported-by: syzbot+0f1819555fbdce992df9@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11KVM: PPC: Book3S HV: Don't lose pending doorbell request on migration on P9Paul Mackerras1-1/+8
commit ff42df49e75f053a8a6b4c2533100cdcc23afe69 upstream. On POWER9, when userspace reads the value of the DPDES register on a vCPU, it is possible for 0 to be returned although there is a doorbell interrupt pending for the vCPU. This can lead to a doorbell interrupt being lost across migration. If the guest kernel uses doorbell interrupts for IPIs, then it could malfunction because of the lost interrupt. This happens because a newly-generated doorbell interrupt is signalled by setting vcpu->arch.doorbell_request to 1; the DPDES value in vcpu->arch.vcore->dpdes is not updated, because it can only be updated when holding the vcpu mutex, in order to avoid races. To fix this, we OR in vcpu->arch.doorbell_request when reading the DPDES value. Cc: stable@vger.kernel.org # v4.13+ Fixes: 579006944e0d ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11KVM: PPC: Book3S HV: Check for MMU ready on piggybacked virtual coresPaul Mackerras1-5/+10
commit d28eafc5a64045c78136162af9d4ba42f8230080 upstream. When we are running multiple vcores on the same physical core, they could be from different VMs and so it is possible that one of the VMs could have its arch.mmu_ready flag cleared (for example by a concurrent HPT resize) when we go to run it on a physical core. We currently check the arch.mmu_ready flag for the primary vcore but not the flags for the other vcores that will be run alongside it. This adds that check, and also a check when we select the secondary vcores from the preempted vcores list. Cc: stable@vger.kernel.org # v4.14+ Fixes: 38c53af85306 ("KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interruptsPaul Mackerras1-13/+23
commit 959c5d5134786b4988b6fdd08e444aa67d1667ed upstream. Escalation interrupts are interrupts sent to the host by the XIVE hardware when it has an interrupt to deliver to a guest VCPU but that VCPU is not running anywhere in the system. Hence we disable the escalation interrupt for the VCPU being run when we enter the guest and re-enable it when the guest does an H_CEDE hypercall indicating it is idle. It is possible that an escalation interrupt gets generated just as we are entering the guest. In that case the escalation interrupt may be using a queue entry in one of the interrupt queues, and that queue entry may not have been processed when the guest exits with an H_CEDE. The existing entry code detects this situation and does not clear the vcpu->arch.xive_esc_on flag as an indication that there is a pending queue entry (if the queue entry gets processed, xive_esc_irq() will clear the flag). There is a comment in the code saying that if the flag is still set on H_CEDE, we have to abort the cede rather than re-enabling the escalation interrupt, lest we end up with two occurrences of the escalation interrupt in the interrupt queue. However, the exit code doesn't do that; it aborts the cede in the sense that vcpu->arch.ceded gets cleared, but it still enables the escalation interrupt by setting the source's PQ bits to 00. Instead we need to set the PQ bits to 10, indicating that an interrupt has been triggered. We also need to avoid setting vcpu->arch.xive_esc_on in this case (i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because xive_esc_irq() will run at some point and clear it, and if we race with that we may end up with an incorrect result (i.e. xive_esc_on set when the escalation interrupt has just been handled). It is extremely unlikely that having two queue entries would cause observable problems; theoretically it could cause queue overflow, but the CPU would have to have thousands of interrupts targetted to it for that to be possible. However, this fix will also make it possible to determine accurately whether there is an unhandled escalation interrupt in the queue, which will be needed by the following patch. Fixes: 9b9b13a6d153 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190813100349.GD9567@blackberry Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE devicePaul Mackerras3-1/+15
commit 8d4ba9c931bc384bcc6889a43915aaaf19d3e499 upstream. At present, when running a guest on POWER9 using HV KVM but not using an in-kernel interrupt controller (XICS or XIVE), for example if QEMU is run with the kernel_irqchip=off option, the guest entry code goes ahead and tries to load the guest context into the XIVE hardware, even though no context has been set up. To fix this, we check that the "CAM word" is non-zero before pushing it to the hardware. The CAM word is initialized to a non-zero value in kvmppc_xive_connect_vcpu() and kvmppc_xive_native_connect_vcpu(), and is now cleared in kvmppc_xive_{,native_}cleanup_vcpu. Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") Cc: stable@vger.kernel.org # v4.12+ Reported-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190813100100.GC9567@blackberry Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11KVM: PPC: Book3S HV: XIVE: Free escalation interrupts before disabling the VPCédric Le Goater2-13/+17
commit 237aed48c642328ff0ab19b63423634340224a06 upstream. When a vCPU is brought done, the XIVE VP (Virtual Processor) is first disabled and then the event notification queues are freed. When freeing the queues, we check for possible escalation interrupts and free them also. But when a XIVE VP is disabled, the underlying XIVE ENDs also are disabled in OPAL. When an END (Event Notification Descriptor) is disabled, its ESB pages (ESn and ESe) are disabled and loads return all 1s. Which means that any access on the ESB page of the escalation interrupt will return invalid values. When an interrupt is freed, the shutdown handler computes a 'saved_p' field from the value returned by a load in xive_do_source_set_mask(). This value is incorrect for escalation interrupts for the reason described above. This has no impact on Linux/KVM today because we don't make use of it but we will introduce in future changes a xive_get_irqchip_state() handler. This handler will use the 'saved_p' field to return the state of an interrupt and 'saved_p' being incorrect, softlockup will occur. Fix the vCPU cleanup sequence by first freeing the escalation interrupts if any, then disable the XIVE VP and last free the queues. Fixes: 90c73795afa2 ("KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode") Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190806172538.5087-1-clg@kaod.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11KVM: PPC: Book3S: Enable XIVE native capability only if OPAL has required ↵Paul Mackerras6-4/+21
functions commit 2ad7a27deaf6d78545d97ab80874584f6990360e upstream. There are some POWER9 machines where the OPAL firmware does not support the OPAL_XIVE_GET_QUEUE_STATE and OPAL_XIVE_SET_QUEUE_STATE calls. The impact of this is that a guest using XIVE natively will not be able to be migrated successfully. On the source side, the get_attr operation on the KVM native device for the KVM_DEV_XIVE_GRP_EQ_CONFIG attribute will fail; on the destination side, the set_attr operation for the same attribute will fail. This adds tests for the existence of the OPAL get/set queue state functions, and if they are not supported, the XIVE-native KVM device is not created and the KVM_CAP_PPC_IRQ_XIVE capability returns false. Userspace can then either provide a software emulation of XIVE, or else tell the guest that it does not have a XIVE controller available to it. Cc: stable@vger.kernel.org # v5.2+ Fixes: 3fab2d10588e ("KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode") Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11KVM: s390: fix __insn32_query() inline assemblyHeiko Carstens1-3/+3
commit b1c41ac3ce569b04644bb1e3fd28926604637da3 upstream. The inline assembly constraints of __insn32_query() tell the compiler that only the first byte of "query" is being written to. Intended was probably that 32 bytes are written to. Fix and simplify the code and just use a "memory" clobber. Fixes: d668139718a9 ("KVM: s390: provide query function for instructions returning 32 byte") Cc: stable@vger.kernel.org # v5.2+ Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11s390/topology: avoid firing events before kobjs are createdVasily Gorbik1-1/+2
commit f3122a79a1b0a113d3aea748e0ec26f2cb2889de upstream. arch_update_cpu_topology is first called from: kernel_init_freeable->sched_init_smp->sched_init_domains even before cpus has been registered in: kernel_init_freeable->do_one_initcall->s390_smp_init Do not trigger kobject_uevent change events until cpu devices are actually created. Fixes the following kasan findings: BUG: KASAN: global-out-of-bounds in kobject_uevent_env+0xb40/0xee0 Read of size 8 at addr 0000000000000020 by task swapper/0/1 BUG: KASAN: global-out-of-bounds in kobject_uevent_env+0xb36/0xee0 Read of size 8 at addr 0000000000000018 by task swapper/0/1 CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B Hardware name: IBM 3906 M04 704 (LPAR) Call Trace: ([<0000000143c6db7e>] show_stack+0x14e/0x1a8) [<0000000145956498>] dump_stack+0x1d0/0x218 [<000000014429fb4c>] print_address_description+0x64/0x380 [<000000014429f630>] __kasan_report+0x138/0x168 [<0000000145960b96>] kobject_uevent_env+0xb36/0xee0 [<0000000143c7c47c>] arch_update_cpu_topology+0x104/0x108 [<0000000143df9e22>] sched_init_domains+0x62/0xe8 [<000000014644c94a>] sched_init_smp+0x3a/0xc0 [<0000000146433a20>] kernel_init_freeable+0x558/0x958 [<000000014599002a>] kernel_init+0x22/0x160 [<00000001459a71d4>] ret_from_fork+0x28/0x30 [<00000001459a71dc>] kernel_thread_starter+0x0/0x10 Cc: stable@vger.kernel.org Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11KVM: s390: Test for bad access register and size at the start of S390_MEM_OPThomas Huth1-1/+1
commit a13b03bbb4575b350b46090af4dfd30e735aaed1 upstream. If the KVM_S390_MEM_OP ioctl is called with an access register >= 16, then there is certainly a bug in the calling userspace application. We check for wrong access registers, but only if the vCPU was already in the access register mode before (i.e. the SIE block has recorded it). The check is also buried somewhere deep in the calling chain (in the function ar_translation()), so this is somewhat hard to find. It's better to always report an error to the userspace in case this field is set wrong, and it's safer in the KVM code if we block wrong values here early instead of relying on a check somewhere deep down the calling chain, so let's add another check to kvm_s390_guest_mem_op() directly. We also should check that the "size" is non-zero here (thanks to Janosch Frank for the hint!). If we do not check the size, we could call vmalloc() with this 0 value, and this will cause a kernel warning. Signed-off-by: Thomas Huth <thuth@redhat.com> Link: https://lkml.kernel.org/r/20190829122517.31042-1-thuth@redhat.com Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11s390/process: avoid potential reading of freed stackVasily Gorbik1-6/+16
commit 8769f610fe6d473e5e8e221709c3ac402037da6c upstream. With THREAD_INFO_IN_TASK (which is selected on s390) task's stack usage is refcounted and should always be protected by get/put when touching other task's stack to avoid race conditions with task's destruction code. Fixes: d5c352cdd022 ("s390: move thread_info into task_struct") Cc: stable@vger.kernel.org # v4.10+ Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-07KVM: hyperv: Fix Direct Synthetic timers assert an interrupt w/o lapic_in_kernelWanpeng Li1-2/+10
commit a073d7e3ad687a7ef32b65affe80faa7ce89bf92 upstream. Reported by syzkaller: kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN RIP: 0010:__apic_accept_irq+0x46/0x740 arch/x86/kvm/lapic.c:1029 Call Trace: kvm_apic_set_irq+0xb4/0x140 arch/x86/kvm/lapic.c:558 stimer_notify_direct arch/x86/kvm/hyperv.c:648 [inline] stimer_expiration arch/x86/kvm/hyperv.c:659 [inline] kvm_hv_process_stimers+0x594/0x1650 arch/x86/kvm/hyperv.c:686 vcpu_enter_guest+0x2b2a/0x54b0 arch/x86/kvm/x86.c:7896 vcpu_run+0x393/0xd40 arch/x86/kvm/x86.c:8152 kvm_arch_vcpu_ioctl_run+0x636/0x900 arch/x86/kvm/x86.c:8360 kvm_vcpu_ioctl+0x6cf/0xaf0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2765 The testcase programs HV_X64_MSR_STIMERn_CONFIG/HV_X64_MSR_STIMERn_COUNT, in addition, there is no lapic in the kernel, the counters value are small enough in order that kvm_hv_process_stimers() inject this already-expired timer interrupt into the guest through lapic in the kernel which triggers the NULL deferencing. This patch fixes it by don't advertise direct mode synthetic timers and discarding the inject when lapic is not in kernel. syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=1752fe0a600000 Reported-by: syzbot+dff25ee91f0c7d5c1695@syzkaller.appspotmail.com Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-07arm: use STACK_TOP when computing mmap base addressAlexandre Ghiti1-2/+2
[ Upstream commit 86e568e9c0525fc40e76d827212d5e9721cf7504 ] mmap base address must be computed wrt stack top address, using TASK_SIZE is wrong since STACK_TOP and TASK_SIZE are not equivalent. Link: http://lkml.kernel.org/r/20190730055113.23635-8-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: James Hogan <jhogan@kernel.org> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-07arm: properly account for stack randomization and stack guard gapAlexandre Ghiti1-2/+12
[ Upstream commit af0f4297286f13a75edf93677b1fb2fc16c412a7 ] This commit takes care of stack randomization and stack guard gap when computing mmap base address and checks if the task asked for randomization. This fixes the problem uncovered and not fixed for arm here: https://lkml.kernel.org/r/20170622200033.25714-1-riel@redhat.com Link: http://lkml.kernel.org/r/20190730055113.23635-7-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: James Hogan <jhogan@kernel.org> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-07mips: properly account for stack randomization and stack guard gapAlexandre Ghiti1-2/+12
[ Upstream commit b1f61b5bde3a1f50392c97b4c8513d1b8efb1cf2 ] This commit takes care of stack randomization and stack guard gap when computing mmap base address and checks if the task asked for randomization. This fixes the problem uncovered and not fixed for arm here: https://lkml.kernel.org/r/20170622200033.25714-1-riel@redhat.com Link: http://lkml.kernel.org/r/20190730055113.23635-10-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Paul Burton <paul.burton@mips.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: James Hogan <jhogan@kernel.org> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-07arm64: consider stack randomization for mmap base only when necessaryAlexandre Ghiti1-1/+5
[ Upstream commit e8d54b62c55ab6201de6d195fc2c276294c1f6ae ] Do not offset mmap base address because of stack randomization if current task does not want randomization. Note that x86 already implements this behaviour. Link: http://lkml.kernel.org/r/20190730055113.23635-4-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: James Hogan <jhogan@kernel.org> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-07hypfs: Fix error number left in struct pointer memberDavid Howells1-4/+5
[ Upstream commit b54c64f7adeb241423cd46598f458b5486b0375e ] In hypfs_fill_super(), if hypfs_create_update_file() fails, sbi->update_file is left holding an error number. This is passed to hypfs_kill_super() which doesn't check for this. Fix this by not setting sbi->update_value until after we've checked for error. Fixes: 24bbb1faf3f0 ("[PATCH] s390_hypfs filesystem") Signed-off-by: David Howells <dhowells@redhat.com> cc: Martin Schwidefsky <schwidefsky@de.ibm.com> cc: Heiko Carstens <heiko.carstens@de.ibm.com> cc: linux-s390@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-07ARM: 8903/1: ensure that usable memory in bank 0 starts from a PMD-aligned ↵Mike Rapoport1-0/+16
address [ Upstream commit 00d2ec1e6bd82c0538e6dd3e4a4040de93ba4fef ] The calculation of memblock_limit in adjust_lowmem_bounds() assumes that bank 0 starts from a PMD-aligned address. However, the beginning of the first bank may be NOMAP memory and the start of usable memory will be not aligned to PMD boundary. In such case the memblock_limit will be set to the end of the NOMAP region, which will prevent any memblock allocations. Mark the region between the end of the NOMAP area and the next PMD-aligned address as NOMAP as well, so that the usable memory will start at PMD-aligned address. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-07ARM: 8905/1: Emit __gnu_mcount_nc when using Clang 10.0.0 or newerNathan Chancellor2-1/+5
[ Upstream commit b0fe66cf095016e0b238374c10ae366e1f087d11 ] Currently, multi_v7_defconfig + CONFIG_FUNCTION_TRACER fails to build with clang: arm-linux-gnueabi-ld: kernel/softirq.o: in function `_local_bh_enable': softirq.c:(.text+0x504): undefined reference to `mcount' arm-linux-gnueabi-ld: kernel/softirq.o: in function `__local_bh_enable_ip': softirq.c:(.text+0x58c): undefined reference to `mcount' arm-linux-gnueabi-ld: kernel/softirq.o: in function `do_softirq': softirq.c:(.text+0x6c8): undefined reference to `mcount' arm-linux-gnueabi-ld: kernel/softirq.o: in function `irq_enter': softirq.c:(.text+0x75c): undefined reference to `mcount' arm-linux-gnueabi-ld: kernel/softirq.o: in function `irq_exit': softirq.c:(.text+0x840): undefined reference to `mcount' arm-linux-gnueabi-ld: kernel/softirq.o:softirq.c:(.text+0xa50): more undefined references to `mcount' follow clang can emit a working mcount symbol, __gnu_mcount_nc, when '-meabi gnu' is passed to it. Until r369147 in LLVM, this was broken and caused the kernel not to boot with '-pg' because the calling convention was not correct. Always build with '-meabi gnu' when using clang but ensure that '-pg' (which is added with CONFIG_FUNCTION_TRACER and its prereq CONFIG_HAVE_FUNCTION_TRACER) cannot be added with it unless this is fixed (which means using clang 10.0.0 and newer). Link: https://github.com/ClangBuiltLinux/linux/issues/35 Link: https://bugs.llvm.org/show_bug.cgi?id=33845 Link: https://github.com/llvm/llvm-project/commit/16fa8b09702378bacfa3d07081afe6b353b99e60 Reviewed-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Stefan Agner <stefan@agner.ch> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-07ARM: 8875/1: Kconfig: default to AEABI w/ ClangNick Desaulniers1-2/+3
[ Upstream commit a05b9608456e0d4464c6f7ca8572324ace57a3f4 ] Clang produces references to __aeabi_uidivmod and __aeabi_idivmod for arm-linux-gnueabi and arm-linux-gnueabihf targets incorrectly when AEABI is not selected (such as when OABI_COMPAT is selected). While this means that OABI userspaces wont be able to upgraded to kernels built with Clang, it means that boards that don't enable AEABI like s3c2410_defconfig will stop failing to link in KernelCI when built with Clang. Link: https://github.com/ClangBuiltLinux/linux/issues/482 Link: https://groups.google.com/forum/#!msg/clang-built-linux/yydsAAux5hk/GxjqJSW-AQAJ Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-07ARM: 8898/1: mm: Don't treat faults reported from cache maintenance as writesWill Deacon2-2/+3
[ Upstream commit 834020366da9ab3fb87d1eb9a3160eb22dbed63a ] Translation faults arising from cache maintenance instructions are rather unhelpfully reported with an FSR value where the WnR field is set to 1, indicating that the faulting access was a write. Since cache maintenance instructions on 32-bit ARM do not require any particular permissions, this can cause our private 'cacheflush' system call to fail spuriously if a translation fault is generated due to page aging when targetting a read-only VMA. In this situation, we will return -EFAULT to userspace, although this is unfortunately suppressed by the popular '__builtin___clear_cache()' intrinsic provided by GCC, which returns void. Although it's tempting to write this off as a userspace issue, we can actually do a little bit better on CPUs that support LPAE, even if the short-descriptor format is in use. On these CPUs, cache maintenance faults additionally set the CM field in the FSR, which we can use to suppress the write permission checks in the page fault handler and succeed in performing cache maintenance to read-only areas even in the presence of a translation fault. Reported-by: Orion Hodson <oth@google.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-07mips/atomic: Fix smp_mb__{before,after}_atomic()Peter Zijlstra4-29/+45
[ Upstream commit 42344113ba7a1ed7b5654cd5270af0d5698d8521 ] Recent probing at the Linux Kernel Memory Model uncovered a 'surprise'. Strongly ordered architectures where the atomic RmW primitive implies full memory ordering and smp_mb__{before,after}_atomic() are a simple barrier() (such as MIPS without WEAK_REORDERING_BEYOND_LLSC) fail for: *x = 1; atomic_inc(u); smp_mb__after_atomic(); r0 = *y; Because, while the atomic_inc() implies memory order, it (surprisingly) does not provide a compiler barrier. This then allows the compiler to re-order like so: atomic_inc(u); *x = 1; smp_mb__after_atomic(); r0 = *y; Which the CPU is then allowed to re-order (under TSO rules) like: atomic_inc(u); r0 = *y; *x = 1; And this very much was not intended. Therefore strengthen the atomic RmW ops to include a compiler barrier. Reported-by: Andrea Parri <andrea.parri@amarulasolutions.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paul Burton <paul.burton@mips.com> Signed-off-by: Sasha Levin <sashal@kernel.org>