Age | Commit message (Collapse) | Author | Files | Lines |
|
commit e1c3743e1a20647c53b719dbf28b48f45d23f2cd upstream.
On a signal handler return, the user could set a context with MSR[TS] bits
set, and these bits would be copied to task regs->msr.
At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
several __get_user() are called and then a recheckpoint is executed.
This is a problem since a page fault (in kernel space) could happen when
calling __get_user(). If it happens, the process MSR[TS] bits were
already set, but recheckpoint was not executed, and SPRs are still invalid.
The page fault can cause the current process to be de-scheduled, with
MSR[TS] active and without tm_recheckpoint() being called. More
importantly, without TEXASR[FS] bit set also.
Since TEXASR might not have the FS bit set, and when the process is
scheduled back, it will try to reclaim, which will be aborted because of
the CPU is not in the suspended state, and, then, recheckpoint. This
recheckpoint will restore thread->texasr into TEXASR SPR, which might be
zero, hitting a BUG_ON().
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
cpu 0xb: Vector: 700 (Program Check) at [c00000041f1576d0]
pc: c000000000054550: restore_gprs+0xb0/0x180
lr: 0000000000000000
sp: c00000041f157950
msr: 8000000100021033
current = 0xc00000041f143000
paca = 0xc00000000fb86300 softe: 0 irq_happened: 0x01
pid = 1021, comm = kworker/11:1
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
enter ? for help
[c00000041f157b30] c00000000001bc3c tm_recheckpoint.part.11+0x6c/0xa0
[c00000041f157b70] c00000000001d184 __switch_to+0x1e4/0x4c0
[c00000041f157bd0] c00000000082eeb8 __schedule+0x2f8/0x990
[c00000041f157cb0] c00000000082f598 schedule+0x48/0xc0
[c00000041f157ce0] c0000000000f0d28 worker_thread+0x148/0x610
[c00000041f157d80] c0000000000f96b0 kthread+0x120/0x140
[c00000041f157e30] c00000000000c0e0 ret_from_kernel_thread+0x5c/0x7c
This patch simply delays the MSR[TS] set, so, if there is any page fault in
the __get_user() section, it does not have regs->msr[TS] set, since the TM
structures are still invalid, thus avoiding doing TM operations for
in-kernel exceptions and possible process reschedule.
With this patch, the MSR[TS] will only be set just before recheckpointing
and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
invalid state.
Other than that, if CONFIG_PREEMPT is set, there might be a preemption just
after setting MSR[TS] and before tm_recheckpoint(), thus, this block must
be atomic from a preemption perspective, thus, calling
preempt_disable/enable() on this code.
It is not possible to move tm_recheckpoint to happen earlier, because it is
required to get the checkpointed registers from userspace, with
__get_user(), thus, the only way to avoid this undesired behavior is
delaying the MSR[TS] set.
The 32-bits signal handler seems to be safe this current issue, but, it
might be exposed to the preemption issue, thus, disabling preemption in
this chunk of code.
Changes from v2:
* Run the critical section with preempt_disable.
Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals")
Cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit aea447141c7e7824b81b49acd1bc785506fba46e upstream.
The powerpc kernel uses setjmp which causes a warning when building
with clang:
In file included from arch/powerpc/xmon/xmon.c:51:
./arch/powerpc/include/asm/setjmp.h:15:13: error: declaration of
built-in function 'setjmp' requires inclusion of the header <setjmp.h>
[-Werror,-Wbuiltin-requires-header]
extern long setjmp(long *);
^
./arch/powerpc/include/asm/setjmp.h:16:13: error: declaration of
built-in function 'longjmp' requires inclusion of the header <setjmp.h>
[-Werror,-Wbuiltin-requires-header]
extern void longjmp(long *, long);
^
This *is* the header and we're not using the built-in setjump but
rather the one in arch/powerpc/kernel/misc.S. As the compiler warning
does not make sense, it for the files where setjmp is used.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
[mpe: Move subdir-ccflags in xmon/Makefile to not clobber -Werror]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 78e7b15e17ac175e7eed9e21c6f92d03d3b0a6fa upstream.
The arch_teardown_msi_irqs() function assumes that controller ops
pointers were already checked in arch_setup_msi_irqs(), but this
assumption is wrong: arch_teardown_msi_irqs() can be called even when
arch_setup_msi_irqs() returns an error (-ENOSYS).
This can happen in the following scenario:
- msi_capability_init() calls pci_msi_setup_msi_irqs()
- pci_msi_setup_msi_irqs() returns -ENOSYS
- msi_capability_init() notices the error and calls free_msi_irqs()
- free_msi_irqs() calls pci_msi_teardown_msi_irqs()
This is easier to see when CONFIG_PCI_MSI_IRQ_DOMAIN is not set and
pci_msi_setup_msi_irqs() and pci_msi_teardown_msi_irqs() are just
aliases to arch_setup_msi_irqs() and arch_teardown_msi_irqs().
The call to free_msi_irqs() upon pci_msi_setup_msi_irqs() failure
seems legit, as it does additional cleanup; e.g.
list_del(&entry->list) and kfree(entry) inside free_msi_irqs() do
happen (MSI descriptors are allocated before pci_msi_setup_msi_irqs()
is called and need to be cleaned up if that fails).
Fixes: 6b2fd7efeb88 ("PCI/MSI/PPC: Remove arch_msi_check_device()")
Cc: stable@vger.kernel.org # v3.18+
Signed-off-by: Radu Rendec <radu.rendec@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f9bc28aedfb5bbd572d2d365f3095c1becd7209b ]
If an error occurs during an unplug operation, it's possible for
eeh_dump_dev_log() to be called when edev->pdn is null, which
currently leads to dereferencing a null pointer.
Handle this by skipping the error log for those devices.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b851ba02a6f3075f0f99c60c4bc30a4af80cf428 ]
The recent module relocation overflow crash demonstrated that we
have no range checking on REL32 relative relocations. This patch
implements a basic check, the same kernel that previously oopsed
and rebooted now continues with some of these errors when loading
the module:
module_64: x_tables: REL32 527703503449812 out of range!
Possibly other relocations (ADDR32, REL16, TOC16, etc.) should also have
overflow checks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit daf00ae71dad8aa05965713c62558aeebf2df48e ]
commit b96672dd840f ("powerpc: Machine check interrupt is a non-
maskable interrupt") added a call to nmi_enter() at the beginning of
machine check restart exception handler. Due to that, in_interrupt()
always returns true regardless of the state before entering the
exception, and die() panics even when the system was not already in
interrupt.
This patch calls nmi_exit() before calling die() in order to restore
the interrupt state we had before calling nmi_enter()
Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable interrupt")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 96dc89d526ef77604376f06220e3d2931a0bfd58 ]
Current we store the userspace r1 to PACATMSCRATCH before finally
saving it to the thread struct.
In theory an exception could be taken here (like a machine check or
SLB miss) that could write PACATMSCRATCH and hence corrupt the
userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but
others do.
We've never actually seen this happen but it's theoretically
possible. Either way, the code is fragile as it is.
This patch saves r1 to the kernel stack (which can't fault) before we
turn MSR[RI] back on. PACATMSCRATCH is still used but only with
MSR[RI] off. We then copy r1 from the kernel stack to the thread
struct once we have MSR[RI] back on.
Suggested-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit cf13435b730a502e814c63c84d93db131e563f5f ]
When we treclaim we store the userspace checkpointed r13 to a scratch
SPR and then later save the scratch SPR to the user thread struct.
Unfortunately, this doesn't work as accessing the user thread struct
can take an SLB fault and the SLB fault handler will write the same
scratch SPRG that now contains the userspace r13.
To fix this, we store r13 to the kernel stack (which can't fault)
before we access the user thread struct.
Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen
as a random userspace segfault with r13 looking like a kernel address.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8950329c4a64c6d3ca0bc34711a1afbd9ce05657 ]
Memory reservation for crashkernel could fail if there are holes around
kdump kernel offset (128M). Fail gracefully in such cases and print an
error message.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: David Gibson <dgibson@redhat.com>
Reviewed-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 78ee9946371f5848ddfc88ab1a43867df8f17d83 ]
Because rfi_flush_fallback runs immediately before the return to
userspace it currently runs with the user r1 (stack pointer). This
means if we oops in there we will report a bad kernel stack pointer in
the exception entry path, eg:
Bad kernel stack pointer 7ffff7150e40 at c0000000000023b4
Oops: Bad kernel stack pointer, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1246 Comm: klogd Not tainted 4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3 #7
NIP: c0000000000023b4 LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000fffe7d40 TRAP: 4100 Not tainted (4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: c0000000f1e66a80
GPR00: 0000000002000000 00007ffff7150e40 00007fff93a99900 0000000000000020
...
NIP [c0000000000023b4] rfi_flush_fallback+0x34/0x80
LR [0000000010053e00] 0x10053e00
Although the NIP tells us where we were, and the TRAP number tells us
what happened, it would still be nicer if we could report the actual
exception rather than barfing about the stack pointer.
We an do that fairly simply by loading the kernel stack pointer on
entry and restoring the user value before returning. That way we see a
regular oops such as:
Unrecoverable exception 4100 at c00000000000239c
Oops: Unrecoverable exception, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1251 Comm: klogd Not tainted 4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty #40
NIP: c00000000000239c LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000f1e17bb0 TRAP: 4100 Not tainted (4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: 0
...
NIP [c00000000000239c] rfi_flush_fallback+0x3c/0x80
LR [0000000010053e00] 0x10053e00
Call Trace:
[c0000000f1e17e30] [c00000000000b9e4] system_call+0x5c/0x70 (unreliable)
Note this shouldn't make the kernel stack pointer vulnerable to a
meltdown attack, because it should be flushed from the cache before we
return to userspace. The user r1 value will be in the cache, because
we load it in the return path, but that is harmless.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1bd6a1c4b80a28d975287630644e6b47d0f977a5 upstream.
Crash memory ranges is an array of memory ranges of the crashing kernel
to be exported as a dump via /proc/vmcore file. The size of the array
is set based on INIT_MEMBLOCK_REGIONS, which works alright in most cases
where memblock memory regions count is less than INIT_MEMBLOCK_REGIONS
value. But this count can grow beyond INIT_MEMBLOCK_REGIONS value since
commit 142b45a72e22 ("memblock: Add array resizing support").
On large memory systems with a few DLPAR operations, the memblock memory
regions count could be larger than INIT_MEMBLOCK_REGIONS value. On such
systems, registering fadump results in crash or other system failures
like below:
task: c00007f39a290010 ti: c00000000b738000 task.ti: c00000000b738000
NIP: c000000000047df4 LR: c0000000000f9e58 CTR: c00000000010f180
REGS: c00000000b73b570 TRAP: 0300 Tainted: G L X (4.4.140+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22004484 XER: 20000000
CFAR: c000000000008500 DAR: 000007a450000000 DSISR: 40000000 SOFTE: 0
...
NIP [c000000000047df4] smp_send_reschedule+0x24/0x80
LR [c0000000000f9e58] resched_curr+0x138/0x160
Call Trace:
resched_curr+0x138/0x160 (unreliable)
check_preempt_curr+0xc8/0xf0
ttwu_do_wakeup+0x38/0x150
try_to_wake_up+0x224/0x4d0
__wake_up_common+0x94/0x100
ep_poll_callback+0xac/0x1c0
__wake_up_common+0x94/0x100
__wake_up_sync_key+0x70/0xa0
sock_def_readable+0x58/0xa0
unix_stream_sendmsg+0x2dc/0x4c0
sock_sendmsg+0x68/0xa0
___sys_sendmsg+0x2cc/0x2e0
__sys_sendmsg+0x5c/0xc0
SyS_socketcall+0x36c/0x3f0
system_call+0x3c/0x100
as array index overflow is not checked for while setting up crash memory
ranges causing memory corruption. To resolve this issue, dynamically
allocate memory for crash memory ranges and resize it incrementally,
in units of pagesize, on hitting array size limit.
Fixes: 2df173d9e85d ("fadump: Initialize elfcore header and add PT_LOAD program headers.")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
[mpe: Just use PAGE_SIZE directly, fixup variable placement]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e4ccb1dae6bdef228d729c076c38161ef6e7ca34 ]
New binutils generate the following warning
AS arch/powerpc/kernel/head_8xx.o
arch/powerpc/kernel/head_8xx.S: Assembler messages:
arch/powerpc/kernel/head_8xx.S:916: Warning: invalid register expression
This patch fixes it.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit eae5f709a4d738c52b6ab636981755d76349ea9e ]
__printf is useful to verify format and arguments. Fix arg mismatch
reported by gcc, remove the following warnings (with W=1):
arch/powerpc/kernel/prom_init.c:1467:31: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1471:31: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1504:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1505:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1506:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1507:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1508:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1509:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:1975:39: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘unsigned int’
arch/powerpc/kernel/prom_init.c:1986:27: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:2567:38: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:2567:46: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 3 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:2569:38: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
arch/powerpc/kernel/prom_init.c:2569:46: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 3 has type ‘long unsigned int’
The patch also include arg mismatch fix for case with #define DEBUG_PROM
(warning not listed here).
This patch fix also the following warnings revealed by checkpatch:
WARNING: Prefer using '"%s...", __func__' to using 'alloc_up', this function's name, in a string
#101: FILE: arch/powerpc/kernel/prom_init.c:1235:
+ prom_debug("alloc_up(%lx, %lx)\n", size, align);
and
WARNING: Prefer using '"%s...", __func__' to using 'alloc_down', this function's name, in a string
#138: FILE: arch/powerpc/kernel/prom_init.c:1278:
+ prom_debug("alloc_down(%lx, %lx, %s)\n", size, align,
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c89ca593220931c150cffda24b4d4ccf82f13fc8 ]
The header file <linux/syscalls.h> was missing from the includes. Fix the
following warning, treated as error with W=1:
arch/powerpc/kernel/pci_32.c:286:6: error: no previous prototype for ‘sys_pciconfig_iobase’ [-Werror=missing-prototypes]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 46d4be41b987a6b2d25a2ebdd94cafb44e21d6c5 ]
Correct two cases where eeh_pcid_get() is used to reference the driver's
module but the reference is dropped before the driver pointer is used.
In eeh_rmv_device() also refactor a little so that only two calls to
eeh_pcid_put() are needed, rather than three and the reference isn't
taken at all if it wasn't needed.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b03897cf318dfc47de33a7ecbc7655584266f034 upstream.
On 64-bit servers, SPRN_SPRG3 and its userspace read-only mirror
SPRN_USPRG3 are used as userspace VDSO write and read registers
respectively.
SPRN_SPRG3 is lost when we enter stop4 and above, and is currently not
restored. As a result, any read from SPRN_USPRG3 returns zero on an
exit from stop4 (Power9 only) and above.
Thus in this situation, on POWER9, any call from sched_getcpu() always
returns zero, as on powerpc, we call __kernel_getcpu() which relies
upon SPRN_USPRG3 to report the CPU and NUMA node information.
Fix this by restoring SPRN_SPRG3 on wake up from a deep stop state
with the sprg_vdso value that is cached in PACA.
Fixes: e1c1cfed5432 ("powerpc/powernv: Save/Restore additional SPRs for stop4 cpuidle")
Cc: stable@vger.kernel.org # v4.14+
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 722cde76d68e8cc4f3de42e71c82fd40dea4f7b9 upstream.
Unregister fadump on kexec down path otherwise the fadump registration
in new kexec-ed kernel complains that fadump is already registered.
This makes new kernel to continue using fadump registered by previous
kernel which may lead to invalid vmcore generation. Hence this patch
fixes this issue by un-registering fadump in fadump_cleanup() which is
called during kexec path so that new kernel can register fadump with
new valid values.
Fixes: b500afff11f6 ("fadump: Invalidate registration and release reserved memory for general use.")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cd6ef7eebf171bfcba7dc2df719c2a4958775040 upstream.
Back when we first introduced the DAWR, in commit 4ae7ebe9522a
("powerpc: Change hardware breakpoint to allow longer ranges"), we
screwed up the constraint making it a 1024 byte boundary rather than a
512. This makes the check overly permissive. Fortunately GDB is the
only real user and it always did they right thing, so we never
noticed.
This fixes the constraint to 512 bytes.
Fixes: 4ae7ebe9522a ("powerpc: Change hardware breakpoint to allow longer ranges")
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4f7c06e26ec9cf7fe9f0c54dc90079b6a4f4b2c3 upstream.
In commit e2a800beaca1 ("powerpc/hw_brk: Fix off by one error when
validating DAWR region end") we fixed setting the DAWR end point to
its max value via PPC_PTRACE_SETHWDEBUG. Unfortunately we broke
PTRACE_SET_DEBUGREG when setting a 512 byte aligned breakpoint.
PTRACE_SET_DEBUGREG currently sets the length of the breakpoint to
zero (memset() in hw_breakpoint_init()). This worked with
arch_validate_hwbkpt_settings() before the above patch was applied but
is now broken if the breakpoint is 512byte aligned.
This sets the length of the breakpoint to 8 bytes when using
PTRACE_SET_DEBUGREG.
Fixes: e2a800beaca1 ("powerpc/hw_brk: Fix off by one error when validating DAWR region end")
Cc: stable@vger.kernel.org # v3.11+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 91d06971881f71d945910de128658038513d1b24 upstream.
Currently we do not have an isync, or any other context synchronizing
instruction prior to the slbie/slbmte in _switch() that updates the
SLB entry for the kernel stack.
However that is not correct as outlined in the ISA.
From Power ISA Version 3.0B, Book III, Chapter 11, page 1133:
"Changing the contents of ... the contents of SLB entries ... can
have the side effect of altering the context in which data
addresses and instruction addresses are interpreted, and in which
instructions are executed and data accesses are performed.
...
These side effects need not occur in program order, and therefore
may require explicit synchronization by software.
...
The synchronizing instruction before the context-altering
instruction ensures that all instructions up to and including that
synchronizing instruction are fetched and executed in the context
that existed before the alteration."
And page 1136:
"For data accesses, the context synchronizing instruction before the
slbie, slbieg, slbia, slbmte, tlbie, or tlbiel instruction ensures
that all preceding instructions that access data storage have
completed to a point at which they have reported all exceptions
they will cause."
We're not aware of any bugs caused by this, but it should be fixed
regardless.
Add the missing isync when updating kernel stack SLB entry.
Cc: stable@vger.kernel.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Flesh out change log with more ISA text & explanation]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit aa0ab02ba992eb956934b21373e0138211486ddd upstream.
On the 8xx, the page size is set in the PMD entry and applies to
all pages of the page table pointed by the said PMD entry.
When an app has some regular pages allocated (e.g. see below) and tries
to mmap() a huge page at a hint address covered by the same PMD entry,
the kernel accepts the hint allthough the 8xx cannot handle different
page sizes in the same PMD entry.
10000000-10001000 r-xp 00000000 00:0f 2597 /root/malloc
10010000-10011000 rwxp 00000000 00:0f 2597 /root/malloc
mmap(0x10080000, 524288, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|0x40000, -1, 0) = 0x10080000
This results the app remaining forever in do_page_fault()/hugetlb_fault()
and when interrupting that app, we get the following warning:
[162980.035629] WARNING: CPU: 0 PID: 2777 at arch/powerpc/mm/hugetlbpage.c:354 hugetlb_free_pgd_range+0xc8/0x1e4
[162980.035699] CPU: 0 PID: 2777 Comm: malloc Tainted: G W 4.14.6 #85
[162980.035744] task: c67e2c00 task.stack: c668e000
[162980.035783] NIP: c000fe18 LR: c00e1eec CTR: c00f90c0
[162980.035830] REGS: c668fc20 TRAP: 0700 Tainted: G W (4.14.6)
[162980.035854] MSR: 00029032 <EE,ME,IR,DR,RI> CR: 24044224 XER: 20000000
[162980.036003]
[162980.036003] GPR00: c00e1eec c668fcd0 c67e2c00 00000010 c6869410 10080000 00000000 77fb4000
[162980.036003] GPR08: ffff0001 0683c001 00000000 ffffff80 44028228 10018a34 00004008 418004fc
[162980.036003] GPR16: c668e000 00040100 c668e000 c06c0000 c668fe78 c668e000 c6835ba0 c668fd48
[162980.036003] GPR24: 00000000 73ffffff 74000000 00000001 77fb4000 100fffff 10100000 10100000
[162980.036743] NIP [c000fe18] hugetlb_free_pgd_range+0xc8/0x1e4
[162980.036839] LR [c00e1eec] free_pgtables+0x12c/0x150
[162980.036861] Call Trace:
[162980.036939] [c668fcd0] [c00f0774] unlink_anon_vmas+0x1c4/0x214 (unreliable)
[162980.037040] [c668fd10] [c00e1eec] free_pgtables+0x12c/0x150
[162980.037118] [c668fd40] [c00eabac] exit_mmap+0xe8/0x1b4
[162980.037210] [c668fda0] [c0019710] mmput.part.9+0x20/0xd8
[162980.037301] [c668fdb0] [c001ecb0] do_exit+0x1f0/0x93c
[162980.037386] [c668fe00] [c001f478] do_group_exit+0x40/0xcc
[162980.037479] [c668fe10] [c002a76c] get_signal+0x47c/0x614
[162980.037570] [c668fe70] [c0007840] do_signal+0x54/0x244
[162980.037654] [c668ff30] [c0007ae8] do_notify_resume+0x34/0x88
[162980.037744] [c668ff40] [c000dae8] do_user_signal+0x74/0xc4
[162980.037781] Instruction dump:
[162980.037821] 7fdff378 81370000 54a3463a 80890020 7d24182e 7c841a14 712a0004 4082ff94
[162980.038014] 2f890000 419e0010 712a0ff0 408200e0 <0fe00000> 54a9000a 7f984840 419d0094
[162980.038216] ---[ end trace c0ceeca8e7a5800a ]---
[162980.038754] BUG: non-zero nr_ptes on freeing mm: 1
[162985.363322] BUG: non-zero nr_ptes on freeing mm: -1
In order to fix this, this patch uses the address space "slices"
implemented for BOOK3S/64 and enhanced to support PPC32 by the
preceding patch.
This patch modifies the context.id on the 8xx to be in the range
[1:16] instead of [0:15] in order to identify context.id == 0 as
not initialised contexts as done on BOOK3S
This patch activates CONFIG_PPC_MM_SLICES when CONFIG_HUGETLB_PAGE is
selected for the 8xx
Alltough we could in theory have as many slices as PMD entries, the
current slices implementation limits the number of low slices to 16.
This limitation is not preventing us to fix the initial issue allthough
it is suboptimal. It will be cured in a subsequent patch.
Fixes: 4b91428699477 ("powerpc/8xx: Implement support of hugepages")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d40b6768e45bd9213139b2d91d30c7692b6007b1 ]
system_reset_exception does most of its own crash handling now,
invoking the debugger or crash dumps if they are registered. If not,
then it goes through to die() to print stack traces, and then is
supposed to panic (according to comments).
However after die() prints oopses, it does its own handling which
doesn't allow system_reset_exception to panic (e.g., it may just
kill the current process). This patch causes sreset exceptions to
return from die after it prints messages but before acting.
This also stops die from invoking the debugger on 0x100 crashes.
system_reset_exception similarly calls the debugger. It had been
thought this was harmless (because if the debugger was disabled,
neither call would fire, and if it was enabled the first call
would return). However in some cases like xmon 'X' command, the
debugger returns 0, which currently causes it to be entered
again (first in system_reset_exception, then in die), which is
confusing.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c1b25a17d24925b0961c319cfc3fd7e1dc778914 ]
POWER8 restores AMOR when waking from deep sleep, but POWER9 does not,
because it does not go through the subcore restore.
Have POWER9 restore it in core restore.
Fixes: ee97b6b99f42 ("powerpc/mm/radix: Setup AMOR in HV mode to allow key 0")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 709b973c844c0b4d115ac3a227a2e5a68722c912 ]
The function get_user() can sleep while trying to fetch instruction
from user address space and causes the following warning from the
scheduler.
BUG: sleeping function called from invalid context
Though interrupts get enabled back but it happens bit later after
get_user() is called. This change moves enabling these interrupts
earlier covering the function get_user(). While at this, lets check
for kernel mode and crash as this interrupt should not have been
triggered from the kernel context.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a048a07d7f4535baa4cbad6bc024f175317ab938 upstream.
On some CPUs we can prevent a vulnerability related to store-to-load
forwarding by preventing store forwarding between privilege domains,
by inserting a barrier in kernel entry and exit paths.
This is known to be the case on at least Power7, Power8 and Power9
powerpc CPUs.
Barriers must be inserted generally before the first load after moving
to a higher privilege, and after the last store before moving to a
lower privilege, HV and PR privilege transitions must be protected.
Barriers are added as patch sections, with all kernel/hypervisor entry
points patched, and the exit points to lower privilge levels patched
similarly to the RFI flush patching.
Firmware advertisement is not implemented yet, so CPU flush types
are hard coded.
Thanks to Michal Suchánek for bug fixes and review.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michal Suchánek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 501a78cbc17c329fabf8e9750a1e9ab810c88a0e upstream.
The recent LPM changes to setup_rfi_flush() are causing some section
mismatch warnings because we removed the __init annotation on
setup_rfi_flush():
The function setup_rfi_flush() references
the function __init ppc64_bolted_size().
the function __init memblock_alloc_base().
The references are actually in init_fallback_flush(), but that is
inlined into setup_rfi_flush().
These references are safe because:
- only pseries calls setup_rfi_flush() at runtime
- pseries always passes L1D_FLUSH_FALLBACK at boot
- so the fallback flush area will always be allocated
- so the check in init_fallback_flush() will always return early:
/* Only allocate the fallback flush area once (at boot time). */
if (l1d_flush_fallback_area)
return;
- and therefore we won't actually call the freed init routines.
We should rework the code to make it safer by default rather than
relying on the above, but for now as a quick-fix just add a __ref
annotation to squash the warning.
Fixes: abf110f3e1ce ("powerpc/rfi-flush: Make it possible to call setup_rfi_flush() again")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e7347a86830f38dc3e40c8f7e28c04412b12a2e7 upstream.
This moves the definition of the default security feature flags
(i.e., enabled by default) closer to the security feature flags.
This can be used to restore current flags to the default flags.
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d6fbe1c55c55c6937cbea3531af7da84ab7473c3 upstream.
Add a definition for cpu_show_spectre_v2() to override the generic
version. This has several permuations, though in practice some may not
occur we cater for any combination.
The most verbose is:
Mitigation: Indirect branch serialisation (kernel only), Indirect
branch cache disabled, ori31 speculation barrier enabled
We don't treat the ori31 speculation barrier as a mitigation on its
own, because it has to be *used* by code in order to be a mitigation
and we don't know if userspace is doing that. So if that's all we see
we say:
Vulnerable, ori31 speculation barrier enabled
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 56986016cb8cd9050e601831fe89f332b4e3c46e upstream.
Add a definition for cpu_show_spectre_v1() to override the generic
version. Currently this just prints "Not affected" or "Vulnerable"
based on the firmware flag.
Although the kernel does have array_index_nospec() in a few places, we
haven't yet audited all the powerpc code to see where it's necessary,
so for now we don't list that as a mitigation.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ff348355e9c72493947be337bb4fae4fc1a41eba upstream.
Now that we have the security feature flags we can make the
information displayed in the "meltdown" file more informative.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8ad33041563a10b34988800c682ada14b2612533 upstream.
This landed in setup_64.c for no good reason other than we had nowhere
else to put it. Now that we have a security-related file, that is a
better place for it so move it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9a868f634349e62922c226834aa23e3d1329ae7f upstream.
This commit adds security feature flags to reflect the settings we
receive from firmware regarding Spectre/Meltdown mitigations.
The feature names reflect the names we are given by firmware on bare
metal machines. See the hostboot source for details.
Arguably these could be firmware features, but that then requires them
to be read early in boot so they're available prior to asm feature
patching, but we don't actually want to use them for patching. We may
also want to dynamically update them in future, which would be
incompatible with the way firmware features work (at the moment at
least). So for now just make them separate flags.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0063d61ccfc011f379a31acaeba6de7c926fed2c upstream.
Currently the rfi-flush messages print 'Using <type> flush' for all
enabled_flush_types, but that is not necessarily true -- as now the
fallback flush is always enabled on pseries, but the fixup function
overwrites its nop/branch slot with other flush types, if available.
So, replace the 'Using <type> flush' messages with '<type> flush is
available'.
Also, print the patched flush types in the fixup function, so users
can know what is (not) being used (e.g., the slower, fallback flush,
or no flush type at all if flush is disabled via the debugfs switch).
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit abf110f3e1cea40f5ea15e85f5d67c39c14568a7 upstream.
For PowerVM migration we want to be able to call setup_rfi_flush()
again after we've migrated the partition.
To support that we need to check that we're not trying to allocate the
fallback flush area after memblock has gone away (i.e., boot-time only).
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1e2a9fc7496955faacbbed49461d611b704a7505 upstream.
rfi_flush_enable() includes a check to see if we're already
enabled (or disabled), and in that case does nothing.
But that means calling setup_rfi_flush() a 2nd time doesn't actually
work, which is a bit confusing.
Move that check into the debugfs code, where it really belongs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bdcb1aefc5b3f7d0f1dc8b02673602bca2ff7a4b upstream.
The fallback RFI flush is used when firmware does not provide a way
to flush the cache. It's a "displacement flush" that evicts useful
data by displacing it with an uninteresting buffer.
The flush has to take care to work with implementation specific cache
replacment policies, so the recipe has been in flux. The initial
slow but conservative approach is to touch all lines of a congruence
class, with dependencies between each load. It has since been
determined that a linear pattern of loads without dependencies is
sufficient, and is significantly faster.
Measuring the speed of a null syscall with RFI fallback flush enabled
gives the relative improvement:
P8 - 1.83x
P9 - 1.75x
The flush also becomes simpler and more adaptable to different cache
geometries.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit faf37c44a105f3608115785f17cbbf3500f8bc71 upstream.
Clear the PCR (Processor Compatibility Register) on boot to ensure we
are not running in a compatibility mode.
We've seen this cause problems when a crash (and kdump) occurs while
running compat mode guests. The kdump kernel then runs with the PCR
set and causes problems. The symptom in the kdump kernel (also seen in
petitboot after fast-reboot) is early userspace programs taking
sigills on newer instructions (seen in libc).
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: stable@vger.kernel.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 349524bc0da698ec77f2057cf4a4948eb6349265 upstream.
This causes warnings from cpufreq mutex code. This is also rather
unnecessary and ineffective. If we really want to prevent concurrent
unplug, we could take the unplug read lock but I don't see this being
critical.
Fixes: cd77b5ce208c ("powerpc/powernv/cpufreq: Fix the frequency read by /proc/cpuinfo")
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f0295e047fcf52ccb42561fb7de6942f5201b676 upstream.
The current EEH callbacks can race with a driver unbind. This can
result in a backtraces like this:
EEH: Frozen PHB#0-PE#1fc detected
EEH: PE location: S000009, PHB location: N/A
CPU: 2 PID: 2312 Comm: kworker/u258:3 Not tainted 4.15.6-openpower1 #2
Workqueue: nvme-wq nvme_reset_work [nvme]
Call Trace:
dump_stack+0x9c/0xd0 (unreliable)
eeh_dev_check_failure+0x420/0x470
eeh_check_failure+0xa0/0xa4
nvme_reset_work+0x138/0x1414 [nvme]
process_one_work+0x1ec/0x328
worker_thread+0x2e4/0x3a8
kthread+0x14c/0x154
ret_from_kernel_thread+0x5c/0xc8
nvme nvme1: Removing after probe failure status: -19
<snip>
cpu 0x23: Vector: 300 (Data Access) at [c000000ff50f3800]
pc: c0080000089a0eb0: nvme_error_detected+0x4c/0x90 [nvme]
lr: c000000000026564: eeh_report_error+0xe0/0x110
sp: c000000ff50f3a80
msr: 9000000000009033
dar: 400
dsisr: 40000000
current = 0xc000000ff507c000
paca = 0xc00000000fdc9d80 softe: 0 irq_happened: 0x01
pid = 782, comm = eehd
Linux version 4.15.6-openpower1 (smc@smc-desktop) (gcc version 6.4.0 (Buildroot 2017.11.2-00008-g4b6188e)) #2 SM P Tue Feb 27 12:33:27 PST 2018
enter ? for help
eeh_report_error+0xe0/0x110
eeh_pe_dev_traverse+0xc0/0xdc
eeh_handle_normal_event+0x184/0x4c4
eeh_handle_event+0x30/0x288
eeh_event_handler+0x124/0x170
kthread+0x14c/0x154
ret_from_kernel_thread+0x5c/0xc8
The first part is an EEH (on boot), the second half is the resulting
crash. nvme probe starts the nvme_reset_work() worker thread. This
worker thread starts touching the device which see a device error
(EEH) and hence queues up an event in the powerpc EEH worker
thread. nvme_reset_work() then continues and runs
nvme_remove_dead_ctrl_work() which results in unbinding the driver
from the device and hence releases all resources. At the same time,
the EEH worker thread starts doing the EEH .error_detected() driver
callback, which no longer works since the resources have been freed.
This fixes the problem in the same way the generic PCIe AER code (in
drivers/pci/pcie/aer/aerdrv_core.c) does. It makes the EEH code hold
the device_lock() while performing the driver EEH callbacks and
associated code. This ensures either the callbacks are no longer
register, or if they are registered the driver will not be removed
from underneath us.
This has been broken forever. The EEH call backs were first introduced
in 2005 (in 77bd7415610) but it's not clear if a lock was needed back
then.
Fixes: 77bd74156101 ("[PATCH] powerpc: PCI Error Recovery: PPC64 core recovery routines")
Cc: stable@vger.kernel.org # v2.6.16+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4552d128c26e0f0f27a5bd2fadc24092b8f6c1d7 ]
The die() oops path contains a serializing lock to prevent oops
messages from being interleaved. In the case of a system reset
initiated oops (e.g., qemu nmi command), __die was being called
which lacks that synchronisation and oops reports could be
interleaved across CPUs.
A recent patch 4388c9b3a6ee7 ("powerpc: Do not send system reset
request through the oops path") changed this to __die to avoid
the debugger() call, but there is no real harm to calling it twice
if the first time fell through. So go back to using die() here.
This was observed to fix the problem.
Fixes: 4388c9b3a6ee7 ("powerpc: Do not send system reset request through the oops path")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 13a83eac373c49c0a081cbcd137e79210fe78acd upstream.
On boot we save the configuration space of PCIe bridges. We do this so
when we get an EEH event and everything gets reset that we can restore
them.
Unfortunately we save this state before we've enabled the MMIO space
on the bridges. Hence if we have to reset the bridge when we come back
MMIO is not enabled and we end up taking an PE freeze when the driver
starts accessing again.
This patch forces the memory/MMIO and bus mastering on when restoring
bridges on EEH. Ideally we'd do this correctly by saving the
configuration space writes later, but that will have to come later in
a larger EEH rewrite. For now we have this simple fix.
The original bug can be triggered on a boston machine by doing:
echo 0x8000000000000000 > /sys/kernel/debug/powerpc/PCI0001/err_injct_outbound
On boston, this PHB has a PCIe switch on it. Without this patch,
you'll see two EEH events, 1 expected and 1 the failure we are fixing
here. The second EEH event causes the anything under the PHB to
disappear (i.e. the i40e eth).
With this patch, only 1 EEH event occurs and devices properly recover.
Fixes: 652defed4875 ("powerpc/eeh: Check PCIe link after reset")
Cc: stable@vger.kernel.org # v3.11+
Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bf8a1abc3ddbd6e9a8312ea7d96e5dd89c140f18 upstream.
kexec_file_load() on powerpc doesn't support kdump kernels yet, so it
returns -ENOTSUPP in that case.
I've recently learned that this errno is internal to the kernel and
isn't supposed to be exposed to userspace. Therefore, change to
-EOPNOTSUPP which is defined in an uapi header.
This does indeed make kexec-tools happier. Before the patch, on
ppc64le:
# ~bauermann/src/kexec-tools/build/sbin/kexec -s -p /boot/vmlinuz
kexec_file_load failed: Unknown error 524
After the patch:
# ~bauermann/src/kexec-tools/build/sbin/kexec -s -p /boot/vmlinuz
kexec_file_load failed: Operation not supported
Fixes: a0458284f062 ("powerpc: Add support code for kexec_file_load()")
Cc: stable@vger.kernel.org # v4.10+
Reported-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Reviewed-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e6e133c47e6bd4d5dac05b35d06634a8e5648615 upstream.
Michael Ellerman reported the following call trace when running
ftracetest:
BUG: using __this_cpu_write() in preemptible [00000000] code: ftracetest/6178
caller is opt_pre_handler+0xc4/0x110
CPU: 1 PID: 6178 Comm: ftracetest Not tainted 4.15.0-rc7-gcc6x-gb2cd1df #1
Call Trace:
[c0000000f9ec39c0] [c000000000ac4304] dump_stack+0xb4/0x100 (unreliable)
[c0000000f9ec3a00] [c00000000061159c] check_preemption_disabled+0x15c/0x170
[c0000000f9ec3a90] [c000000000217e84] opt_pre_handler+0xc4/0x110
[c0000000f9ec3af0] [c00000000004cf68] optimized_callback+0x148/0x170
[c0000000f9ec3b40] [c00000000004d954] optinsn_slot+0xec/0x10000
[c0000000f9ec3e30] [c00000000004bae0] kretprobe_trampoline+0x0/0x10
This is showing up since OPTPROBES is now enabled with CONFIG_PREEMPT.
trampoline_probe_handler() considers itself to be a special kprobe
handler for kretprobes. In doing so, it expects to be called from
kprobe_handler() on a trap, and re-enables preemption before returning a
non-zero return value so as to suppress any subsequent processing of the
trap by the kprobe_handler().
However, with optprobes, we don't deal with special handlers (we ignore
the return code) and just try to re-enable preemption causing the above
trace.
To address this, modify trampoline_probe_handler() to not be special.
The only additional processing done in kprobe_handler() is to emulate
the instruction (in this case, a 'nop'). We adjust the value of
regs->nip for the purpose and delegate the job of re-enabling
preemption and resetting current kprobe to the probe handlers
(kprobe_handler() or optimized_callback()).
Fixes: 8a2d71a3f273 ("powerpc/kprobes: Disable preemption before invoking probe handler for optprobes")
Cc: stable@vger.kernel.org # v4.15+
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a57ac411832384eb93df4bfed2bf644c4089720e upstream.
Presently the dt_cpu_ftrs restore_cpu will only add bits to the LPCR
for secondaries, but some bits must be removed (e.g., UPRT for HPT).
Not clearing these bits on secondaries causes checkstops when booting
with disable_radix.
restore_cpu can not just set LPCR, because it is also called by the
idle wakeup code which relies on opal_slw_set_reg to restore the value
of LPCR, at least on P8 which does not save LPCR to stack in the idle
code.
Fix this by including a mask of bits to clear from LPCR as well, which
is used by restore_cpu.
This is a little messy now, but it's a minimal fix that can be
backported. Longer term, the idle SPR save/restore code can be
reworked to completely avoid calls to restore_cpu, then restore_cpu
would be able to unconditionally set LPCR to match boot processor
environment.
Fixes: 5a61ef74f269f ("powerpc/64s: Support new device tree binding for discovering CPU features")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 52396500f97c53860164debc7d4f759077853423 upstream.
The SLB bad address handler's trap number fixup does not preserve the
low bit that indicates nonvolatile GPRs have not been saved. This
leads save_nvgprs to skip saving them, and subsequent functions and
return from interrupt will think they are saved.
This causes kernel branch-to-garbage debugging to not have correct
registers, can also cause userspace to have its registers clobbered
after a segfault.
Fixes: f0f558b131db ("powerpc/mm: Preserve CFAR value on SLB miss caused by access to bogus address")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
irq_happened
commit ff6781fd1bb404d8a551c02c35c70cec1da17ff1 upstream.
force_external_irq_replay() can be called in the do_IRQ path with
interrupts hard enabled and soft disabled if may_hard_irq_enable() set
MSR[EE]=1. It updates local_paca->irq_happened with a load, modify,
store sequence. If a maskable interrupt hits during this sequence, it
will go to the masked handler to be marked pending in irq_happened.
This update will be lost when the interrupt returns and the store
instruction executes. This can result in unpredictable latencies,
timeouts, lockups, etc.
Fix this by ensuring hard interrupts are disabled before modifying
irq_happened.
This could cause any maskable asynchronous interrupt to get lost, but
it was noticed on P9 SMP system doing RDMA NVMe target over 100GbE,
so very high external interrupt rate and high IPI rate. The hang was
bisected down to enabling doorbell interrupts for IPIs. These provided
an interrupt type that could run at high rates in the do_IRQ path,
stressing the race.
Fixes: 1d607bb3bd60 ("powerpc/irq: Add mechanism to force a replay of interrupts")
Cc: stable@vger.kernel.org # v4.8+
Reported-by: Carol L. Soto <clsoto@us.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit acb1feab320e38588fccc568e3767761f494976f ]
When an interrupt is returning to a soft-disabled context (which can
happen for non-maskable interrupts or synchronous interrupts), it goes
through the motions of soft-disabling again, including calling
TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()).
This is not necessary, because we must already be soft-disabled in the
interrupt context, it also may be causing crashes in the irq tracing
code to re-enter as an nmi. Replace it with a warning to ensure that
soft-interrupts are still disabled.
Fixes: 7c0482e3d055 ("powerpc/irq: Fix another case of lazy IRQ state getting out of sync")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b9eab08d012fa093947b230f9a87257c27fb829b ]
When attempting to load a livepatch module, I got the following error:
module_64: patch_module: Expect noop after relocate, got 3c820000
The error was triggered by the following code in
unregister_netdevice_queue():
14c: 00 00 00 48 b 14c <unregister_netdevice_queue+0x14c>
14c: R_PPC64_REL24 net_set_todo
150: 00 00 82 3c addis r4,r2,0
GCC didn't insert a nop after the branch to net_set_todo() because it's
a sibling call, so it never returns. The nop isn't needed after the
branch in that case.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-and-tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f23ab3efb1b30cc5c5ef5ae4ef294ed467f30675 upstream.
Commit 398a719d34a1 ("powerpc/mm: Update bits used to skip hash_page")
mistakenly dropped the DSISR_DABRMATCH bit from the mask of bit tested
to skip trying to hash a page.
As a result, the DABR matches would no longer be detected.
This adds it back. We open code it in the 2 places where it matters
rather than fold it into DSISR_BAD_FAULT_32S/64S because this isn't
technically a bad fault and while we would never hit it with the
current code, I prefer if page_fault_is_bad() didn't trigger on these.
Fixes: 398a719d34a1 ("powerpc/mm: Update bits used to skip hash_page")
Cc: stable@vger.kernel.org # v4.14
Tested-by: Pedro Miraglia Franco de Carvalho <pedromfc@br.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 236003e6b5443c45c18e613d2b0d776a9f87540e upstream.
Expose the state of the RFI flush (enabled/disabled) via debugfs, and
allow it to be enabled/disabled at runtime.
eg: $ cat /sys/kernel/debug/powerpc/rfi_flush
1
$ echo 0 > /sys/kernel/debug/powerpc/rfi_flush
$ cat /sys/kernel/debug/powerpc/rfi_flush
0
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|