summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-11-29powerpc: select CPUMASK_OFFSTACK if NR_CPUS >= 8192Nicholas Piggin1-0/+1
Some core kernel code starts to go beyond the 2048 byte stack size warning at NR_CPUS=8192, so select CPUMASK_OFFSTACK in that case. x86 does similarly for very large NR_CPUS. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211105035042.1398309-2-npiggin@gmail.com
2021-11-29powerpc: remove cpu_online_cores_map functionNicholas Piggin3-41/+8
This function builds the cores online map with on-stack cpumasks which can cause high stack usage with large NR_CPUS. It is not used in any performance sensitive paths, so instead just check for first thread sibling. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211105035042.1398309-1-npiggin@gmail.com
2021-11-29Revert "powerpc/code-patching: Improve verification of patchability"Michael Ellerman3-2/+6
This reverts commit 8b8a8f0ab3f5519e45c526f826a655817486c5bb. As reported[1] by Sachin this causes problems with ftrace, and it also causes the code patching selftests to fail as reported[2] by Stephen. So revert it for now. 1: https://lore.kernel.org/linuxppc-dev/3668743C-09DF-4673-B15C-2FFE2A57F7D7@linux.vnet.ibm.com/ 2: https://lore.kernel.org/linuxppc-dev/20211126161747.1f7795b0@canb.auug.org.au/ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2021-11-25powerpc/watchdog: Fix wd_smp_last_reset_tb reportingNicholas Piggin1-4/+4
wd_smp_last_reset_tb now gets reset by watchdog_smp_panic() as part of marking CPUs stuck and removing them from the pending mask before it begins any printing. This causes last reset times reported to be off. Fix this by reading it into a local variable before it gets reset. Fixes: 76521c4b0291 ("powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211125103346.1188958-1-npiggin@gmail.com
2021-11-25powerpc/microwatt: Make microwatt_get_random_darn() staticMichael Ellerman1-1/+1
Make microwatt_get_random_darn() static, because it can be. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211118004415.1706863-1-mpe@ellerman.id.au
2021-11-25powerpc/watchdog: read TB close to where it is usedNicholas Piggin1-12/+14
When taking watchdog actions, printing messages, comparing and re-setting wd_smp_last_reset_tb, etc., read TB close to the point of use and under wd_smp_lock or printing lock (if applicable). This should keep timebase mostly monotonic with kernel log messages, and could prevent (in theory) a laggy CPU updating wd_smp_last_reset_tb to something a long way in the past, and causing other CPUs to appear to be stuck. These additional TB reads are all slowpath (lockup has been detected), so performance does not matter. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211110025056.2084347-5-npiggin@gmail.com
2021-11-25powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipiNicholas Piggin1-19/+74
There is a deadlock with the console_owner lock and the wd_smp_lock: CPU x takes the console_owner lock CPU y takes a watchdog timer interrupt and takes __wd_smp_lock CPU x takes a soft-NMI interrupt, detects deadlock, spins on __wd_smp_lock CPU y detects deadlock, tries to print something and spins on console_owner -> deadlock Change the watchdog locking scheme so wd_smp_lock protects the watchdog internal data, but "reporting" (printing, issuing NMI IPIs, taking any action outside of watchdog) uses a non-waiting exclusion. If a CPU detects a problem but can not take the reporting lock, it just returns because something else is already reporting. It will try again at some point. Typically hard lockup watchdog report usefulness is not impacted due to failure to spewing a large enough amount of data in as short a time as possible, but by messages getting garbled. Laurent debugged this and found the deadlock, and this patch is based on his general approach to avoid expensive operations while holding the lock. With the addition of the reporting exclusion. Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> [np: rework to add reporting exclusion update changelog] Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211110025056.2084347-4-npiggin@gmail.com
2021-11-25powerpc/watchdog: tighten non-atomic read-modify-write accessNicholas Piggin1-10/+26
Most updates to wd_smp_cpus_pending are under lock except the watchdog interrupt bit clear. This can race with non-atomic RMW updates to the mask under lock, which can happen in two instances: Firstly, if another CPU detects this one is stuck, removes it from the mask, mask becomes empty and is re-filled with non-atomic stores. This is okay because it would re-fill the mask with this CPU's bit clear anyway (because this CPU is now stuck), so it doesn't matter that the bit clear update got "lost". Add a comment for this. Secondly, if another CPU detects a different CPU is stuck and removes it from the pending mask with a non-atomic store to bytes which also include the bit of this CPU. This case can result in the bit clear being lost and the end result being the bit is set. This should be so rare it hardly matters, but to make things simpler to reason about just avoid the non-atomic access for that case. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211110025056.2084347-3-npiggin@gmail.com
2021-11-25powerpc/watchdog: Fix missed watchdog reset due to memory ordering raceNicholas Piggin1-1/+40
It is possible for all CPUs to miss the pending cpumask becoming clear, and then nobody resetting it, which will cause the lockup detector to stop working. It will eventually expire, but watchdog_smp_panic will avoid doing anything if the pending mask is clear and it will never be reset. Order the cpumask clear vs the subsequent test to close this race. Add an extra check for an empty pending mask when the watchdog fires and finds its bit still clear, to try to catch any other possible races or bugs here and keep the watchdog working. The extra test in arch_touch_nmi_watchdog is required to prevent the new warning from firing off. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Debugged-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211110025056.2084347-2-npiggin@gmail.com
2021-11-25powerpc/prom_init: Fix improper check of prom_getprop()Peiwei Hu1-1/+1
prom_getprop() can return PROM_ERROR. Binary operator can not identify it. Fixes: 94d2dde738a5 ("[POWERPC] Efika: prune fixups and make them more carefull") Signed-off-by: Peiwei Hu <jlu.hpw@foxmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/tencent_BA28CC6897B7C95A92EB8C580B5D18589105@qq.com
2021-11-25powerpc/rtas: rtas_busy_delay_time() kernel-docNathan Lynch1-2/+19
Provide API documentation for rtas_busy_delay_time(), explaining why we return the same value for 9900 and -2. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211117060259.957178-3-nathanl@linux.ibm.com
2021-11-25powerpc/rtas: rtas_busy_delay() improvementsNathan Lynch2-8/+68
Generally RTAS cannot block, and in PAPR it is required to return control to the OS within a few tens of microseconds. In order to support operations which may take longer to complete, many RTAS primitives can return intermediate -2 ("busy") or 990x ("extended delay") values, which indicate that the OS should reattempt the same call with the same arguments at some point in the future. Current versions of PAPR are less than clear about this, but the intended meanings of these values in more detail are: RTAS_BUSY (-2): RTAS has suspended a potentially long-running operation in order to meet its latency obligation and give the OS the opportunity to perform other work. RTAS can resume making progress as soon as the OS reattempts the call. RTAS_EXTENDED_DELAY_{MIN...MAX} (9900-9905): RTAS must wait for an external event to occur or for internal contention to resolve before it can complete the requested operation. The value encodes a non-binding hint as to roughly how long the OS should wait before calling again, but the OS is allowed to reattempt the call sooner or even immediately. Linux of course must take its own CPU scheduling obligations into account when handling these statuses; e.g. a task which receives an RTAS_BUSY status should check whether to reschedule before it attempts the RTAS call again to avoid starving other tasks. rtas_busy_delay() is a helper function that "consumes" a busy or extended delay status. Common usage: int rc; do { rc = rtas_call(rtas_token("some-function"), ...); } while (rtas_busy_delay(rc)); /* convert rc to Linux error value, etc */ If rc is a busy or extended delay status, the caller can rely on rtas_busy_delay() to perform an appropriate sleep or reschedule and return nonzero. Other statuses are handled normally by the caller. The current implementation of rtas_busy_delay() both oversleeps and overuses the CPU: * It performs msleep() for all 990x and even when no delay is suggested (-2), but this is understood to actually sleep for two jiffies minimum in practice (20ms with HZ=100). 9900 (1ms) and 9901 (10ms) appear to be the most common extended delay statuses, and the oversleeping measurably lengthens DLPAR operations, which perform many RTAS calls. * It does not sleep on 990x unless need_resched() is true, causing code like the loop above to needlessly retry, wasting CPU time. Alter the logic to align better with the intended meanings: * When passed RTAS_BUSY, perform cond_resched() and return without sleeping. The caller should reattempt immediately * Always sleep when passed an extended delay status, using usleep_range() for precise shorter sleeps. Limit the sleep time to one second even though there are higher architected values. Change rtas_busy_delay()'s return type to bool to better reflect its usage, and add kernel-doc. rtas_busy_delay_time() is unchanged, even though it "incorrectly" returns 1 for RTAS_BUSY. There are users of that API with open-coded delay loops in sensitive contexts that will have to be taken on an individual basis. Brief results for addition and removal of 5GB memory on a small P9 PowerVM partition follow. Load was generated with stress-ng --cpu N. For add, elapsed time is greatly reduced without significant change in the number of RTAS calls or time spent on CPU. For remove, elapsed time is modestly reduced, with significant reductions in RTAS calls and time spent on CPU. With no competing workload (- before, + after): Performance counter stats for 'bash -c echo "memory add count 20" > /sys/kernel/dlpar' (10 runs): - 1,935 probe:rtas_call # 0.003 M/sec ( +- 0.22% ) - 609.99 msec task-clock # 0.183 CPUs utilized ( +- 0.19% ) + 1,956 probe:rtas_call # 0.003 M/sec ( +- 0.17% ) + 618.56 msec task-clock # 0.278 CPUs utilized ( +- 0.11% ) - 3.3322 +- 0.0670 seconds time elapsed ( +- 2.01% ) + 2.2222 +- 0.0416 seconds time elapsed ( +- 1.87% ) Performance counter stats for 'bash -c echo "memory remove count 20" > /sys/kernel/dlpar' (10 runs): - 6,224 probe:rtas_call # 0.008 M/sec ( +- 2.57% ) - 750.36 msec task-clock # 0.190 CPUs utilized ( +- 2.01% ) + 843 probe:rtas_call # 0.003 M/sec ( +- 0.12% ) + 250.66 msec task-clock # 0.068 CPUs utilized ( +- 0.17% ) - 3.9394 +- 0.0890 seconds time elapsed ( +- 2.26% ) + 3.678 +- 0.113 seconds time elapsed ( +- 3.07% ) With all CPUs 100% busy (- before, + after): Performance counter stats for 'bash -c echo "memory add count 20" > /sys/kernel/dlpar' (10 runs): - 2,979 probe:rtas_call # 0.003 M/sec ( +- 0.12% ) - 1,096.62 msec task-clock # 0.105 CPUs utilized ( +- 0.10% ) + 2,981 probe:rtas_call # 0.003 M/sec ( +- 0.22% ) + 1,095.26 msec task-clock # 0.154 CPUs utilized ( +- 0.21% ) - 10.476 +- 0.104 seconds time elapsed ( +- 1.00% ) + 7.1124 +- 0.0865 seconds time elapsed ( +- 1.22% ) Performance counter stats for 'bash -c echo "memory remove count 20" > /sys/kernel/dlpar' (10 runs): - 2,702 probe:rtas_call # 0.004 M/sec ( +- 4.00% ) - 722.71 msec task-clock # 0.067 CPUs utilized ( +- 2.41% ) + 1,246 probe:rtas_call # 0.003 M/sec ( +- 0.25% ) + 487.73 msec task-clock # 0.049 CPUs utilized ( +- 0.20% ) - 10.829 +- 0.163 seconds time elapsed ( +- 1.51% ) + 9.9887 +- 0.0866 seconds time elapsed ( +- 0.87% ) Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211117060259.957178-2-nathanl@linux.ibm.com
2021-11-25powerpc/pseries: delete scanlogNathan Lynch5-202/+0
Remove the pseries scanlog driver. This code supports functions from Power4-era servers that are not present on targets currently supported by arch/powerpc. System manuals from this time have this description: Scan Dump data is a set of chip data that the service processor gathers after a system malfunction. It consists of chip scan rings, chip trace arrays, and Scan COM (SCOM) registers. This data is stored in the scan-log partition of the system’s Nonvolatile Random Access Memory (NVRAM). PowerVM partition firmware development doesn't recognize the associated function call or property, and they don't see any references to them in their codebase. It seems to have been specific to non-virtualized pseries. References: Historical Linux commit from February 2003 (interesting to note this seems to be the source of non-GPL exports for rtas_call etc): https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=f92e361842d5251e50562b09664082dcbd0548bb IntelliStation and pSeries docs which refer to the feature: http://ps-2.retropc.se/basil.holloway/ALL%20PDF/380635.pdf http://ps-2.kev009.com/rs6000/manuals/p/p615-6C3-6E3/6C3_and_6E3_Users_Guide_SA38-0629.pdf Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210920173203.1800475-1-nathanl@linux.ibm.com
2021-11-25powerpc/rtas: kernel-doc fixesNathan Lynch1-4/+5
Fix the following issues reported by kernel-doc: $ scripts/kernel-doc -v -none arch/powerpc/kernel/rtas.c arch/powerpc/kernel/rtas.c:810: info: Scanning doc for function rtas_activate_firmware arch/powerpc/kernel/rtas.c:818: warning: contents before sections arch/powerpc/kernel/rtas.c:841: info: Scanning doc for function rtas_call_reentrant arch/powerpc/kernel/rtas.c:893: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Find a specific pseries error log in an RTAS extended event log. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211116215806.928235-1-nathanl@linux.ibm.com
2021-11-25powerpc/code-patching: Improve verification of patchabilityChristophe Leroy3-6/+2
Today, patch_instruction() assumes that it is called exclusively on valid addresses, and only checks that it is not called on an init address after init section has been freed. Improve verification by calling kernel_text_address() instead. kernel_text_address() already includes a verification of initmem release. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/bc683d499a411730504b132a924de0ccc2ef1f79.1636971137.git.christophe.leroy@csgroup.eu
2021-11-25powerpc/tsi108: make EXPORT_SYMBOL follow its function immediatelyJason Wang1-2/+1
EXPORT_SYMBOL(foo); should immediately follow its function/variable. Signed-off-by: Jason Wang <wangborong@cdjrlc.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211114115616.493815-1-wangborong@cdjrlc.com
2021-11-25bpf ppc32: Access only if addr is kernel addressHari Bathini1-0/+34
With KUAP enabled, any kernel code which wants to access userspace needs to be surrounded by disable-enable KUAP. But that is not happening for BPF_PROBE_MEM load instruction. Though PPC32 does not support read protection, considering the fact that PTR_TO_BTF_ID (which uses BPF_PROBE_MEM mode) could either be a valid kernel pointer or NULL but should never be a pointer to userspace address, execute BPF_PROBE_MEM load only if addr is kernel address, otherwise set dst_reg=0 and move on. This will catch NULL, valid or invalid userspace pointers. Only bad kernel pointer will be handled by BPF exception table. [Alexei suggested for x86] Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211012123056.485795-9-hbathini@linux.ibm.com
2021-11-25bpf ppc32: Add BPF_PROBE_MEM support for JITHari Bathini3-0/+36
BPF load instruction with BPF_PROBE_MEM mode can cause a fault inside kernel. Append exception table for such instructions within BPF program. Unlike other archs which uses extable 'fixup' field to pass dest_reg and nip, BPF exception table on PowerPC follows the generic PowerPC exception table design, where it populates both fixup and extable sections within BPF program. fixup section contains 3 instructions, first 2 instructions clear dest_reg (lower & higher 32-bit registers) and last instruction jumps to next instruction in the BPF code. extable 'insn' field contains relative offset of the instruction and 'fixup' field contains relative offset of the fixup entry. Example layout of BPF program with extable present: +------------------+ | | | | 0x4020 -->| lwz r28,4(r4) | | | | | 0x40ac -->| lwz r3,0(r24) | | lwz r4,4(r24) | | | | | |------------------| 0x4278 -->| li r28,0 | \ | li r27,0 | | fixup entry | b 0x4024 | / 0x4284 -->| li r4,0 | | li r3,0 | | b 0x40b4 | |------------------| 0x4290 -->| insn=0xfffffd90 | \ extable entry | fixup=0xffffffe4 | / 0x4298 -->| insn=0xfffffe14 | | fixup=0xffffffe8 | +------------------+ (Addresses shown here are chosen random, not real) Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211012123056.485795-8-hbathini@linux.ibm.com
2021-11-25bpf ppc64: Access only if addr is kernel addressRavi Bangoria1-0/+26
On PPC64 with KUAP enabled, any kernel code which wants to access userspace needs to be surrounded by disable-enable KUAP. But that is not happening for BPF_PROBE_MEM load instruction. So, when BPF program tries to access invalid userspace address, page-fault handler considers it as bad KUAP fault: Kernel attempted to read user page (d0000000) - exploit attempt? (uid: 0) Considering the fact that PTR_TO_BTF_ID (which uses BPF_PROBE_MEM mode) could either be a valid kernel pointer or NULL but should never be a pointer to userspace address, execute BPF_PROBE_MEM load only if addr is kernel address, otherwise set dst_reg=0 and move on. This will catch NULL, valid or invalid userspace pointers. Only bad kernel pointer will be handled by BPF exception table. [Alexei suggested for x86] Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211012123056.485795-7-hbathini@linux.ibm.com
2021-11-25bpf ppc64: Add BPF_PROBE_MEM support for JITRavi Bangoria4-9/+80
BPF load instruction with BPF_PROBE_MEM mode can cause a fault inside kernel. Append exception table for such instructions within BPF program. Unlike other archs which uses extable 'fixup' field to pass dest_reg and nip, BPF exception table on PowerPC follows the generic PowerPC exception table design, where it populates both fixup and extable sections within BPF program. fixup section contains two instructions, first instruction clears dest_reg and 2nd jumps to next instruction in the BPF code. extable 'insn' field contains relative offset of the instruction and 'fixup' field contains relative offset of the fixup entry. Example layout of BPF program with extable present: +------------------+ | | | | 0x4020 -->| ld r27,4(r3) | | | | | 0x40ac -->| lwz r3,0(r4) | | | | | |------------------| 0x4280 -->| li r27,0 | \ fixup entry | b 0x4024 | / 0x4288 -->| li r3,0 | | b 0x40b0 | |------------------| 0x4290 -->| insn=0xfffffd90 | \ extable entry | fixup=0xffffffec | / 0x4298 -->| insn=0xfffffe14 | | fixup=0xffffffec | +------------------+ (Addresses shown here are chosen random, not real) Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211012123056.485795-6-hbathini@linux.ibm.com
2021-11-25powerpc/ppc-opcode: introduce PPC_RAW_BRANCH() macroHari Bathini2-1/+3
Define and use PPC_RAW_BRANCH() macro instead of open coding it. This macro is used while adding BPF_PROBE_MEM support. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211012123056.485795-5-hbathini@linux.ibm.com
2021-11-25bpf powerpc: refactor JIT compiler codeHari Bathini2-27/+37
Refactor powerpc LDX JITing code to simplify adding BPF_PROBE_MEM support. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211012123056.485795-4-hbathini@linux.ibm.com
2021-11-25bpf powerpc: Remove extra_pass from bpf_jit_build_body()Ravi Bangoria4-8/+8
In case of extra_pass, usual JIT passes are always skipped. So, extra_pass is always false while calling bpf_jit_build_body() and can be removed. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211012123056.485795-3-hbathini@linux.ibm.com
2021-11-25bpf powerpc: Remove unused SEEN_STACKRavi Bangoria1-2/+1
SEEN_STACK is unused on PowerPC. Remove it. Also, have SEEN_TAILCALL use 0x40000000. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211012123056.485795-2-hbathini@linux.ibm.com
2021-11-25powerpc/eeh: Use a goto for recovery failuresOliver O'Halloran1-48/+45
The EEH recovery logic in eeh_handle_normal_event() has some pretty strange flow control. If we remove all the actual recovery logic we're left with the following skeleton: if (result != PCI_ERS_RESULT_DISCONNECT) { ... } if (result != PCI_ERS_RESULT_DISCONNECT) { ... } if (result == PCI_ERS_RESULT_NONE) { ... } if (result == PCI_ERS_RESULT_CAN_RECOVER) { ... } if (result == PCI_ERS_RESULT_CAN_RECOVER) { ... } if (result == PCI_ERS_RESULT_NEED_RESET) { ... } if ((result == PCI_ERS_RESULT_RECOVERED) || (result == PCI_ERS_RESULT_NONE)) { ... goto out; } /* * unsuccessful recovery / PCI_ERS_RESULT_DISCONECTED * handling is here. */ ... out: ... Most of the "if () { ... }" blocks above change "result" to PCI_ERS_RESULT_DISCONNECTED if an error occurs in that recovery step. This makes the control flow a bit confusing since it breaks the early-exit pattern that is generally used in Linux. In any case we end up handling the error in the final else block so why not just jump there directly? Doing so also allows us to de-indent a bunch of code. No functional changes. [dja: rebase on top of linux-next + my preceeding refactor, move clearing the EEH_DEV_NO_HANDLER bit above the first goto so that it is always clear in the error handler code as it was before.] Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211015070628.1331635-2-dja@axtens.net
2021-11-25powerpc/eeh: Small refactor of eeh_handle_normal_event()Daniel Axtens1-34/+35
The control flow of eeh_handle_normal_event() is a bit tricky. Break out one of the error handling paths - rather than be in an else block, we'll make it part of the regular body of the function and put a 'goto out;' in the true limb of the if. Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211015070628.1331635-1-dja@axtens.net
2021-11-25powerpc/xive: Add a debugfs toggle for save-restoreCédric Le Goater3-1/+3
On POWER10, the automatic "save & restore" of interrupt context is always available. Provide a way to deactivate it for tests or performance. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211105102636.1016378-11-clg@kaod.org
2021-11-25powerpc/xive: Add a kernel parameter for StoreEOICédric Le Goater2-0/+19
StoreEOI is activated by default on platforms supporting the feature (POWER10) and will be used as soon as firmware advertises its availability. The kernel parameter provides a way to deactivate its use. It can be still be reactivated through debugfs. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211105102636.1016378-10-clg@kaod.org
2021-11-25powerpc/xive: Add a debugfs toggle for StoreEOICédric Le Goater1-3/+14
It can be used to deactivate temporarily StoreEOI for tests or performance on platforms supporting the feature (POWER10) Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211105102636.1016378-9-clg@kaod.org
2021-11-25powerpc/xive: Add a debugfs file to dump EQsCédric Le Goater1-0/+37
The XIVE driver under Linux uses a single interrupt priority and only one event queue is configured per CPU. Expose the contents under a 'xive/eqs/cpuX' debugfs file. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211105102636.1016378-8-clg@kaod.org
2021-11-25powerpc/xive: Rename the 'cpus' debugfs file to 'ipis'Cédric Le Goater1-20/+7
and remove the EQ entries output which is not very useful since only the next two events of the queue are taken into account. We will improve the dump of the EQ in the next patches. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211105102636.1016378-7-clg@kaod.org
2021-11-25powerpc/xive: Change the debugfs file 'xive' into a directoryCédric Le Goater1-11/+25
Use a 'cpus' file to dump CPU states and 'interrupts' to dump IRQ states. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211105102636.1016378-6-clg@kaod.org
2021-11-25powerpc/xive: Introduce xive_core_debugfs_create()Cédric Le Goater1-3/+15
and fix some compile issues when !CONFIG_DEBUG_FS. Signed-off-by: Cédric Le Goater <clg@kaod.org> [mpe: Add empty stub to fix !CONFIG_DEBUG_FS build] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211105102636.1016378-5-clg@kaod.org
2021-11-25powerpc/xive: Activate StoreEOI on P10Cédric Le Goater2-0/+3
StoreEOI (the capability to EOI with a store) requires load-after-store ordering in some cases to be reliable. P10 introduced a new offset for load operations to enforce correct ordering and the XIVE driver has the required support since kernel 5.8, commit b1f9be9392f0 ("powerpc/xive: Enforce load-after-store ordering when StoreEOI is active") Since skiboot v7, StoreEOI support is advertised on P10 with a new flag on the PowerNV platform. See skiboot commit 4bd7d84afe46 ("xive/p10: Introduce a new OPAL_XIVE_IRQ_STORE_EOI2 flag"). When detected, activate the feature. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211105102636.1016378-4-clg@kaod.org
2021-11-25powerpc/xive: Introduce an helper to print out interrupt characteristicsCédric Le Goater1-27/+27
and extend output of debugfs and xmon with addresses of the ESB management and trigger pages. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211105102636.1016378-3-clg@kaod.org
2021-11-25powerpc/xive: Replace pr_devel() by pr_debug() to ease debugCédric Le Goater2-33/+34
These routines are not on hot code paths and pr_debug() is easier to activate. Also add a '0x' prefix to hex printed values (HW IRQ number). Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211105102636.1016378-2-clg@kaod.org
2021-11-25powerpc/powernv: Remove POWER9 PVR version check for entry and uaccess flushesNicholas Piggin1-3/+7
These aren't necessarily POWER9 only, and it's not to say some new vulnerability may not get discovered on other processors for which we would like the flexibility of having the workaround enabled by firmware. Remove the restriction that the workarounds only apply to POWER9. However POWER7 and POWER8 are not affected, and they may not have older firmware that does not advertise this, so clear these workarounds manually. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Joel Stanley <joel@jms.id.au> [mpe: Incorporate changes from Nick, reword comment slightly.] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210503130243.891868-5-npiggin@gmail.com
2021-11-25powerpc/btext: add missing of_node_putJulia Lawall1-1/+3
for_each_node_by_type performs an of_node_get on each iteration, so a break out of the loop requires an of_node_put. A simplified version of the semantic patch that fixes this problem is as follows (http://coccinelle.lip6.fr): // <smpl> @@ local idexpression n; expression e; @@ for_each_node_by_type(n,...) { ... ( of_node_put(n); | e = n | + of_node_put(n); ? break; ) ... } ... when != n // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1448051604-25256-6-git-send-email-Julia.Lawall@lip6.fr
2021-11-25powerpc/cell: add missing of_node_putJulia Lawall1-0/+1
for_each_node_by_name performs an of_node_get on each iteration, so a break out of the loop requires an of_node_put. A simplified version of the semantic patch that fixes this problem is as follows (http://coccinelle.lip6.fr): // <smpl> @@ expression e,e1; local idexpression n; @@ for_each_node_by_name(n, e1) { ... when != of_node_put(n) when != e = n ( return n; | + of_node_put(n); ? return ...; ) ... } // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1448051604-25256-7-git-send-email-Julia.Lawall@lip6.fr
2021-11-25powerpc/powernv: add missing of_node_putJulia Lawall1-0/+1
for_each_compatible_node performs an of_node_get on each iteration, so a break out of the loop requires an of_node_put. A simplified version of the semantic patch that fixes this problem is as follows (http://coccinelle.lip6.fr): // <smpl> @@ local idexpression n; expression e; @@ for_each_compatible_node(n,...) { ... ( of_node_put(n); | e = n | + of_node_put(n); ? break; ) ... } ... when != n // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1448051604-25256-4-git-send-email-Julia.Lawall@lip6.fr
2021-11-25powerpc/6xx: add missing of_node_putJulia Lawall1-0/+1
for_each_compatible_node performs an of_node_get on each iteration, so a break out of the loop requires an of_node_put. A simplified version of the semantic patch that fixes this problem is as follows (http://coccinelle.lip6.fr): // <smpl> @@ expression e; local idexpression n; @@ @@ local idexpression n; expression e; @@ for_each_compatible_node(n,...) { ... ( of_node_put(n); | e = n | + of_node_put(n); ? break; ) ... } ... when != n // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1448051604-25256-2-git-send-email-Julia.Lawall@lip6.fr
2021-11-25Merge branch 'topic/ppc-kvm' into nextMichael Ellerman30-714/+1555
This merge's Nick's big P9 KVM series, original cover letter follows: KVM: PPC: Book3S HV P9: entry/exit optimisations This reduces radix guest full entry/exit latency on POWER9 and POWER10 by 2x. Nested HV guests should see smaller improvements in their L1 entry/exit, but this is also combined with most L0 speedups also applying to nested entry. nginx localhost throughput test in a SMP nested guest is improved about 10% (in a direct guest it doesn't change much because it uses XIVE for IPIs) when L0 and L1 are patched. It does this in several main ways: - Rearrange code to optimise SPR accesses. Mainly, avoid scoreboard stalls. - Test SPR values to avoid mtSPRs where possible. mtSPRs are expensive. - Reduce mftb. mftb is expensive. - Demand fault certain facilities to avoid saving and/or restoring them (at the cost of fault when they are used, but this is mitigated over a number of entries, like the facilities when context switching processes). PM, TM, and EBB so far. - Defer some sequences that are made just in case a guest is interrupted in the middle of a critical section to the case where the guest is scheduled on a different CPU, rather than every time (at the cost of an extra IPI in this case). Namely the tlbsync sequence for radix with GTSE, which is very expensive. - Reduce locking, barriers, atomics related to the vcpus-per-vcore > 1 handling that the P9 path does not require.
2021-11-24KVM: PPC: Book3S HV P9: Remove subcore HMI handlingNicholas Piggin5-9/+67
On POWER9 and newer, rather than the complex HMI synchronisation and subcore state, have each thread un-apply the guest TB offset before calling into the early HMI handler. This allows the subcore state to be avoided, including subcore enter / exit guest, which includes an expensive divide that shows up slightly in profiles. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-54-npiggin@gmail.com
2021-11-24KVM: PPC: Book3S HV P9: Stop using vc->dpdesNicholas Piggin3-12/+22
The P9 path uses vc->dpdes only for msgsndp / SMT emulation. This adds an ordering requirement between vcpu->doorbell_request and vc->dpdes for no real benefit. Use vcpu->doorbell_request directly. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-53-npiggin@gmail.com
2021-11-24KVM: PPC: Book3S HV P9: Tidy kvmppc_create_dtl_entryNicholas Piggin1-25/+35
This goes further to removing vcores from the P9 path. Also avoid the memset in favour of explicitly initialising all fields. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-52-npiggin@gmail.com
2021-11-24KVM: PPC: Book3S HV P9: Remove most of the vcore logicNicholas Piggin1-62/+85
The P9 path always uses one vcpu per vcore, so none of the vcore, locks, stolen time, blocking logic, shared waitq, etc., is required. Remove most of it. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-51-npiggin@gmail.com
2021-11-24KVM: PPC: Book3S HV P9: Avoid cpu_in_guest atomics on entry and exitNicholas Piggin3-19/+22
cpu_in_guest is set to determine if a CPU needs to be IPI'ed to exit the guest and notice the need_tlb_flush bit. This can be implemented as a global per-CPU pointer to the currently running guest instead of per-guest cpumasks, saving 2 atomics per entry/exit. P7/8 doesn't require cpu_in_guest, nor does a nested HV (only the L0 does), so move it to the P9 HV path. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-50-npiggin@gmail.com
2021-11-24KVM: PPC: Book3S HV P9: Add unlikely annotation for !mmu_readyNicholas Piggin1-1/+1
The mmu will almost always be ready. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-49-npiggin@gmail.com
2021-11-24KVM: PPC: Book3S HV P9: Avoid changing MSR[RI] in entry and exitNicholas Piggin1-27/+23
kvm_hstate.in_guest provides the equivalent of MSR[RI]=0 protection, and it covers the existing MSR[RI]=0 section in late entry and early exit, so clearing and setting MSR[RI] in those cases does not actually do anything useful. Remove the RI manipulation and replace it with comments. Make the in_guest memory accesses a bit closer to a proper critical section pattern. This speeds up guest entry/exit performance. This also removes the MSR[RI] warnings which aren't very interesting and would cause crashes if they hit due to causing an interrupt in non-recoverable code. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-48-npiggin@gmail.com
2021-11-24KVM: PPC: Book3S HV P9: Optimise hash guest SLB savingNicholas Piggin1-4/+18
slbmfee/slbmfev instructions are very expensive, moreso than a regular mfspr instruction, so minimising them significantly improves hash guest exit performance. The slbmfev is only required if slbmfee found a valid SLB entry. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-47-npiggin@gmail.com