summaryrefslogtreecommitdiff
path: root/arch/powerpc
AgeCommit message (Collapse)AuthorFilesLines
2018-11-21Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP"Christophe Leroy3-47/+34
commit cc4ebf5c0a3440ed0a32d25c55ebdb6ce5f3c0bc upstream. This reverts commit 4f94b2c7462d9720b2afa7e8e8d4c19446bb31ce. That commit was buggy, as it used rlwinm instead of rlwimi. Instead of fixing that bug, we revert the previous commit in order to reduce the dependency between L1 entries and L2 entries Fixes: 4f94b2c7462d9 ("powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP") Cc: stable@vger.kernel.org Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21powerpc/memtrace: Remove memory in chunksRashmica Gupta1-5/+16
[ Upstream commit 3f7daf3d7582dc6628ac40a9045dd1bbd80c5f35 ] When hot-removing memory release_mem_region_adjustable() splits iomem resources if they are not the exact size of the memory being hot-deleted. Adding this memory back to the kernel adds a new resource. Eg a node has memory 0x0 - 0xfffffffff. Hot-removing 1GB from 0xf40000000 results in the single resource 0x0-0xfffffffff being split into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff. When we hot-add the memory back we now have three resources: 0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff. This is an issue if we try to remove some memory that overlaps resources. Eg when trying to remove 2GB at address 0xf40000000, release_mem_region_adjustable() fails as it expects the chunk of memory to be within the boundaries of a single resource. We then get the warning: "Unable to release resource" and attempting to use memtrace again gives us this error: "bash: echo: write error: Resource temporarily unavailable" This patch makes memtrace remove memory in chunks that are always the same size from an address that is always equal to end_of_memory - n*size, for some n. So hotremoving and hotadding memory of different sizes will now not attempt to remove memory that spans multiple resources. Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21powerpc/boot: Ensure _zimage_start is a weak symbolJoel Stanley1-1/+3
[ Upstream commit ee9d21b3b3583712029a0db65a4b7c081d08d3b3 ] When building with clang crt0's _zimage_start is not marked weak, which breaks the build when linking the kernel image: $ objdump -t arch/powerpc/boot/crt0.o |grep _zimage_start$ 0000000000000058 g .text 0000000000000000 _zimage_start ld: arch/powerpc/boot/wrapper.a(crt0.o): in function '_zimage_start': (.text+0x58): multiple definition of '_zimage_start'; arch/powerpc/boot/pseries-head.o:(.text+0x0): first defined here Clang requires the .weak directive to appear after the symbol is declared. The binutils manual says: This directive sets the weak attribute on the comma separated list of symbol names. If the symbols do not already exist, they will be created. So it appears this is different with clang. The only reference I could see for this was an OpenBSD mailing list post[1]. Changing it to be after the declaration fixes building with Clang, and still works with GCC. $ objdump -t arch/powerpc/boot/crt0.o |grep _zimage_start$ 0000000000000058 w .text 0000000000000000 _zimage_start Reported to clang as https://bugs.llvm.org/show_bug.cgi?id=38921 [1] https://groups.google.com/forum/#!topic/fa.openbsd.tech/PAgKKen2YCY Signed-off-by: Joel Stanley <joel@jms.id.au> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21powerpc/mm: Don't report hugepage tables as memory leaks when using kmemleakChristophe Leroy1-0/+3
[ Upstream commit 803d690e68f0c5230183f1a42c7d50a41d16e380 ] When a process allocates a hugepage, the following leak is reported by kmemleak. This is a false positive which is due to the pointer to the table being stored in the PGD as physical memory address and not virtual memory pointer. unreferenced object 0xc30f8200 (size 512): comm "mmap", pid 374, jiffies 4872494 (age 627.630s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<e32b68da>] huge_pte_alloc+0xdc/0x1f8 [<9e0df1e1>] hugetlb_fault+0x560/0x8f8 [<7938ec6c>] follow_hugetlb_page+0x14c/0x44c [<afbdb405>] __get_user_pages+0x1c4/0x3dc [<b8fd7cd9>] __mm_populate+0xac/0x140 [<3215421e>] vm_mmap_pgoff+0xb4/0xb8 [<c148db69>] ksys_mmap_pgoff+0xcc/0x1fc [<4fcd760f>] ret_from_syscall+0x0/0x38 See commit a984506c542e2 ("powerpc/mm: Don't report PUDs as memory leaks when using kmemleak") for detailed explanation. To fix that, this patch tells kmemleak to ignore the allocated hugepage table. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21powerpc/nohash: fix undefined behaviour when testing page size supportDaniel Axtens1-0/+3
[ Upstream commit f5e284803a7206d43e26f9ffcae5de9626d95e37 ] When enumerating page size definitions to check hardware support, we construct a constant which is (1U << (def->shift - 10)). However, the array of page size definitions is only initalised for various MMU_PAGE_* constants, so it contains a number of 0-initialised elements with def->shift == 0. This means we end up shifting by a very large number, which gives the following UBSan splat: ================================================================================ UBSAN: Undefined behaviour in /home/dja/dev/linux/linux/arch/powerpc/mm/tlb_nohash.c:506:21 shift exponent 4294967286 is too large for 32-bit type 'unsigned int' CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc3-00045-ga604f927b012-dirty #6 Call Trace: [c00000000101bc20] [c000000000a13d54] .dump_stack+0xa8/0xec (unreliable) [c00000000101bcb0] [c0000000004f20a8] .ubsan_epilogue+0x18/0x64 [c00000000101bd30] [c0000000004f2b10] .__ubsan_handle_shift_out_of_bounds+0x110/0x1a4 [c00000000101be20] [c000000000d21760] .early_init_mmu+0x1b4/0x5a0 [c00000000101bf10] [c000000000d1ba28] .early_setup+0x100/0x130 [c00000000101bf90] [c000000000000528] start_here_multiplatform+0x68/0x80 ================================================================================ Fix this by first checking if the element exists (shift != 0) before constructing the constant. Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21powerpc/eeh: Fix possible null deref in eeh_dump_dev_log()Sam Bobroff1-0/+5
[ Upstream commit f9bc28aedfb5bbd572d2d365f3095c1becd7209b ] If an error occurs during an unplug operation, it's possible for eeh_dump_dev_log() to be called when edev->pdn is null, which currently leads to dereferencing a null pointer. Handle this by skipping the error log for those devices. Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21powerpc/Makefile: Fix PPC_BOOK3S_64 ASFLAGSJoel Stanley1-1/+5
[ Upstream commit 960e30029863db95ec79a71009272d4661db5991 ] Ever since commit 15a3204d24a3 ("powerpc/64s: Set assembler machine type to POWER4") we force -mpower4 to be passed to the assembler irrespective of the CFLAGS used (for Book3s 64). When building a powerpc64 kernel with clang, clang will not add -many to the assembler flags, so any instructions that the compiler has generated that are not available on power4 will cause an error: /usr/bin/as -a64 -mppc64 -mlittle-endian -mpower8 \ -I ./arch/powerpc/include -I ./arch/powerpc/include/generated \ -I ./include -I ./arch/powerpc/include/uapi \ -I ./arch/powerpc/include/generated/uapi -I ./include/uapi \ -I ./include/generated/uapi -I arch/powerpc -I arch/powerpc \ -maltivec -mpower4 -o init/do_mounts.o /tmp/do_mounts-3b0a3d.s /tmp/do_mounts-51ce54.s:748: Error: unrecognized opcode: `isel' GCC does include -many, so the GCC driven gas call will succeed: as -v -I ./arch/powerpc/include -I ./arch/powerpc/include/generated -I ./include -I ./arch/powerpc/include/uapi -I ./arch/powerpc/include/generated/uapi -I ./include/uapi -I ./include/generated/uapi -I arch/powerpc -I arch/powerpc -a64 -mpower8 -many -mlittle -maltivec -mpower4 -o init/do_mounts.o Note that isel is power7 and above for IBM CPUs. GCC only generates it for Power9 and above, but the above test was run against the clang generated assembly. Peter Bergner explains: When using -many -mpower4, gas will first try and find a matching power4 mnemonic and failing that, it will then allow any valid mnemonic that gas knows about. GCC's use of -many predates me though. IIRC, Alan looked at trying to remove it, but I forget why he didn't. Could be either a gcc or gas issue at the time. I'm not sure whether issue still exists or not. He and I have modified how gas works internally a fair amount since he tried removing gcc use of -many. I will also note that when using -many, gas will choose the first mnemonic that matches in the mnemonic table and we have (mostly) sorted the table so that server mnemonics show up earlier in the table than other mnemonics, so they'll be seen/chosen first. By explicitly setting -many we can build with Clang and GCC while retaining the -mpower4 option. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21powerpc/mm: fix always true/false warning in slice.cChristophe Leroy1-7/+14
[ Upstream commit 37e9c674e7e6f445e12cb1151017bd4bacdd1e2d ] This patch fixes the following warnings (obtained with make W=1). arch/powerpc/mm/slice.c: In function 'slice_range_to_mask': arch/powerpc/mm/slice.c:73:12: error: comparison is always true due to limited range of data type [-Werror=type-limits] if (start < SLICE_LOW_TOP) { ^ arch/powerpc/mm/slice.c:81:20: error: comparison is always false due to limited range of data type [-Werror=type-limits] if ((start + len) > SLICE_LOW_TOP) { ^ arch/powerpc/mm/slice.c: In function 'slice_mask_for_free': arch/powerpc/mm/slice.c:136:17: error: comparison is always true due to limited range of data type [-Werror=type-limits] if (high_limit <= SLICE_LOW_TOP) ^ arch/powerpc/mm/slice.c: In function 'slice_check_range_fits': arch/powerpc/mm/slice.c:185:12: error: comparison is always true due to limited range of data type [-Werror=type-limits] if (start < SLICE_LOW_TOP) { ^ arch/powerpc/mm/slice.c:195:39: error: comparison is always false due to limited range of data type [-Werror=type-limits] if (SLICE_NUM_HIGH && ((start + len) > SLICE_LOW_TOP)) { ^ arch/powerpc/mm/slice.c: In function 'slice_scan_available': arch/powerpc/mm/slice.c:306:11: error: comparison is always true due to limited range of data type [-Werror=type-limits] if (addr < SLICE_LOW_TOP) { ^ arch/powerpc/mm/slice.c: In function 'get_slice_psize': arch/powerpc/mm/slice.c:709:11: error: comparison is always true due to limited range of data type [-Werror=type-limits] if (addr < SLICE_LOW_TOP) { ^ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21powerpc/mm: Fix page table dump to work on RadixMichael Ellerman1-3/+9
[ Upstream commit 0d923962ab69c27cca664a2d535e90ef655110ca ] When we're running on Book3S with the Radix MMU enabled the page table dump currently prints the wrong addresses because it uses the wrong start address. Fix it to use PAGE_OFFSET rather than KERN_VIRT_START. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21powerpc/64/module: REL32 relocation range checkNicholas Piggin1-1/+8
[ Upstream commit b851ba02a6f3075f0f99c60c4bc30a4af80cf428 ] The recent module relocation overflow crash demonstrated that we have no range checking on REL32 relative relocations. This patch implements a basic check, the same kernel that previously oopsed and rebooted now continues with some of these errors when loading the module: module_64: x_tables: REL32 527703503449812 out of range! Possibly other relocations (ADDR32, REL16, TOC16, etc.) should also have overflow checks. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21powerpc/traps: restore recoverability of machine_check interruptsChristophe Leroy1-2/+7
[ Upstream commit daf00ae71dad8aa05965713c62558aeebf2df48e ] commit b96672dd840f ("powerpc: Machine check interrupt is a non- maskable interrupt") added a call to nmi_enter() at the beginning of machine check restart exception handler. Due to that, in_interrupt() always returns true regardless of the state before entering the exception, and die() panics even when the system was not already in interrupt. This patch calls nmi_exit() before calling die() in order to restore the interrupt state we had before calling nmi_enter() Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable interrupt") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13powerpc/64s/hash: Do not use PPC_INVALIDATE_ERAT on CPUs before POWER9Nicholas Piggin2-2/+9
commit bc276ecba132caccb1fda5863a652c15def2b8c6 upstream. PPC_INVALIDATE_ERAT is slbia IH=7 which is a new variant introduced with POWER9, and the result is undefined on earlier CPUs. Commits 7b9f71f974 ("powerpc/64s: POWER9 machine check handler") and d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9") caused POWER7/8 code to use this instruction. Remove it. An ERAT flush can be made by invalidatig the SLB, but before POWER9 that requires a flush and rebolt. Fixes: 7b9f71f974 ("powerpc/64s: POWER9 machine check handler") Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9") Cc: stable@vger.kernel.org # v4.11+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13powerpc/tm: Fix HFSCR bit for no suspend caseMichael Neuling1-6/+12
commit dd9a8c5a87395b6f05552c3b44e42fdc95760552 upstream. Currently on P9N DD2.1 we end up taking infinite TM facility unavailable exceptions on the first TM usage by userspace. In the special case of TM no suspend (P9N DD2.1), Linux is told TM is off via CPU dt-ftrs but told to (partially) use it via OPAL_REINIT_CPUS_TM_SUSPEND_DISABLED. So HFSCR[TM] will be off from dt-ftrs but we need to turn it on for the no suspend case. This patch fixes this by enabling HFSCR TM in this case. Cc: stable@vger.kernel.org # 4.15+ Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13powerpc/msi: Fix compile error on mpc83xxChristophe Leroy1-0/+7
commit 0f99153def98134403c9149128e59d3e1786cf04 upstream. mpic_get_primary_version() is not defined when not using MPIC. The compile error log like: arch/powerpc/sysdev/built-in.o: In function `fsl_of_msi_probe': fsl_msi.c:(.text+0x150c): undefined reference to `fsl_mpic_primary_get_version' Signed-off-by: Jia Hongtao <hongtao.jia@freescale.com> Signed-off-by: Scott Wood <scottwood@freescale.com> Reported-by: Radu Rendec <radu.rendec@gmail.com> Fixes: 807d38b73b6 ("powerpc/mpic: Add get_version API both for internal and external use") Cc: stable@vger.kernel.org Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-04powerpc/numa: Skip onlining a offline node in kdump pathSrikar Dronamraju1-2/+3
[ Upstream commit ac1788cc7da4ce54edcfd2e499afdb0a23d5c41d ] With commit 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot"), kdump kernel on shared LPAR may crash. The necessary conditions are - Shared LPAR with at least 2 nodes having memory and CPUs. - Memory requirement for kdump kernel must be met by the first N-1 nodes where there are at least N nodes with memory and CPUs. Example numactl of such a machine. $ numactl -H available: 5 nodes (0,2,5-7) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: node 2 size: 255 MB node 2 free: 189 MB node 5 cpus: 24 25 26 27 28 29 30 31 node 5 size: 4095 MB node 5 free: 4024 MB node 6 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 6 size: 6353 MB node 6 free: 5998 MB node 7 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 node 7 size: 7640 MB node 7 free: 7164 MB node distances: node 0 2 5 6 7 0: 10 40 40 40 40 2: 40 10 40 40 40 5: 40 40 10 40 40 6: 40 40 40 10 20 7: 40 40 40 20 10 Steps to reproduce. 1. Load / start kdump service. 2. Trigger a kdump (for example : echo c > /proc/sysrq-trigger) When booting a kdump kernel with 2048M: kexec: Starting switchover sequence. I'm in purgatory Using 1TB segments hash-mmu: Initializing hash mmu with SLB Linux version 4.19.0-rc5-master+ (srikar@linux-xxu6) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Thu Sep 27 19:45:00 IST 2018 Found initrd at 0xc000000009e70000:0xc00000000ae554b4 Using pSeries machine description ----------------------------------------------------- ppc64_pft_size = 0x1e phys_mem_size = 0x88000000 dcache_bsize = 0x80 icache_bsize = 0x80 cpu_features = 0x000000ff8f5d91a7 possible = 0x0000fbffcf5fb1a7 always = 0x0000006f8b5c91a1 cpu_user_features = 0xdc0065c2 0xef000000 mmu_features = 0x7c006001 firmware_features = 0x00000007c45bfc57 htab_hash_mask = 0x7fffff physical_start = 0x8000000 ----------------------------------------------------- numa: NODE_DATA [mem 0x87d5e300-0x87d67fff] numa: NODE_DATA(0) on node 6 numa: NODE_DATA [mem 0x87d54600-0x87d5e2ff] Top of RAM: 0x88000000, Total RAM: 0x88000000 Memory hole size: 0MB Zone ranges: DMA [mem 0x0000000000000000-0x0000000087ffffff] DMA32 empty Normal empty Movable zone start for each node Early memory node ranges node 6: [mem 0x0000000000000000-0x0000000087ffffff] Could not find start_pfn for node 0 Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000] On node 0 totalpages: 0 Initmem setup node 6 [mem 0x0000000000000000-0x0000000087ffffff] On node 6 totalpages: 34816 Unable to handle kernel paging request for data at address 0x00000060 Faulting instruction address: 0xc000000008703a54 Oops: Kernel access of bad area, sig: 11 [#1] LE SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 11 PID: 1 Comm: swapper/11 Not tainted 4.19.0-rc5-master+ #1 NIP: c000000008703a54 LR: c000000008703a38 CTR: 0000000000000000 REGS: c00000000b673440 TRAP: 0380 Not tainted (4.19.0-rc5-master+) MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 24022022 XER: 20000002 CFAR: c0000000086fc238 IRQMASK: 0 GPR00: c000000008703a38 c00000000b6736c0 c000000009281900 0000000000000000 GPR04: 0000000000000000 0000000000000000 fffffffffffff001 c00000000b660080 GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000220 GPR12: 0000000000002200 c000000009e51400 0000000000000000 0000000000000008 GPR16: 0000000000000000 c000000008c152e8 c000000008c152a8 0000000000000000 GPR20: c000000009422fd8 c000000009412fd8 c000000009426040 0000000000000008 GPR24: 0000000000000000 0000000000000000 c000000009168bc8 c000000009168c78 GPR28: c00000000b126410 0000000000000000 c00000000916a0b8 c00000000b126400 NIP [c000000008703a54] bus_add_device+0x84/0x1e0 LR [c000000008703a38] bus_add_device+0x68/0x1e0 Call Trace: [c00000000b6736c0] [c000000008703a38] bus_add_device+0x68/0x1e0 (unreliable) [c00000000b673740] [c000000008700194] device_add+0x454/0x7c0 [c00000000b673800] [c00000000872e660] __register_one_node+0xb0/0x240 [c00000000b673860] [c00000000839a6bc] __try_online_node+0x12c/0x180 [c00000000b673900] [c00000000839b978] try_online_node+0x58/0x90 [c00000000b673930] [c0000000080846d8] find_and_online_cpu_nid+0x158/0x190 [c00000000b673a10] [c0000000080848a0] numa_update_cpu_topology+0x190/0x580 [c00000000b673c00] [c000000008d3f2e4] smp_cpus_done+0x94/0x108 [c00000000b673c70] [c000000008d5c00c] smp_init+0x174/0x19c [c00000000b673d00] [c000000008d346b8] kernel_init_freeable+0x1e0/0x450 [c00000000b673dc0] [c0000000080102e8] kernel_init+0x28/0x160 [c00000000b673e30] [c00000000800b65c] ret_from_kernel_thread+0x5c/0x80 Instruction dump: 60000000 60000000 e89e0020 7fe3fb78 4bff87d5 60000000 7c7d1b79 4082008c e8bf0050 e93e0098 3b9f0010 2fa50000 <e8690060> 38630018 419e0114 7f84e378 ---[ end trace 593577668c2daa65 ]--- However a regular kernel with 4096M (2048 gets reserved for crash kernel) boots properly. Unlike regular kernels, which mark all available nodes as online, kdump kernel only marks just enough nodes as online and marks the rest as offline at boot. However kdump kernel boots with all available CPUs. With Commit 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot"), all CPUs are onlined on their respective nodes at boot time. try_online_node() tries to online the offline nodes but fails as all needed subsystems are not yet initialized. As part of fix, detect and skip early onlining of a offline node. Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot") Reported-by: Pavithra Prakash <pavrampu@in.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-10-20powerpc/numa: Use associativity if VPHN hcall is successfulSrikar Dronamraju1-1/+3
[ Upstream commit 2483ef056f6e42f61cd266452e2841165dfe1b5c ] Currently associativity is used to lookup node-id even if the preceding VPHN hcall failed. However this can cause CPU to be made part of the wrong node, (most likely to be node 0). This is because VPHN is not enabled on KVM guests. With 2ea6263 ("powerpc/topology: Get topology for shared processors at boot"), associativity is used to set to the wrong node. Hence KVM guest topology is broken. For example : A 4 node KVM guest before would have reported. [root@localhost ~]# numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 node 0 size: 1746 MB node 0 free: 1604 MB node 1 cpus: 4 5 6 7 node 1 size: 2044 MB node 1 free: 1765 MB node 2 cpus: 8 9 10 11 node 2 size: 2044 MB node 2 free: 1837 MB node 3 cpus: 12 13 14 15 node 3 size: 2044 MB node 3 free: 1903 MB node distances: node 0 1 2 3 0: 10 40 40 40 1: 40 10 40 40 2: 40 40 10 40 3: 40 40 40 10 Would now report: [root@localhost ~]# numactl -H available: 4 nodes (0-3) node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 1746 MB node 0 free: 1244 MB node 1 cpus: node 1 size: 2044 MB node 1 free: 2032 MB node 2 cpus: 1 node 2 size: 2044 MB node 2 free: 2028 MB node 3 cpus: node 3 size: 2044 MB node 3 free: 2032 MB node distances: node 0 1 2 3 0: 10 40 40 40 1: 40 10 40 40 2: 40 40 10 40 3: 40 40 40 10 Fix this by skipping associativity lookup if the VPHN hcall failed. Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot") Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-20powerpc/tm: Avoid possible userspace r1 corruption on reclaimMichael Neuling1-1/+8
[ Upstream commit 96dc89d526ef77604376f06220e3d2931a0bfd58 ] Current we store the userspace r1 to PACATMSCRATCH before finally saving it to the thread struct. In theory an exception could be taken here (like a machine check or SLB miss) that could write PACATMSCRATCH and hence corrupt the userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but others do. We've never actually seen this happen but it's theoretically possible. Either way, the code is fragile as it is. This patch saves r1 to the kernel stack (which can't fault) before we turn MSR[RI] back on. PACATMSCRATCH is still used but only with MSR[RI] off. We then copy r1 from the kernel stack to the thread struct once we have MSR[RI] back on. Suggested-by: Breno Leitao <leitao@debian.org> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-20powerpc/tm: Fix userspace r13 corruptionMichael Neuling1-2/+9
[ Upstream commit cf13435b730a502e814c63c84d93db131e563f5f ] When we treclaim we store the userspace checkpointed r13 to a scratch SPR and then later save the scratch SPR to the user thread struct. Unfortunately, this doesn't work as accessing the user thread struct can take an SLB fault and the SLB fault handler will write the same scratch SPRG that now contains the userspace r13. To fix this, we store r13 to the kernel stack (which can't fault) before we access the user thread struct. Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen as a random userspace segfault with r13 looking like a kernel address. Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18KVM: PPC: Book3S HV: Avoid crash from THP collapse during radix page faultPaul Mackerras1-0/+10
commit 6579804c431712d56956a63b1a01509441cc6800 upstream. Commit 71d29f43b633 ("KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size", 2018-09-11) added a call to __find_linux_pte() and a dereference of the returned PTE pointer to the radix page fault path in the common case where the page is normal system memory. Previously, __find_linux_pte() was only called for mappings to physical addresses which don't have a page struct (e.g. memory-mapped I/O) or where the page struct is marked as reserved memory. This exposes us to the possibility that the returned PTE pointer could be NULL, for example in the case of a concurrent THP collapse operation. Dereferencing the returned NULL pointer causes a host crash. To fix this, we check for NULL, and if it is NULL, we retry the operation by returning to the guest, with the expectation that it will generate the same page fault again (unless of course it has been fixed up by another CPU in the meantime). Fixes: 71d29f43b633 ("KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18mm: Preserve _PAGE_DEVMAP across mprotect() callsJan Kara1-2/+2
commit 4628a64591e6cee181237060961e98c615c33966 upstream. Currently _PAGE_DEVMAP bit is not preserved in mprotect(2) calls. As a result we will see warnings such as: BUG: Bad page map in process JobWrk0013 pte:800001803875ea25 pmd:7624381067 addr:00007f0930720000 vm_flags:280000f9 anon_vma: (null) mapping:ffff97f2384056f0 index:0 file:457-000000fe00000030-00000009-000000ca-00000001_2001.fileblock fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap [xfs] readpage: (null) CPU: 3 PID: 15848 Comm: JobWrk0013 Tainted: G W 4.12.14-2.g7573215-default #1 SLE12-SP4 (unreleased) Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018 Call Trace: dump_stack+0x5a/0x75 print_bad_pte+0x217/0x2c0 ? enqueue_task_fair+0x76/0x9f0 _vm_normal_page+0xe5/0x100 zap_pte_range+0x148/0x740 unmap_page_range+0x39a/0x4b0 unmap_vmas+0x42/0x90 unmap_region+0x99/0xf0 ? vma_gap_callbacks_rotate+0x1a/0x20 do_munmap+0x255/0x3a0 vm_munmap+0x54/0x80 SyS_munmap+0x1d/0x30 do_syscall_64+0x74/0x150 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 ... when mprotect(2) gets used on DAX mappings. Also there is a wide variety of other failures that can result from the missing _PAGE_DEVMAP flag when the area gets used by get_user_pages() later. Fix the problem by including _PAGE_DEVMAP in a set of flags that get preserved by mprotect(2). Fixes: 69660fd797c3 ("x86, mm: introduce _PAGE_DEVMAP") Fixes: ebd31197931d ("powerpc/mm: Add devmap support for ppc64") Cc: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping sizeNicholas Piggin1-54/+37
[ Upstream commit 71d29f43b6332badc5598c656616a62575e83342 ] THP paths can defer splitting compound pages until after the actual remap and TLB flushes to split a huge PMD/PUD. This causes radix partition scope page table mappings to get out of synch with the host qemu page table mappings. This results in random memory corruption in the guest when running with THP. The easiest way to reproduce is use KVM balloon to free up a lot of memory in the guest and then shrink the balloon to give the memory back, while some work is being done in the guest. Cc: David Gibson <david@gibson.dropbear.id.au> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: kvm-ppc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-13powerpc/lib: fix book3s/32 boot failure due to code patchingChristophe Leroy1-8/+12
commit b45ba4a51cde29b2939365ef0c07ad34c8321789 upstream. Commit 51c3c62b58b3 ("powerpc: Avoid code patching freed init sections") accesses 'init_mem_is_free' flag too early, before the kernel is relocated. This provokes early boot failure (before the console is active). As it is not necessary to do this verification that early, this patch moves the test into patch_instruction() instead of __patch_instruction(). This modification also has the advantage of avoiding unnecessary remappings. Fixes: 51c3c62b58b3 ("powerpc: Avoid code patching freed init sections") Cc: stable@vger.kernel.org # 4.13+ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-13powerpc: Avoid code patching freed init sectionsMichael Neuling3-0/+9
commit 51c3c62b58b357e8d35e4cc32f7b4ec907426fe3 upstream. This stops us from doing code patching in init sections after they've been freed. In this chain: kvm_guest_init() -> kvm_use_magic_page() -> fault_in_pages_readable() -> __get_user() -> __get_user_nocheck() -> barrier_nospec(); We have a code patching location at barrier_nospec() and kvm_guest_init() is an init function. This whole chain gets inlined, so when we free the init section (hence kvm_guest_init()), this code goes away and hence should no longer be patched. We seen this as userspace memory corruption when using a memory checker while doing partition migration testing on powervm (this starts the code patching post migration via /sys/kernel/mobility/migration). In theory, it could also happen when using /sys/kernel/debug/powerpc/barrier_nospec. Cc: stable@vger.kernel.org # 4.13+ Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-10KVM: PPC: Book3S HV: Don't truncate HPTE index in xlate functionPaul Mackerras1-1/+1
[ Upstream commit 46dec40fb741f00f1864580130779aeeaf24fb3d ] This fixes a bug which causes guest virtual addresses to get translated to guest real addresses incorrectly when the guest is using the HPT MMU and has more than 256GB of RAM, or more specifically has a HPT larger than 2GB. This has showed up in testing as a failure of the host to emulate doorbell instructions correctly on POWER9 for HPT guests with more than 256GB of RAM. The bug is that the HPTE index in kvmppc_mmu_book3s_64_hv_xlate() is stored as an int, and in forming the HPTE address, the index gets shifted left 4 bits as an int before being signed-extended to 64 bits. The simple fix is to make the variable a long int, matching the return type of kvmppc_hv_find_lock_hpte(), which is what calculates the index. Fixes: 697d3899dcb4 ("KVM: PPC: Implement MMIO emulation support for Book3S HV guests") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-04powerpc/pseries: Fix unitialized timer reset on migrationMichael Bringmann1-1/+2
commit 8604895a34d92f5e186ceb931b0d1b384030ea3d upstream. After migration of a powerpc LPAR, the kernel executes code to update the system state to reflect new platform characteristics. Such changes include modifications to device tree properties provided to the system by PHYP. Property notifications received by the post_mobility_fixup() code are passed along to the kernel in general through a call to of_update_property() which in turn passes such events back to all modules through entries like the '.notifier_call' function within the NUMA module. When the NUMA module updates its state, it resets its event timer. If this occurs after a previous call to stop_topology_update() or on a system without VPHN enabled, the code runs into an unitialized timer structure and crashes. This patch adds a safety check along this path toward the problem code. An example crash log is as follows. ibmvscsi 30000081: Re-enabling adapter! ------------[ cut here ]------------ kernel BUG at kernel/time/timer.c:958! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA pSeries Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179 ... NIP mod_timer+0x4c/0x400 LR reset_topology_timer+0x40/0x60 Call Trace: 0xc0000003f9407830 (unreliable) reset_topology_timer+0x40/0x60 dt_update_callback+0x100/0x120 notifier_call_chain+0x90/0x100 __blocking_notifier_call_chain+0x60/0x90 of_property_notify+0x90/0xd0 of_update_property+0x104/0x150 update_dt_property+0xdc/0x1f0 pseries_devicetree_update+0x2d0/0x510 post_mobility_fixup+0x7c/0xf0 migration_store+0xa4/0xc0 kobj_attr_store+0x30/0x60 sysfs_kf_write+0x64/0xa0 kernfs_fop_write+0x16c/0x240 __vfs_write+0x40/0x200 vfs_write+0xc8/0x240 ksys_write+0x5c/0x100 system_call+0x58/0x6c Fixes: 5d88aa85c00b ("powerpc/pseries: Update CPU maps when device tree is updated") Cc: stable@vger.kernel.org # v3.10+ Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-04powerpc/pkeys: Fix reading of ibm, processor-storage-keys propertyThiago Jung Bauermann1-1/+1
commit c716a25b9b70084e1144f77423f5aedd772ea478 upstream. scan_pkey_feature() uses of_property_read_u32_array() to read the ibm,processor-storage-keys property and calls be32_to_cpu() on the value it gets. The problem is that of_property_read_u32_array() already returns the value converted to the CPU byte order. The value of pkeys_total ends up more or less sane because there's a min() call in pkey_initialize() which reduces pkeys_total to 32. So in practice the kernel ignores the fact that the hypervisor reserved one key for itself (the device tree advertises 31 keys in my test VM). This is wrong, but the effect in practice is that when a process tries to allocate the 32nd key, it gets an -EINVAL error instead of -ENOSPC which would indicate that there aren't any keys available Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-04powerpc: fix csum_ipv6_magic() on little endian platformsChristophe Leroy1-0/+3
commit 85682a7e3b9c664995ad477520f917039afdc330 upstream. On little endian platforms, csum_ipv6_magic() keeps len and proto in CPU byte order. This generates a bad results leading to ICMPv6 packets from other hosts being dropped by powerpc64le platforms. In order to fix this, len and proto should be converted to network byte order ie bigendian byte order. However checksumming 0x12345678 and 0x56341278 provide the exact same result so it is enough to rotate the sum of len and proto by 1 byte. PPC32 only support bigendian so the fix is needed for PPC64 only Fixes: e9c4943a107b ("powerpc: Implement csum_ipv6_magic in assembly") Reported-by: Jianlin Shi <jishi@redhat.com> Reported-by: Xin Long <lucien.xin@gmail.com> Cc: <stable@vger.kernel.org> # 4.18+ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-04KVM: PPC: Book3S HV: Fix guest r11 corruption with POWER9 TM workaroundsMichael Neuling1-2/+2
commit f14040bca89258b8a1c71e2112e430462172ce93 upstream. When we come into the softpatch handler (0x1500), we use r11 to store the HSRR0 for later use by the denorm handler. We also use the softpatch handler for the TM workarounds for POWER9. Unfortunately, in kvmppc_interrupt_hv we later store r11 out to the vcpu assuming it's still what we got from userspace. This causes r11 to be corrupted in the VCPU and hence when we restore the guest, we get a corrupted r11. We've seen this when running TM tests inside guests on P9. This fixes the problem by only touching r11 in the denorm case. Fixes: 4bb3c7a020 ("KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9") Cc: <stable@vger.kernel.org> # 4.17+ Test-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-04powerpc/powernv/ioda2: Reduce upper limit for DMA window sizeAlexey Kardashevskiy1-1/+1
[ Upstream commit d3d4ffaae439981e1e441ebb125aa3588627c5d8 ] We use PHB in mode1 which uses bit 59 to select a correct DMA window. However there is mode2 which uses bits 59:55 and allows up to 32 DMA windows per a PE. Even though documentation does not clearly specify that, it seems that the actual hardware does not support bits 59:55 even in mode1, in other words we can create a window as big as 1<<58 but DMA simply won't work. This reduces the upper limit from 59 to 55 bits to let the userspace know about the hardware limits. Fixes: 7aafac11e3 "powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested" Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-04powerpc/kdump: Handle crashkernel memory reservation failureHari Bathini1-1/+6
[ Upstream commit 8950329c4a64c6d3ca0bc34711a1afbd9ce05657 ] Memory reservation for crashkernel could fail if there are holes around kdump kernel offset (128M). Fail gracefully in such cases and print an error message. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Tested-by: David Gibson <dgibson@redhat.com> Reviewed-by: Dave Young <dyoung@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-26KVM: PPC: Book3S: Fix matching of hardware and emulated TCE tablesAlexey Kardashevskiy1-3/+2
[ Upstream commit 76346cd93a5eca33700f82685d56172dd65d4c0a ] When attaching a hardware table to LIOBN in KVM, we match table parameters such as page size, table offset and table size. However the tables are created via very different paths - VFIO and KVM - and the VFIO path goes through the platform code which has minimum TCE page size requirement (which is 4K but since we allocate memory by pages and cannot avoid alignment anyway, we align to 64k pages for powernv_defconfig). So when we match the tables, one might be bigger that the other which means the hardware table cannot get attached to LIOBN and DMA mapping fails. This removes the table size alignment from the guest visible table. This does not affect the memory allocation which is still aligned - kvmppc_tce_pages() takes care of this. This relaxes the check we do when attaching tables to allow the hardware table be bigger than the guest visible table. Ideally we want the KVM table to cover the same space as the hardware table does but since the hardware table may use multiple levels, and all levels must use the same table size (IODA2 design), the area it can actually cover might get very different from the window size which the guest requested, even though the guest won't map it all. Fixes: ca1fc489cf "KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller physical pages" Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-26KVM: PPC: Book3S HV: Add of_node_put() in success pathNicholas Mc Guire1-0/+2
[ Upstream commit 51eaa08f029c7343df846325d7cf047be8b96e81 ] The call to of_find_compatible_node() is returning a pointer with incremented refcount so it must be explicitly decremented after the last use. As here it is only being used for checking of node presence but the result is not actually used in the success path it can be dropped immediately. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Fixes: commit f725758b899f ("KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-26powerpc/powernv: opal_put_chars partial write fixNicholas Piggin1-1/+1
[ Upstream commit bd90284cc6c1c9e8e48c8eadd0c79574fcce0b81 ] The intention here is to consume and discard the remaining buffer upon error. This works if there has not been a previous partial write. If there has been, then total_len is no longer total number of bytes to copy. total_len is always "bytes left to copy", so it should be added to written bytes. This code may not be exercised any more if partial writes will not be hit, but this is a small bugfix before a larger change. Reviewed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-19powerpc/mm: Don't report PUDs as memory leaks when using kmemleakMichael Ellerman1-2/+21
[ Upstream commit a984506c542e26b31cbb446438f8439fa2253b2e ] Paul Menzel reported that kmemleak was producing reports such as: unreferenced object 0xc0000000f8b80000 (size 16384): comm "init", pid 1, jiffies 4294937416 (age 312.240s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000d997deb7>] __pud_alloc+0x80/0x190 [<0000000087f2e8a3>] move_page_tables+0xbac/0xdc0 [<00000000091e51c2>] shift_arg_pages+0xc0/0x210 [<00000000ab88670c>] setup_arg_pages+0x22c/0x2a0 [<0000000060871529>] load_elf_binary+0x41c/0x1648 [<00000000ecd9d2d4>] search_binary_handler.part.11+0xbc/0x280 [<0000000034e0cdd7>] __do_execve_file.isra.13+0x73c/0x940 [<000000005f953a6e>] sys_execve+0x58/0x70 [<000000009700a858>] system_call+0x5c/0x70 Indicating that a PUD was being leaked. However what's really happening is that kmemleak is not able to recognise the references from the PGD to the PUD, because they are not fully qualified pointers. We can confirm that in xmon, eg: Find the task struct for pid 1 "init": 0:mon> P task_struct ->thread.ksp PID PPID S P CMD c0000001fe7c0000 c0000001fe803960 1 0 S 13 systemd Dump virtual address 0 to find the PGD: 0:mon> dv 0 c0000001fe7c0000 pgd @ 0xc0000000f8b01000 Dump the memory of the PGD: 0:mon> d c0000000f8b01000 c0000000f8b01000 00000000f8b90000 0000000000000000 |................| c0000000f8b01010 0000000000000000 0000000000000000 |................| c0000000f8b01020 0000000000000000 0000000000000000 |................| c0000000f8b01030 0000000000000000 00000000f8b80000 |................| ^^^^^^^^^^^^^^^^ There we can see the reference to our supposedly leaked PUD. But because it's missing the leading 0xc, kmemleak won't recognise it. We can confirm it's still in use by translating an address that is mapped via it: 0:mon> dv 7fff94000000 c0000001fe7c0000 pgd @ 0xc0000000f8b01000 pgdp @ 0xc0000000f8b01038 = 0x00000000f8b80000 <-- pudp @ 0xc0000000f8b81ff8 = 0x00000000037c4000 pmdp @ 0xc0000000037c5ca0 = 0x00000000fbd89000 ptep @ 0xc0000000fbd89000 = 0xc0800001d5ce0386 Maps physical address = 0x00000001d5ce0000 Flags = Accessed Dirty Read Write The fix is fairly simple. We need to tell kmemleak to ignore PUD allocations and never report them as leaks. We can also tell it not to scan the PGD, because it will never find pointers in there. However it will still notice if we allocate a PGD and then leak it. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-19powerpc/pseries: fix EEH recovery of some IOV devicesSam Bobroff1-8/+17
[ Upstream commit b87b9cf4935325c98522823caeddd333022a1c62 ] EEH recovery currently fails on pSeries for some IOV capable PCI devices, if CONFIG_PCI_IOV is on and the hypervisor doesn't provide certain device tree properties for the device. (Found on an IOV capable device using the ipr driver.) Recovery fails in pci_enable_resources() at the check on r->parent, because r->flags is set and r->parent is not. This state is due to sriov_init() setting the start, end and flags members of the IOV BARs but the parent not being set later in pseries_pci_fixup_iov_resources(), because the "ibm,open-sriov-vf-bar-info" property is missing. Correct this by zeroing the resource flags for IOV BARs when they can't be configured (this is the same method used by sriov_init() and __pci_read_base()). VFs cleared this way can't be enabled later, because that requires another device tree property, "ibm,number-of-configurable-vfs" as well as support for the RTAS function "ibm_map_pes". These are all part of hypervisor support for IOV and it seems unlikely that a hypervisor would ever partially, but not fully, support it. (None are currently provided by QEMU/KVM.) Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com> Reviewed-by: Bryant G. Ly <bryantly@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-19powerpc/4xx: Fix error return path in ppc4xx_msi_probe()Guenter Roeck1-21/+30
[ Upstream commit 6e0495c2e8ac39b1aad0a4588fe64413ce9028c0 ] An arbitrary error in ppc4xx_msi_probe() quite likely results in a crash similar to the following, seen after dma_alloc_coherent() returned an error. Unable to handle kernel paging request for data at address 0x00000000 Faulting instruction address: 0xc001bff0 Oops: Kernel access of bad area, sig: 11 [#1] BE Canyonlands Modules linked in: CPU: 0 PID: 1 Comm: swapper Tainted: G W 4.18.0-rc6-00010-gff33d1030a6c #1 NIP: c001bff0 LR: c001c418 CTR: c01faa7c REGS: cf82db40 TRAP: 0300 Tainted: G W (4.18.0-rc6-00010-gff33d1030a6c) MSR: 00029000 <CE,EE,ME> CR: 28002024 XER: 00000000 DEAR: 00000000 ESR: 00000000 GPR00: c001c418 cf82dbf0 cf828000 cf8de400 00000000 00000000 000000c4 000000c4 GPR08: c0481ea4 00000000 00000000 000000c4 22002024 00000000 c00025e8 00000000 GPR16: 00000000 00000000 00000000 00000000 00000000 00000000 c0492380 0000004a GPR24: 00029000 0000000c 00000000 cf8de410 c0494d60 c0494d60 cf8bebc0 00000001 NIP [c001bff0] ppc4xx_of_msi_remove+0x48/0xa0 LR [c001c418] ppc4xx_msi_probe+0x294/0x3b8 Call Trace: [cf82dbf0] [00029000] 0x29000 (unreliable) [cf82dc10] [c001c418] ppc4xx_msi_probe+0x294/0x3b8 [cf82dc70] [c0209fbc] platform_drv_probe+0x40/0x9c [cf82dc90] [c0208240] driver_probe_device+0x2a8/0x350 [cf82dcc0] [c0206204] bus_for_each_drv+0x60/0xac [cf82dcf0] [c0207e88] __device_attach+0xe8/0x160 [cf82dd20] [c02071e0] bus_probe_device+0xa0/0xbc [cf82dd40] [c02050c8] device_add+0x404/0x5c4 [cf82dd90] [c0288978] of_platform_device_create_pdata+0x88/0xd8 [cf82ddb0] [c0288b70] of_platform_bus_create+0x134/0x220 [cf82de10] [c0288bcc] of_platform_bus_create+0x190/0x220 [cf82de70] [c0288cf4] of_platform_bus_probe+0x98/0xec [cf82de90] [c0449650] __machine_initcall_canyonlands_ppc460ex_device_probe+0x38/0x54 [cf82dea0] [c0002404] do_one_initcall+0x40/0x188 [cf82df00] [c043daec] kernel_init_freeable+0x130/0x1d0 [cf82df30] [c0002600] kernel_init+0x18/0x104 [cf82df40] [c000c23c] ret_from_kernel_thread+0x14/0x1c Instruction dump: 90010024 813d0024 2f890000 83c30058 41bd0014 48000038 813d0024 7f89f800 409d002c 813e000c 57ea103a 3bff0001 <7c69502e> 2f830000 419effe0 4803b26d ---[ end trace 8cf551077ecfc42a ]--- Fix it up. Specifically, - Return valid error codes from ppc4xx_setup_pcieh_hw(), have it clean up after itself, and only access hardware after all possible error conditions have been handled. - Use devm_kzalloc() instead of kzalloc() in ppc4xx_msi_probe() Signed-off-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-19powerpc/powernv: Fix concurrency issue with npu->mmio_atsd_usageReza Arbab1-2/+3
[ Upstream commit 9eab9901b015f489199105c470de1ffc337cfabb ] We've encountered a performance issue when multiple processors stress {get,put}_mmio_atsd_reg(). These functions contend for mmio_atsd_usage, an unsigned long used as a bitmask. The accesses to mmio_atsd_usage are done using test_and_set_bit_lock() and clear_bit_unlock(). As implemented, both of these will require a (successful) stwcx to that same cache line. What we end up with is thread A, attempting to unlock, being slowed by other threads repeatedly attempting to lock. A's stwcx instructions fail and retry because the memory reservation is lost every time a different thread beats it to the punch. There may be a long-term way to fix this at a larger scale, but for now resolve the immediate problem by gating our call to test_and_set_bit_lock() with one to test_bit(), which is obviously implemented without using a store. Fixes: 1ab66d1fbada ("powerpc/powernv: Introduce address translation services for Nvlink2") Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Acked-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-19KVM: PPC: Book3S HV: Use correct pagesize in kvm_unmap_radix()Paul Mackerras1-3/+3
commit c066fafc595eef5ae3c83ae3a8305956b8c3ef15 upstream. Since commit e641a317830b ("KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix", 2017-10-26), kvm_unmap_radix() computes the number of PAGE_SIZEd pages being unmapped and passes it to kvmppc_update_dirty_map(), which expects to be passed the page size instead. Consequently it will only mark one system page dirty even when a large page (for example a THP page) is being unmapped. The consequence of this is that part of the THP page might not get copied during live migration, resulting in memory corruption for the guest. This fixes it by computing and passing the page size in kvm_unmap_radix(). Cc: stable@vger.kernel.org # v4.15+ Fixes: e641a317830b (KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix) Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-15powerpc/pseries: Avoid using the size greater than RTAS_ERROR_LOG_MAX.Mahesh Salgaonkar1-1/+1
[ Upstream commit 74e96bf44f430cf7a01de19ba6cf49b361cdfd6e ] The global mce data buffer that used to copy rtas error log is of 2048 (RTAS_ERROR_LOG_MAX) bytes in size. Before the copy we read extended_log_length from rtas error log header, then use max of extended_log_length and RTAS_ERROR_LOG_MAX as a size of data to be copied. Ideally the platform (phyp) will never send extended error log with size > 2048. But if that happens, then we have a risk of buffer overrun and corruption. Fix this by using min_t instead. Fixes: d368514c3097 ("powerpc: Fix corruption when grabbing FWNMI data") Reported-by: Michal Suchanek <msuchanek@suse.com> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-15powerpc/64s: Make rfi_flush_fallback a little more robustMichael Ellerman1-0/+6
[ Upstream commit 78ee9946371f5848ddfc88ab1a43867df8f17d83 ] Because rfi_flush_fallback runs immediately before the return to userspace it currently runs with the user r1 (stack pointer). This means if we oops in there we will report a bad kernel stack pointer in the exception entry path, eg: Bad kernel stack pointer 7ffff7150e40 at c0000000000023b4 Oops: Bad kernel stack pointer, sig: 6 [#1] LE SMP NR_CPUS=32 NUMA PowerNV Modules linked in: CPU: 0 PID: 1246 Comm: klogd Not tainted 4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3 #7 NIP: c0000000000023b4 LR: 0000000010053e00 CTR: 0000000000000040 REGS: c0000000fffe7d40 TRAP: 4100 Not tainted (4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3) MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000 CFAR: c00000000000bac8 IRQMASK: c0000000f1e66a80 GPR00: 0000000002000000 00007ffff7150e40 00007fff93a99900 0000000000000020 ... NIP [c0000000000023b4] rfi_flush_fallback+0x34/0x80 LR [0000000010053e00] 0x10053e00 Although the NIP tells us where we were, and the TRAP number tells us what happened, it would still be nicer if we could report the actual exception rather than barfing about the stack pointer. We an do that fairly simply by loading the kernel stack pointer on entry and restoring the user value before returning. That way we see a regular oops such as: Unrecoverable exception 4100 at c00000000000239c Oops: Unrecoverable exception, sig: 6 [#1] LE SMP NR_CPUS=32 NUMA PowerNV Modules linked in: CPU: 0 PID: 1251 Comm: klogd Not tainted 4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty #40 NIP: c00000000000239c LR: 0000000010053e00 CTR: 0000000000000040 REGS: c0000000f1e17bb0 TRAP: 4100 Not tainted (4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty) MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000 CFAR: c00000000000bac8 IRQMASK: 0 ... NIP [c00000000000239c] rfi_flush_fallback+0x3c/0x80 LR [0000000010053e00] 0x10053e00 Call Trace: [c0000000f1e17e30] [c00000000000b9e4] system_call+0x5c/0x70 (unreliable) Note this shouldn't make the kernel stack pointer vulnerable to a meltdown attack, because it should be flushed from the cache before we return to userspace. The user r1 value will be in the cache, because we load it in the return path, but that is harmless. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-15powerpc/platforms/85xx: fix t1042rdb_diu.c build errors & warningRandy Dunlap1-0/+4
[ Upstream commit f5daf77a55ef0e695cc90c440ed6503073ac5e07 ] Fix build errors and warnings in t1042rdb_diu.c by adding header files and MODULE_LICENSE(). ../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: warning: data definition has no type or storage class early_initcall(t1042rdb_diu_init); ../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: error: type defaults to 'int' in declaration of 'early_initcall' [-Werror=implicit-int] ../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: warning: parameter names (without types) in function declaration and WARNING: modpost: missing MODULE_LICENSE() in arch/powerpc/platforms/85xx/t1042rdb_diu.o Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Scott Wood <oss@buserror.net> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-15powerpc: Fix size calculation using resource_size()Dan Carpenter1-1/+1
[ Upstream commit c42d3be0c06f0c1c416054022aa535c08a1f9b39 ] The problem is the the calculation should be "end - start + 1" but the plus one is missing in this calculation. Fixes: 8626816e905e ("powerpc: add support for MPIC message register API") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-15powerpc/uaccess: Enable get_user(u64, *p) on 32-bitMichael Ellerman1-3/+10
[ Upstream commit f7a6947cd49b7ff4e03f1b4f7e7b223003d752ca ] Currently if you build a 32-bit powerpc kernel and use get_user() to load a u64 value it will fail to build with eg: kernel/rseq.o: In function `rseq_get_rseq_cs': kernel/rseq.c:123: undefined reference to `__get_user_bad' This is hitting the check in __get_user_size() that makes sure the size we're copying doesn't exceed the size of the destination: #define __get_user_size(x, ptr, size, retval) do { retval = 0; __chk_user_ptr(ptr); if (size > sizeof(x)) (x) = __get_user_bad(); Which doesn't immediately make sense because the size of the destination is u64, but it's not really, because __get_user_check() etc. internally create an unsigned long and copy into that: #define __get_user_check(x, ptr, size) ({ long __gu_err = -EFAULT; unsigned long __gu_val = 0; The problem being that on 32-bit unsigned long is not big enough to hold a u64. We can fix this with a trick from hpa in the x86 code, we statically check the type of x and set the type of __gu_val to either unsigned long or unsigned long long. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-15powerpc/topology: Get topology for shared processors at bootSrikar Dronamraju3-10/+20
[ Upstream commit 2ea62630681027c455117aa471ea3ab8bb099ead ] On a shared LPAR, Phyp will not update the CPU associativity at boot time. Just after the boot system does recognize itself as a shared LPAR and trigger a request for correct CPU associativity. But by then the scheduler would have already created/destroyed its sched domains. This causes - Broken load balance across Nodes causing islands of cores. - Performance degradation esp if the system is lightly loaded - dmesg to wrongly report all CPUs to be in Node 0. - Messages in dmesg saying borken topology. - With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain"), can cause rcu stalls at boot up. The sched_domains_numa_masks table which is used to generate cpumasks is only created at boot time just before creating sched domains and never updated. Hence, its better to get the topology correct before the sched domains are created. For example on 64 core Power 8 shared LPAR, dmesg reports Brought up 512 CPUs Node 0 CPUs: 0-511 Node 1 CPUs: Node 2 CPUs: Node 3 CPUs: Node 4 CPUs: Node 5 CPUs: Node 6 CPUs: Node 7 CPUs: Node 8 CPUs: Node 9 CPUs: Node 10 CPUs: Node 11 CPUs: ... BUG: arch topology borken the DIE domain not a subset of the NUMA domain BUG: arch topology borken the DIE domain not a subset of the NUMA domain numactl/lscpu output will still be correct with cores spreading across all nodes: Socket(s): 64 NUMA node(s): 12 Model: 2.0 (pvr 004d 0200) Model name: POWER8 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 64K L1i cache: 32K NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471 NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479 NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487 NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495 NUMA node4 CPU(s): 208-215,304-311,400-407,496-503 NUMA node5 CPU(s): 168-175,264-271,360-367,456-463 NUMA node6 CPU(s): 128-135,224-231,320-327,416-423 NUMA node7 CPU(s): 136-143,232-239,328-335,424-431 NUMA node8 CPU(s): 216-223,312-319,408-415,504-511 NUMA node9 CPU(s): 144-151,240-247,336-343,432-439 NUMA node10 CPU(s): 152-159,248-255,344-351,440-447 NUMA node11 CPU(s): 160-167,256-263,352-359,448-455 Currently on this LPAR, the scheduler detects 2 levels of Numa and created numa sched domains for all CPUs, but it finds a single DIE domain consisting of all CPUs. Hence it deletes all numa sched domains. To address this, detect the shared processor and update topology soon after CPUs are setup so that correct topology is updated just before scheduler creates sched domain. With the fix, dmesg reports: numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471 numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479 numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487 numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495 numa: Node 4 CPUs: 208-215 304-311 400-407 496-503 numa: Node 5 CPUs: 168-175 264-271 360-367 456-463 numa: Node 6 CPUs: 128-135 224-231 320-327 416-423 numa: Node 7 CPUs: 136-143 232-239 328-335 424-431 numa: Node 8 CPUs: 216-223 312-319 408-415 504-511 numa: Node 9 CPUs: 144-151 240-247 336-343 432-439 numa: Node 10 CPUs: 152-159 248-255 344-351 440-447 numa: Node 11 CPUs: 160-167 256-263 352-359 448-455 and lscpu also reports: Socket(s): 64 NUMA node(s): 12 Model: 2.0 (pvr 004d 0200) Model name: POWER8 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 64K L1i cache: 32K NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471 NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479 NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487 NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495 NUMA node4 CPU(s): 208-215,304-311,400-407,496-503 NUMA node5 CPU(s): 168-175,264-271,360-367,456-463 NUMA node6 CPU(s): 128-135,224-231,320-327,416-423 NUMA node7 CPU(s): 136-143,232-239,328-335,424-431 NUMA node8 CPU(s): 216-223,312-319,408-415,504-511 NUMA node9 CPU(s): 144-151,240-247,336-343,432-439 NUMA node10 CPU(s): 152-159,248-255,344-351,440-447 NUMA node11 CPU(s): 160-167,256-263,352-359,448-455 Reported-by: Manjunatha H R <manjuhr1@in.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [mpe: Trim / format change log] Tested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pagesPaul Mackerras1-7/+10
commit 8cfbdbdc24815417a3ab35101ccf706b9a23ff17 upstream. Commit 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in the pinned physical page", 2018-07-17) added some checks to ensure that guest DMA mappings don't attempt to map more than the guest is entitled to access. However, errors in the logic mean that legitimate guest requests to map pages for DMA are being denied in some situations. Specifically, if the first page of the range passed to mm_iommu_get() is mapped with a normal page, and subsequent pages are mapped with transparent huge pages, we end up with mem->pageshift == 0. That means that the page size checks in mm_iommu_ua_to_hpa() and mm_iommu_up_to_hpa_rm() will always fail for every page in that region, and thus the guest can never map any memory in that region for DMA, typically leading to a flood of error messages like this: qemu-system-ppc64: VFIO_MAP_DMA: -22 qemu-system-ppc64: vfio_dma_map(0x10005f47780, 0x800000000000000, 0x10000, 0x7fff63ff0000) = -22 (Invalid argument) The logic errors in mm_iommu_get() are: (a) use of 'ua' not 'ua + (i << PAGE_SHIFT)' in the find_linux_pte() call (meaning that find_linux_pte() returns the pte for the first address in the range, not the address we are currently up to); (b) use of 'pageshift' as the variable to receive the hugepage shift returned by find_linux_pte() - for a normal page this gets set to 0, leading to us setting mem->pageshift to 0 when we conclude that the pte returned by find_linux_pte() didn't match the page we were looking at; (c) comparing 'compshift', which is a page order, i.e. log base 2 of the number of pages, with 'pageshift', which is a log base 2 of the number of bytes. To fix these problems, this patch introduces 'cur_ua' to hold the current user address and uses that in the find_linux_pte() call; introduces 'pteshift' to hold the hugepage shift found by find_linux_pte(); and compares 'pteshift' with 'compshift + PAGE_SHIFT' rather than 'compshift'. The patch also moves the local_irq_restore to the point after the PTE pointer returned by find_linux_pte() has been dereferenced because otherwise the PTE could change underneath us, and adds a check to avoid doing the find_linux_pte() call once mem->pageshift has been reduced to PAGE_SHIFT, as an optimization. Fixes: 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in the pinned physical page") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09powerpc/powernv/pci: Work around races in PCI bridge enablingBenjamin Herrenschmidt1-0/+37
commit db2173198b9513f7add8009f225afa1f1c79bcc6 upstream. The generic code is racy when multiple children of a PCI bridge try to enable it simultaneously. This leads to drivers trying to access a device through a not-yet-enabled bridge, and this EEH errors under various circumstances when using parallel driver probing. There is work going on to fix that properly in the PCI core but it will take some time. x86 gets away with it because (outside of hotplug), the BIOS enables all the bridges at boot time. This patch does the same thing on powernv by enabling all bridges that have child devices at boot time, thus avoiding subsequent races. It's suitable for backporting to stable and distros, while the proper PCI fix will probably be significantly more invasive. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09powerpc64/ftrace: Include ftrace.h needed for enable/disable callsLuke Dashjr1-0/+1
commit d6ee76d3d37d156c479348821574b6f99d6472a1 upstream. this_cpu_disable_ftrace and this_cpu_enable_ftrace are inlines in ftrace.h Without it included, the build fails. Fixes: a4bc64d305af ("powerpc64/ftrace: Disable ftrace during kvm entry/exit") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org> Acked-by: Naveen N. Rao <naveen.n.rao at linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09powerpc/nohash: fix pte_access_permitted()Christophe Leroy1-6/+3
commit 810e9f86f36f59f1d6f6710220c49afe0c705f38 upstream. Commit 5769beaf180a8 ("powerpc/mm: Add proper pte access check helper for other platforms") replaced generic pte_access_permitted() by an arch specific one. The generic one is defined as (pte_present(pte) && (!(write) || pte_write(pte))) The arch specific one is open coded checking that _PAGE_USER and _PAGE_WRITE (_PAGE_RW) flags are set, but lacking to check that _PAGE_RO and _PAGE_PRIVILEGED are unset, leading to a useless test on targets like the 8xx which defines _PAGE_RW and _PAGE_USER as 0. Commit 5fa5b16be5b31 ("powerpc/mm/hugetlb: Use pte_access_permitted for hugetlb access check") replaced some tests performed with pte helpers by a call to pte_access_permitted(), leading to the same issue. This patch rewrites powerpc/nohash pte_access_permitted() using pte helpers. Fixes: 5769beaf180a8 ("powerpc/mm: Add proper pte access check helper for other platforms") Fixes: 5fa5b16be5b31 ("powerpc/mm/hugetlb: Use pte_access_permitted for hugetlb access check") Cc: stable@vger.kernel.org # v4.15+ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09powerpc/pkeys: Preallocate execute-only keyRam Pai1-45/+18
commit a4fcc877d4e18b5efe26e93f08f0cfd4e278c7d9 upstream. execute-only key is allocated dynamically. This is a problem. When a thread implicitly creates an execute-only key, and resets the UAMOR for that key, the UAMOR value does not percolate to all the other threads. Any other thread may ignorantly change the permissions on the key. This can cause the key to be not execute-only for that thread. Preallocate the execute-only key and ensure that no thread can change the permission of the key, by resetting the corresponding bit in UAMOR. Fixes: 5586cf61e108 ("powerpc: introduce execute-only pkey") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09powerpc/pkeys: Fix calculation of total pkeys.Ram Pai1-1/+1
commit fe6a2804e65969a574377bdb3605afb79e6091a9 upstream. Total number of pkeys calculation is off by 1. Fix it. Fixes: 4fb158f65ac5 ("powerpc: track allocation status of all pkeys") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>