summaryrefslogtreecommitdiff
path: root/arch/powerpc/mm
AgeCommit message (Collapse)AuthorFilesLines
2020-08-26powerpc: Allow 4224 bytes of stack expansion for the signal frameMichael Ellerman1-2/+5
commit 63dee5df43a31f3844efabc58972f0a206ca4534 upstream. We have powerpc specific logic in our page fault handling to decide if an access to an unmapped address below the stack pointer should expand the stack VMA. The code was originally added in 2004 "ported from 2.4". The rough logic is that the stack is allowed to grow to 1MB with no extra checking. Over 1MB the access must be within 2048 bytes of the stack pointer, or be from a user instruction that updates the stack pointer. The 2048 byte allowance below the stack pointer is there to cover the 288 byte "red zone" as well as the "about 1.5kB" needed by the signal delivery code. Unfortunately since then the signal frame has expanded, and is now 4224 bytes on 64-bit kernels with transactional memory enabled. This means if a process has consumed more than 1MB of stack, and its stack pointer lies less than 4224 bytes from the next page boundary, signal delivery will fault when trying to expand the stack and the process will see a SEGV. The total size of the signal frame is the size of struct rt_sigframe (which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on 64-bit). The 2048 byte allowance was correct until 2008 as the signal frame was: struct rt_sigframe { struct ucontext uc; /* 0 1440 */ /* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */ long unsigned int _unused[2]; /* 1440 16 */ unsigned int tramp[6]; /* 1456 24 */ struct siginfo * pinfo; /* 1480 8 */ void * puc; /* 1488 8 */ struct siginfo info; /* 1496 128 */ /* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */ char abigap[288]; /* 1624 288 */ /* size: 1920, cachelines: 15, members: 7 */ /* padding: 8 */ }; 1920 + 128 = 2048 Then in commit ce48b2100785 ("powerpc: Add VSX context save/restore, ptrace and signal support") (Jul 2008) the signal frame expanded to 2304 bytes: struct rt_sigframe { struct ucontext uc; /* 0 1696 */ <-- /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */ long unsigned int _unused[2]; /* 1696 16 */ unsigned int tramp[6]; /* 1712 24 */ struct siginfo * pinfo; /* 1736 8 */ void * puc; /* 1744 8 */ struct siginfo info; /* 1752 128 */ /* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */ char abigap[288]; /* 1880 288 */ /* size: 2176, cachelines: 17, members: 7 */ /* padding: 8 */ }; 2176 + 128 = 2304 At this point we should have been exposed to the bug, though as far as I know it was never reported. I no longer have a system old enough to easily test on. Then in 2010 commit 320b2b8de126 ("mm: keep a guard page below a grow-down stack segment") caused our stack expansion code to never trigger, as there was always a VMA found for a write up to PAGE_SIZE below r1. That meant the bug was hidden as we continued to expand the signal frame in commit 2b0a576d15e0 ("powerpc: Add new transactional memory state to the signal context") (Feb 2013): struct rt_sigframe { struct ucontext uc; /* 0 1696 */ /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */ struct ucontext uc_transact; /* 1696 1696 */ <-- /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */ long unsigned int _unused[2]; /* 3392 16 */ unsigned int tramp[6]; /* 3408 24 */ struct siginfo * pinfo; /* 3432 8 */ void * puc; /* 3440 8 */ struct siginfo info; /* 3448 128 */ /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */ char abigap[288]; /* 3576 288 */ /* size: 3872, cachelines: 31, members: 8 */ /* padding: 8 */ /* last cacheline: 32 bytes */ }; 3872 + 128 = 4000 And commit 573ebfa6601f ("powerpc: Increase stack redzone for 64-bit userspace to 512 bytes") (Feb 2014): struct rt_sigframe { struct ucontext uc; /* 0 1696 */ /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */ struct ucontext uc_transact; /* 1696 1696 */ /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */ long unsigned int _unused[2]; /* 3392 16 */ unsigned int tramp[6]; /* 3408 24 */ struct siginfo * pinfo; /* 3432 8 */ void * puc; /* 3440 8 */ struct siginfo info; /* 3448 128 */ /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */ char abigap[512]; /* 3576 512 */ <-- /* size: 4096, cachelines: 32, members: 8 */ /* padding: 8 */ }; 4096 + 128 = 4224 Then finally in 2017, commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") exposed us to the existing bug, because it changed the stack VMA to be the correct/real size, meaning our stack expansion code is now triggered. Fix it by increasing the allowance to 4224 bytes. Hard-coding 4224 is obviously unsafe against future expansions of the signal frame in the same way as the existing code. We can't easily use sizeof() because the signal frame structure is not in a header. We will either fix that, or rip out all the custom stack expansion checking logic entirely. Fixes: ce48b2100785 ("powerpc: Add VSX context save/restore, ptrace and signal support") Cc: stable@vger.kernel.org # v2.6.27+ Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Tested-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200724092528.1578671-2-mpe@ellerman.id.au Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-24powerpc/fsl_booke: Avoid creating duplicate tlb1 entryLaurentiu Tudor1-1/+11
[ Upstream commit aa4113340ae6c2811e046f08c2bc21011d20a072 ] In the current implementation, the call to loadcam_multi() is wrapped between switch_to_as1() and restore_to_as0() calls so, when it tries to create its own temporary AS=1 TLB1 entry, it ends up duplicating the existing one created by switch_to_as1(). Add a check to skip creating the temporary entry if already running in AS=1. Fixes: d9e1831a4202 ("powerpc/85xx: Load all early TLB entries at once") Cc: stable@vger.kernel.org # v4.4+ Signed-off-by: Laurentiu Tudor <laurentiu.tudor@nxp.com> Acked-by: Scott Wood <oss@buserror.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200123111914.2565-1-laurentiu.tudor@nxp.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-12powerpc: Ensure that swiotlb buffer is allocated from low memoryMike Rapoport1-0/+8
[ Upstream commit 8fabc623238e68b3ac63c0dd1657bf86c1fa33af ] Some powerpc platforms (e.g. 85xx) limit DMA-able memory way below 4G. If a system has more physical memory than this limit, the swiotlb buffer is not addressable because it is allocated from memblock using top-down mode. Force memblock to bottom-up mode before calling swiotlb_init() to ensure that the swiotlb buffer is DMA-able. Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191204123524.22919-1-rppt@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warningAneesh Kumar K.V1-0/+1
[ Upstream commit 16f6b67cf03cb43db7104acb2ca877bdc2606c92 ] With large memory (8TB and more) hotplug, we can get soft lockup warnings as below. These were caused by a long loop without any explicit cond_resched which is a problem for !PREEMPT kernels. Avoid this using cond_resched() while inserting hash page table entries. We already do similar cond_resched() in __add_pages(), see commit f64ac5e6e306 ("mm, memory_hotplug: add scheduling point to __add_pages"). rcu: 3-....: (24002 ticks this GP) idle=13e/1/0x4000000000000002 softirq=722/722 fqs=12001 (t=24003 jiffies g=4285 q=2002) NMI backtrace for cpu 3 CPU: 3 PID: 3870 Comm: ndctl Not tainted 5.3.0-197.18-default+ #2 Call Trace: dump_stack+0xb0/0xf4 (unreliable) nmi_cpu_backtrace+0x124/0x130 nmi_trigger_cpumask_backtrace+0x1ac/0x1f0 arch_trigger_cpumask_backtrace+0x28/0x3c rcu_dump_cpu_stacks+0xf8/0x154 rcu_sched_clock_irq+0x878/0xb40 update_process_times+0x48/0x90 tick_sched_handle.isra.16+0x4c/0x80 tick_sched_timer+0x68/0xe0 __hrtimer_run_queues+0x180/0x430 hrtimer_interrupt+0x110/0x300 timer_interrupt+0x108/0x2f0 decrementer_common+0x114/0x120 --- interrupt: 901 at arch_add_memory+0xc0/0x130 LR = arch_add_memory+0x74/0x130 memremap_pages+0x494/0x650 devm_memremap_pages+0x3c/0xa0 pmem_attach_disk+0x188/0x750 nvdimm_bus_probe+0xac/0x2c0 really_probe+0x148/0x570 driver_probe_device+0x19c/0x1d0 device_driver_attach+0xcc/0x100 bind_store+0x134/0x1c0 drv_attr_store+0x44/0x60 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1a0/0x270 __vfs_write+0x3c/0x70 vfs_write+0xd0/0x260 ksys_write+0xdc/0x130 system_call+0x5c/0x68 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191001084656.31277-1-aneesh.kumar@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04powerpc/pseries: Don't fail hash page table insert for bolted mappingAneesh Kumar K.V1-1/+8
[ Upstream commit 75838a3290cd4ebbd1f567f310ba04b6ef017ce4 ] If the hypervisor returned H_PTEG_FULL for H_ENTER hcall, retry a hash page table insert by removing a random entry from the group. After some runtime, it is very well possible to find all the 8 hash page table entry slot in the hpte group used for mapping. Don't fail a bolted entry insert in that case. With Storage class memory a user can find this error easily since a namespace enable/disable is equivalent to memory add/remove. This results in failures as reported below: $ ndctl create-namespace -r region1 -t pmem -m devdax -a 65536 -s 100M libndctl: ndctl_dax_enable: dax1.3: failed to enable Error: namespace1.2: failed to enable failed to create namespace: No such device or address In kernel log we find the details as below: Unable to create mapping for hot added memory 0xc000042006000000..0xc00004200d000000: -1 dax_pmem: probe of dax1.3 failed with error -14 This indicates that we failed to create a bolted hash table entry for direct-map address backing the namespace. We also observe failures such that not all namespaces will be enabled with ndctl enable-namespace all command. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024093542.29777-2-aneesh.kumar@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05powerpc/mm: Make NULL pointer deferences explicit on bad page faults.Christophe Leroy1-8/+9
[ Upstream commit 49a502ea23bf9dec47f8f3c3960909ff409cd1bb ] As several other arches including x86, this patch makes it explicit that a bad page fault is a NULL pointer dereference when the fault address is lower than PAGE_SIZE In the mean time, this page makes all bad_page_fault() messages shorter so that they remain on one single line. And it prefixes them by "BUG: " so that they get easily grepped. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> [mpe: Avoid pr_cont()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05powerpc/book3s/32: fix number of bats in p/v_block_mapped()Christophe Leroy1-2/+2
[ Upstream commit e93ba1b7eb5b188c749052df7af1c90821c5f320 ] This patch fixes the loop in p_block_mapped() and v_block_mapped() to scan the entire bat_addrs[] array. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-25powerpc/64s/hash: Fix stab_rr off by one initializationNicholas Piggin1-1/+1
[ Upstream commit 09b4438db13fa83b6219aee5993711a2aa2a0c64 ] This causes SLB alloation to start 1 beyond the start of the SLB. There is no real problem because after it wraps it stats behaving properly, it's just surprisig to see when looking at SLB traces. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-21powerpc/mm/radix: Use the right page size for vmemmap mappingAneesh Kumar K.V1-9/+7
commit 89a3496e0664577043666791ec07fb731d57c950 upstream. We use mmu_vmemmap_psize to find the page size for mapping the vmmemap area. With radix translation, we are suboptimally setting this value to PAGE_SIZE. We do check for 2M page size support and update mmu_vmemap_psize to use hugepage size but we suboptimally reset the value to PAGE_SIZE in radix__early_init_mmu(). This resulted in always mapping vmemmap area with 64K page size. Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31powerpc/numa: improve control of topology updatesNathan Lynch1-6/+12
[ Upstream commit 2d4d9b308f8f8dec68f6dbbff18c68ec7c6bd26f ] When booted with "topology_updates=no", or when "off" is written to /proc/powerpc/topology_updates, NUMA reassignments are inhibited for PRRN and VPHN events. However, migration and suspend unconditionally re-enable reassignments via start_topology_update(). This is incoherent. Check the topology_updates_enabled flag in start/stop_topology_update() so that callers of those APIs need not be aware of whether reassignments are enabled. This allows the administrative decision on reassignments to remain in force across migrations and suspensions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-17powerpc/fsl: Flush the branch predictor at each kernel entry (64bit)Diana Craciun1-0/+7
commit 10c5e83afd4a3f01712d97d3bb1ae34d5b74a185 upstream. In order to protect against speculation attacks on indirect branches, the branch predictor is flushed at kernel entry to protect for the following situations: - userspace process attacking another userspace process - userspace process attacking the kernel Basically when the privillege level change (i.e. the kernel is entered), the branch predictor state is flushed. Signed-off-by: Diana Craciun <diana.craciun@nxp.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-17powerpc: Avoid code patching freed init sectionsMichael Neuling1-0/+2
commit 51c3c62b58b357e8d35e4cc32f7b4ec907426fe3 upstream. This stops us from doing code patching in init sections after they've been freed. In this chain: kvm_guest_init() -> kvm_use_magic_page() -> fault_in_pages_readable() -> __get_user() -> __get_user_nocheck() -> barrier_nospec(); We have a code patching location at barrier_nospec() and kvm_guest_init() is an init function. This whole chain gets inlined, so when we free the init section (hence kvm_guest_init()), this code goes away and hence should no longer be patched. We seen this as userspace memory corruption when using a memory checker while doing partition migration testing on powervm (this starts the code patching post migration via /sys/kernel/mobility/migration). In theory, it could also happen when using /sys/kernel/debug/powerpc/barrier_nospec. Cc: stable@vger.kernel.org # 4.13+ Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05powerpc/pseries: Perform full re-add of CPU for topology update post-migrationNathan Fontenot1-8/+1
[ Upstream commit 81b61324922c67f73813d8a9c175f3c153f6a1c6 ] On pseries systems, performing a partition migration can result in altering the nodes a CPU is assigned to on the destination system. For exampl, pre-migration on the source system CPUs are in node 1 and 3, post-migration on the destination system CPUs are in nodes 2 and 3. Handling the node change for a CPU can cause corruption in the slab cache if we hit a timing where a CPUs node is changed while cache_reap() is invoked. The corruption occurs because the slab cache code appears to rely on the CPU and slab cache pages being on the same node. The current dynamic updating of a CPUs node done in arch/powerpc/mm/numa.c does not prevent us from hitting this scenario. Changing the device tree property update notification handler that recognizes an affinity change for a CPU to do a full DLPAR remove and add of the CPU instead of dynamically changing its node resolves this issue. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael W. Bringmann <mwb@linux.vnet.ibm.com> Tested-by: Michael W. Bringmann <mwb@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-01powerpc/numa: Suppress "VPHN is not supported" messagesSatheesh Rajendran1-1/+1
[ Upstream commit 437ccdc8ce629470babdda1a7086e2f477048cbd ] When VPHN function is not supported and during cpu hotplug event, kernel prints message 'VPHN function not supported. Disabling polling...'. Currently it prints on every hotplug event, it floods dmesg when a KVM guest tries to hotplug huge number of vcpus, let's just print once and suppress further kernel prints. Signed-off-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-21powerpc/nohash: fix undefined behaviour when testing page size supportDaniel Axtens1-0/+3
[ Upstream commit f5e284803a7206d43e26f9ffcae5de9626d95e37 ] When enumerating page size definitions to check hardware support, we construct a constant which is (1U << (def->shift - 10)). However, the array of page size definitions is only initalised for various MMU_PAGE_* constants, so it contains a number of 0-initialised elements with def->shift == 0. This means we end up shifting by a very large number, which gives the following UBSan splat: ================================================================================ UBSAN: Undefined behaviour in /home/dja/dev/linux/linux/arch/powerpc/mm/tlb_nohash.c:506:21 shift exponent 4294967286 is too large for 32-bit type 'unsigned int' CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc3-00045-ga604f927b012-dirty #6 Call Trace: [c00000000101bc20] [c000000000a13d54] .dump_stack+0xa8/0xec (unreliable) [c00000000101bcb0] [c0000000004f20a8] .ubsan_epilogue+0x18/0x64 [c00000000101bd30] [c0000000004f2b10] .__ubsan_handle_shift_out_of_bounds+0x110/0x1a4 [c00000000101be20] [c000000000d21760] .early_init_mmu+0x1b4/0x5a0 [c00000000101bf10] [c000000000d1ba28] .early_setup+0x100/0x130 [c00000000101bf90] [c000000000000528] start_here_multiplatform+0x68/0x80 ================================================================================ Fix this by first checking if the element exists (shift != 0) before constructing the constant. Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03powerpc/64s: Fix compiler store ordering to SLB shadow areaNicholas Piggin1-4/+4
[ Upstream commit 926bc2f100c24d4842b3064b5af44ae964c1d81c ] The stores to update the SLB shadow area must be made as they appear in the C code, so that the hypervisor does not see an entry with mismatched vsid and esid. Use WRITE_ONCE for this. GCC has been observed to elide the first store to esid in the update, which means that if the hypervisor interrupts the guest after storing to vsid, it could see an entry with old esid and new vsid, which may possibly result in memory corruption. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30powerpc/numa: Ensure nodes initialized for hotplugMichael Bringmann1-10/+37
[ Upstream commit ea05ba7c559c8e5a5946c3a94a2a266e9a6680a6 ] This patch fixes some problems encountered at runtime with configurations that support memory-less nodes, or that hot-add CPUs into nodes that are memoryless during system execution after boot. The problems of interest include: * Nodes known to powerpc to be memoryless at boot, but to have CPUs in them are allowed to be 'possible' and 'online'. Memory allocations for those nodes are taken from another node that does have memory until and if memory is hot-added to the node. * Nodes which have no resources assigned at boot, but which may still be referenced subsequently by affinity or associativity attributes, are kept in the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs to the system can reference these nodes and bring them online instead of redirecting the references to one of the set of nodes known to have memory at boot. Note that this software operates under the context of CPU hotplug. We are not doing memory hotplug in this code, but rather updating the kernel's CPU topology (i.e. arch_update_cpu_topology / numa_update_cpu_topology). We are initializing a node that may be used by CPUs or memory before it can be referenced as invalid by a CPU hotplug operation. CPU hotplug operations are protected by a range of APIs including cpu_maps_update_begin/cpu_maps_update_done, cpus_read/write_lock / cpus_read/write_unlock, device locks, and more. Memory hotplug operations, including try_online_node, are protected by mem_hotplug_begin/mem_hotplug_done, device locks, and more. In the case of CPUs being hot-added to a previously memoryless node, the try_online_node operation occurs wholly within the CPU locks with no overlap. Using HMC hot-add/hot-remove operations, we have been able to add and remove CPUs to any possible node without failures. HMC operations involve a degree self-serialization, though. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30powerpc/numa: Use ibm,max-associativity-domains to discover possible nodesMichael Bringmann1-3/+34
[ Upstream commit a346137e9142b039fd13af2e59696e3d40c487ef ] On powerpc systems which allow 'hot-add' of CPU or memory resources, it may occur that the new resources are to be inserted into nodes that were not used for these resources at bootup. In the kernel, any node that is used must be defined and initialized. These empty nodes may occur when, * Dedicated vs. shared resources. Shared resources require information such as the VPHN hcall for CPU assignment to nodes. Associativity decisions made based on dedicated resource rules, such as associativity properties in the device tree, may vary from decisions made using the values returned by the VPHN hcall. * memoryless nodes at boot. Nodes need to be defined as 'possible' at boot for operation with other code modules. Previously, the powerpc code would limit the set of possible nodes to those which have memory assigned at boot, and were thus online. Subsequent add/remove of CPUs or memory would only work with this subset of possible nodes. * memoryless nodes with CPUs at boot. Due to the previous restriction on nodes, nodes that had CPUs but no memory were being collapsed into other nodes that did have memory at boot. In practice this meant that the node assignment presented by the runtime kernel differed from the affinity and associativity attributes presented by the device tree or VPHN hcalls. Nodes that might be known to the pHyp were not 'possible' in the runtime kernel because they did not have memory at boot. This patch ensures that sufficient nodes are defined to support configuration requirements after boot, as well as at boot. This patch set fixes a couple of problems. * Nodes known to powerpc to be memoryless at boot, but to have CPUs in them are allowed to be 'possible' and 'online'. Memory allocations for those nodes are taken from another node that does have memory until and if memory is hot-added to the node. * Nodes which have no resources assigned at boot, but which may still be referenced subsequently by affinity or associativity attributes, are kept in the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs to the system can reference these nodes and bring them online instead of redirecting to one of the set of nodes that were known to have memory at boot. This patch extracts the value of the lowest domain level (number of allocable resources) from the device tree property "ibm,max-associativity-domains" to use as the maximum number of nodes to setup as possibly available in the system. This new setting will override the instruction: nodes_and(node_possible_map, node_possible_map, node_online_map); presently seen in the function arch/powerpc/mm/numa.c:initmem_init(). If the "ibm,max-associativity-domains" property is not present at boot, no operation will be performed to define or enable additional nodes, or enable the above 'nodes_and()'. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22powerpc/nohash: Fix use of mmu_has_feature() in setup_initial_memory_limit()Michael Ellerman1-1/+1
[ Upstream commit 4868e3508d1934d28961f940ed6b9f1e347ab52c ] setup_initial_memory_limit() is called from early_init_devtree(), which runs prior to feature patching. If the kernel is built with CONFIG_JUMP_LABEL=y and CONFIG_JUMP_LABEL_FEATURE_CHECKS=y then we will potentially get the wrong value. If we also have CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG=y we get a warning and backtrace: Warning! mmu_has_feature() used prior to jump label init! CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc4-gccN-next-20170331-g6af2434 #1 Call Trace: [c000000000fc3d50] [c000000000a26c30] .dump_stack+0xa8/0xe8 (unreliable) [c000000000fc3de0] [c00000000002e6b8] .setup_initial_memory_limit+0xa4/0x104 [c000000000fc3e60] [c000000000d5c23c] .early_init_devtree+0xd0/0x2f8 [c000000000fc3f00] [c000000000d5d3b0] .early_setup+0x90/0x11c [c000000000fc3f90] [c000000000000520] start_here_multiplatform+0x68/0x80 Fix it by using early_mmu_has_feature(). Fixes: c12e6f24d413 ("powerpc: Add option to use jump label for mmu_has_feature()") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22powerpc: Avoid taking a data miss on every userspace instruction missAnton Blanchard1-1/+1
[ Upstream commit a7a9dcd882a67b68568868b988289fce5ffd8419 ] Early on in do_page_fault() we call store_updates_sp(), regardless of the type of exception. For an instruction miss this doesn't make sense, because we only use this information to detect if a data miss is the result of a stack expansion instruction or not. Worse still, it results in a data miss within every userspace instruction miss handler, because we try and load the very instruction we are about to install a pte for! A simple exec microbenchmark runs 6% faster on POWER8 with this fix: #include <stdlib.h> #include <stdio.h> #include <unistd.h> int main(int argc, char *argv[]) { unsigned long left = atol(argv[1]); char leftstr[16]; if (left-- == 0) return 0; sprintf(leftstr, "%ld", left); execlp(argv[0], argv[0], leftstr, NULL); perror("exec failed\n"); return 0; } Pass the number of iterations on the command line (eg 10000) and time how long it takes to execute. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layoutAneesh Kumar K.V1-0/+18
[ Upstream commit a525108cf1cc14651602d678da38fa627a76a724 ] Without this if firmware reports 1MB page size support we will crash trying to use 1MB as hugetlb page size. echo 300 > /sys/kernel/mm/hugepages/hugepages-1024kB/nr_hugepages kernel BUG at ./arch/powerpc/include/asm/hugetlb.h:19! ..... .... [c0000000e2c27b30] c00000000029dae8 .hugetlb_fault+0x638/0xda0 [c0000000e2c27c30] c00000000026fb64 .handle_mm_fault+0x844/0x1d70 [c0000000e2c27d70] c00000000004805c .do_page_fault+0x3dc/0x7c0 [c0000000e2c27e30] c00000000000ac98 handle_page_fault+0x10/0x30 With fix, we don't enable 1MB as hugepage size. bash-4.2# cd /sys/kernel/mm/hugepages/ bash-4.2# ls hugepages-16384kB hugepages-16777216kB Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14powerpc/64: Invalidate process table caching after setting process tablePaul Mackerras1-0/+4
[ Upstream commit 7a70d7288c926ae88e0c773fbb506aa374e99c2d ] The POWER9 MMU reads and caches entries from the process table. When we kexec from one kernel to another, the second kernel sets its process table pointer but doesn't currently do anything to make the CPU invalidate any cached entries from the old process table. This adds a tlbie (TLB invalidate entry) instruction with parameters to invalidate caching of the process table after the new process table is installed. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-10powerpc/mm: Fix memory hotplug BUG() on radixReza Arbab2-2/+20
[ Upstream commit 32b53c012e0bfe20b2745962a89db0dc72ef3270 ] Memory hotplug is leading to hash page table calls, even on radix: arch_add_memory create_section_mapping htab_bolt_mapping BUG_ON(!ppc_md.hpte_insert); To fix, refactor {create,remove}_section_mapping() into hash__ and radix__ variants. Leave the radix versions stubbed for now. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-08powerpc/64: Don't try to use radix MMU under a hypervisorPaul Mackerras1-0/+33
[ Upstream commit 18569c1f134e1c5c88228f043c09678ae6052b7c ] Currently, if the kernel is running on a POWER9 processor under a hypervisor, it will try to use the radix MMU even though it doesn't have the necessary code to use radix under a hypervisor (it doesn't negotiate use of radix, and it doesn't do the H_REGISTER_PROC_TBL hcall). The result is that the guest kernel will crash when it tries to turn on the MMU. This fixes it by looking for the /chosen/ibm,architecture-vec-5 property, and if it exists, clears the radix MMU feature bit, before we decide whether to initialize for radix or HPT. This property is created by the hypervisor as a result of the guest calling the ibm,client-architecture-support method to indicate its capabilities, so it will indicate whether the hypervisor agreed to us using radix. Systems without a hypervisor may have this property also (for example, skiboot creates it), so we check the HV bit in the MSR to see whether we are running as a guest or not. If we are in hypervisor mode, then we can do whatever we like including using the radix MMU. The reason for using this property is that in future, when we have support for using radix under a hypervisor, we will need to check this property to see whether the hypervisor agreed to us using radix. Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines") Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-28powerpc/mm/radix: Properly clear process table entryBenjamin Herrenschmidt1-3/+9
commit c6bb0b8d426a8cf865ca9c8a532cc3a2927cfceb upstream. On radix, the process table entry we want to clear when destroying a context is entry 0, not entry 1. This has no *immediate* consequence on Power9, but it can cause other bugs to become worse. Fixes: 7e381c0ff618 ("powerpc/mm/radix: Add mmu context handling callback for radix") Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-24mm: larger stack guard gap, between vmasHugh Dickins3-4/+4
commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [wt: backport to 4.11: adjust context] [wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide] Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25powerpc/iommu: Do not call PageTransHuge() on tail pagesAlexey Kardashevskiy1-2/+2
commit e889e96e98e8da97bd39e46b7253615eabe14397 upstream. The CMA pages migration code does not support compound pages at the moment so it performs few tests before proceeding to actual page migration. One of the tests - PageTransHuge() - has VM_BUG_ON_PAGE(PageTail()) as it is designed to be called on head pages only. Since we also test for PageCompound(), and it contains PageTail() and PageHead(), we can simplify the check by leaving just PageCompound() and therefore avoid possible VM_BUG_ON_PAGE. Fixes: 2e5bbb5461f1 ("KVM: PPC: Book3S HV: Migrate pinned pages out of CMA") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-12powerpc/mm: Add missing global TLB invalidate if cxl is activeFrederic Barrat1-2/+5
commit 88b1bf7268f56887ca88eb09c6fb0f4fc970121a upstream. Commit 4c6d9acce1f4 ("powerpc/mm: Add hooks for cxl") converted local TLB invalidates to global if the cxl driver is active. This is necessary because the CAPP snoops invalidations to forward them to the PSL on the cxl adapter. However one path was forgotten. native_flush_hash_range() still does local TLB invalidates, as found out the hard way recently. This patch fixes it by following the same logic as previously: if the cxl driver is active, the local TLB invalidates are 'upgraded' to global. Fixes: 4c6d9acce1f4 ("powerpc/mm: Add hooks for cxl") Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22powerpc/mm: Fix build break when CMA=n && SPAPR_TCE_IOMMU=yMichael Ellerman1-1/+1
[ Upstream commit a05ef161cdd22faccffe06f21fc8f1e249565385 ] Currently the build breaks if CMA=n and SPAPR_TCE_IOMMU=y: arch/powerpc/mm/mmu_context_iommu.c: In function ‘mm_iommu_get’: arch/powerpc/mm/mmu_context_iommu.c:193:42: error: ‘MIGRATE_CMA’ undeclared (first use in this function) if (get_pageblock_migratetype(page) == MIGRATE_CMA) { ^~~~~~~~~~~ Fix it by using the existing is_migrate_cma_page(), which evaulates to false when CMA=n. Fixes: 2e5bbb5461f1 ("KVM: PPC: Book3S HV: Migrate pinned pages out of CMA") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdownAlexey Kardashevskiy2-14/+1
[ Upstream commit 4b6fad7097f883335b6d9627c883cb7f276d94c9 ] At the moment the userspace tool is expected to request pinning of the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present. When the userspace process finishes, all the pinned pages need to be put; this is done as a part of the userspace memory context (MM) destruction which happens on the very last mmdrop(). This approach has a problem that a MM of the userspace process may live longer than the userspace process itself as kernel threads use userspace process MMs which was runnning on a CPU where the kernel thread was scheduled to. If this happened, the MM remains referenced until this exact kernel thread wakes up again and releases the very last reference to the MM, on an idle system this can take even hours. This moves preregistered regions tracking from MM to VFIO; insteads of using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is added so each container releases regions which it has pre-registered. This changes the userspace interface to return EBUSY if a memory region is already registered in a container. However it should not have any practical effect as the only userspace tool available now does register memory region once per container anyway. As tce_iommu_register_pages/tce_iommu_unregister_pages are called under container->lock, this does not need additional locking. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22powerpc/iommu: Stop using @current in mm_iommu_xxxAlexey Kardashevskiy1-29/+17
[ Upstream commit d7baee6901b34c4895eb78efdbf13a49079d7404 ] This changes mm_iommu_xxx helpers to take mm_struct as a parameter instead of getting it from @current which in some situations may not have a valid reference to mm. This changes helpers to receive @mm and moves all references to @current to the caller, including checks for !current and !current->mm; checks in mm_iommu_preregistered() are removed as there is no caller yet. This moves the mm_iommu_adjust_locked_vm() call to the caller as it receives mm_iommu_table_group_mem_t but it needs mm. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22powerpc/iommu: Pass mm_struct to init/cleanup helpersAlexey Kardashevskiy2-6/+7
[ Upstream commit 88f54a3581eb9deaa3bd1aade40aef266d782385 ] We are going to get rid of @current references in mmu_context_boos3s64.c and cache mm_struct in the VFIO container. Since mm_context_t does not have reference counting, we will be using mm_struct which does have the reference counter. This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather than mm_context_t (which is embedded into mm). This should not cause any behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-23powerpc/64: Disable use of radix under a hypervisorPaul Mackerras1-1/+2
commit 3f91a89d424a79f8082525db5a375e438887bb3e upstream. Currently, if the kernel is running on a POWER9 processor under a hypervisor, it may try to use the radix MMU even though it doesn't have the necessary code to do so (it doesn't negotiate use of radix, and it doesn't do the H_REGISTER_PROC_TBL hcall). If the hypervisor supports both radix and HPT, then it will set up the guest to use HPT (since the guest doesn't request radix in the CAS call), but if the radix feature bit is set in the ibm,pa-features property (which is valid, since ibm,pa-features is defined to represent the capabilities of the processor) the guest will try to use radix, resulting in a crash when it turns the MMU on. This makes the minimal fix for the current code, which is to disable radix unless we are running in hypervisor mode. Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-15powerpc/mm/radix: Update ERAT flushes when invalidating TLBBenjamin Herrenschmidt1-5/+1
commit 90c1e3c2fafec57fcb55b5d69bcf293b1a5fc8b3 upstream. Three tiny changes to the ERAT flushing logic: First don't make it depend on DD1. It hasn't been decided yet but we might run DD2 in a mode that also requires explicit flushes for performance reasons so make it unconditional. We also add a missing isync, and finally remove the flush from _tlbiel_va as it is only necessary for congruence-class invalidations (PID, LPID and full TLB), not targetted invalidations. Fixes: 96ed1fe511a8 ("powerpc/mm/radix: Invalidate ERAT on tlbiel for POWER9 DD1") Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-09powerpc/mm: Use the correct pointer when setting a 2MB pteReza Arbab1-2/+2
commit a0615a16f7d0ceb5804d295203c302d496d8ee91 upstream. When setting a 2MB pte, radix__map_kernel_page() is using the address ptep = (pte_t *)pudp; Fix this conversion to use pmdp instead. Use pmdp_ptep() to do this instead of casting the pointer. Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines") Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19powerpc/mm: Correct process and partition table max sizeSuraj Jitindar Singh1-2/+2
commit 555c16328ae6d75a90e234eac9b51998d68f185b upstream. Version 3.00 of the ISA states that the PATS (partition table size) field of the PTCR (partition table control register) and the PRTS (process table size) field of the partition table entry must both be less than or equal to 24. However the actual size of the partition and process tables is equal to 2 to the power of 12 plus the PATS and PRTS fields, respectively. This means that the max allowable size of each of these tables is 2^36 or 64GB for both. Thus when checking the size shift for each we should be checking for values of greater than 36 instead of the current check for shifts larger than 24 and 23. Fixes: 2bfd65e45e877fb5704730244da67c748d28a1b8 Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19powerpc/64: Simplify adaptation to new ISA v3.00 HPTE formatPaul Mackerras1-6/+24
commit 6b243fcfb5f1e16bcf732e6f86a63f8af5b59a9f upstream. This changes the way that we support the new ISA v3.00 HPTE format. Instead of adapting everything that uses HPTE values to handle either the old format or the new format, depending on which CPU we are on, we now convert explicitly between old and new formats if necessary in the low-level routines that actually access HPTEs in memory. This limits the amount of code that needs to know about the new format and makes the conversions explicit. This is OK because the old format contains all the information that is in the new format. This also fixes operation under a hypervisor, because the H_ENTER hypercall (and other hypercalls that deal with HPTEs) will continue to require the HPTE value to be supplied in the old format. At present the kernel will not boot in HPT mode on POWER9 under a hypervisor. This fixes and partially reverts commit 50de596de8be ("powerpc/mm/hash: Add support for Power9 Hash", 2016-04-29). Fixes: 50de596de8be ("powerpc/mm/hash: Add support for Power9 Hash") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-29powerpc/mm: Fix lazy icache flush on pre-POWER5Benjamin Herrenschmidt2-3/+3
On 64-bit CPUs with no-execute support and non-snooping icache, such as 970 or POWER4, we have a software mechanism to ensure coherency of the cache (using exec faults when needed). This was broken due to a logic error when the code was rewritten from assembly to C, previously the assembly code did: BEGIN_FTR_SECTION mr r4,r30 mr r5,r7 bl hash_page_do_lazy_icache END_FTR_SECTION(CPU_FTR_NOEXECUTE|CPU_FTR_COHERENT_ICACHE, CPU_FTR_NOEXECUTE) Which tests that: (cpu_features & (NOEXECUTE | COHERENT_ICACHE)) == NOEXECUTE Which says that the current cpu does have NOEXECUTE, but does not have COHERENT_ICACHE. Fixes: 91f1da99792a ("powerpc/mm: Convert 4k hash insert to C") Fixes: 89ff725051d1 ("powerpc/mm: Convert __hash_page_64K to C") Fixes: a43c0eb8364c ("powerpc/mm: Convert 4k insert from asm to C") Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Change log verbosification] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-25powerpc/mm: Fixup kernel read only mappingAneesh Kumar K.V1-2/+6
With commit e58e87adc8bf9 ("powerpc/mm: Update _PAGE_KERNEL_RO") we started using the ppp value 0b110 to map kernel readonly. But that facility was only added as part of ISA 2.04. For earlier ISA version only supported ppp bit value for readonly mapping is 0b011. (This implies both user and kernel get mapped using the same ppp bit value for readonly mapping.). Update the code such that for earlier architecture version we use ppp value 0b011 for readonly mapping. We don't differentiate between power5+ and power5 here and apply the new ppp bits only from power6 (ISA 2.05). This keep the changes minimal. This fixes issue with PS3 spu usage reported at https://lkml.kernel.org/r/rep.1421449714.geoff@infradead.org Fixes: e58e87adc8bf9 ("powerpc/mm: Update _PAGE_KERNEL_RO") Cc: stable@vger.kernel.org # v4.7+ Tested-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-18powerpc/mm: Fix missing update of HID register on secondary CPUsAneesh Kumar K.V2-0/+8
We need to update on secondaries for the selected MMU mode. Fixes: ad410674f560 ("powerpc/mm: Update the HID bit when switching from radix to hash") Reported-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-18powerpc/mm/radix: Invalidate ERAT on tlbiel for POWER9 DD1Michael Neuling1-0/+4
On POWER9 DD1, when we do a local TLB invalidate we also need to explicitly invalidate the ERAT. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-27powerpc/mm/radix: Use tlbiel only if we ever ran on the current cpuAneesh Kumar K.V1-4/+4
Before this patch, we used tlbiel, if we ever ran only on this core. That was mostly derived from the nohash usage of the same. But is incorrect, the ISA 3.0 clarifies tlbiel such that: "All TLB entries that have all of the following properties are made invalid on the thread executing the tlbiel instruction" ie. tlbiel only invalidates TLB entries on the current thread. So if the mm has been used on any other thread (aka. cpu) then we must broadcast the invalidate. This bug could lead to invalid TLB entries if a program runs on multiple threads of a core. Hence use tlbiel, if we only ever ran on only the current cpu. Fixes: 1a472c9dba6b ("powerpc/mm/radix: Add tlbflush routines") Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-19powerpc: Fix numa topology console printAneesh Kumar K.V1-5/+5
With recent update to printk, we get console output like below: [ 0.550639] Brought up 160 CPUs [ 0.550718] Node 0 CPUs: [ 0.550721] 0 [ 0.550754] -39 [ 0.550794] Node 1 CPUs: [ 0.550798] 40 [ 0.550817] -79 [ 0.550856] Node 16 CPUs: [ 0.550860] 80 [ 0.550880] -119 [ 0.550917] Node 17 CPUs: [ 0.550923] 120 [ 0.550942] -159 Fix this by properly using pr_cont(), ie. KERN_CONT. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-19powerpc/mm: Drop dump_numa_memory_topology()Michael Ellerman1-36/+0
At boot we dump the NUMA memory topology in dump_numa_memory_topology(), at KERN_DEBUG level, resulting in output like: Node 0 Memory: 0x0-0x100000000 Node 1 Memory: 0x100000000-0x200000000 Which is nice enough, but immediately after that we iterate over each node and call setup_node_data(), which also prints out the node ranges, at KERN_INFO, giving eg: numa: Initmem setup node 0 [mem 0x00000000-0xffffffff] numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff] Additionally dump_numa_memory_topology() does not use KERN_CONT correctly, resulting in split output lines on recent kernels. So drop dump_numa_memory_topology() as superfluous chatter. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-19powerpc/mm: Prevent unlikely crash in copro_calculate_slb()Frederic Barrat1-0/+2
If a cxl adapter faults on an invalid address for a kernel context, we may enter copro_calculate_slb() with a NULL mm pointer (kernel context) and an effective address which looks like a user address. Which will cause a crash when dereferencing mm. It is clearly an AFU bug, but there's no reason to crash either. So return an error, so that cxl can ack the interrupt with an address error. Fixes: 73d16a6e0e51 ("powerpc/cell: Move data segment faulting code out of cell platform") Cc: stable@vger.kernel.org # v3.18+ Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-15Merge branch 'kbuild' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kbuild updates from Michal Marek: - EXPORT_SYMBOL for asm source by Al Viro. This does bring a regression, because genksyms no longer generates checksums for these symbols (CONFIG_MODVERSIONS). Nick Piggin is working on a patch to fix this. Plus, we are talking about functions like strcpy(), which rarely change prototypes. - Fixes for PPC fallout of the above by Stephen Rothwell and Nick Piggin - fixdep speedup by Alexey Dobriyan. - preparatory work by Nick Piggin to allow architectures to build with -ffunction-sections, -fdata-sections and --gc-sections - CONFIG_THIN_ARCHIVES support by Stephen Rothwell - fix for filenames with colons in the initramfs source by me. * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (22 commits) initramfs: Escape colons in depfile ppc: there is no clear_pages to export powerpc/64: whitelist unresolved modversions CRCs kbuild: -ffunction-sections fix for archs with conflicting sections kbuild: add arch specific post-link Makefile kbuild: allow archs to select link dead code/data elimination kbuild: allow architectures to use thin archives instead of ld -r kbuild: Regenerate genksyms lexer kbuild: genksyms fix for typeof handling fixdep: faster CONFIG_ search ia64: move exports to definitions sparc32: debride memcpy.S a bit [sparc] unify 32bit and 64bit string.h sparc: move exports to definitions ppc: move exports to definitions arm: move exports to definitions s390: move exports to definitions m68k: move exports to definitions alpha: move exports to actual definitions x86: move exports to actual definitions ...
2016-10-14Merge tag 'powerpc-4.9-2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull more powerpc updates from Michael Ellerman: "Some more powerpc updates for 4.9: Freescale updates from Scott Wood: - qbman support (a prerequisite for datapath drivers such as ethernet) - a PCI DMA fix+improvement - reset handler changes - more 8xx optimizations - some cleanups and fixes.' Fixes: - selftests/powerpc: Add missing binaries to .gitignores (Michael Ellerman) - selftests/powerpc: Fix build break caused by EXPORT_SYMBOL changes (Michael Ellerman) - powerpc/pseries: Fix stack corruption in htpe code (Laurent Dufour) - powerpc/64s: Fix power4_fixup_nap placement (Nicholas Piggin) - powerpc/64: Fix incorrect return value from __copy_tofrom_user (Paul Mackerras) - powerpc/mm/hash64: Fix might_have_hea() check (Michael Ellerman) Other: - MAINTAINERS: Remove myself from PA Semi entries (Olof Johansson) - MAINTAINERS: Drop separate pseries entry (Michael Ellerman) - MAINTAINERS: Update powerpc website & add selftests (Michael Ellerman): * tag 'powerpc-4.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (35 commits) powerpc/mm/hash64: Fix might_have_hea() check powerpc/64: Fix incorrect return value from __copy_tofrom_user powerpc/64s: Fix power4_fixup_nap placement powerpc/pseries: Fix stack corruption in htpe code selftests/powerpc: Fix build break caused by EXPORT_SYMBOL changes MAINTAINERS: Update powerpc website & add selftests MAINTAINERS: Drop separate pseries entry MAINTAINERS: Remove myself from PA Semi entries selftests/powerpc: Add missing binaries to .gitignores arch/powerpc: Add CONFIG_FSL_DPAA to corenetXX_smp_defconfig soc/qman: Add self-test for QMan driver soc/bman: Add self-test for BMan driver soc/fsl: Introduce DPAA 1.x QMan device driver soc/fsl: Introduce DPAA 1.x BMan device driver powerpc/8xx: make user addr DTLB miss the short path powerpc/8xx: Move additional DTLBMiss handlers out of exception area powerpc/8xx: use r3 to scratch CR in ITLBmiss soc/fsl/qe: fix gpio save_regs functions powerpc/8xx: add dedicated machine check handler powerpc/8xx: add system_reset_exception ...
2016-10-12powerpc/mm/hash64: Fix might_have_hea() checkMichael Ellerman1-1/+1
In commit 2b4e3ad8f579 ("powerpc/mm/hash64: Don't test for machine type to detect HEA special case") we changed the logic in might_have_hea() to check FW_FEATURE_SPLPAR rather than machine_is(pseries). However the check was incorrectly negated, leading to crashes on machines with HEA adapters, such as: mm: Hashing failure ! EA=0xd000080080004040 access=0x800000000000000e current=NetworkManager trap=0x300 vsid=0x13d349c ssize=1 base psize=2 psize 2 pte=0xc0003cc033e701ae Unable to handle kernel paging request for data at address 0xd000080080004040 Call Trace: .ehea_create_cq+0x148/0x340 [ehea] (unreliable) .ehea_up+0x258/0x1200 [ehea] .ehea_open+0x44/0x1a0 [ehea] ... Fix it by removing the negation. Fixes: 2b4e3ad8f579 ("powerpc/mm/hash64: Don't test for machine type to detect HEA special case") Cc: stable@vger.kernel.org # v4.8+ Reported-by: Denis Kirjanov <kda@linux-powerpc.org> Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-08Merge tag 'powerpc-4.9-1' of ↵Linus Torvalds11-49/+218
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights: - Major rework of Book3S 64-bit exception vectors (Nicholas Piggin) - Use gas sections for arranging exception vectors et. al. - Large set of TM cleanups and selftests (Cyril Bur) - Enable transactional memory (TM) lazily for userspace (Cyril Bur) - Support for XZ compression in the zImage wrapper (Oliver O'Halloran) - Add support for bpf constant blinding (Naveen N. Rao) - Beginnings of upstream support for PA Semi Nemo motherboards (Darren Stevens) Fixes: - Ensure .mem(init|exit).text are within _stext/_etext (Michael Ellerman) - xmon: Don't use ld on 32-bit (Michael Ellerman) - vdso64: Use double word compare on pointers (Anton Blanchard) - powerpc/nvram: Fix an incorrect partition merge (Pan Xinhui) - powerpc: Fix usage of _PAGE_RO in hugepage (Christophe Leroy) - powerpc/mm: Update FORCE_MAX_ZONEORDER range to allow hugetlb w/4K (Aneesh Kumar K.V) - Fix memory leak in queue_hotplug_event() error path (Andrew Donnellan) - Replay hypervisor maintenance interrupt first (Nicholas Piggin) Various performance optimisations (Anton Blanchard): - Align hot loops of memset() and backwards_memcpy() - During context switch, check before setting mm_cpumask - Remove static branch prediction in atomic{, 64}_add_unless - Only disable HAVE_EFFICIENT_UNALIGNED_ACCESS on POWER7 little endian - Set default CPU type to POWER8 for little endian builds Cleanups & features: - Sparse fixes/cleanups (Daniel Axtens) - Preserve CFAR value on SLB miss caused by access to bogus address (Paul Mackerras) - Radix MMU fixups for POWER9 (Aneesh Kumar K.V) - Support for setting used_(vsr|vr|spe) in sigreturn path (for CRIU) (Simon Guo) - Optimise syscall entry for virtual, relocatable case (Nicholas Piggin) - Optimise MSR handling in exception handling (Nicholas Piggin) - Support for kexec with Radix MMU (Benjamin Herrenschmidt) - powernv EEH fixes (Russell Currey) - Suprise PCI hotplug support for powernv (Gavin Shan) - Endian/sparse fixes for powernv PCI (Gavin Shan) - Defconfig updates (Anton Blanchard) - KVM: PPC: Book3S HV: Migrate pinned pages out of CMA (Balbir Singh) - cxl: Flush PSL cache before resetting the adapter (Frederic Barrat) - cxl: replace loop with for_each_child_of_node(), remove unneeded of_node_put() (Andrew Donnellan) - Fix HV facility unavailable to use correct handler (Nicholas Piggin) - Remove unnecessary syscall trampoline (Nicholas Piggin) - fadump: Fix build break when CONFIG_PROC_VMCORE=n (Michael Ellerman) - Quieten EEH message when no adapters are found (Anton Blanchard) - powernv: Add PHB register dump debugfs handle (Russell Currey) - Use kprobe blacklist for exception handlers & asm functions (Nicholas Piggin) - Document the syscall ABI (Nicholas Piggin) - MAINTAINERS: Update cxl maintainers (Michael Neuling) - powerpc: Remove all usages of NO_IRQ (Michael Ellerman) Minor cleanups: - Andrew Donnellan, Christophe Leroy, Colin Ian King, Cyril Bur, Frederic Barrat, Pan Xinhui, PrasannaKumar Muralidharan, Rui Teng, Simon Guo" * tag 'powerpc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (156 commits) powerpc/bpf: Add support for bpf constant blinding powerpc/bpf: Implement support for tail calls powerpc/bpf: Introduce accessors for using the tmp local stack space powerpc/fadump: Fix build break when CONFIG_PROC_VMCORE=n powerpc: tm: Enable transactional memory (TM) lazily for userspace powerpc/tm: Add TM Unavailable Exception powerpc: Remove do_load_up_transact_{fpu,altivec} powerpc: tm: Rename transct_(*) to ck(\1)_state powerpc: tm: Always use fp_state and vr_state to store live registers selftests/powerpc: Add checks for transactional VSXs in signal contexts selftests/powerpc: Add checks for transactional VMXs in signal contexts selftests/powerpc: Add checks for transactional FPUs in signal contexts selftests/powerpc: Add checks for transactional GPRs in signal contexts selftests/powerpc: Check that signals always get delivered selftests/powerpc: Add TM tcheck helpers in C selftests/powerpc: Allow tests to extend their kill timeout selftests/powerpc: Introduce GPR asm helper header file selftests/powerpc: Move VMX stack frame macros to header file selftests/powerpc: Rework FPU stack placement macros and move to header file selftests/powerpc: Check for VSX preservation across userspace preemption ...
2016-10-06Merge tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2-40/+57
Pull KVM updates from Radim Krčmář: "All architectures: - move `make kvmconfig` stubs from x86 - use 64 bits for debugfs stats ARM: - Important fixes for not using an in-kernel irqchip - handle SError exceptions and present them to guests if appropriate - proxying of GICV access at EL2 if guest mappings are unsafe - GICv3 on AArch32 on ARMv8 - preparations for GICv3 save/restore, including ABI docs - cleanups and a bit of optimizations MIPS: - A couple of fixes in preparation for supporting MIPS EVA host kernels - MIPS SMP host & TLB invalidation fixes PPC: - Fix the bug which caused guests to falsely report lockups - other minor fixes - a small optimization s390: - Lazy enablement of runtime instrumentation - up to 255 CPUs for nested guests - rework of machine check deliver - cleanups and fixes x86: - IOMMU part of AMD's AVIC for vmexit-less interrupt delivery - Hyper-V TSC page - per-vcpu tsc_offset in debugfs - accelerated INS/OUTS in nVMX - cleanups and fixes" * tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (140 commits) KVM: MIPS: Drop dubious EntryHi optimisation KVM: MIPS: Invalidate TLB by regenerating ASIDs KVM: MIPS: Split kernel/user ASID regeneration KVM: MIPS: Drop other CPU ASIDs on guest MMU changes KVM: arm/arm64: vgic: Don't flush/sync without a working vgic KVM: arm64: Require in-kernel irqchip for PMU support KVM: PPC: Book3s PR: Allow access to unprivileged MMCR2 register KVM: PPC: Book3S PR: Support 64kB page size on POWER8E and POWER8NVL KVM: PPC: Book3S: Remove duplicate setting of the B field in tlbie KVM: PPC: BookE: Fix a sanity check KVM: PPC: Book3S HV: Take out virtual core piggybacking code KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread ARM: gic-v3: Work around definition of gic_write_bpr1 KVM: nVMX: Fix the NMI IDT-vectoring handling KVM: VMX: Enable MSR-BASED TPR shadow even if APICv is inactive KVM: nVMX: Fix reload apic access page warning kvmconfig: add virtio-gpu to config fragment config: move x86 kvm_guest.config to a common location arm64: KVM: Remove duplicating init code for setting VMID ARM: KVM: Support vgic-v3 ...