summaryrefslogtreecommitdiff
path: root/arch/powerpc/mm/pgtable-radix.c
AgeCommit message (Collapse)AuthorFilesLines
2017-08-31powerpc/mm/radix: Prettify mapped memory range print outMichael Ellerman1-1/+6
When we map memory at boot we print out the ranges of real addresses that we mapped and the page size that was used. Currently it's a bit ugly: Mapped range 0x0 - 0x2000000000 with 0x40000000 Mapped range 0x200000000000 - 0x202000000000 with 0x40000000 Pad the addresses so they line up, and print the page size using actual units, eg: Mapped 0x0000000000000000-0x0000000001200000 with 64.0 KiB pages Mapped 0x0000000001200000-0x0000000040000000 with 2.00 MiB pages Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31powerpc/mm/radix: Add pr_fmt() to pgtable-radix.cMichael Ellerman1-0/+4
Make the printks look a bit nicer by adding a prefix. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-23Merge branch 'fixes' into nextMichael Ellerman1-1/+33
There's a non-trivial dependency between some commits we want to put in next and the KVM prefetch work around that went into fixes. So merge fixes into next.
2017-08-17powerpc/mm: Don't send IPI to all cpus on THP updatesAneesh Kumar K.V1-4/+4
Now that we made sure that lockless walk of linux page table is mostly limitted to current task(current->mm->pgdir) we can update the THP update sequence to only send IPI to CPUs on which this task has run. This helps in reducing the IPI overload on systems with large number of CPUs. WRT kvm even though kvm is walking page table with vpc->arch.pgdir, it is done only on secondary CPUs and in that case we have primary CPU added to task's mm cpumask. Sending an IPI to primary will force the secondary to do a vm exit and hence this mm cpumask usage is safe here. WRT CAPI, we still end up walking linux page table with capi context MM. For now the pte lookup serialization sends an IPI to all CPUs in CPI is in use. We can further improve this by adding the CAPI interrupt handling CPU to task mm cpumask. That will be done in a later patch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10powerpc/mm: Properly invalidate when setting process table baseSuraj Jitindar Singh1-2/+6
The host process table base is stored in the partition table by calling the function native_register_process_table(). Currently this just sets the entry in memory and is missing a subsequent cache invalidation instruction. Any update to the partition table should be followed by a cache invalidation instruction specifying invalidation of the caching of any partition table entries (RIC = 2, PRS = 0). We already have a function to update the partition table with the required cache invalidation instructions - mmu_partition_table_set_entry(). Update the native_register_process_table() function to call mmu_partition_table_set_entry(), this ensures all appropriate invalidation will be performed. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Use a local for patb0 to clean it up slightly] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-08powerpc/mm/book3s64: Make KERN_IO_START a variableMichael Ellerman1-0/+1
Currently KERN_IO_START is defined as: #define KERN_IO_START (KERN_VIRT_START + (KERN_VIRT_SIZE >> 1)) Although it looks like a constant, both the components are actually variables, to allow us to have a different value between Radix and Hash with a single kernel. However that still requires both Radix and Hash to place the kernel IO region at the same location relative to the start and end of the kernel virtual region (namely 1/2 way through it), and we'd like to change that. So split KERN_IO_START out into its own variable, and initialise it for Radix and Hash. In the medium term we should be able to reconsolidate this, by doing a more involved rearrangement of the location of the regions. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-02powerpc/mm/radix: Avoid flushing the PWC on every flush_tlb_rangeBenjamin Herrenschmidt1-1/+4
We do that because it's used by THP pmd collapsing, so use instead a dedicated flush function. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-26powerpc/mm/radix: Workaround prefetch issue with KVMBenjamin Herrenschmidt1-1/+33
There's a somewhat architectural issue with Radix MMU and KVM. When coming out of a guest with AIL (Alternate Interrupt Location, ie, MMU enabled), we start executing hypervisor code with the PID register still containing whatever the guest has been using. The problem is that the CPU can (and will) then start prefetching or speculatively load from whatever host context has that same PID (if any), thus bringing translations for that context into the TLB, which Linux doesn't know about. This can cause stale translations and subsequent crashes. Fixing this in a way that is neither racy nor a huge performance impact is difficult. We could just make the host invalidations always use broadcast forms but that would hurt single threaded programs for example. We chose to fix it instead by partitioning the PID space between guest and host. This is possible because today Linux only use 19 out of the 20 bits of PID space, so existing guests will work if we make the host use the top half of the 20 bits space. We additionally add support for a property to indicate to Linux the size of the PID register which will be useful if we eventually have processors with a larger PID space available. There is still an issue with malicious guests purposefully setting the PID register to a value in the hosts PID range. Hopefully future HW can prevent that, but in the meantime, we handle it with a pair of kludges: - On the way out of a guest, before we clear the current VCPU in the PACA, we check the PID and if it's outside of the permitted range we flush the TLB for that PID. - When context switching, if the mm is "new" on that CPU (the corresponding bit was set for the first time in the mm cpumask), we check if any sibling thread is in KVM (has a non-NULL VCPU pointer in the PACA). If that is the case, we also flush the PID for that CPU (core). This second part is needed to handle the case where a process is migrated (or starts a new pthread) on a sibling thread of the CPU coming out of KVM, as there's a window where stale translations can exist before we detect it and flush them out. A future optimization could be added by keeping track of whether the PID has ever been used and avoid doing that for completely fresh PIDs. We could similarily mark PIDs that have been the subject of a global invalidation as "fresh". But for now this will do. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop unneeded include of kvm_book3s_asm.h] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-18powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=yMichael Ellerman1-0/+8
Currently even with STRICT_KERNEL_RWX we leave the __init text marked executable after init, which is bad. Add a hook to mark it NX (no-execute) before we free it, and implement it for radix and hash. Note that we use __init_end as the end address, not _einittext, because overlaps_kernel_text() uses __init_end, because there are additional executable sections other than .init.text between __init_begin and __init_end. Tested on radix and hash with: 0:mon> p $__init_begin *** 400 exception occurred Fixes: 1e0fc9d1eb2b ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-18powerpc/mm/radix: Refactor radix__mark_rodata_ro()Michael Ellerman1-5/+15
Move the core logic into a helper, so we can use it for changing permissions other than _PAGE_WRITE. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-04powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for RadixBalbir Singh1-1/+66
The Radix linear mapping code (create_physical_mapping()) tries to use the largest page size it can at each step. Currently the only reason it steps down to a smaller page size is if the start addr is unaligned (never happens in practice), or the end of memory is not aligned to a huge page boundary. To support STRICT_RWX we need to break the mapping at __init_begin, so that the text and rodata prior to that can be marked R_X and the regular pages after can be marked RW. Having done that we can now implement mark_rodata_ro() for Radix, knowing that we won't need to split any mappings. Signed-off-by: Balbir Singh <bsingharora@gmail.com> [mpe: Split down to PAGE_SIZE, not 2MB, rewrite change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-03powerpc/mm/radix: Fix execute permissions for interrupt_vectorsBalbir Singh1-1/+2
Commit 9abcc981de97 ("powerpc/mm/radix: Only add X for pages overlapping kernel text") changed the linear mapping on Radix to only mark the kernel text executable. However if the kernel is run relocated, for example as a kdump kernel, then the exception vectors are split from the kernel text, ie. they remain at real address 0. We tend to get away with it, because the kernel itself will usually be below 1G, which means the 1G page at 0-1G is marked executable and everything works OK. However if the kernel is loaded above 1G, or the system has less than 1G in total (meaning we can't use a 1G page), then the exception vectors will not be marked executable and the kernel will fail to boot. Fix it by also checking if the address range overlaps the exception vectors when deciding if we should add PAGE_KERNEL_X. Fixes: 9abcc981de97 ("powerpc/mm/radix: Only add X for pages overlapping kernel text") Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: Balbir Singh <bsingharora@gmail.com> [mpe: Combine with the existing check, rewrite change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-02powerpc/mm: Add devmap support for ppc64Oliver O'Halloran1-1/+2
Add support for the devmap bit on PTEs and PMDs for PPC64 Book3S. This is used to differentiate device backed memory from transparent huge pages since they are handled in more or less the same manner by the core mm code. Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-23powerpc/mm: Trace tlbie(l) instructionsBalbir Singh1-0/+5
Add a trace point for tlbie(l) (Translation Lookaside Buffer Invalidate Entry (Local)) instructions. The tlbie instruction has changed over the years, so not all versions accept the same operands. Use the ISA v3 field operands because they are the most verbose, we may change them in future. Example output: qemu-system-ppc-5371 [016] 1412.369519: tlbie: tlbie with lpid 0, local 1, rb=67bd8900174c11c1, rs=0, ric=0 prs=0 r=0 Signed-off-by: Balbir Singh <bsingharora@gmail.com> [mpe: Add some missing trace_tlbie()s, reword change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-15powerpc/mm/radix: Only add X for pages overlapping kernel textMichael Ellerman1-3/+11
Currently we map the whole linear mapping with PAGE_KERNEL_X. Instead we should check if the page overlaps the kernel text and only then add PAGE_KERNEL_X. Note that we still use 1G pages if they're available, so this will typically still result in a 1G executable page at KERNELBASE. So this fix is primarily useful for catching stray branches to high linear mapping addresses. Without this patch, we can execute at 1G in xmon using: 0:mon> m c000000040000000 c000000040000000 00 l c000000040000000 00000000 01006038 c000000040000004 00000000 2000804e c000000040000008 00000000 x 0:mon> di c000000040000000 c000000040000000 38600001 li r3,1 c000000040000004 4e800020 blr 0:mon> p c000000040000000 return value is 0x1 After we get a 400 as expected: 0:mon> p c000000040000000 *** 400 exception occurred Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines") Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com>
2017-03-07Merge tag 'powerpc-4.11-3' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "Five fairly small fixes for things that went in this cycle. A fairly large patch to rework the CAS logic on Power9, necessitated by a late change to the firmware API, and we can't boot without it. Three fixes going to stable, allowing more instructions to be emulated on LE, fixing a boot crash on 32-bit Freescale BookE machines, and the OPAL XICS workaround. And a patch from me to sort the selects under CONFIG PPC. Annoying churn, but worth it in the long run, and best for it to go in now to avoid conflicts. Thanks to: Alexey Kardashevskiy, Anton Blanchard, Balbir Singh, Gautham R. Shenoy, Laurentiu Tudor, Nicholas Piggin, Paul Mackerras, Ravi Bangoria, Sachin Sant, Shile Zhang, Suraj Jitindar Singh" * tag 'powerpc-4.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc: Sort the selects under CONFIG_PPC powerpc/64: Fix L1D cache shape vector reporting L1I values powerpc/64: Avoid panic during boot due to divide by zero in init_cache_info() powerpc: Update to new option-vector-5 format for CAS powerpc: Parse the command line before calling CAS powerpc/xics: Work around limitations of OPAL XICS priority handling powerpc/64: Fix checksum folding in csum_add() powerpc/powernv: Fix opal tracepoints with JUMP_LABEL=n powerpc/booke: Fix boot crash due to null hugepd powerpc: Fix compiling a BE kernel with a powerpc64le toolchain selftest/powerpc: Fix false failures for skipped tests powerpc/powernv: Fix bug due to labeling ambiguity in power_enter_stop powerpc/64: Invalidate process table caching after setting process table powerpc: emulate_step() tests for load/store instructions powerpc: Emulation support for load/store instructions on LE
2017-03-03powerpc/64: Invalidate process table caching after setting process tablePaul Mackerras1-0/+4
The POWER9 MMU reads and caches entries from the process table. When we kexec from one kernel to another, the second kernel sets its process table pointer but doesn't currently do anything to make the CPU invalidate any cached entries from the old process table. This adds a tlbie (TLB invalidate entry) instruction with parameters to invalidate caching of the process table after the new process table is installed. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-02sched/headers: Prepare to remove the <linux/mm_types.h> dependency from ↵Ingo Molnar1-1/+1
<linux/sched.h> Update code that relied on sched.h including various MM types for them. This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-22Merge tag 'powerpc-4.11-1' of ↵Linus Torvalds1-39/+222
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Support for direct mapped LPC on POWER9, giving Linux direct access to devices that may be on there such as a UART. - Memory hotplug support for the Power9 Radix MMU. - Add new AUX vectors describing the processor's cache geometry, to be used by glibc. - The ability for a guest to ask the hypervisor to resize the guest's hash table, and in addition support for doing so automatically when memory is hotplugged into/out-of the guest. This allows the hash table to be sized based on the current memory usage of the guest, rather than the maximum possible memory usage. - Implementation of optprobes (kprobe optimisation) for powerpc. In addition there's the topic branch shared with the KVM tree, which includes support for guests to use the Radix MMU on Power9. Thanks to: Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T, Anton Blanchard, Benjamin Herrenschmidt, Chris Packham, Daniel Axtens, Daniel Borkmann, David Gibson, Finn Thain, Gautham R. Shenoy, Gavin Shan, Greg Kurz, Joel Stanley, John Allen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Ravi Bangoria, Reza Arbab, Shailendra Singh, Vaibhav Jain, Wei Yongjun" * tag 'powerpc-4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (129 commits) powerpc/mm/radix: Skip ptesync in pte update helpers powerpc/mm/radix: Use ptep_get_and_clear_full when clearing pte for full mm powerpc/mm/radix: Update pte update sequence for pte clear case powerpc/mm: Update PROTFAULT handling in the page fault path powerpc/xmon: Fix data-breakpoint powerpc/mm: Fix build break with BOOK3S_64=n and MEMORY_HOTPLUG=y powerpc/mm: Fix build break when CMA=n && SPAPR_TCE_IOMMU=y powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n powerpc/pseries: Fix typo in parameter description powerpc/kprobes: Remove kprobe_exceptions_notify() kprobes: Introduce weak variant of kprobe_exceptions_notify() powerpc/ftrace: Fix confusing help text for DISABLE_MPROFILE_KERNEL powerpc/powernv: Fix opal_exit tracepoint opcode powerpc: Add a prototype for mcount() so it can be versioned powerpc: Drop GPL from of_node_to_nid() export to match other arches powerpc/kprobes: Optimize kprobe in kretprobe_trampoline() powerpc/kprobes: Implement Optprobes powerpc/kprobes: Fixes for kprobe_lookup_name() on BE powerpc: Add helper to check if offset is within relative branch range powerpc/bpf: Introduce __PPC_SH64() ...
2017-02-14Merge branch 'topic/ppc-kvm' into nextMichael Ellerman1-0/+2
Merge the topic branch we're sharing with the kvm-ppc tree.
2017-01-31powerpc/64: Enable use of radix MMU under hypervisor on POWER9Paul Mackerras1-0/+2
To use radix as a guest, we first need to tell the hypervisor via the ibm,client-architecture call first that we support POWER9 and architecture v3.00, and that we can do either radix or hash and that we would like to choose later using an hcall (the H_REGISTER_PROC_TBL hcall). Then we need to check whether the hypervisor agreed to us using radix. We need to do this very early on in the kernel boot process before any of the MMU initialization is done. If the hypervisor doesn't agree, we can't use radix and therefore clear the radix MMU feature bit. Later, when we have set up our process table, which points to the radix tree for each process, we need to install that using the H_REGISTER_PROC_TBL hcall. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31powerpc/mm: unstub radix__vmemmap_remove_mapping()Reza Arbab1-1/+28
Use remove_pagetable() and friends for radix vmemmap removal. We do not require the special-case handling of vmemmap done in the x86 versions of these functions. This is because vmemmap_free() has already freed the mapped pages, and calls us with an aligned address range. So, add a few failsafe WARNs, but otherwise the code to remove physical mappings is already sufficient for vmemmap. Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31powerpc/mm: add radix__remove_section_mapping()Reza Arbab1-0/+133
Tear down and free the four-level page tables of physical mappings during memory hotremove. Borrow the basic structure of remove_pagetable() and friends from the identically-named x86 functions. Reduce the frequency of tlb flushes and page_table_lock spinlocks by only doing them in the outermost function. There was some question as to whether the locking is needed at all. Leave it for now, but we could consider dropping it. Memory must be offline to be removed, thus not in use. So there shouldn't be the sort of concurrent page walking activity here that might prompt us to use RCU. Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31powerpc/mm: add radix__create_section_mapping()Reza Arbab1-0/+7
Wire up memory hotplug page mapping for radix. Share the mapping function already used by radix_init_pgtable(). Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31powerpc/mm: refactor radix physical page mappingReza Arbab1-38/+50
Move the page mapping code in radix_init_pgtable() into a separate function that will also be used for memory hotplug. The current goto loop progressively decreases its mapping size as it covers the tail of a range whose end is unaligned. Change this to a for loop which can do the same for both ends of the range. Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30powerpc/powernv: Initialise nest mmuAlistair Popple1-0/+2
POWER9 contains an off core mmu called the nest mmu (NMMU). This is used by other hardware units on the chip to translate virtual addresses into real addresses. The unit attempting an address translation provides the majority of the context required for the translation request except for the base address of the partition table (ie. the PTCR) which needs to be programmed into the NMMU. This patch adds a call to OPAL to set the PTCR for the nest mmu in opal_init(). Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30powerpc/mm: Use the correct pointer when setting a 2MB pteReza Arbab1-2/+2
When setting a 2MB pte, radix__map_kernel_page() is using the address ptep = (pte_t *)pudp; Fix this conversion to use pmdp instead. Use pmdp_ptep() to do this instead of casting the pointer. Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines") Cc: stable@vger.kernel.org # v4.7+ Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-16Merge tag 'powerpc-4.10-1' of ↵Linus Torvalds1-2/+38
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Support for the kexec_file_load() syscall, which is a prereq for secure and trusted boot. - Prevent kernel execution of userspace on P9 Radix (similar to SMEP/PXN). - Sort the exception tables at build time, to save time at boot, and store them as relative offsets to save space in the kernel image & memory. - Allow building the kernel with thin archives, which should allow us to build an allyesconfig once some other fixes land. - Build fixes to allow us to correctly rebuild when changing the kernel endian from big to little or vice versa. - Plumbing so that we can avoid doing a full mm TLB flush on P9 Radix. - Initial stack protector support (-fstack-protector). - Support for dumping the radix (aka. Linux) and hash page tables via debugfs. - Fix an oops in cxl coredump generation when cxl_get_fd() is used. - Freescale updates from Scott: "Highlights include 8xx hugepage support, qbman fixes/cleanup, device tree updates, and some misc cleanup." - Many and varied fixes and minor enhancements as always. Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz, Christophe Jaillet, Christophe Leroy, Denis Kirjanov, Elimar Riesebieter, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Jack Miller, Johan Hovold, Lars-Peter Clausen, Libin, Madhavan Srinivasan, Michael Neuling, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin, Rashmica Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain" [ And thanks to Michael, who took time off from a new baby to get this pull request done. - Linus ] * tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (174 commits) powerpc/fsl/dts: add FMan node for t1042d4rdb powerpc/fsl/dts: add sg_2500_aqr105_phy4 alias on t1024rdb powerpc/fsl/dts: add QMan and BMan nodes on t1024 powerpc/fsl/dts: add QMan and BMan nodes on t1023 soc/fsl/qman: test: use DEFINE_SPINLOCK() powerpc/fsl-lbc: use DEFINE_SPINLOCK() powerpc/8xx: Implement support of hugepages powerpc: get hugetlbpage handling more generic powerpc: port 64 bits pgtable_cache to 32 bits powerpc/boot: Request no dynamic linker for boot wrapper soc/fsl/bman: Use resource_size instead of computation soc/fsl/qe: use builtin_platform_driver powerpc/fsl_pmc: use builtin_platform_driver powerpc/83xx/suspend: use builtin_platform_driver powerpc/ftrace: Fix the comments for ftrace_modify_code powerpc/perf: macros for power9 format encoding powerpc/perf: power9 raw event format encoding powerpc/perf: update attribute_group data structure powerpc/perf: factor out the event format field powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown ...
2016-12-14Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-12/+6
Pull KVM updates from Paolo Bonzini: "Small release, the most interesting stuff is x86 nested virt improvements. x86: - userspace can now hide nested VMX features from guests - nested VMX can now run Hyper-V in a guest - support for AVX512_4VNNIW and AVX512_FMAPS in KVM - infrastructure support for virtual Intel GPUs. PPC: - support for KVM guests on POWER9 - improved support for interrupt polling - optimizations and cleanups. s390: - two small optimizations, more stuff is in flight and will be in 4.11. ARM: - support for the GICv3 ITS on 32bit platforms" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits) arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest KVM: arm/arm64: timer: Check for properly initialized timer on init KVM: arm/arm64: vgic-v2: Limit ITARGETSR bits to number of VCPUs KVM: x86: Handle the kthread worker using the new API KVM: nVMX: invvpid handling improvements KVM: nVMX: check host CR3 on vmentry and vmexit KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry KVM: nVMX: propagate errors from prepare_vmcs02 KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation KVM: nVMX: support restore of VMX capability MSRs KVM: nVMX: generate non-true VMX MSRs based on true versions KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs. KVM: x86: Add kvm_skip_emulated_instruction and use it. KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12 KVM: VMX: Reorder some skip_emulated_instruction calls KVM: x86: Add a return value to kvm_emulate_cpuid KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h ...
2016-11-26powerpc/mm/radix: Prevent kernel execution of user spaceBalbir Singh1-0/+22
ISA 3 defines new encoded access authority that allows instruction access prevention in privileged mode and allows normal access to problem state. This patch just enables IAMR (Instruction Authority Mask Register), enabling AMR would require more work. I've tested this with a buggy driver and a simple payload. The payload is specific to the build I've tested. mpe: Also tested with LKDTM: # echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT lkdtm: Performing direct entry EXEC_USERSPACE lkdtm: attempting ok execution at c0000000005bf560 lkdtm: attempting bad execution at 00003fff8d940000 Unable to handle kernel paging request for instruction fetch Faulting instruction address: 0x3fff8d940000 Oops: Kernel access of bad area, sig: 11 [#1] NIP: 00003fff8d940000 LR: c0000000005bfa58 CTR: 00003fff8d940000 REGS: c0000000f1fcf900 TRAP: 0400 Not tainted (4.9.0-rc5-compiler_gcc-6.2.0-00109-g956dbc06232a) MSR: 9000000010009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48002222 XER: 00000000 ... Call Trace: lkdtm_EXEC_USERSPACE+0x104/0x120 (unreliable) lkdtm_do_action+0x3c/0x80 direct_entry+0x100/0x1b0 full_proxy_write+0x94/0x100 __vfs_write+0x3c/0x1b0 vfs_write+0xcc/0x230 SyS_write+0x60/0x110 system_call+0x38/0xfc Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-25powerpc/mm/radix: Setup AMOR in HV mode to allow key 0Balbir Singh1-0/+14
Setup AMOR (Authority Mask Override Register) in HV mode so that the host and guest kernel can in turn setup IAMR. This allows us to enable key 0 in a following patch. Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-24Merge branch 'topic/ppc-kvm' into nextMichael Ellerman1-12/+6
Merge the topic branch we're sharing with the kvm-ppc tree.
2016-11-23powerpc/64: Provide functions for accessing POWER9 partition tablePaul Mackerras1-12/+6
POWER9 requires the host to set up a partition table, which is a table in memory indexed by logical partition ID (LPID) which contains the pointers to page tables and process tables for the host and each guest. This factors out the initialization of the partition table into a single function. This code was previously duplicated between hash_utils_64.c and pgtable-radix.c. This provides a function for setting a partition table entry, which is used in early MMU initialization, and will be used by KVM whenever a guest is created. This function includes a tlbie instruction which will flush all TLB entries for the LPID and all caches of the partition table entry for the LPID, across the system. This also moves a call to memblock_set_current_limit(), which was in radix_init_partition_table(), but has nothing to do with the partition table. By analogy with the similar code for hash, the call gets moved to near the end of radix__early_init_mmu(). It now gets called when running as a guest, whereas previously it would only be called if the kernel is running as the host. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-18powerpc/mm: Fix missing update of HID register on secondary CPUsAneesh Kumar K.V1-0/+4
We need to update on secondaries for the selected MMU mode. Fixes: ad410674f560 ("powerpc/mm: Update the HID bit when switching from radix to hash") Reported-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-17powerpc/mm: Correct process and partition table max sizeSuraj Jitindar Singh1-2/+2
Version 3.00 of the ISA states that the PATS (partition table size) field of the PTCR (partition table control register) and the PRTS (process table size) field of the partition table entry must both be less than or equal to 24. However the actual size of the partition and process tables is equal to 2 to the power of 12 plus the PATS and PRTS fields, respectively. This means that the max allowable size of each of these tables is 2^36 or 64GB for both. Thus when checking the size shift for each we should be checking for values of greater than 36 instead of the current check for shifts larger than 24 and 23. Fixes: 2bfd65e45e877fb5704730244da67c748d28a1b8 Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-17powerpc/mm: Fix typo in radix encodings printBalbir Singh1-1/+1
Rename "sift" to "shift". Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-23powerpc/64/kexec: Fix MMU cleanup on radixBenjamin Herrenschmidt1-0/+12
Just using the hash ops won't work anymore since radix will have NULL in there. Instead create an mmu_cleanup_all() function which will do the right thing based on the MMU mode. For Radix, for now I clear UPRT and the PTCR, effectively switching back to Radix with no partition table setup. Currently set it to NULL on BookE thought it might be a good idea to wipe the TLB there (Scott ?) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-13powerpc/mm: Update the HID bit when switching from radix to hashAneesh Kumar K.V1-0/+28
Power9 DD1 requires to update the hid0 register when switching from hash to radix. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-04powerpc/mm: Move register_process_table() out of ppc_mdMichael Ellerman1-2/+2
We want to initialise register_process_table() before ppc_md is setup, so that it can be called as part of MMU init (at least on Radix ATM). That no longer works because probe_machine() requires that ppc_md be empty before it's called, and we now do probe_machine() much later. So make register_process_table a global for now. It will probably move into a mmu_radix_ops struct at some point in the future. This was broken by me when applying commit 7025776ed1eb "powerpc/mm: Move hash table ops to a separate structure" due to conflicts with other patches. Fixes: 7025776ed1eb ("powerpc/mm: Move hash table ops to a separate structure") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01powerpc/mm: Do radix device tree scanning earlierMichael Ellerman1-2/+1
Like we just did for hash, split the device tree scanning parts out and call them from mmu_early_init_devtree(). Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21powerpc/64: Move MMU backend selection out of platform codeBenjamin Herrenschmidt1-0/+1
We move it into early_mmu_init() based on firmware features. For PS3, we have to move the setting of these into early_init_devtree(). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17powerpc/mm/radix: Update machine call back to support new HCALL.Aneesh Kumar K.V1-3/+6
This update the machine dep callback such that we can use the same callback to register process table. The interface is updated such that we can easily call H_REGISTER_PROC_TBL hcall. The HCALL itself is introduced in a later patch. No functionality change introduced by this patch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17powerpc/mm: Print formation regarding the the MMU modeAneesh Kumar K.V1-1/+2
This helps in easily identifying the MMU mode with which the kernel is operating. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17powerpc/mm/radix: Update LPCR HR bit as per ISAAneesh Kumar K.V1-2/+2
PowerISA 3.0 requires the MMU mode (radix vs. hash) of the hypervisor to be mirrored in the LPCR register, in addition to the partition table. This is done to avoid fetching from the table when deciding, among other things, how to perform transitions to HV mode on some interrupts. So let's set it up appropriately Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-30powerpc: Initialise pci_io_base as early as possibleDarren Stevens1-0/+5
Commit d6a9996e84ac ("powerpc/mm: vmalloc abstraction in preparation for radix") turned kernel memory and IO addresses from #defined constants to variables initialised at runtime. On PA6T (pasemi) systems the setup_arch() machine call initialises the onboard PCI-e root-ports, and uses pci_io_base to do this, which is now before its value has been set, resulting in a panic early in boot before console IO is initialised. Move the pci_io_base initialisation to the same place as vmalloc ranges are set (hash__early_init_mmu()/radix__early_init_mmu()) - this is the earliest possible place we can initialise it. Fixes: d6a9996e84ac ("powerpc/mm: vmalloc abstraction in preparation for radix") Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de> Signed-off-by: Darren Stevens <darren@stevens-zone.net> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Add #ifdef CONFIG_PCI, massage change log slightly] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-17powerpc/mm/radix: Update Radix tree size as per ISA 3.0Aneesh Kumar K.V1-6/+3
ISA 3.0 updated it to be encoded as Radix tree size = 2^(RTS + 31). We have it encoded as 2^(RTS + 28). Add a helper with the correct encoding and use it instead of opencoding. Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines") Reviewed-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-01powerpc/mm/radix: Update LPCR only if it is powernvAneesh Kumar K.V1-13/+10
LPCR cannot be updated when running in guest mode. Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11powerpc/mm/radix: Add radix THP callbacksAneesh Kumar K.V1-0/+117
The deposited pgtable_t is a pte fragment hence we cannot use page->lru for linking then together. We use the first two 64 bits for pte fragment as list_head type to link all deposited fragments together. On withdraw we properly zero then out. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11powerpc/mm: pte_frag abstractionAneesh Kumar K.V1-0/+5
In this patch we make the number of pte fragments per level 4 page table page a variable. Radix level 4 table size is 256 bytes and hence we can have 256 fragments per level 4 page. We don't update the fragment count in this patch. We need to do performance measurements to find the right value for fragment count. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11powerpc/mm: vmalloc abstraction in preparation for radixAneesh Kumar K.V1-0/+7
The vmalloc range differs between hash and radix config. Hence make VMALLOC_START and related constants a variable which will be runtime initialized depending on whether hash or radix mode is active. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Fix missing init of ioremap_bot in pgtable_64.c for ppc64e] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>