summaryrefslogtreecommitdiff
path: root/arch/powerpc/platforms
AgeCommit message (Collapse)AuthorFilesLines
2019-04-30powerpc/64s: Reimplement book3s idle code in CNicholas Piggin2-174/+690
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation speific HV idle code to the powernv platform code. Book3S assembly stubs are kept in common code and used only to save the stack frame and non-volatile GPRs before executing architected idle instructions, and restoring the stack and reloading GPRs then returning to C after waking from idle. The complex logic dealing with threads and subcores, locking, SPRs, HMIs, timebase resync, etc., is all done in C which makes it more maintainable. This is not a strict translation to C code, there are some significant differences: - Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs, but saves and restores them itself. - The optimisation where EC=ESL=0 idle modes did not have to save GPRs or change MSR is restored, because it's now simple to do. ESL=1 sleeps that do not lose GPRs can use this optimization too. - KVM secondary entry and cede is now more of a call/return style rather than branchy. nap_state_lost is not required because KVM always returns via NVGPR restoring path. - KVM secondary wakeup from offline sequence is moved entirely into the offline wakeup, which avoids a hwsync in the normal idle wakeup path. Performance measured with context switch ping-pong on different threads or cores, is possibly improved a small amount, 1-3% depending on stop state and core vs thread test for shallow states. Deep states it's in the noise compared with other latencies. KVM improvements: - Idle sleepers now always return to caller rather than branch out to KVM first. - This allows optimisations like very fast return to caller when no state has been lost. - KVM no longer requires nap_state_lost because it controls NVGPR save/restore itself on the way in and out. - The heavy idle wakeup KVM request check can be moved out of the normal host idle code and into the not-performance-critical offline code. - KVM nap code now returns from where it is called, which makes the flow a bit easier to follow. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Squash the KVM changes in] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-30powerpc/powernv/npu: Use pci_dev_id() helperHeiner Kallweit1-8/+6
Use new helper pci_dev_id() to simplify the code. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2019-04-29powerpc/pseries: Track LMB nid instead of using device treeNathan Fontenot1-9/+8
When removing memory we need to remove the memory from the node it was added to instead of looking up the node it should be in in the device tree. During testing we have seen scenarios where the affinity for a LMB changes due to a partition migration or PRRN event. In these cases the node the LMB exists in may not match the node the device tree indicates it belongs in. This can lead to a system crash when trying to DLPAR remove the LMB after a migration or PRRN event. The current code looks up the node in the device tree to remove the LMB from, the crash occurs when we try to offline this node and it does not have any data, i.e. node_data[nid] == NULL. 36:mon> e cpu 0x36: Vector: 300 (Data Access) at [c0000001828b7810] pc: c00000000036d08c: try_offline_node+0x2c/0x1b0 lr: c0000000003a14ec: remove_memory+0xbc/0x110 sp: c0000001828b7a90 msr: 800000000280b033 dar: 9a28 dsisr: 40000000 current = 0xc0000006329c4c80 paca = 0xc000000007a55200 softe: 0 irq_happened: 0x01 pid = 76926, comm = kworker/u320:3 36:mon> t [link register ] c0000000003a14ec remove_memory+0xbc/0x110 [c0000001828b7a90] c00000000006a1cc arch_remove_memory+0x9c/0xd0 (unreliable) [c0000001828b7ad0] c0000000003a14e0 remove_memory+0xb0/0x110 [c0000001828b7b20] c0000000000c7db4 dlpar_remove_lmb+0x94/0x160 [c0000001828b7b60] c0000000000c8ef8 dlpar_memory+0x7e8/0xd10 [c0000001828b7bf0] c0000000000bf828 handle_dlpar_errorlog+0xf8/0x160 [c0000001828b7c60] c0000000000bf8cc pseries_hp_work_fn+0x3c/0xa0 [c0000001828b7c90] c000000000128cd8 process_one_work+0x298/0x5a0 [c0000001828b7d20] c000000000129068 worker_thread+0x88/0x620 [c0000001828b7dc0] c00000000013223c kthread+0x1ac/0x1c0 [c0000001828b7e30] c00000000000b45c ret_from_kernel_thread+0x5c/0x80 To resolve this we need to track the node a LMB belongs to when it is added to the system so we can remove it from that node instead of the node that the device tree indicates it should belong to. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-23powerpc/512x: mark clocks as big endianJonas Gorski1-3/+6
These clocks' registers are accessed as big endian, so mark them as such. Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2019-04-21powerpc/mm/hash: Rename KERNEL_REGION_ID to LINEAR_MAP_REGION_IDAneesh Kumar K.V1-1/+1
The region actually point to linear map. Rename the #define to clarify thati. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/mm/hash64: Map all the kernel regions in the same 0xc rangeAneesh Kumar K.V1-2/+2
This patch maps vmalloc, IO and vmemap regions in the 0xc address range instead of the current 0xd and 0xf range. This brings the mapping closer to radix translation mode. With hash 64K page size each of this region is 512TB whereas with 4K config we are limited by the max page table range of 64TB and hence there regions are of 16TB size. The kernel mapping is now: On 4K hash kernel_region_map_size = 16TB kernel vmalloc start = 0xc000100000000000 kernel IO start = 0xc000200000000000 kernel vmemmap start = 0xc000300000000000 64K hash, 64K radix and 4k radix: kernel_region_map_size = 512TB kernel vmalloc start = 0xc008000000000000 kernel IO start = 0xc00a000000000000 kernel vmemmap start = 0xc00c000000000000 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/32s: Implement Kernel Userspace Access ProtectionChristophe Leroy1-0/+1
This patch implements Kernel Userspace Access Protection for book3s/32. Due to limitations of the processor page protection capabilities, the protection is only against writing. read protection cannot be achieved using page protection. The previous patch modifies the page protection so that RW user pages are RW for Key 0 and RO for Key 1, and it sets Key 0 for both user and kernel. This patch changes userspace segment registers are set to Ku 0 and Ks 1. When kernel needs to write to RW pages, the associated segment register is then changed to Ks 0 in order to allow write access to the kernel. In order to avoid having the read all segment registers when locking/unlocking the access, some data is kept in the thread_struct and saved on stack on exceptions. The field identifies both the first unlocked segment and the first segment following the last unlocked one. When no segment is unlocked, it contains value 0. As the hash_page() function is not able to easily determine if a protfault is due to a bad kernel access to userspace, protfaults need to be handled by handle_page_fault when KUAP is set. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> [mpe: Drop allow_read/write_to/from_user() as they're now in kup.h, and adapt allow_user_access() to do nothing when to == NULL] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/32s: Implement Kernel Userspace Execution Prevention.Christophe Leroy1-0/+1
To implement Kernel Userspace Execution Prevention, this patch sets NX bit on all user segments on kernel entry and clears NX bit on all user segments on kernel exit. Note that powerpc 601 doesn't have the NX bit, so KUEP will not work on it. A warning is displayed at startup. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/8xx: Add Kernel Userspace Access ProtectionChristophe Leroy1-0/+1
This patch adds Kernel Userspace Access Protection on the 8xx. When a page is RO or RW, it is set RO or RW for Key 0 and NA for Key 1. Up to now, the User group is defined with Key 0 for both User and Supervisor. By changing the group to Key 0 for User and Key 1 for Supervisor, this patch prevents the Kernel from being able to access user data. At exception entry, the kernel saves SPRN_MD_AP in the regs struct, and reapply the protection. At exception exit it restores SPRN_MD_AP with the value saved on exception entry. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> [mpe: Drop allow_read/write_to/from_user() as they're now in kup.h] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/8xx: Add Kernel Userspace Execution PreventionChristophe Leroy1-0/+1
This patch adds Kernel Userspace Execution Prevention on the 8xx. When a page is Executable, it is set Executable for Key 0 and NX for Key 1. Up to now, the User group is defined with Key 0 for both User and Supervisor. By changing the group to Key 0 for User and Key 1 for Supervisor, this patch prevents the Kernel from being able to execute user code. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/32: Prepare for Kernel Userspace Access ProtectionChristophe Leroy1-1/+1
This patch adds ASM macros for saving, restoring and checking the KUAP state, and modifies setup_32 to call them on exceptions from kernel. The macros are defined as empty by default for when CONFIG_PPC_KUAP is not selected and/or for platforms which don't handle (yet) KUAP. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/64s: Implement KUAP for Radix MMUMichael Ellerman1-0/+8
Kernel Userspace Access Prevention utilises a feature of the Radix MMU which disallows read and write access to userspace addresses. By utilising this, the kernel is prevented from accessing user data from outside of trusted paths that perform proper safety checks, such as copy_{to/from}_user() and friends. Userspace access is disabled from early boot and is only enabled when performing an operation like copy_{to/from}_user(). The register that controls this (AMR) does not prevent userspace from accessing itself, so there is no need to save and restore when entering and exiting userspace. When entering the kernel from the kernel we save AMR and if it is not blocking user access (because eg. we faulted doing a user access) we reblock user access for the duration of the exception (ie. the page fault) and then restore the AMR when returning back to the kernel. This feature can be tested by using the lkdtm driver (CONFIG_LKDTM=y) and performing the following: # (echo ACCESS_USERSPACE) > [debugfs]/provoke-crash/DIRECT If enabled, this should send SIGSEGV to the thread. We also add paranoid checking of AMR in switch and syscall return under CONFIG_PPC_KUAP_DEBUG. Co-authored-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/mm/radix: Use KUEP API for Radix MMURussell Currey1-0/+1
Execution protection already exists on radix, this just refactors the radix init to provide the KUEP setup function instead. Thus, the only functional change is that it can now be disabled. Signed-off-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc: Add a framework for Kernel Userspace Access ProtectionChristophe Leroy1-0/+12
This patch implements a framework for Kernel Userspace Access Protection. Then subarches will have the possibility to provide their own implementation by providing setup_kuap() and allow/prevent_user_access(). Some platforms will need to know the area accessed and whether it is accessed from read, write or both. Therefore source, destination and size and handed over to the two functions. mpe: Rename to allow/prevent rather than unlock/lock, and add read/write wrappers. Drop the 32-bit code for now until we have an implementation for it. Add kuap to pt_regs for 64-bit as well as 32-bit. Don't split strings, use pr_crit_ratelimited(). Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc: Add skeleton for Kernel Userspace Execution PreventionChristophe Leroy1-0/+12
This patch adds a skeleton for Kernel Userspace Execution Prevention. Then subarches implementing it have to define CONFIG_PPC_HAVE_KUEP and provide setup_kuep() function. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> [mpe: Don't split strings, use pr_crit_ratelimited()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/powernv: Squash sparse warnings in opal-call.cAndrew Donnellan1-0/+2
sparse complains a lot about opal-call.c: arch/powerpc/platforms/powernv/opal-call.c:128:1: warning: symbol 'opal_invalid_call' was not declared. Should it be static? arch/powerpc/platforms/powernv/opal-call.c:129:1: warning: symbol 'opal_console_write' was not declared. Should it be static? arch/powerpc/platforms/powernv/opal-call.c:130:1: warning: symbol 'opal_console_read' was not declared. Should it be static? Those symbols are forward declared in opal.h, but we can't include that because the function signatures in opal.h are different. So instead, just add an extra forward declaration to the OPAL_CALL macro to shut sparse up. Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/8xx: Fix possible device node reference leakWen Yang1-2/+1
The call to of_find_compatible_node() returns a node pointer with refcount incremented thus it must be explicitly decremented after the last usage. irq_domain_add_linear() also calls of_node_get() to increase refcount, so irq_domain() will not be affected when it is released. Detected by coccinelle. Fixes: a8db8cf0d894 ("irq_domain: Replace irq_alloc_host() with revmap-specific initializers") Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr> Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Peng Hao <peng.hao2@zte.com.cn> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/pseries: hwpoison the pages upon hitting UEGanesh Goudar1-0/+83
Add support to hwpoison the pages upon hitting machine check exception. This patch queues the address where UE is hit to percpu array and schedules work to plumb it into memory poison infrastructure. Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> [mpe: Combine #ifdefs, drop PPC_BIT8(), and empty inline stub] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/83xx: Add missing of_node_put() after of_device_is_available()Julia Lawall1-1/+3
Add an of_node_put() when a tested device node is not available. Fixes: c026c98739c7e ("powerpc/83xx: Do not configure or probe disabled FSL DR USB controllers") Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/pseries/pmem: Fix a set but not used valueQian Cai1-2/+1
The commit 4c5d87db4978 ("powerpc/pseries: PAPR persistent memory support") set a local variable "count" in dlpar_hp_pmem() but never use it. arch/powerpc/platforms/pseries/pmem.c: In function 'dlpar_hp_pmem': arch/powerpc/platforms/pseries/pmem.c:109:6: warning: variable 'count' set but not used Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/pseries/iommu: Fix set but not used valuesQian Cai1-8/+5
The commit b7d6bf4fdd47 ("powerpc/pseries/pci: Remove obsolete SW invalidate") left 2 variables unused. arch/powerpc/platforms/pseries/iommu.c:108:17: warning: variable 'tces' set but not used __be64 *tcep, *tces; ^~~~ arch/powerpc/platforms/pseries/iommu.c:132:17: warning: variable 'tces' set but not used __be64 *tcep, *tces; ^~~~ Also, the commit 68c0449ea16d ("powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation") set "ranges" in ddw_memory_hotplug_max() but never use it. arch/powerpc/platforms/pseries/iommu.c: In function 'ddw_memory_hotplug_max': arch/powerpc/platforms/pseries/iommu.c:948:7: warning: variable 'ranges' set but not used int ranges, n_mem_addr_cells, n_mem_size_cells, len; ^~~~~~ Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/mm: move warning from resize_hpt_for_hotplug()Laurent Vivier1-1/+2
resize_hpt_for_hotplug() reports a warning when it cannot resize the hash page table ("Unable to resize hash page table to target order") but in some cases it's not a problem and can make user thinks something has not worked properly. This patch moves the warning to arch_remove_memory() to only report the problem when it is needed. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/pseries/mce: Improve array initialization.Mahesh Salgaonkar1-26/+26
This is a follow up to the patch that fixed misleading print for TLB mutlihit due to wrongly populated mc_err_types[] array. Convert all the static array initialization to '[x] = val' style for better readability of array indexing and avoid any further confusion. Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/embedded6xx: Remove unused functions holly_power_off and holly_haltMathieu Malaterre1-12/+0
Silence the following warnings triggered using W=1: arch/powerpc/platforms/embedded6xx/holly.c:236:6: error: no previous prototype for 'holly_power_off' arch/powerpc/platforms/embedded6xx/holly.c:243:6: error: no previous prototype for 'holly_halt' Signed-off-by: Mathieu Malaterre <malat@debian.org> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/embedded6xx: Make some functions staticMathieu Malaterre1-3/+4
In commit cb9e4d10c448 ("[POWERPC] Add support for 750CL Holly board") new functions were added. Since most of these functions can be made static, make it so. Both holly_power_off and holly_halt functions were not changed since they are unused, making them static would have triggered the following warning (treated as error): arch/powerpc/platforms/embedded6xx/holly.c:244:13: error: 'holly_halt' defined but not used Silence the following warnings triggered using W=1: arch/powerpc/platforms/embedded6xx/holly.c:47:5: error: no previous prototype for 'holly_exclude_device' arch/powerpc/platforms/embedded6xx/holly.c:190:6: error: no previous prototype for 'holly_show_cpuinfo' arch/powerpc/platforms/embedded6xx/holly.c:196:17: error: no previous prototype for 'holly_restart' Signed-off-by: Mathieu Malaterre <malat@debian.org> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-17powerpc/mm/radix: Make Radix require HUGETLB_PAGEMichael Ellerman1-1/+1
Joel reported weird crashes using skiroot_defconfig, in his case we jumped into an NX page: kernel tried to execute exec-protected page (c000000002bff4f0) - exploit attempt? (uid: 0) BUG: Unable to handle kernel instruction fetch Faulting instruction address: 0xc000000002bff4f0 Looking at the disassembly, we had simply branched to that address: c000000000c001bc 49fff335 bl c000000002bff4f0 But that didn't match the original kernel image: c000000000c001bc 4bfff335 bl c000000000bff4f0 <kobject_get+0x8> When STRICT_KERNEL_RWX is enabled, and we're using the radix MMU, we call radix__change_memory_range() late in boot to change page protections. We do that both to mark rodata read only and also to mark init text no-execute. That involves walking the kernel page tables, and clearing _PAGE_WRITE or _PAGE_EXEC respectively. With radix we may use hugepages for the linear mapping, so the code in radix__change_memory_range() uses eg. pmd_huge() to test if it has found a huge mapping, and if so it stops the page table walk and changes the PMD permissions. However if the kernel is built without HUGETLBFS support, pmd_huge() is just a #define that always returns 0. That causes the code in radix__change_memory_range() to incorrectly interpret the PMD value as a pointer to a PTE page rather than as a PTE at the PMD level. We can see this using `dv` in xmon which also uses pmd_huge(): 0:mon> dv c000000000000000 pgd @ 0xc000000001740000 pgdp @ 0xc000000001740000 = 0x80000000ffffb009 pudp @ 0xc0000000ffffb000 = 0x80000000ffffa009 pmdp @ 0xc0000000ffffa000 = 0xc00000000000018f <- this is a PTE ptep @ 0xc000000000000100 = 0xa64bb17da64ab07d <- kernel text The end result is we treat the value at 0xc000000000000100 as a PTE and clear _PAGE_WRITE or _PAGE_EXEC, potentially corrupting the code at that address. In Joel's specific case we cleared the sign bit in the offset of the branch, causing a backward branch to turn into a forward branch which caused us to branch into a non-executable page. However the exact nature of the crash depends on kernel version, compiler version, and other factors. We need to fix radix__change_memory_range() to not use accessors that depend on HUGETLBFS, but we also have radix memory hotplug code that uses pmd_huge() etc that will also need fixing. So for now just disallow the broken combination of Radix with HUGETLBFS disabled. The only defconfig we have that is affected is skiroot_defconfig, so turn on HUGETLBFS there so that it still gets Radix. Fixes: 566ca99af026 ("powerpc/mm/radix: Add dummy radix_enabled()") Cc: stable@vger.kernel.org # v4.7+ Reported-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-11powerpc/xive: add OPAL extensions for the XIVE native exploitation supportCédric Le Goater1-0/+3
The support for XIVE native exploitation mode in Linux/KVM needs a couple more OPAL calls to get and set the state of the XIVE internal structures being used by a sPAPR guest. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-29powerpc/pseries/mce: Fix misleading print for TLB mutlihitMahesh Salgaonkar1-0/+1
On pseries, TLB multihit are reported as D-Cache Multihit. This is because the wrongly populated mc_err_types[] array. Per PAPR, TLB error type is 0x04 and mc_err_types[4] points to "D-Cache" instead of "TLB" string. Fixup the mc_err_types[] array. Machine check error type per PAPR: 0x00 = Uncorrectable Memory Error (UE) 0x01 = SLB error 0x02 = ERAT Error 0x04 = TLB error 0x05 = D-Cache error 0x07 = I-Cache error Fixes: 8f0b80561f21 ("powerpc/pseries: Display machine check error details.") Cc: stable@vger.kernel.org # v4.20+ Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-27powerpc/pseries/energy: Use OF accessor functions to read ibm,drc-indexesGautham R. Shenoy1-9/+18
In cpu_to_drc_index() in the case when FW_FEATURE_DRC_INFO is absent, we currently use of_read_property() to obtain the pointer to the array corresponding to the property "ibm,drc-indexes". The elements of this array are of type __be32, but are accessed without any conversion to the OS-endianness, which is buggy on a Little Endian OS. Fix this by using of_property_read_u32_index() accessor function to safely read the elements of the array. Fixes: e83636ac3334 ("pseries/drc-info: Search DRC properties for CPU indexes") Cc: stable@vger.kernel.org # v4.16+ Reported-by: Pavithra R. Prakash <pavrampu@in.ibm.com> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [mpe: Make the WARN_ON a WARN_ON_ONCE so it's not retriggerable] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-16Merge tag 'devdax-for-5.1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull device-dax updates from Dan Williams: "New device-dax infrastructure to allow persistent memory and other "reserved" / performance differentiated memories, to be assigned to the core-mm as "System RAM". Some users want to use persistent memory as additional volatile memory. They are willing to cope with potential performance differences, for example between DRAM and 3D Xpoint, and want to use typical Linux memory management apis rather than a userspace memory allocator layered over an mmap() of a dax file. The administration model is to decide how much Persistent Memory (pmem) to use as System RAM, create a device-dax-mode namespace of that size, and then assign it to the core-mm. The rationale for device-dax is that it is a generic memory-mapping driver that can be layered over any "special purpose" memory, not just pmem. On subsequent boots udev rules can be used to restore the memory assignment. One implication of using pmem as RAM is that mlock() no longer keeps data off persistent media. For this reason it is recommended to enable NVDIMM Security (previously merged for 5.0) to encrypt pmem contents at rest. We considered making this recommendation an actively enforced requirement, but in the end decided to leave it as a distribution / administrator policy to allow for emulation and test environments that lack security capable NVDIMMs. Summary: - Replace the /sys/class/dax device model with /sys/bus/dax, and include a compat driver so distributions can opt-in to the new ABI. - Allow for an alternative driver for the device-dax address-range - Introduce the 'kmem' driver to hotplug / assign a device-dax address-range to the core-mm. - Arrange for the device-dax target-node to be onlined so that the newly added memory range can be uniquely referenced by numa apis" NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because we currently have special - and very annoying rules in the kernel about accessing PMEM only with the "MC safe" accessors, because machine checks inside the regular repeat string copy functions can be fatal in some (not described) circumstances. And apparently the PMEM modules can cause that a lot more than regular RAM. The argument is that this happens because PMEM doesn't necessarily get scrubbed at boot like RAM does, but that is planned to be added for the user space tooling. Quoting Dan from another email: "The exposure can be reduced in the volatile-RAM case by scanning for and clearing errors before it is onlined as RAM. The userspace tooling for that can be in place before v5.1-final. There's also runtime notifications of errors via acpi_nfit_uc_error_notify() from background scrubbers on the DIMM devices. With that mechanism the kernel could proactively clear newly discovered poison in the volatile case, but that would be additional development more suitable for v5.2. I understand the concern, and the need to highlight this issue by tapping the brakes on feature development, but I don't see PMEM as RAM making the situation worse when the exposure is also there via DAX in the PMEM case. Volatile-RAM is arguably a safer use case since it's possible to repair pages where the persistent case needs active application coordination" * tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: device-dax: "Hotplug" persistent memory for use like normal RAM mm/resource: Let walk_system_ram_range() search child resources mm/memory-hotplug: Allow memory resources to be children mm/resource: Move HMM pr_debug() deeper into resource code mm/resource: Return real error codes from walk failures device-dax: Add a 'modalias' attribute to DAX 'bus' devices device-dax: Add a 'target_node' attribute device-dax: Auto-bind device after successful new_id acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node device-dax: Add /sys/class/dax backwards compatibility device-dax: Add support for a dax override driver device-dax: Move resource pinning+mapping into the common driver device-dax: Introduce bus + driver model device-dax: Start defining a dax bus model device-dax: Remove multi-resource infrastructure device-dax: Kill dax_region base device-dax: Kill dax_region ida
2019-03-16Merge tag 'powerpc-5.1-2' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "One fix to prevent runtime allocation of 16GB pages when running in a VM (as opposed to bare metal), because it doesn't work. A small fix to our recently added KCOV support to exempt some more code from being instrumented. Plus a few minor build fixes, a small dead code removal and a defconfig update. Thanks to: Alexey Kardashevskiy, Aneesh Kumar K.V, Christophe Leroy, Jason Yan, Joel Stanley, Mahesh Salgaonkar, Mathieu Malaterre" * tag 'powerpc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s: Include <asm/nmi.h> header file to fix a warning powerpc/powernv: Fix compile without CONFIG_TRACEPOINTS powerpc/mm: Disable kcov for SLB routines powerpc: remove dead code in head_fsl_booke.S powerpc/configs: Sync skiroot defconfig powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration
2019-03-13powerpc/powernv: Fix compile without CONFIG_TRACEPOINTSAlexey Kardashevskiy1-0/+1
The functions returns s64 but the return statement is missing. This adds the missing return statement. Fixes: 75d9fc7fd94e ("powerpc/powernv: move OPAL call wrapper tracing and interrupt handling to C") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-12treewide: add checks for the return value of memblock_alloc*()Mike Rapoport5-0/+20
Add check for the return value of memblock_alloc*() functions and call panic() in case of error. The panic message repeats the one used by panicing memblock allocators with adjustment of parameters to include only relevant ones. The replacement was mostly automated with semantic patches like the one below with manual massaging of format strings. @@ expression ptr, size, align; @@ ptr = memblock_alloc(size, align); + if (!ptr) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align); [anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type] Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org [rppt@linux.ibm.com: fix format strings for panics after memblock_alloc] Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com [rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails] Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx [akpm@linux-foundation.org: fix xtensa printk warning] Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Reviewed-by: Guo Ren <ren_guo@c-sky.com> [c-sky] Acked-by: Paul Burton <paul.burton@mips.com> [MIPS] Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390] Reviewed-by: Juergen Gross <jgross@suse.com> [Xen] Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Max Filippov <jcmvbkbc@gmail.com> [xtensa] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds3-8/+18
Merge more updates from Andrew Morton: - some of the rest of MM - various misc things - dynamic-debug updates - checkpatch - some epoll speedups - autofs - rapidio - lib/, lib/lzo/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits) samples/mic/mpssd/mpssd.h: remove duplicate header kernel/fork.c: remove duplicated include include/linux/relay.h: fix percpu annotation in struct rchan arch/nios2/mm/fault.c: remove duplicate include unicore32: stop printing the virtual memory layout MAINTAINERS: fix GTA02 entry and mark as orphan mm: create the new vm_fault_t type arm, s390, unicore32: remove oneliner wrappers for memblock_alloc() arch: simplify several early memory allocations openrisc: simplify pte_alloc_one_kernel() sh: prefer memblock APIs returning virtual address microblaze: prefer memblock API returning virtual address powerpc: prefer memblock APIs returning virtual address lib/lzo: separate lzo-rle from lzo lib/lzo: implement run-length encoding lib/lzo: fast 8-byte copy on arm64 lib/lzo: 64-bit CTZ on arm64 lib/lzo: tidy-up ifdefs ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size ipc: annotate implicit fall through ...
2019-03-08arch: simplify several early memory allocationsMike Rapoport1-2/+1
There are several early memory allocations in arch/ code that use memblock_phys_alloc() to allocate memory, convert the returned physical address to the virtual address and then set the allocated memory to zero. Exactly the same behaviour can be achieved simply by calling memblock_alloc(): it allocates the memory in the same way as memblock_phys_alloc(), then it performs the phys_to_virt() conversion and clears the allocated memory. Replace the longer sequence with a simpler call to memblock_alloc(). Link: http://lkml.kernel.org/r/1546248566-14910-6-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <michal.simek@xilinx.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Mackerras <paulus@samba.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-08powerpc: prefer memblock APIs returning virtual addressMike Rapoport2-6/+17
Patch series "memblock: simplify several early memory allocation", v4. These patches simplify some of the early memory allocations by replacing usage of older memblock APIs with newer and shinier ones. Quite a few places in the arch/ code allocated memory using a memblock API that returns a physical address of the allocated area, then converted this physical address to a virtual one and then used memset(0) to clear the allocated range. More recent memblock APIs do all the three steps in one call and their usage simplifies the code. It's important to note that regardless of API used, the core allocation is nearly identical for any set of memblock allocators: first it tries to find a free memory with all the constraints specified by the caller and then falls back to the allocation with some or all constraints disabled. The first three patches perform the conversion of call sites that have exact requirements for the node and the possible memory range. The fourth patch is a bit one-off as it simplifies openrisc's implementation of pte_alloc_one_kernel(), and not only the memblock usage. The fifth patch takes care of simpler cases when the allocation can be satisfied with a simple call to memblock_alloc(). The sixth patch removes one-liner wrappers for memblock_alloc on arm and unicore32, as suggested by Christoph. This patch (of 6): There are a several places that allocate memory using memblock APIs that return a physical address, convert the returned address to the virtual address and frequently also memset(0) the allocated range. Update these places to use memblock allocators already returning a virtual address. Use memblock functions that clear the allocated memory instead of calling memset(0) where appropriate. The calls to memblock_alloc_base() that were not followed by memset(0) are replaced with memblock_alloc_try_nid_raw(). Since the latter does not panic() when the allocation fails, the appropriate panic() calls are added to the call sites. Link: http://lkml.kernel.org/r/1546248566-14910-2-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Greentime Hu <green.hu@gmail.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Mark Salter <msalter@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07Merge tag 'powerpc-5.1-1' of ↵Linus Torvalds39-850/+556
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Notable changes: - Enable THREAD_INFO_IN_TASK to move thread_info off the stack. - A big series from Christoph reworking our DMA code to use more of the generic infrastructure, as he said: "This series switches the powerpc port to use the generic swiotlb and noncoherent dma ops, and to use more generic code for the coherent direct mapping, as well as removing a lot of dead code." - Increase our vmalloc space to 512T with the Hash MMU on modern CPUs, allowing us to support machines with larger amounts of total RAM or distance between nodes. - Two series from Christophe, one to optimise TLB miss handlers on 6xx, and another to optimise the way STRICT_KERNEL_RWX is implemented on some 32-bit CPUs. - Support for KCOV coverage instrumentation which means we can run syzkaller and discover even more bugs in our code. And as always many clean-ups, reworks and minor fixes etc. Thanks to: Alan Modra, Alexey Kardashevskiy, Alistair Popple, Andrea Arcangeli, Andrew Donnellan, Aneesh Kumar K.V, Aravinda Prasad, Balbir Singh, Brajeswar Ghosh, Breno Leitao, Christian Lamparter, Christian Zigotzky, Christophe Leroy, Christoph Hellwig, Corentin Labbe, Daniel Axtens, David Gibson, Diana Craciun, Firoz Khan, Gustavo A. R. Silva, Igor Stoppa, Joe Lawrence, Joel Stanley, Jonathan Neuschäfer, Jordan Niethe, Laurent Dufour, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Masahiro Yamada, Mathieu Malaterre, Matteo Croce, Meelis Roos, Michael W. Bringmann, Nathan Chancellor, Nathan Fontenot, Nicholas Piggin, Nick Desaulniers, Nicolai Stange, Oliver O'Halloran, Paul Mackerras, Peter Xu, PrasannaKumar Muralidharan, Qian Cai, Rashmica Gupta, Reza Arbab, Robert P. J. Day, Russell Currey, Sabyasachi Gupta, Sam Bobroff, Sandipan Das, Sergey Senozhatsky, Souptick Joarder, Stewart Smith, Tyrel Datwyler, Vaibhav Jain, YueHaibing" * tag 'powerpc-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (200 commits) powerpc/32: Clear on-stack exception marker upon exception return powerpc: Remove export of save_stack_trace_tsk_reliable() powerpc/mm: fix "section_base" set but not used powerpc/mm: Fix "sz" set but not used warning powerpc/mm: Check secondary hash page table powerpc: remove nargs from __SYSCALL powerpc/64s: Fix unrelocated interrupt trampoline address test powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables powerpc/fsl: Fix the flush of branch predictor. powerpc/powernv: Make opal log only readable by root powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc powerpc/powernv: move OPAL call wrapper tracing and interrupt handling to C powerpc/64s: Fix data interrupts vs d-side MCE reentrancy powerpc/64s: Prepare to handle data interrupts vs d-side MCE reentrancy powerpc/64s: system reset interrupt preserve HSRRs powerpc/64s: Fix HV NMI vs HV interrupt recoverability test powerpc/mm/hash: Handle mmap_min_addr correctly in get_unmapped_area topdown search powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback selftests/powerpc: Remove duplicate header powerpc sstep: Add support for modsd, modud instructions ...
2019-03-07Merge tag 'driver-core-5.1-rc1' of ↵Linus Torvalds1-6/+4
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the big driver core patchset for 5.1-rc1 More patches than "normal" here this merge window, due to some work in the driver core by Alexander Duyck to rework the async probe functionality to work better for a number of devices, and independant work from Rafael for the device link functionality to make it work "correctly". Also in here is: - lots of BUS_ATTR() removals, the macro is about to go away - firmware test fixups - ihex fixups and simplification - component additions (also includes i915 patches) - lots of minor coding style fixups and cleanups. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (65 commits) driver core: platform: remove misleading err_alloc label platform: set of_node in platform_device_register_full() firmware: hardcode the debug message for -ENOENT driver core: Add missing description of new struct device_link field driver core: Fix PM-runtime for links added during consumer probe drivers/component: kerneldoc polish async: Add cmdline option to specify drivers to be async probed driver core: Fix possible supplier PM-usage counter imbalance PM-runtime: Fix __pm_runtime_set_status() race with runtime resume driver: platform: Support parsing GpioInt 0 in platform_get_irq() selftests: firmware: fix verify_reqs() return value Revert "selftests: firmware: remove use of non-standard diff -Z option" Revert "selftests: firmware: add CONFIG_FW_LOADER_USER_HELPER_FALLBACK to config" device: Fix comment for driver_data in struct device kernfs: Allocating memory for kernfs_iattrs with kmem_cache. sysfs: remove unused include of kernfs-internal.h driver core: Postpone DMA tear-down until after devres release driver core: Document limitation related to DL_FLAG_RPM_ACTIVE PM-runtime: Take suppliers into account in __pm_runtime_set_status() device.h: Add __cold to dev_<level> logging functions ...
2019-03-07Merge tag 'char-misc-5.1-rc1' of ↵Linus Torvalds8-13/+23
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here is the big char/misc driver patch pull request for 5.1-rc1. The largest thing by far is the new habanalabs driver for their AI accelerator chip. For now it is in the drivers/misc directory but will probably move to a new directory soon along with other drivers of this type. Other than that, just the usual set of individual driver updates and fixes. There's an "odd" merge in here from the DRM tree that they asked me to do as the MEI driver is starting to interact with the i915 driver, and it needed some coordination. All of those patches have been properly acked by the relevant subsystem maintainers. All of these have been in linux-next with no reported issues, most for quite some time" * tag 'char-misc-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (219 commits) habanalabs: adjust Kconfig to fix build errors habanalabs: use %px instead of %p in error print habanalabs: use do_div for 64-bit divisions intel_th: gth: Fix an off-by-one in output unassigning habanalabs: fix little-endian<->cpu conversion warnings habanalabs: use NULL to initialize array of pointers habanalabs: fix little-endian<->cpu conversion warnings habanalabs: soft-reset device if context-switch fails habanalabs: print pointer using %p habanalabs: fix memory leak with CBs with unaligned size habanalabs: return correct error code on MMU mapping failure habanalabs: add comments in uapi/misc/habanalabs.h habanalabs: extend QMAN0 job timeout habanalabs: set DMA0 completion to SOB 1007 habanalabs: fix validation of WREG32 to DMA completion habanalabs: fix mmu cache registers init habanalabs: disable CPU access on timeouts habanalabs: add MMU DRAM default page mapping habanalabs: Dissociate RAZWI info from event types misc/habanalabs: adjust Kconfig to fix build errors ...
2019-03-06mm: replace all open encodings for NUMA_NO_NODEAnshuman Khandual1-2/+3
Patch series "Replace all open encodings for NUMA_NO_NODE", v3. All these places for replacement were found by running the following grep patterns on the entire kernel code. Please let me know if this might have missed some instances. This might also have replaced some false positives. I will appreciate suggestions, inputs and review. 1. git grep "nid == -1" 2. git grep "node == -1" 3. git grep "nid = -1" 4. git grep "node = -1" This patch (of 2): At present there are multiple places where invalid node number is encoded as -1. Even though implicitly understood it is always better to have macros in there. Replace these open encodings for an invalid node number with the global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like 'invalid node' from various places redirecting them to a common definition. Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe] Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx] Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband] Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-02powerpc: remove nargs from __SYSCALLFiroz Khan1-1/+1
The __SYSCALL macro's arguments are system call number, system call entry name and number of arguments for the system call. Argument- nargs in __SYSCALL(nr, entry, nargs) is neither calculated nor used anywhere. So it would be better to keep the implementaion as __SYSCALL(nr, entry). This will unifies the implementation with some other architetures too. Signed-off-by: Firoz Khan <firoz.khan@linaro.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-28powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tablesAlexey Kardashevskiy2-2/+6
We store 2 multilevel tables in iommu_table - one for the hardware and one with the corresponding userspace addresses. Before allocating the tables, the iommu_table_group_ops::get_table_size() hook returns the combined size of the two and VFIO SPAPR TCE IOMMU driver adjusts the locked_vm counter correctly. When the table is actually allocated, the amount of allocated memory is stored in iommu_table::it_allocated_size and used to decrement the locked_vm counter when we release the memory used by the table; .get_table_size() and .create_table() calculate it independently but the result is expected to be the same. However the allocator does not add the userspace table size to .it_allocated_size so when we destroy the table because of VFIO PCI unplug (i.e. VFIO container is gone but the userspace keeps running), we decrement locked_vm by just a half of size of memory we are releasing. To make things worse, since we enabled on-demand allocation of indirect levels, it_allocated_size contains only the amount of memory actually allocated at the table creation time which can just be a fraction. It is not a problem with incrementing locked_vm (as get_table_size() value is used) but it is with decrementing. As the result, we leak locked_vm and may not be able to allocate more IOMMU tables after few iterations of hotplug/unplug. This sets it_allocated_size in the pnv_pci_ioda2_ops::create_table() hook to what pnv_pci_ioda2_get_table_size() returns so from now on we have a single place which calculates the maximum memory a table can occupy. The original meaning of it_allocated_size is somewhat lost now though. We do not ditch it_allocated_size whatsoever here and we do not call get_table_size() from vfio_iommu_spapr_tce.c when decrementing locked_vm as we may have multiple IOMMU groups per container and even though they all are supposed to have the same get_table_size() implementation, there is a small chance for failure or confusion. Fixes: 090bad39b237 ("powerpc/powernv: Add indirect levels to it_userspace") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-27powerpc/powernv: Make opal log only readable by rootJordan Niethe1-1/+1
Currently the opal log is globally readable. It is kernel policy to limit the visibility of physical addresses / kernel pointers to root. Given this and the fact the opal log may contain this information it would be better to limit the readability to root. Fixes: bfc36894a48b ("powerpc/powernv: Add OPAL message log interface") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Reviewed-by: Stewart Smith <stewart@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-26powerpc/powernv: move OPAL call wrapper tracing and interrupt handling to CNicholas Piggin3-308/+324
The OPAL call wrapper gets interrupt disabling wrong. It disables interrupts just by clearing MSR[EE], which has two problems: - It doesn't call into the IRQ tracing subsystem, which means tracing across OPAL calls does not always notice IRQs have been disabled. - It doesn't go through the IRQ soft-mask code, which causes a minor bug. MSR[EE] can not be restored by saving the MSR then clearing MSR[EE], because a racing interrupt while soft-masked could clear MSR[EE] between the two steps. This can cause MSR[EE] to be incorrectly enabled when the OPAL call returns. Fortunately that should only result in another masked interrupt being taken to disable MSR[EE] again, but it's a bit sloppy. The existing code also saves MSR to PACA, which is not re-entrant if there is a nested OPAL call from different MSR contexts, which can happen these days with SRESET interrupts on bare metal. To fix these issues, move the tracing and IRQ handling code to C, and call into asm just for the low level call when everything is ready to go. Save the MSR on stack rather than PACA. Performance cost is kept to a minimum with a few optimisations: - The endian switch upon return is combined with the MSR restore, which avoids an expensive context synchronizing operation for LE kernels. This makes up for the additional mtmsrd to enable interrupts with local_irq_enable(). - blr is now used to return from the opal_* functions that are called as C functions, to avoid link stack corruption. This requires a skiboot fix as well to keep the call stack balanced. A NULL call is more costly after this, (410ns->430ns on POWER9), but OPAL calls are generally not performance critical at this scale. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-23Merge tag 'powerpc-5.0-6' of ↵Linus Torvalds2-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fix from Michael Ellerman: "One fix for an oops when using SRIOV, introduced by the recent changes to support compound IOMMU groups. Thanks to Alexey Kardashevskiy" * tag 'powerpc-5.0-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/powernv/sriov: Register IOMMU groups for VFs
2019-02-23powerpc/wii: remove wii_mmu_mapin_mem2()Christophe Leroy1-28/+0
wii_mmu_mapin_mem2() is not used anymore, remove it. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-23powerpc/wii: properly disable use of BATs when requested.Christophe Leroy1-0/+4
'nobats' kernel parameter or some options like CONFIG_DEBUG_PAGEALLOC deny the use of BATS for mapping memory. This patch makes sure that the specific wii RAM mapping function takes it into account as well. Fixes: de32400dd26e ("wii: use both mem1 and mem2 as ram") Cc: stable@vger.kernel.org Reviewed-by: Jonathan Neuschafer <j.neuschaefer@gmx.net> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21powerpc/83xx: Also save/restore SPRG4-7 during suspendChristophe Leroy1-7/+27
The 83xx has 8 SPRG registers and uses at least SPRG4 for DTLB handling LRU. Fixes: 2319f1239592 ("powerpc/mm: e300c2/c3/c4 TLB errata workaround") Cc: stable@vger.kernel.org Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exitPaul Mackerras2-25/+27
Commit 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api only on Hotplug", 2017-07-21) added two calls to opal_slw_set_reg() inside pnv_cpu_offline(), with the aim of changing the LPCR value in the SLW image to disable wakeups from the decrementer while a CPU is offline. However, pnv_cpu_offline() gets called each time a secondary CPU thread is woken up to participate in running a KVM guest, that is, not just when a CPU is offlined. Since opal_slw_set_reg() is a very slow operation (with observed execution times around 20 milliseconds), this means that an offline secondary CPU can often be busy doing the opal_slw_set_reg() call when the primary CPU wants to grab all the secondary threads so that it can run a KVM guest. This leads to messages like "KVM: couldn't grab CPU n" being printed and guest execution failing. There is no need to reprogram the SLW image on every KVM guest entry and exit. So that we do it only when a CPU is really transitioning between online and offline, this moves the calls to pnv_program_cpu_hotplug_lpcr() into pnv_smp_cpu_kill_self(). Fixes: 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api only on Hotplug") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21powerpc/pseries: export timebase register sample in lparcfgTyrel Datwyler1-0/+1
The Processor Utilzation of Resource Registers (PURR) provide an estimate of resources used by a cpu thread. Section 7.6 in Book III of the ISA outlines how to calculate the percentage of shared resources for threads using the ratio of the PURR delta and Timebase Register delta for a sampled period. This calculation is currently done erroneously by the lparstat tool from the powerpc-utils package. This patch exports the current timebase value after we sample the PURRs and exposes it to userspace accounting tools via /proc/ppc64/lparcfg. Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>