summaryrefslogtreecommitdiff
path: root/arch/powerpc/include/asm/book3s
AgeCommit message (Collapse)AuthorFilesLines
2018-02-02Merge tag 'powerpc-4.16-1' of ↵Linus Torvalds10-28/+234
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights: - Enable support for memory protection keys aka "pkeys" on Power7/8/9 when using the hash table MMU. - Extend our interrupt soft masking to support masking PMU interrupts as well as "normal" interrupts, and then use that to implement local_t for a ~4x speedup vs the current atomics-based implementation. - A new driver "ocxl" for "Open Coherent Accelerator Processor Interface (OpenCAPI)" devices. - Support for new device tree properties on PowerVM to describe hotpluggable memory and devices. - Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE to the 64-bit VDSO. - Freescale updates from Scott: fixes for CPM GPIO and an FSL PCI erratum workaround, plus a minor cleanup patch. As well as quite a lot of other changes all over the place, and small fixes and cleanups as always. Thanks to: Alan Modra, Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andreas Schwab, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual, Anton Blanchard, Arnd Bergmann, Balbir Singh, Benjamin Herrenschmidt, Bhaktipriya Shridhar, Bryant G. Ly, Cédric Le Goater, Christophe Leroy, Christophe Lombard, Cyril Bur, David Gibson, Desnes A. Nunes do Rosario, Dmitry Torokhov, Frederic Barrat, Geert Uytterhoeven, Guilherme G. Piccoli, Gustavo A. R. Silva, Gustavo Romero, Ivan Mikhaylov, Joakim Tjernlund, Joe Perches, Josh Poimboeuf, Juan J. Alvarez, Julia Cartwright, Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre, Michael Bringmann, Michael Hanselmann, Michael Neuling, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Philippe Bergheaud, Ram Pai, Russell Currey, Santosh Sivaraj, Scott Wood, Seth Forshee, Simon Guo, Stewart Smith, Sukadev Bhattiprolu, Thiago Jung Bauermann, Vaibhav Jain, Vasyl Gomonovych" * tag 'powerpc-4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (199 commits) powerpc/mm/radix: Fix build error when RADIX_MMU=n macintosh/ams-input: Use true and false for boolean values macintosh: change some data types from int to bool powerpc/watchdog: Print the NIP in soft_nmi_interrupt() powerpc/watchdog: regs can't be null in soft_nmi_interrupt() powerpc/watchdog: Tweak watchdog printks powerpc/cell: Remove axonram driver rtc-opal: Fix handling of firmware error codes, prevent busy loops powerpc/mpc52xx_gpt: make use of raw_spinlock variants macintosh/adb: Properly mark continued kernel messages powerpc/pseries: Fix cpu hotplug crash with memoryless nodes powerpc/numa: Ensure nodes initialized for hotplug powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes powerpc/kernel: Block interrupts when updating TIDR powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn powerpc/mm/nohash: do not flush the entire mm when range is a single page powerpc/pseries: Add Initialization of VF Bars powerpc/pseries/pci: Associate PEs to VFs in configure SR-IOV powerpc/eeh: Add EEH notify resume sysfs powerpc/eeh: Add EEH operations to notify resume ...
2018-02-01mm/thp: remove pmd_huge_split_prepare()Aneesh Kumar K.V4-19/+0
Instead of marking the pmd ready for split, invalidate the pmd. This should take care of powerpc requirement. Only side effect is that we mark the pmd invalid early. This can result in us blocking access to the page a bit longer if we race against a thp split. [kirill.shutemov@linux.intel.com: rebased, dirty THP once] Link: http://lkml.kernel.org/r/20171213105756.69879-13-kirill.shutemov@linux.intel.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Daney <david.daney@cavium.com> Cc: David Miller <davem@davemloft.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitin Gupta <nitin.m.gupta@oracle.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01powerpc/mm: update pmdp_invalidate to return old pmd valueAneesh Kumar K.V1-2/+2
It's required to avoid losing dirty and accessed bits. Link: http://lkml.kernel.org/r/20171213105756.69879-7-kirill.shutemov@linux.intel.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-30powerpc/mm/radix: Fix build error when RADIX_MMU=nMichael Ellerman1-0/+4
The recent TLB flush rework broke the build when the Radix MMU is disabled at build time, eg: (.text+0x264): undefined reference to `.radix__tlbiel_all' We could add an empty version, but if we ever called it by accident that would indicate a bad bug, so add a stub that just WARNs if we do. Fixes: d4748276ae14 ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: check key protection for user page accessRam Pai1-1/+9
Make sure that the kernel does not access user pages without checking their key-protection. Signed-off-by: Ram Pai <linuxram@us.ibm.com> [mpe: Integrate with upstream version of pte_access_permitted()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: helper to validate key-access permissions of a pteRam Pai1-0/+2
helper function that checks if the read/write/execute is allowed on the pte. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: Program HPTE key protection bitsRam Pai1-0/+5
Map the PTE protection key bits to the HPTE key protection bits, while creating HPTE entries. Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: map vma key-protection bits to pte key bits.Ram Pai1-1/+24
Map the key protection bits of the vma to the pkey bits in the PTE. The PTE bits used for pkey are 3,4,5,6 and 57. The first four bits are the same four bits that were freed up initially in this patch series. remember? :-) Without those four bits this patch wouldn't be possible. BUT, on 4k kernel, bit 3, and 4 could not be freed up. remember? Hence we have to be satisfied with 5, 6 and 7. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: introduce execute-only pkeyRam Pai1-0/+1
This patch provides the implementation of execute-only pkey. The architecture-independent layer expects the arch-dependent layer, to support the ability to create and enable a special key which has execute-only permission. Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: track allocation status of all pkeysRam Pai1-0/+9
Total 32 keys are available on power7 and above. However pkey 0,1 are reserved. So effectively we have 30 pkeys. On 4K kernels, we do not have 5 bits in the PTE to represent all the keys; we only have 3bits. Two of those keys are reserved; pkey 0 and pkey 1. So effectively we have 6 pkeys. This patch keeps track of reserved keys, allocated keys and keys that are currently free. Also it adds skeletal functions and macros, that the architecture-independent code expects to be available. Reviewed-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19powerpc/64s: Fix ps3 build error due to tlbiel_all()Michael Ellerman1-0/+4
The recent changes to TLB handling broke the PS3 build: arch/powerpc/include/asm/book3s/64/tlbflush.h:30: undefined reference to `.hash__tlbiel_all' Fix it by adding an fallback version of tlbiel_all() for non-native builds. It should never be called, due to checks in callers so it calls BUG(). We should probably clean it up further but this will suffice for now. Fixes: d4748276ae14 ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-17powerpc/64s: Improve local TLB flush for boot and MCE on POWER9Nicholas Piggin3-0/+38
There are several cases outside the normal address space management where a CPU's entire local TLB is to be flushed: 1. Booting the kernel, in case something has left stale entries in the TLB (e.g., kexec). 2. Machine check, to clean corrupted TLB entries. One other place where the TLB is flushed, is waking from deep idle states. The flush is a side-effect of calling ->cpu_restore with the intention of re-setting various SPRs. The flush itself is unnecessary because in the first case, the TLB should not acquire new corrupted TLB entries as part of sleep/wake (though they may be lost). This type of TLB flush is coded inflexibly, several times for each CPU type, and they have a number of problems with ISA v3.0B: - The current radix mode of the MMU is not taken into account, it is always done as a hash flushn For IS=2 (LPID-matching flush from host) and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if the R field does not match the current radix mode. - ISA v3.0B hash must flush the partition and process table caches as well. - ISA v3.0B radix must flush partition and process scoped translations, partition and process table caches, and also the page walk cache. So consolidate the flushing code and implement it in C and inline asm under the mm/ directory with the rest of the flush code. Add ISA v3.0B cases for radix and hash, and use the radix flush in radix environment. Provide a way for IS=2 (LPID flush) to specify the radix mode of the partition. Have KVM pass in the radix mode of the guest. Take out the flushes from early cputable/dt_cpu_ftrs detection hooks, and move it later in the boot process after, the MMU registers are set up and before relocation is first turned on. The TLB flush is no longer called when restoring from deep idle states. This was not be done as a separate step because booting secondaries uses the same cpu_restore as idle restore, which needs the TLB flush. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16powerpc/mm: Introduce _PAGE_NAChristophe Leroy1-0/+1
Today, PAGE_NONE is defined as a page not having _PAGE_USER. In some circunstances, when the CPU supports it, it might be better to be able to flag a page with NO ACCESS. In a following patch, the 8xx will switch user access being flagged in the PMD, therefore it will not be possible anymore to use _PAGE_USER as a way to flag a page with no access. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16powerpc/mm: extend _PAGE_PRIVILEGED to all CPUsChristophe Leroy1-1/+1
commit ac29c64089b74 ("powerpc/mm: Replace _PAGE_USER with _PAGE_PRIVILEGED") introduced _PAGE_PRIVILEGED for BOOK3S/64 This patch generalises _PAGE_PRIVILEGED for all CPUs, allowing to have either _PAGE_PRIVILEGED or _PAGE_USER or both. PPC_8xx has a _PAGE_SHARED flag which is set for and only for all non user pages. Lets rename it _PAGE_PRIVILEGED to remove confusion as it has nothing to do with Linux shared pages. On BookE, there's a _PAGE_BAP_SR which has to be set for kernel pages: defining _PAGE_PRIVILEGED as _PAGE_BAP_SR will make this generic Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-22powerpc/mm: Add proper pte access check helper for other platformsAneesh Kumar K.V1-0/+23
pte_access_premitted get called in get_user_pages_fast path. If we have marked the pte PROT_NONE, we should not allow a read access on the address. With the current implementation we are not checking the READ and only check for WRITE. This is needed on archs like ppc64 that implement PROT_NONE using _PAGE_USER access instead of _PAGE_PRESENT. Also add pte_user check just to make sure we are not accessing kernel mapping. Even though there is code duplication, keeping the low level pte accessors different for different platforms helps in code readability. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-22powerpc/mm/book3s/64: Add proper pte access check helperAneesh Kumar K.V1-0/+41
pte_access_premitted get called in get_user_pages_fast path. If we have marked the pte PROT_NONE, we should not allow a read access on the address. With the current implementation we are not checking the READ and only check for WRITE. This is needed on archs like ppc64 that implement PROT_NONE using RWX access instead of _PAGE_PRESENT. Also add pte_user check just to make sure we are not accessing kernel mapping. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20powerpc: Swizzle around 4K PTE bits to free up bit 5 and bit 6Ram Pai3-4/+5
We need PTE bits 3 ,4, 5, 6 and 57 to support protection-keys, because these are the bits we want to consolidate on across all configuration to support protection keys. Bit 3,4,5 and 6 are currently used on 4K-pte kernels. But bit 9 and 10 are available. Hence we use the two available bits and free up bit 5 and 6. We will still not be able to free up bit 3 and 4. In the absence of any other free bits, we will have to stay satisfied with what we have :-(. This means we will not be able to support 32 protection keys, but only 8. The bit numbers are big-endian as defined in the ISA3.0 This patch does the following change to 4K PTE. H_PAGE_F_SECOND (S) which occupied bit 4 moves to bit 7. H_PAGE_F_GIX (G,I,X) which occupied bit 5, 6 and 7 also moves to bit 8,9, 10 respectively. H_PAGE_HASHPTE (H) which occupied bit 8 moves to bit 4. Before the patch, the 4k PTE format was as follows 0 1 2 3 4 5 6 7 8 9 10....................57.....63 : : : : : : : : : : : : : v v v v v v v v v v v v v ,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-, |x|x|x|B|S |G |I |X |H| | |x|x|................| |x|x|x| '_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_' After the patch, the 4k PTE format is as follows 0 1 2 3 4 5 6 7 8 9 10....................57.....63 : : : : : : : : : : : : : v v v v v v v v v v v v v ,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-, |x|x|x|B|H | | |S |G|I|X|x|x|................| |.|.|.| '_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_' The patch has no code changes; just swizzles around bits. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20powerpc: shifted-by-one hidx valueRam Pai1-3/+11
0xf is considered invalid hidx value. It indicates absence of a backing HPTE. A PTE is initialized to 0xf either a) when it is new it is newly allocated to hold 4k-backing-HPTE or b) Any time it gets demoted to a 4k-backing-HPTE This patch shifts the representation by one-modulo-0xf; i.e hidx 0 is represented as 1, 1 as 2,... , and 0xf as 0. This convention lets us initialize the secondary-part of the PTE to all zeroes. PTEs are anyway zero'd when allocated. We do not have to zero them again; thus saving on the initialization. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20powerpc: Free up four 64K PTE bits in 64K backed HPTE pagesRam Pai3-19/+17
Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6 in the 64K backed HPTE pages. This along with the earlier patch will entirely free up the four bits from 64K PTE. The bit numbers are big-endian as defined in the ISA3.0 This patch does the following change to 64K PTE backed by 64K HPTE. H_PAGE_F_SECOND (S) which occupied bit 4 moves to the second part of the pte to bit 60. H_PAGE_F_GIX (G,I,X) which occupied bit 5, 6 and 7 also moves to the second part of the pte to bit 61, 62, 63, 64 respectively since bit 7 is now freed up, we move H_PAGE_BUSY (B) from bit 9 to bit 7. The second part of the PTE will hold (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63. NOTE: None of the bits in the secondary PTE were not used by 64k-HPTE backed PTE. Before the patch, the 64K HPTE backed 64k PTE format was as follows 0 1 2 3 4 5 6 7 8 9 10...........................63 : : : : : : : : : : : : v v v v v v v v v v v v ,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-, |x|x|x| |S |G |I |X |x|B| |x|x|................|x|x|x|x| <- primary pte '_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_' | | | | | | | | | | | | |..................| | | | | <- secondary pte '_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_' After the patch, the 64k HPTE backed 64k PTE format is as follows 0 1 2 3 4 5 6 7 8 9 10...........................63 : : : : : : : : : : : : v v v v v v v v v v v v ,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-, |x|x|x| | | | |B |x| | |x|x|................|.|.|.|.| <- primary pte '_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_' | | | | | | | | | | | | |..................|S|G|I|X| <- secondary pte '_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_' The above PTE changes is applicable to hugetlbpages aswell. The patch does the following code changes: a) moves the H_PAGE_F_SECOND and H_PAGE_F_GIX to 4k PTE header since it is no more needed b the 64k PTEs. b) abstracts out __real_pte() and __rpte_to_hidx() so the caller need not know the bit location of the slot. c) moves the slot bits to the secondary pte. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20powerpc: Free up four 64K PTE bits in 4K backed HPTE pagesRam Pai3-7/+5
Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6, in the 4K backed HPTE pages.These bits continue to be used for 64K backed HPTE pages in this patch, but will be freed up in the next patch. The bit numbers are big-endian as defined in the ISA3.0 The patch does the following change to the 4k HTPE backed 64K PTE's format. H_PAGE_BUSY moves from bit 3 to bit 9 (B bit in the figure below) V0 which occupied bit 4 is not used anymore. V1 which occupied bit 5 is not used anymore. V2 which occupied bit 6 is not used anymore. V3 which occupied bit 7 is not used anymore. Before the patch, the 4k backed 64k PTE format was as follows 0 1 2 3 4 5 6 7 8 9 10...........................63 : : : : : : : : : : : : v v v v v v v v v v v v ,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-, |x|x|x|B|V0|V1|V2|V3|x| | |x|x|................|x|x|x|x| <- primary pte '_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_' |S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte '_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_' After the patch, the 4k backed 64k PTE format is as follows 0 1 2 3 4 5 6 7 8 9 10...........................63 : : : : : : : : : : : : v v v v v v v v v v v v ,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-, |x|x|x| | | | | |x|B| |x|x|................|.|.|.|.| <- primary pte '_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_' |S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte '_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_' the four bits S,G,I,X (one quadruplet per 4k HPTE) that cache the hash-bucket slot value, is initialized to 1,1,1,1 indicating -- an invalid slot. If a HPTE gets cached in a 1111 slot(i.e 7th slot of secondary hash bucket), it is released immediately. In other words, even though 1111 is a valid slot value in the hash bucket, we consider it invalid and release the slot and the HPTE. This gives us the opportunity to determine the validity of S,G,I,X bits based on its contents and not on any of the bits V0,V1,V2 or V3 in the primary PTE When we release a HPTE cached in the 1111 slot we also release a legitimate slot in the primary hash bucket and unmap its corresponding HPTE. This is to ensure that we do get a HPTE cached in a slot of the primary hash bucket, the next time we retry. Though treating 1111 slot as invalid, reduces the number of available slots in the hash bucket and may have an effect on the performance, the probabilty of hitting a 1111 slot is extermely low. Compared to the current scheme, the above scheme reduces the number of false hash table updates significantly and has the added advantage of releasing four valuable PTE bits for other purpose. NOTE:even though bits 3, 4, 5, 6, 7 are not used when the 64K PTE is backed by 4k HPTE, they continue to be used if the PTE gets backed by 64k HPTE. The next patch will decouple that aswell, and truely release the bits. This idea was jointly developed by Paul Mackerras, Aneesh, Michael Ellermen and myself. 4K PTE format remains unchanged currently. The patch does the following code changes a) PTE flags are split between 64k and 4k header files. b) __hash_page_4K() is reimplemented to reflect the above logic. Acked-by: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20powerpc: introduce pte_get_hash_gslot() helperRam Pai1-0/+3
Introduce pte_get_hash_gslot()() which returns the global slot number of the HPTE in the global hash table. This function will come in handy as we work towards re-arranging the PTE bits in the later patches. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20powerpc: introduce pte_set_hidx() helperRam Pai2-0/+39
Introduce pte_set_hidx().It sets the (H_PAGE_F_SECOND|H_PAGE_F_GIX) bits at the appropriate location in the PTE of 4K PTE. For 64K PTE, it sets the bits in the second part of the PTE. Though the implementation for the former just needs the slot parameter, it does take some additional parameters to keep the prototype consistent. This function will be handy as we work towards re-arranging the bits in the subsequent patches. Acked-by: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-30mm: switch to 'define pmd_write' instead of __HAVE_ARCH_PMD_WRITEDan Williams1-1/+0
In response to compile breakage introduced by a series that added the pud_write helper to x86, Stephen notes: did you consider using the other paradigm: In arch include files: #define pud_write pud_write static inline int pud_write(pud_t pud) ..... Then in include/asm-generic/pgtable.h: #ifndef pud_write tatic inline int pud_write(pud_t pud) { .... } #endif If you had, then the powerpc code would have worked ... ;-) and many of the other interfaces in include/asm-generic/pgtable.h are protected that way ... Given that some architecture already define pmd_write() as a macro, it's a net reduction to drop the definition of __HAVE_ARCH_PMD_WRITE. Link: http://lkml.kernel.org/r/151129126721.37405.13339850900081557813.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Suggested-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Oliver OHalloran <oliveroh@au1.ibm.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-16Merge tag 'powerpc-4.15-1' of ↵Linus Torvalds5-2/+42
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "A bit of a small release, I suspect in part due to me travelling for KS. But my backlog of patches to review is smaller than usual, so I think in part folks just didn't send as much this cycle. Non-highlights: - Five fixes for the >128T address space handling, both to fix bugs in our implementation and to bring the semantics exactly into line with x86. Highlights: - Support for a new OPAL call on bare metal machines which gives us a true NMI (ie. is not masked by MSR[EE]=0) for debugging etc. - Support for Power9 DD2 in the CXL driver. - Improvements to machine check handling so that uncorrectable errors can be reported into the generic memory_failure() machinery. - Some fixes and improvements for VPHN, which is used under PowerVM to notify the Linux partition of topology changes. - Plumbing to enable TM (transactional memory) without suspend on some Power9 processors (PPC_FEATURE2_HTM_NO_SUSPEND). - Support for emulating vector loads form cache-inhibited memory, on some Power9 revisions. - Disable the fast-endian switch "syscall" by default (behind a CONFIG), we believe it has never had any users. - A major rework of the API drivers use when initiating and waiting for long running operations performed by OPAL firmware, and changes to the powernv_flash driver to use the new API. - Several fixes for the handling of FP/VMX/VSX while processes are using transactional memory. - Optimisations of TLB range flushes when using the radix MMU on Power9. - Improvements to the VAS facility used to access coprocessors on Power9, and related improvements to the way the NX crypto driver handles requests. - Implementation of PMEM_API and UACCESS_FLUSHCACHE for 64-bit. Thanks to: Alexey Kardashevskiy, Alistair Popple, Allen Pais, Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Balbir Singh, Benjamin Herrenschmidt, Breno Leitao, Christophe Leroy, Christophe Lombard, Cyril Bur, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Guilherme G. Piccoli, Gustavo Romero, Haren Myneni, Joel Stanley, Kamalesh Babulal, Kautuk Consul, Markus Elfring, Masami Hiramatsu, Michael Bringmann, Michael Neuling, Michal Suchanek, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pedro Miraglia Franco de Carvalho, Philippe Bergheaud, Sandipan Das, Seth Forshee, Shriya, Stephen Rothwell, Stewart Smith, Sukadev Bhattiprolu, Tyrel Datwyler, Vaibhav Jain, Vaidyanathan Srinivasan, and William A. Kennington III" * tag 'powerpc-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (151 commits) powerpc/64s: Fix Power9 DD2.0 workarounds by adding DD2.1 feature powerpc/64s: Fix masking of SRR1 bits on instruction fault powerpc/64s: mm_context.addr_limit is only used on hash powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary powerpc/64s/hash: Fix fork() with 512TB process address space powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation powerpc/64s/hash: Fix 512T hint detection to use >= 128T powerpc: Fix DABR match on hash based systems powerpc/signal: Properly handle return value from uprobe_deny_signal() powerpc/fadump: use kstrtoint to handle sysfs store powerpc/lib: Implement UACCESS_FLUSHCACHE API powerpc/lib: Implement PMEM API powerpc/powernv/npu: Don't explicitly flush nmmu tlb powerpc/powernv/npu: Use flush_all_mm() instead of flush_tlb_mm() powerpc/powernv/idle: Round up latency and residency values powerpc/kprobes: refactor kprobe_lookup_name for safer string operations powerpc/kprobes: Blacklist emulate_update_regs() from kprobes powerpc/kprobes: Do not disable interrupts for optprobes and kprobes_on_ftrace powerpc/kprobes: Disable preemption before invoking probe handler for optprobes ...
2017-11-13powerpc/64s: mm_context.addr_limit is only used on hashNicholas Piggin2-2/+2
Radix keeps no meaningful state in addr_limit, so remove it from radix code and rename to slb_addr_limit to make it clear it applies to hash only. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman20-0/+20
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-28powerpc/mm: Export flush_all_mm()Frederic Barrat3-0/+40
With the optimizations introduced by commit a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes"), flush_tlb_mm() no longer flushes the page walk cache (PWC) with radix. This patch introduces flush_all_mm(), which flushes everything, TLB and PWC, for a given mm. Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-By: Alistair Popple <alistair@popple.id.au> [mpe: Add a WARN_ON_ONCE() in the empty hash routines] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-09Merge tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-0/+1
Pull KVM updates from Radim Krčmář: "First batch of KVM changes for 4.14 Common: - improve heuristic for boosting preempted spinlocks by ignoring VCPUs in user mode ARM: - fix for decoding external abort types from guests - added support for migrating the active priority of interrupts when running a GICv2 guest on a GICv3 host - minor cleanup PPC: - expose storage keys to userspace - merge kvm-ppc-fixes with a fix that missed 4.13 because of vacations - fixes s390: - merge of kvm/master to avoid conflicts with additional sthyi fixes - wire up the no-dat enhancements in KVM - multiple epoch facility (z14 feature) - Configuration z/Architecture Mode - more sthyi fixes - gdb server range checking fix - small code cleanups x86: - emulate Hyper-V TSC frequency MSRs - add nested INVPCID - emulate EPTP switching VMFUNC - support Virtual GIF - support 5 level page tables - speedup nested VM exits by packing byte operations - speedup MMIO by using hardware provided physical address - a lot of fixes and cleanups, especially nested" * tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits) KVM: arm/arm64: Support uaccess of GICC_APRn KVM: arm/arm64: Extract GICv3 max APRn index calculation KVM: arm/arm64: vITS: Drop its_ite->lpi field KVM: arm/arm64: vgic: constify seq_operations and file_operations KVM: arm/arm64: Fix guest external abort matching KVM: PPC: Book3S HV: Fix memory leak in kvm_vm_ioctl_get_htab_fd KVM: s390: vsie: cleanup mcck reinjection KVM: s390: use WARN_ON_ONCE only for checking KVM: s390: guestdbg: fix range check KVM: PPC: Book3S HV: Report storage key support to userspace KVM: PPC: Book3S HV: Fix case where HDEC is treated as 32-bit on POWER9 KVM: PPC: Book3S HV: Fix invalid use of register expression KVM: PPC: Book3S HV: Fix H_REGISTER_VPA VPA size validation KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTER KVM: PPC: e500mc: Fix a NULL dereference KVM: PPC: e500: Fix some NULL dereferences on error KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables list KVM: s390: we are always in czam mode KVM: s390: expose no-DAT to guest and migration support KVM: s390: sthyi: remove invalid guest write access ...
2017-08-31KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTERRam Pai1-0/+1
In handling a H_ENTER hypercall, the code in kvmppc_do_h_enter clobbers the high-order two bits of the storage key, which is stored in a split field in the second doubleword of the HPTE. Any storage key number above 7 hence fails to operate correctly. This makes sure we preserve all the bits of the storage key. Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-23powerpc/mm: Optimize detection of thread local mm'sBenjamin Herrenschmidt1-0/+3
Instead of comparing the whole CPU mask every time, let's keep a counter of how many bits are set in the mask. Thus testing for a local mm only requires testing if that counter is 1 and the current CPU bit is set in the mask. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-23Merge branch 'fixes' into nextMichael Ellerman2-8/+17
There's a non-trivial dependency between some commits we want to put in next and the KVM prefetch work around that went into fixes. So merge fixes into next.
2017-08-17powerpc/mm: Don't send IPI to all cpus on THP updatesAneesh Kumar K.V1-0/+1
Now that we made sure that lockless walk of linux page table is mostly limitted to current task(current->mm->pgdir) we can update the THP update sequence to only send IPI to CPUs on which this task has run. This helps in reducing the IPI overload on systems with large number of CPUs. WRT kvm even though kvm is walking page table with vpc->arch.pgdir, it is done only on secondary CPUs and in that case we have primary CPU added to task's mm cpumask. Sending an IPI to primary will force the secondary to do a vm exit and hence this mm cpumask usage is safe here. WRT CAPI, we still end up walking linux page table with capi context MM. For now the pte lookup serialization sends an IPI to all CPUs in CPI is in use. We can further improve this by adding the CAPI interrupt handling CPU to task mm cpumask. That will be done in a later patch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-16powerpc/mm/hugetlb: Allow runtime allocation of 16G.Aneesh Kumar K.V1-3/+1
Now that we have GIGANTIC_PAGE enabled on powerpc, use this for 16G hugepages with hash translation mode. Depending on the total system memory we have, we may be able to allocate 16G hugepages runtime. This also remove the hugetlb setup difference between hash/radix translation mode. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-16powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel ↵Aneesh Kumar K.V1-1/+1
command line With commit aa888a74977a8 ("hugetlb: support larger than MAX_ORDER") we added support for allocating gigantic hugepages via kernel command line. Switch ppc64 arch specific code to use that. W.r.t FSL support, we now limit our allocation range using BOOTMEM_ALLOC_ACCESSIBLE. We use the kernel command line to do reservation of hugetlb pages on powernv platforms. On pseries hash mmu mode the supported gigantic huge page size is 16GB and that can only be allocated with hypervisor assist. For pseries the command line option doesn't do the allocation. Instead pseries does gigantic hugepage allocation based on hypervisor hint that is specified via "ibm,expected#pages" property of the memory node. Cc: Scott Wood <oss@buserror.net> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15powerpc/hugetlb: fix page rights verification in gup_hugepte()Christophe Leroy2-0/+6
gup_hugepte() checks if pages are present and readable, and when 'write' is set, also checks if the pages are writable. Initially this was done by checking if _PAGE_PRESENT and _PAGE_READ were set. In addition, _PAGE_WRITE was verified for write accesses. The problem is that we have to handle the three following cases: 1/ The target defines __PAGE_READ and __PAGE_WRITE 2/ The target defines __PAGE_RW 3/ The target defines __PAGE_RO In case 1/, this is obvious In case 2/, __PAGE_READ is defined as 0 and __PAGE_WRITE as __PAGE_RW so it works as well. But in case 3, __PAGE_RW is defined as 0, which means __PAGE_WRITE is 0 and then the test returns true (page writable) in all cases. A first correction was attempted in commit 6b8cb66a6a7cc ("powerpc: Fix usage of _PAGE_RO in hugepage"), but that fix is wrong: instead of checking that the page is writable when write is requested, it checks that the page is NOT writable when write is NOT requested. This patch adds a new pte_read() helper to check whether a page is readable or not. This avoids handling all possible cases in gup_hugepte(). Then gup_hugepte() is modified to use pte_present(), pte_read() and pte_write() instead of the raw flags. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15powerpc/mm: declare some local functions staticChristophe Leroy1-3/+0
get_pteptr() and __mapin_ram_chunk() are only used locally, so define them static Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15powerpc/mm/nohash: Move definition of PGALLOC_GFP to fix build errorsMichael Ellerman1-2/+0
In some obscure Book3E configs (randconfig) we can end up missing a definition for PGALLOC_GFP in pgtable_64.c. Fix it by moving the definition to asm/pgalloc.h. Fixes: de3b87611dd1 ("powerpc/mm/book(e)(3s)/64: Add page table accounting") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-08powerpc/mm/hash64: Make vmalloc 56T on hashMichael Ellerman1-2/+2
On 64-bit book3s, with the hash MMU, we currently define the kernel virtual space (vmalloc, ioremap etc.), to be 16T in size. This is a leftover from pre v3.7 when our user VM was also 16T. Of that 16T we split it 50/50, with half used for PCI IO and ioremap and the other 8T for vmalloc. We never bothered to make it any bigger because 8T of vmalloc ought to be enough for anybody. But it turns out that's not true, the per cpu allocator wants large amounts of vmalloc space, not to make large allocations, but to allow a large stride between allocations, because we use pcpu_embed_first_chunk(). With a bit of juggling we can increase the entire kernel virtual space to 64T. The only real complication is the check of the address in the SLB miss handler, see the comment in the code. Although we could continue to split virtual space 50/50 as we do now, no one seems to be running out of PCI IO or ioremap space. So instead keep that as 8T, and use the remaining 56T for vmalloc. In future we should be able to increase the kernel virtual space to 512T, the code already supports that, it just needs testing on older hardware. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2017-08-08powerpc/mm/book3s64: Make KERN_IO_START a variableMichael Ellerman3-1/+6
Currently KERN_IO_START is defined as: #define KERN_IO_START (KERN_VIRT_START + (KERN_VIRT_SIZE >> 1)) Although it looks like a constant, both the components are actually variables, to allow us to have a different value between Radix and Hash with a single kernel. However that still requires both Radix and Hash to place the kernel IO region at the same location relative to the start and end of the kernel virtual region (namely 1/2 way through it), and we'd like to change that. So split KERN_IO_START out into its own variable, and initialise it for Radix and Hash. In the medium term we should be able to reconsolidate this, by doing a more involved rearrangement of the location of the regions. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03powerpc: Remove old unused icswx based coprocessor supportBenjamin Herrenschmidt1-5/+0
We have a whole pile of unused code to maintain the ACOP register, allocate coprocessor PIDs and handle ACOP faults. This mechanism was used for the HFI adapter on POWER7 which is dead and gone and whose driver never went upstream. It was used on some A2 core based stuff that also never saw the light of day. Take out all that code. There is still some POWER8 coprocessor code that uses icswx but it's kernel only and thus doesn't use any of that infrastructure. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-02powerpc/mm/radix: Avoid flushing the PWC on every flush_tlb_rangeBenjamin Herrenschmidt1-0/+1
We do that because it's used by THP pmd collapsing, so use instead a dedicated flush function. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-02powerpc/mm/radix: Improve TLB/PWC flushesBenjamin Herrenschmidt1-3/+1
At the moment we have to rather sub-optimal flushing behaviours: - flush_tlb_mm() will flush the PWC which is unnecessary (for example when doing a fork) - A large unmap will call flush_tlb_pwc() multiple times causing us to perform that fairly expensive operation repeatedly. This happens often in batches of 3 on every new process. So we change flush_tlb_mm() to only flush the TLB, and we use the existing "need_flush_all" flag in struct mmu_gather to indicate that the PWC needs flushing. Unfortunately, flush_tlb_range() still needs to do a full flush for now as it's used by the THP collapsing. We will fix that later. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-31Merge tag 'v4.13-rc1' into fixesMichael Ellerman1-1/+1
The fixes branch is based off a random pre-rc1 commit, because we had some fixes that needed to go in before rc1 was released. However we now need to fix some code that went in after that point, but before rc1, so merge rc1 to get that code into fixes so we can fix it!
2017-07-28powerpc/mm: Fix pmd/pte_devmap() on non-leaf entriesOliver O'Halloran1-1/+9
The Radix MMU translation tree as defined in ISA v3.0 contains two different types of entry, directories and leaves. Leaves are identified by _PAGE_PTE being set. The formats of the two entries are different, with the directory entries containing no spare bits for use by software. In particular the bit we use for _PAGE_DEVMAP is not reserved for software, and is part of the NLB (Next Level Base) field, essentially the address of the next level in the tree. Note that the Linux pte_t is not == _PAGE_PTE. A huge page pmd entry (or devmap!) is also a leaf and so has _PAGE_PTE set, even though we use a pmd_t for it in Linux. The fix is to ensure that the pmd/pte_devmap() confirm they are looking at a leaf entry (_PAGE_PTE) as well as checking _PAGE_DEVMAP. Fixes: ebd31197931d ("powerpc/mm: Add devmap support for ppc64") Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Tested-by: Laurent Vivier <lvivier@redhat.com> Tested-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com> Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Add a comment in the code and flesh out change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-26powerpc/mm/radix: Workaround prefetch issue with KVMBenjamin Herrenschmidt1-7/+8
There's a somewhat architectural issue with Radix MMU and KVM. When coming out of a guest with AIL (Alternate Interrupt Location, ie, MMU enabled), we start executing hypervisor code with the PID register still containing whatever the guest has been using. The problem is that the CPU can (and will) then start prefetching or speculatively load from whatever host context has that same PID (if any), thus bringing translations for that context into the TLB, which Linux doesn't know about. This can cause stale translations and subsequent crashes. Fixing this in a way that is neither racy nor a huge performance impact is difficult. We could just make the host invalidations always use broadcast forms but that would hurt single threaded programs for example. We chose to fix it instead by partitioning the PID space between guest and host. This is possible because today Linux only use 19 out of the 20 bits of PID space, so existing guests will work if we make the host use the top half of the 20 bits space. We additionally add support for a property to indicate to Linux the size of the PID register which will be useful if we eventually have processors with a larger PID space available. There is still an issue with malicious guests purposefully setting the PID register to a value in the hosts PID range. Hopefully future HW can prevent that, but in the meantime, we handle it with a pair of kludges: - On the way out of a guest, before we clear the current VCPU in the PACA, we check the PID and if it's outside of the permitted range we flush the TLB for that PID. - When context switching, if the mm is "new" on that CPU (the corresponding bit was set for the first time in the mm cpumask), we check if any sibling thread is in KVM (has a non-NULL VCPU pointer in the PACA). If that is the case, we also flush the PID for that CPU (core). This second part is needed to handle the case where a process is migrated (or starts a new pthread) on a sibling thread of the CPU coming out of KVM, as there's a window where stale translations can exist before we detect it and flush them out. A future optimization could be added by keeping track of whether the PID has ever been used and avoid doing that for completely fresh PIDs. We could similarily mark PIDs that have been the subject of a global invalidation as "fresh". But for now this will do. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop unneeded include of kvm_book3s_asm.h] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-21Merge tag 'powerpc-4.13-3' of ↵Linus Torvalds3-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "A handful of fixes, mostly for new code: - some reworking of the new STRICT_KERNEL_RWX support to make sure we also remove executable permission from __init memory before it's freed. - a fix to some recent optimisations to the hypercall entry where we were clobbering r12, this was breaking nested guests (PR KVM). - a fix for the recent patch to opal_configure_cores(). This could break booting on bare metal Power8 boxes if the kernel was built without CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG. - .. and finally a workaround for spurious PMU interrupts on Power9 DD2. Thanks to: Nicholas Piggin, Anton Blanchard, Balbir Singh" * tag 'powerpc-4.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y powerpc/mm/hash: Refactor hash__mark_rodata_ro() powerpc/mm/radix: Refactor radix__mark_rodata_ro() powerpc/64s: Fix hypercall entry clobbering r12 input powerpc/perf: Avoid spurious PMU interrupts after idle powerpc/powernv: Fix boot on Power8 bare metal due to opal_configure_cores()
2017-07-18powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=yMichael Ellerman3-0/+3
Currently even with STRICT_KERNEL_RWX we leave the __init text marked executable after init, which is bad. Add a hook to mark it NX (no-execute) before we free it, and implement it for radix and hash. Note that we use __init_end as the end address, not _einittext, because overlaps_kernel_text() uses __init_end, because there are additional executable sections other than .init.text between __init_begin and __init_end. Tested on radix and hash with: 0:mon> p $__init_begin *** 400 exception occurred Fixes: 1e0fc9d1eb2b ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-13mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko1-1/+1
semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-07Merge tag 'powerpc-4.13-1' of ↵Linus Torvalds6-8/+67
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Support for STRICT_KERNEL_RWX on 64-bit server CPUs. - Platform support for FSP2 (476fpe) board - Enable ZONE_DEVICE on 64-bit server CPUs. - Generic & powerpc spin loop primitives to optimise busy waiting - Convert VDSO update function to use new update_vsyscall() interface - Optimisations to hypercall/syscall/context-switch paths - Improvements to the CPU idle code on Power8 and Power9. As well as many other fixes and improvements. Thanks to: Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman Khandual, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe Lombard, Colin Ian King, Dan Carpenter, Gautham R. Shenoy, Hari Bathini, Ian Munsie, Ivan Mikhaylov, Javier Martinez Canillas, Madhavan Srinivasan, Masahiro Yamada, Matt Brown, Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pavel Machek, Russell Currey, Santosh Sivaraj, Stephen Rothwell, Thiago Jung Bauermann, Yang Li" * tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits) powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix powerpc/mm/hash: Implement mark_rodata_ro() for hash powerpc/vmlinux.lds: Align __init_begin to 16M powerpc/lib/code-patching: Use alternate map for patch_instruction() powerpc/xmon: Add patch_instruction() support for xmon powerpc/kprobes/optprobes: Use patch_instruction() powerpc/kprobes: Move kprobes over to patch_instruction() powerpc/mm/radix: Fix execute permissions for interrupt_vectors powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp() powerpc/64s: Blacklist rtas entry/exit from kprobes powerpc/64s: Blacklist functions invoked on a trap powerpc/64s: Un-blacklist system_call() from kprobes powerpc/64s: Move system_call() symbol to just after setting MSR_EE powerpc/64s: Blacklist system_call() and system_call_common() from kprobes powerpc/64s: Convert .L__replay_interrupt_return to a local label powerpc64/elfv1: Only dereference function descriptor for non-text symbols cxl: Export library to support IBM XSL powerpc/dts: Use #include "..." to include local DT powerpc/perf/hv-24x7: Aggregate result elements on POWER9 SMT8 ...
2017-07-07powerpc/mm/hugetlb: add support for 1G huge pagesAneesh Kumar K.V1-0/+10
POWER9 supports hugepages of size 2M and 1G in radix MMU mode. This patch enables the usage of 1G page size for hugetlbfs. This also update the helper such we can do 1G page allocation at runtime. We still don't enable 1G page size on DD1 version. This is to avoid doing workaround mentioned in commit 6d3a0379ebdc ("powerpc/mm: Add radix__tlb_flush_pte_p9_dd1()"). Link: http://lkml.kernel.org/r/1494995292-4443-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>