summaryrefslogtreecommitdiff
path: root/arch/s390/mm
AgeCommit message (Collapse)AuthorFilesLines
2022-01-27gup: document and work around "COW can break either way" issueLinus Torvalds1-1/+8
commit 9bbd42e79720122334226afad9ddcac1c3e6d373 upstream. Doing a "get_user_pages()" on a copy-on-write page for reading can be ambiguous: the page can be COW'ed at any time afterwards, and the direction of a COW event isn't defined. Yes, whoever writes to it will generally do the COW, but if the thread that did the get_user_pages() unmapped the page before the write (and that could happen due to memory pressure in addition to any outright action), the writer could also just take over the old page instead. End result: the get_user_pages() call might result in a page pointer that is no longer associated with the original VM, and is associated with - and controlled by - another VM having taken it over instead. So when doing a get_user_pages() on a COW mapping, the only really safe thing to do would be to break the COW when getting the page, even when only getting it for reading. At the same time, some users simply don't even care. For example, the perf code wants to look up the page not because it cares about the page, but because the code simply wants to look up the physical address of the access for informational purposes, and doesn't really care about races when a page might be unmapped and remapped elsewhere. This adds logic to force a COW event by setting FOLL_WRITE on any copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page pointer as a result. The current semantics end up being: - __get_user_pages_fast(): no change. If you don't ask for a write, you won't break COW. You'd better know what you're doing. - get_user_pages_fast(): the fast-case "look it up in the page tables without anything getting mmap_sem" now refuses to follow a read-only page, since it might need COW breaking. Which happens in the slow path - the fast path doesn't know if the memory might be COW or not. - get_user_pages() (including the slow-path fallback for gup_fast()): for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with very similar semantics to FOLL_FORCE. If it turns out that we want finer granularity (ie "only break COW when it might actually matter" - things like the zero page are special and don't need to be broken) we might need to push these semantics deeper into the lookup fault path. So if people care enough, it's possible that we might end up adding a new internal FOLL_BREAK_COW flag to go with the internal FOLL_COW flag we already have for tracking "I had a COW". Alternatively, if it turns out that different callers might want to explicitly control the forced COW break behavior, we might even want to make such a flag visible to the users of get_user_pages() instead of using the above default semantics. But for now, this is mostly commentary on the issue (this commit message being a lot bigger than the patch, and that patch in turn is almost all comments), with that minimal "enable COW breaking early" logic using the existing FOLL_WRITE behavior. [ It might be worth noting that we've always had this ambiguity, and it could arguably be seen as a user-space issue. You only get private COW mappings that could break either way in situations where user space is doing cooperative things (ie fork() before an execve() etc), but it _is_ surprising and very subtle, and fork() is supposed to give you independent address spaces. So let's treat this as a kernel issue and make the semantics of get_user_pages() easier to understand. Note that obviously a true shared mapping will still get a page that can change under us, so this does _not_ mean that get_user_pages() somehow returns any "stable" page ] [surenb: backport notes Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_pgd_range. Removed FOLL_PIN usage in should_force_cow_break since it's missing in the earlier kernels.] Reported-by: Jann Horn <jannh@google.com> Tested-by: Christoph Hellwig <hch@lst.de> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kirill Shutemov <kirill@shutemov.name> Acked-by: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [surenb: backport to 4.19 kernel] Cc: stable@vger.kernel.org # 4.19.x Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 4.9: - Generic get_user_pages_fast() calls __get_user_pages_fast() here, so make it pass write=1 - Various architectures have their own implementations of get_user_pages_fast(), so apply the corresponding change there - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-26s390/gmap: don't unconditionally call pte_unmap_unlock() in __gmap_zap()David Hildenbrand1-2/+3
[ Upstream commit b159f94c86b43cf7e73e654bc527255b1f4eafc4 ] ... otherwise we will try unlocking a spinlock that was never locked via a garbage pointer. At the time we reach this code path, we usually successfully looked up a PGSTE already; however, evil user space could have manipulated the VMA layout in the meantime and triggered removal of the page table. Fixes: 1e133ab296f3 ("s390/mm: split arch/s390/mm/pgtable.c") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Link: https://lore.kernel.org/r/20210909162248.14969-3-david@redhat.com Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-22s390/mm: fix huge pte soft dirty copyingJanosch Frank1-1/+1
commit 528a9539348a0234375dfaa1ca5dbbb2f8f8e8d2 upstream. If the pmd is soft dirty we must mark the pte as soft dirty (and not dirty). This fixes some cases for guest migration with huge page backings. Cc: <stable@vger.kernel.org> # 4.8 Fixes: bc29b7ac1d9f ("s390/mm: clean up pte/pmd encoding") Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-24KVM: s390: vsie: Fix possible race when shadowing region 3 tablesDavid Hildenbrand1-0/+1
[ Upstream commit 1493e0f944f3c319d11e067c185c904d01c17ae5 ] We have to properly retry again by returning -EINVAL immediately in case somebody else instantiated the table concurrently. We missed to add the goto in this function only. The code now matches the other, similar shadowing functions. We are overwriting an existing region 2 table entry. All allocated pages are added to the crst_list to be freed later, so they are not lost forever. However, when unshadowing the region 2 table, we wouldn't trigger unshadowing of the original shadowed region 3 table that we replaced. It would get unshadowed when the original region 3 table is modified. As it's not connected to the page table hierarchy anymore, it's not going to get used anymore. However, for a limited time, this page table will stick around, so it's in some sense a temporary memory leak. Identified by manual code inspection. I don't think this classifies as stable material. Fixes: 998f637cc4b9 ("s390/mm: avoid races on region/segment/page table shadowing") Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200403153050.20569-4-david@redhat.com Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-24KVM: s390: vsie: Fix region 1 ASCE sanity shadow address checksDavid Hildenbrand1-1/+5
commit a1d032a49522cb5368e5dfb945a85899b4c74f65 upstream. In case we have a region 1 the following calculation (31 + ((gmap->asce & _ASCE_TYPE_MASK) >> 2)*11) results in 64. As shifts beyond the size are undefined the compiler is free to use instructions like sllg. sllg will only use 6 bits of the shift value (here 64) resulting in no shift at all. That means that ALL addresses will be rejected. The can result in endless loops, e.g. when prefix cannot get mapped. Fixes: 4be130a08420 ("s390/mm: add shadow gmap support") Tested-by: Janosch Frank <frankja@linux.ibm.com> Reported-by: Janosch Frank <frankja@linux.ibm.com> Cc: <stable@vger.kernel.org> # v4.8+ Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200403153050.20569-2-david@redhat.com Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> [borntraeger@de.ibm.com: fix patch description, remove WARN_ON_ONCE] Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-05mm, gup: add missing refcount overflow checks on x86 and s390Vlastimil Babka1-3/+6
The mainline commit 8fde12ca79af ("mm: prevent get_user_pages() from overflowing page refcount") was backported to 4.9.y stable as commit 2ed768cfd895. The backport however missed that in 4.9, there are several arch-specific gup.c versions with fast gup implementations, so these do not prevent refcount overflow. This is partially fixed for x86 in stable-only commit d73af79742e7 ("x86, mm, gup: prevent get_page() race with munmap in paravirt guest"). This stable-only commit adds missing parts to x86 version, as well as s390 version, both taken from the SUSE SLES/openSUSE 4.12-based kernels. The remaining architectures with own gup.c are sparc, mips, sh. It's unlikely the known overflow scenario based on FUSE, which needs 140GB of RAM, is a problem for those architectures, and I don't feel confident enough to patch them. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-06s390/cmm: fix information leak in cmm_timeout_handler()Yihui ZENG1-6/+6
commit b8e51a6a9db94bc1fb18ae831b3dab106b5a4b5f upstream. The problem is that we were putting the NUL terminator too far: buf[sizeof(buf) - 1] = '\0'; If the user input isn't NUL terminated and they haven't initialized the whole buffer then it leads to an info leak. The NUL terminator should be: buf[len - 1] = '\0'; Signed-off-by: Yihui Zeng <yzeng56@asu.edu> Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> [heiko.carstens@de.ibm.com: keep semantics of how *lenp and *ppos are handled] Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-01s390/mm: Check for valid vma before zapping in gmap_discardJanosch Frank1-0/+2
commit 1843abd03250115af6cec0892683e70cf2297c25 upstream. Userspace could have munmapped the area before doing unmapping from the gmap. This would leave us with a valid vmaddr, but an invalid vma from which we would try to zap memory. Let's check before using the vma. Fixes: 1e133ab296f3 ("s390/mm: split arch/s390/mm/pgtable.c") Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Message-Id: <20180816082432.78828-1-frankja@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-04s390/extmem: fix gcc 8 stringop-overflow warningVasily Gorbik1-2/+2
[ Upstream commit 6b2ddf33baec23dace85bd647e3fc4ac070963e8 ] arch/s390/mm/extmem.c: In function '__segment_load': arch/s390/mm/extmem.c:436:2: warning: 'strncat' specified bound 7 equals source length [-Wstringop-overflow=] strncat(seg->res_name, " (DCSS)", 7); What gcc complains about here is the misuse of strncat function, which in this case does not limit a number of bytes taken from "src", so it is in the end the same as strcat(seg->res_name, " (DCSS)"); Keeping in mind that a res_name is 15 bytes, strncat in this case would overflow the buffer and write 0 into alignment byte between the fields in the struct. To avoid that increasing res_name size to 16, and reusing strlcat. Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-04s390/mm: correct allocate_pgste proc_handler callbackVasily Gorbik1-1/+1
[ Upstream commit 5bedf8aa03c28cb8dc98bdd32a41b66d8f7d3eaa ] Since proc_dointvec does not perform value range control, proc_dointvec_minmax should be used to limit value range, which is clearly intended here, as the internal representation of the value: unsigned int alloc_pgste:1; In fact it currently works, since we have mm->context.alloc_pgste = page_table_allocate_pgste || ... ... since commit 23fefe119ceb5 ("s390/kvm: avoid global config of vm.alloc_pgste=1") Before that it was mm->context.alloc_pgste = page_table_allocate_pgste; which was broken. That was introduced with commit 0b46e0a3ec0d7 ("s390/kvm: remove delayed reallocation of page tables for KVM"). Fixes: 0b46e0a3ec0d7 ("s390/kvm: remove delayed reallocation of page tables for KVM") Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-05s390/kvm: fix deadlock when killed by oomClaudio Imbrenda1-0/+2
commit 306d6c49ac9ded11114cb53b0925da52f2c2ada1 upstream. When the oom killer kills a userspace process in the page fault handler while in guest context, the fault handler fails to release the mm_sem if the FAULT_FLAG_RETRY_NOWAIT option is set. This leads to a deadlock when tearing down the mm when the process terminates. This bug can only happen when pfault is enabled, so only KVM clients are affected. The problem arises in the rare cases in which handle_mm_fault does not release the mm_sem. This patch fixes the issue by manually releasing the mm_sem when needed. Fixes: 24eb3a824c4f3 ("KVM: s390: Add FAULT_FLAG_RETRY_NOWAIT for guest fault") Cc: <stable@vger.kernel.org> # 3.15+ Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05s390/mm: fix write access check in gup_huge_pmd()Gerald Schaefer1-4/+3
commit ba385c0594e723d41790ecfb12c610e6f90c7785 upstream. The check for the _SEGMENT_ENTRY_PROTECT bit in gup_huge_pmd() is the wrong way around. It must not be set for write==1, and not be checked for write==0. Fix this similar to how it was fixed for ptes long time ago in commit 25591b070336 ("[S390] fix get_user_pages_fast"). One impact of this bug would be unnecessarily using the gup slow path for write==0 on r/w mappings. A potentially more severe impact would be that gup_huge_pmd() will succeed for write==1 on r/o mappings. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-09s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangsChristian Borntraeger1-7/+32
commit fa41ba0d08de7c975c3e94d0067553f9b934221f upstream. Right now there is a potential hang situation for postcopy migrations, if the guest is enabling storage keys on the target system during the postcopy process. For storage key virtualization, we have to forbid the empty zero page as the storage key is a property of the physical page frame. As we enable storage key handling lazily we then drop all mappings for empty zero pages for lazy refaulting later on. This does not work with the postcopy migration, which relies on the empty zero page never triggering a fault again in the future. The reason is that postcopy migration will simply read a page on the target system if that page is a known zero page to fault in an empty zero page. At the same time postcopy remembers that this page was already transferred - so any future userfault on that page will NOT be retransmitted again to avoid races. If now the guest enters the storage key mode while in postcopy, we will break this assumption of postcopy. The solution is to disable the empty zero page for KVM guests early on and not during storage key enablement. With this change, the postcopy migration process is guaranteed to start after no zero pages are left. As guest pages are very likely not empty zero pages anyway the memory overhead is also pretty small. While at it this also adds proper page table locking to the zero page removal. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Janosch Frank <frankja@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-24mm: larger stack guard gap, between vmasHugh Dickins1-2/+2
commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [wt: backport to 4.11: adjust context] [wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide] Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-18KVM: s390: Fix guest migration for huge guests resulting in panicJanosch Frank1-1/+18
commit 2e4d88009f57057df7672fa69a32b5224af54d37 upstream. While we can technically not run huge page guests right now, we can setup a guest with huge pages. Trying to migrate it will trigger a VM_BUG_ON and, if the kernel is not configured to panic on a BUG, it will happily try to work on non-existing page table entries. With this patch, we always return "dirty" if we encounter a large page when migrating. This at least fixes the immediate problem until we have proper handling for both kind of pages. Fixes: 15f36eb ("KVM: s390: Add proper dirty bitmap support to S390 kvm.") Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-01s390/mm: Fix cmma unused transfer from pgste into pteChristian Borntraeger1-3/+4
commit 0d6da872d3e4a60f43c295386d7ff9a4cdcd57e9 upstream. The last pgtable rework silently disabled the CMMA unused state by setting a local pte variable (a parameter) instead of propagating it back into the caller. Fix it. Fixes: ebde765c0e85 ("s390/mm: uninline ptep_xxx functions from pgtable.h") Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28Merge branch 'for-linus' of ↵Linus Torvalds2-17/+22
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Martin Schwidefsky: "A few more s390 patches for 4.9: - a fix for an overflow in the dasd driver reported by UBSAN - fix a regression and add hotplug memory to the zone movable again - add ignore defines for the pkey system calls - fix the ouput of the merged stack tracer - replace printk with pr_cont in arch/s390 where appropriate - remove the arch specific return_address function again - ignore reserved channel paths at boot time - add a missing hugetlb_bad_size call to the arch backend" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/mm: fix zone calculation in arch_add_memory() s390/dumpstack: use pr_cont within show_stack and die s390/dumpstack: get rid of return_address again s390/disassambler: use pr_cont where appropriate s390/dumpstack: use pr_cont where appropriate s390/dumpstack: restore reliable indicator for call traces s390/mm: use hugetlb_bad_size() s390/cio: don't register chpids in reserved state s390: ignore pkey system calls s390/dasd: avoid undefined behaviour
2016-10-24s390/mm: fix zone calculation in arch_add_memory()Gerald Schaefer1-17/+21
Standby (hotplug) memory should be added to ZONE_MOVABLE on s390. After commit 199071f1 "s390/mm: make arch_add_memory() NUMA aware", arch_add_memory() used memblock_end_of_DRAM() to find out the end of ZONE_NORMAL and the beginning of ZONE_MOVABLE. However, commit 7f36e3e5 "memory-hotplug: add hot-added memory ranges to memblock before allocate node_data for a node." moved the call of memblock_add_node() before the call of arch_add_memory() in add_memory_resource(), and thus changed the return value of memblock_end_of_DRAM() when called in arch_add_memory(). As a result, arch_add_memory() will think that all memory blocks should be added to ZONE_NORMAL. Fix this by changing the logic in arch_add_memory() so that it will manually iterate over all zones of a given node to find out which zone a memory block should be added to. Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-10-19mm: replace get_user_pages_unlocked() write/force parameters with gup_flagsLorenzo Stoakes1-1/+2
This removes the 'write' and 'force' use from get_user_pages_unlocked() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-17s390/mm: use hugetlb_bad_size()Shyam Saini1-0/+1
Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported hugepage size is found. Signed-off-by: Shyam Saini <mayhs11saini@gmail.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-10-05Merge branch 'for-linus' of ↵Linus Torvalds4-12/+27
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "The new features and main improvements in this merge for v4.9 - Support for the UBSAN sanitizer - Set HAVE_EFFICIENT_UNALIGNED_ACCESS, it improves the code in some places - Improvements for the in-kernel fpu code, in particular the overhead for multiple consecutive in kernel fpu users is recuded - Add a SIMD implementation for the RAID6 gen and xor operations - Add RAID6 recovery based on the XC instruction - The PCI DMA flush logic has been improved to increase the speed of the map / unmap operations - The time synchronization code has seen some updates And bug fixes all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (48 commits) s390/con3270: fix insufficient space padding s390/con3270: fix use of uninitialised data MAINTAINERS: update DASD maintainer s390/cio: fix accidental interrupt enabling during resume s390/dasd: add missing \n to end of dev_err messages s390/config: Enable config options for Docker s390/dasd: make query host access interruptible s390/dasd: fix panic during offline processing s390/dasd: fix hanging offline processing s390/pci_dma: improve lazy flush for unmap s390/pci_dma: split dma_update_trans s390/pci_dma: improve map_sg s390/pci_dma: simplify dma address calculation s390/pci_dma: remove dma address range check iommu/s390: simplify registration of I/O address translation parameters s390: migrate exception table users off module.h and onto extable.h s390: export header for CLP ioctl s390/vmur: fix irq pointer dereference in int handler s390/dasd: add missing KOBJ_CHANGE event for unformatted devices s390: enable UBSAN ...
2016-09-20s390: migrate exception table users off module.h and onto extable.hPaul Gortmaker1-1/+1
These files were only including module.h for exception table related functions. We've now separated that content out into its own file "extable.h" so now move over to that and avoid all the extra header content in module.h that we don't really need to compile these files. The additions of uaccess.h are to deal with implict includes like: arch/s390/kernel/traps.c: In function 'do_report_trap': arch/s390/kernel/traps.c:56:4: error: implicit declaration of function 'extable_fixup' [-Werror=implicit-function-declaration] arch/s390/kernel/traps.c: In function 'illegal_op': arch/s390/kernel/traps.c:173:3: error: implicit declaration of function 'get_user' [-Werror=implicit-function-declaration] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-09-19s390/mm/pfault: Convert to hotplug state machineSebastian Andrzej Siewior1-18/+12
Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-s390@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: rt@linutronix.de Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Link: http://lkml.kernel.org/r/20160906170457.32393-18-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-08-24s390/mm: merge local / non-local IDTE helperMartin Schwidefsky1-5/+5
Merge the __p[m|u]xdp_idte and __p[m|u]dp_idte_local functions into a single __p[m|u]dp_idte function with an additional parameter. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-08-24s390/mm: merge local / non-local IPTE helperMartin Schwidefsky2-6/+6
Merge the __ptep_ipte and __ptep_ipte_local functions into a single __ptep_ipte function with an additional parameter. The __pte_ipte_range function is still extra as the while loops makes it hard to merge. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-08-24s390/mm,kvm: flush gmap address space with IDTEMartin Schwidefsky1-0/+15
The __tlb_flush_mm() helper uses a global flush if the mm struct has a gmap structure attached to it. Replace the global flush with two individual flushes by means of the IDTE instruction if only a single gmap is attached the the mm. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-08-10s390/pageattr: handle numpages parameter correctlyHeiko Carstens1-0/+2
Both set_memory_ro() and set_memory_rw() will modify the page attributes of at least one page, even if the numpages parameter is zero. The author expected that calling these functions with numpages == zero would never happen. However with the new 444d13ff10fb ("modules: add ro_after_init support") feature this happens frequently. Therefore do the right thing and make these two functions return gracefully if nothing should be done. Fixes crashes on module load like this one: Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 000003ff80008000 TEID: 000003ff80008407 Fault in home space mode while using kernel ASCE. AS:0000000000d18007 R3:00000001e6aa4007 S:00000001e6a10800 P:00000001e34ee21d Oops: 0004 ilc:3 [#1] SMP Modules linked in: x_tables CPU: 10 PID: 1 Comm: systemd Not tainted 4.7.0-11895-g3fa9045 #4 Hardware name: IBM 2964 N96 703 (LPAR) task: 00000001e9118000 task.stack: 00000001e9120000 Krnl PSW : 0704e00180000000 00000000005677f8 (rb_erase+0xf0/0x4d0) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 000003ff80008b20 000003ff80008b20 000003ff80008b70 0000000000b9d608 000003ff80008b20 0000000000000000 00000001e9123e88 000003ff80008950 00000001e485ab40 000003ff00000000 000003ff80008b00 00000001e4858480 0000000100000000 000003ff80008b68 00000000001d5998 00000001e9123c28 Krnl Code: 00000000005677e8: ec1801c3007c cgij %r1,0,8,567b6e 00000000005677ee: e32010100020 cg %r2,16(%r1) #00000000005677f4: a78401c2 brc 8,567b78 >00000000005677f8: e35010080024 stg %r5,8(%r1) 00000000005677fe: ec5801af007c cgij %r5,0,8,567b5c 0000000000567804: e30050000024 stg %r0,0(%r5) 000000000056780a: ebacf0680004 lmg %r10,%r12,104(%r15) 0000000000567810: 07fe bcr 15,%r14 Call Trace: ([<000003ff80008900>] __this_module+0x0/0xffffffffffffd700 [x_tables]) ([<0000000000264fd4>] do_init_module+0x12c/0x220) ([<00000000001da14a>] load_module+0x24e2/0x2b10) ([<00000000001da976>] SyS_finit_module+0xbe/0xd8) ([<0000000000803b26>] system_call+0xd6/0x264) Last Breaking-Event-Address: [<000000000056771a>] rb_erase+0x12/0x4d0 Kernel panic - not syncing: Fatal exception: panic_on_oops Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Reported-and-tested-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Fixes: e8a97e42dc98 ("s390/pageattr: allow kernel page table splitting") Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-08-02Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds4-117/+1707
Pull KVM updates from Paolo Bonzini: - ARM: GICv3 ITS emulation and various fixes. Removal of the old VGIC implementation. - s390: support for trapping software breakpoints, nested virtualization (vSIE), the STHYI opcode, initial extensions for CPU model support. - MIPS: support for MIPS64 hosts (32-bit guests only) and lots of cleanups, preliminary to this and the upcoming support for hardware virtualization extensions. - x86: support for execute-only mappings in nested EPT; reduced vmexit latency for TSC deadline timer (by about 30%) on Intel hosts; support for more than 255 vCPUs. - PPC: bugfixes. * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits) KVM: PPC: Introduce KVM_CAP_PPC_HTM MIPS: Select HAVE_KVM for MIPS64_R{2,6} MIPS: KVM: Reset CP0_PageMask during host TLB flush MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX() MIPS: KVM: Sign extend MFC0/RDHWR results MIPS: KVM: Fix 64-bit big endian dynamic translation MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase MIPS: KVM: Use 64-bit CP0_EBase when appropriate MIPS: KVM: Set CP0_Status.KX on MIPS64 MIPS: KVM: Make entry code MIPS64 friendly MIPS: KVM: Use kmap instead of CKSEG0ADDR() MIPS: KVM: Use virt_to_phys() to get commpage PFN MIPS: Fix definition of KSEGX() for 64-bit KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD kvm: x86: nVMX: maintain internal copy of current VMCS KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures KVM: arm64: vgic-its: Simplify MAPI error handling KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers KVM: arm64: vgic-its: Turn device_id validation into generic ID validation ...
2016-07-31s390/mm: clean up pte/pmd encodingGerald Schaefer1-14/+38
The hugetlbfs pte<->pmd conversion functions currently assume that the pmd bit layout is consistent with the pte layout, which is not really true. The SW read and write bits are encoded as the sequence "wr" in a pte, but in a pmd it is "rw". The hugetlbfs conversion assumes that the sequence is identical in both cases, which results in swapped read and write bits in the pmd. In practice this is not a problem, because those pmd bits are only relevant for THP pmds and not for hugetlbfs pmds. The hugetlbfs code works on (fake) ptes, and the converted pte bits are correct. There is another variation in pte/pmd encoding which affects dirty prot-none ptes/pmds. In this case, a pmd has both its HW read-only and invalid bit set, while it is only the invalid bit for a pte. This also has no effect in practice, but it should better be consistent. This patch fixes both inconsistencies by changing the SW read/write bit layout for pmds as well as the PAGE_NONE encoding for ptes. It also makes the hugetlbfs conversion functions more robust by introducing a move_set_bit() macro that uses the pte/pmd bit #defines instead of constant shifts. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-07-27Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+1
Merge updates from Andrew Morton: - a few misc bits - ocfs2 - most(?) of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits) thp: fix comments of __pmd_trans_huge_lock() cgroup: remove unnecessary 0 check from css_from_id() cgroup: fix idr leak for the first cgroup root mm: memcontrol: fix documentation for compound parameter mm: memcontrol: remove BUG_ON in uncharge_list mm: fix build warnings in <linux/compaction.h> mm, thp: convert from optimistic swapin collapsing to conservative mm, thp: fix comment inconsistency for swapin readahead functions thp: update Documentation/{vm/transhuge,filesystems/proc}.txt shmem: split huge pages beyond i_size under memory pressure thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE khugepaged: add support of collapse for tmpfs/shmem pages shmem: make shmem_inode_info::lock irq-safe khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() thp: extract khugepaged from mm/huge_memory.c shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings shmem: add huge pages support shmem: get_unmapped_area align huge page shmem: prepare huge= mount option and sysfs knob mm, rmap: account shmem thp pages ...
2016-07-27mm: do not pass mm_struct into handle_mm_faultKirill A. Shutemov1-1/+1
We always have vma->vm_mm around. Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26Merge branch 'for-linus' of ↵Linus Torvalds10-151/+493
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "There are a couple of new things for s390 with this merge request: - a new scheduling domain "drawer" is added to reflect the unusual topology found on z13 machines. Performance tests showed up to 8 percent gain with the additional domain. - the new crc-32 checksum crypto module uses the vector-galois-field multiply and sum SIMD instruction to speed up crc-32 and crc-32c. - proper __ro_after_init support, this requires RO_AFTER_INIT_DATA in the generic vmlinux.lds linker script definitions. - kcov instrumentation support. A prerequisite for that is the inline assembly basic block cleanup, which is the reason for the net/iucv/iucv.c change. - support for 2GB pages is added to the hugetlbfs backend. Then there are two removals: - the oprofile hardware sampling support is dead code and is removed. The oprofile user space uses the perf interface nowadays. - the ETR clock synchronization is removed, this has been superseeded be the STP clock synchronization. And it always has been "interesting" code.. And the usual bug fixes and cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (82 commits) s390/pci: Delete an unnecessary check before the function call "pci_dev_put" s390/smp: clean up a condition s390/cio/chp : Remove deprecated create_singlethread_workqueue s390/chsc: improve channel path descriptor determination s390/chsc: sanitize fmt check for chp_desc determination s390/cio: make fmt1 channel path descriptor optional s390/chsc: fix ioctl CHSC_INFO_CU command s390/cio/device_ops: fix kernel doc s390/cio: allow to reset channel measurement block s390/console: Make preferred console handling more consistent s390/mm: fix gmap tlb flush issues s390/mm: add support for 2GB hugepages s390: have unique symbol for __switch_to address s390/cpuinfo: show maximum thread id s390/ptrace: clarify bits in the per_struct s390: stack address vs thread_info s390: remove pointless load within __switch_to s390: enable kcov support s390/cpumf: use basic block for ecctr inline assembly s390/hypfs: use basic block for diag inline assembly ...
2016-07-13s390/mm: fix gmap tlb flush issuesDavid Hildenbrand1-2/+2
__tlb_flush_asce() should never be used if multiple asce belong to a mm. As this function changes mm logic determining if local or global tlb flushes will be neded, we might end up flushing only the gmap asce on all CPUs and a follow up mm asce flushes will only flush on the local CPU, although that asce ran on multiple CPUs. The missing tlb flushes will provoke strange faults in user space and even low address protections in user space, crashing the kernel. Fixes: 1b948d6caec4 ("s390/mm,tlb: optimize TLB flushing for zEC12") Cc: stable@vger.kernel.org # 3.15+ Reported-by: Sascha Silbe <silbe@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-07-06s390/mm: add support for 2GB hugepagesGerald Schaefer4-40/+176
This adds support for 2GB hugetlbfs pages on s390. Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-28s390/mm: use basic block for essa inline assemblyHeiko Carstens1-4/+9
Use only simple inline assemblies which consist of a single basic block if the register asm construct is being used. Otherwise gcc would generate broken code if the compiler option --sanitize-coverage=trace-pc would be used. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-25s390: get rid of superfluous __GFP_REPEATMichal Hocko1-1/+1
__GFP_REPEAT has a rather weak semantic but since it has been introduced around 2.6.12 it has been ignored for low order allocations. page_table_alloc then uses the flag for a single page allocation. This means that this flag has never been actually useful here because it has always been used only for PAGE_ALLOC_COSTLY requests. Link: http://lkml.kernel.org/r/1464599699-30131-14-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-20KVM: s390: backup the currently enabled gmap when scheduled outDavid Hildenbrand1-0/+11
Nested virtualization will have to enable own gmaps. Current code would enable the wrong gmap whenever scheduled out and back in, therefore resulting in the wrong gmap being enabled. This patch reenables the last enabled gmap, therefore avoiding having to touch vcpu->arch.gmap when enabling a different gmap. Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: don't fault everything in read-write in gmap_pte_op_fixup()David Hildenbrand1-6/+11
Let's not fault in everything in read-write but limit it to read-only where possible. When restricting access rights, we already have the required protection level in our hands. When reading from guest 2 storage (gmap_read_table), it is obviously PROT_READ. When shadowing a pte, the required protection level is given via the guest 2 provided pte. Based on an initial patch by Martin Schwidefsky. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: allow to check if a gmap shadow is validDavid Hildenbrand1-0/+20
It will be very helpful to have a mechanism to check without any locks if a given gmap shadow is still valid and matches the given properties. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: remember the int code for the last gmap faultDavid Hildenbrand1-0/+1
For nested virtualization, we want to know if we are handling a protection exception, because these can directly be forwarded to the guest without additional checks. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: limit number of real-space gmap shadowsDavid Hildenbrand1-0/+13
We have no known user of real-space designation and only support it to be architecture compliant. Gmap shadows with real-space designation are never unshadowed automatically, as there is nothing to protect for the top level table. So let's simply limit the number of such shadows to one by removing existing ones on creation of another one. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: support real-space for gmap shadowsDavid Hildenbrand1-3/+32
We can easily support real-space designation just like EDAT1 and EDAT2. So guest2 can provide for guest3 an asce with the real-space control being set. We simply have to allocate the biggest page table possible and fake all levels. There is no protection to consider. If we exceed guest memory, vsie code will inject an addressing exception (via program intercept). In the future, we could limit the fake table level to the gmap page table. As the top level page table can never go away, such gmap shadows will never get unshadowed, we'll have to come up with another way to limit the number of kept gmap shadows. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: support EDAT2 for gmap shadowsDavid Hildenbrand1-2/+12
If the guest is enabled for EDAT2, we can easily create shadows for guest2 -> guest3 provided tables that make use of EDAT2. If guest2 references a 2GB page, this memory looks consecutive for guest2, but it does not have to be so for us. Therefore we have to create fake segment and page tables. This works just like EDAT1 support, so page tables are removed when the parent table (r3t table entry) is changed. We don't hve to care about: - ACCF-Validity Control in RTTE - Access-Control Bits in RTTE - Fetch-Protection Bit in RTTE - Common-Region Bit in RTTE Just like for EDAT1, all bits might be dropped and there is no guaranteed that they are active. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: support EDAT1 for gmap shadowsDavid Hildenbrand1-4/+25
If the guest is enabled for EDAT1, we can easily create shadows for guest2 -> guest3 provided tables that make use of EDAT1. If guest2 references a 1MB page, this memory looks consecutive for guest2, but it might not be so for us. Therefore we have to create fake page tables. We can easily add that to our existing infrastructure. The invalidation mechanism will make sure that fake page tables are removed when the parent table (sgt table entry) is changed. As EDAT1 also introduced protection on all page table levels, we have to also shadow these correctly. We don't have to care about: - ACCF-Validity Control in STE - Access-Control Bits in STE - Fetch-Protection Bit in STE - Common-Segment Bit in STE As all bits might be dropped and there is no guaranteed that they are active ("unpredictable whether the CPU uses these bits", "may be used"). Without using EDAT1 in the shadow ourselfes (STE-format control == 0), simply shadowing these bits would not be enough. They would be ignored. Please note that we are using the "fake" flag to make this look consistent with further changes (EDAT2, real-space designation support) and don't let the shadow functions handle fc=1 stes. In the future, with huge pages in the host, gmap_shadow_pgt() could simply try to map a huge host page if "fake" is set to one and indicate via return value that no lower fake tables / shadow ptes are required. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: prepare for EDAT1/EDAT2 support in gmap shadowDavid Hildenbrand1-5/+11
In preparation for EDAT1/EDAT2 support for gmap shadows, we have to store the requested edat level in the gmap shadow. The edat level used during shadow translation is a property of the gmap shadow. Depending on that level, the gmap shadow will look differently for the same guest tables. We have to store it internally in order to support it later. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: fix races on gmap_shadow creationDavid Hildenbrand1-17/+28
Before any thread is allowed to use a gmap_shadow, it has to be fully initialized. However, for invalidation to work properly, we have to register the new gmap_shadow before we protect the parent gmap table. Because locking is tricky, and we have to avoid duplicate gmaps, let's introduce an initialized field, that signalizes other threads if that gmap_shadow can already be used or if they have to retry. Let's properly return errors using ERR_PTR() instead of simply returning NULL, so a caller can properly react on the error. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: avoid races on region/segment/page table shadowingDavid Hildenbrand1-27/+70
We have to unlock sg->guest_table_lock in order to call gmap_protect_rmap(). If we sleep just before that call, another VCPU might pick up that shadowed page table (while it is not protected yet) and use it. In order to avoid these races, we have to introduce a third state - "origin set but still invalid" for an entry. This way, we can avoid another thread already using the entry before the table is fully protected. As soon as everything is set up, we can clear the invalid bit - if we had no race with the unshadowing code. Suggested-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: shadow pages with real guest requested protectionDavid Hildenbrand2-16/+12
We really want to avoid manually handling protection for nested virtualization. By shadowing pages with the protection the guest asked us for, the SIE can handle most protection-related actions for us (e.g. special handling for MVPG) and we can directly forward protection exceptions to the guest. PTEs will now always be shadowed with the correct _PAGE_PROTECT flag. Unshadowing will take care of any guest changes to the parent PTE and any host changes to the host PTE. If the host PTE doesn't have the fitting access rights or is not available, we have to fix it up. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: flush tlb of shadows in all situationsDavid Hildenbrand1-3/+4
For now, the tlb of shadow gmap is only flushed when the parent is removed, not when it is removed upfront. Therefore other shadow gmaps can reuse the tables without the tlb getting flushed. Fix this by simply flushing the tlb 1. Before the shadow tables are removed (analogouos to other unshadow functions) 2. When the gmap is freed and therefore the top level pages are freed. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20s390/mm: add shadow gmap supportMartin Schwidefsky4-31/+1200
For a nested KVM guest the outer KVM host needs to create shadow page tables for the nested guest. This patch adds the basic support to the guest address space (gmap) code. For each guest address space the inner KVM host creates, the first outer KVM host needs to create shadow page tables. The address space is identified by the ASCE loaded into the control register 1 at the time the inner SIE instruction for the second nested KVM guest is executed. The outer KVM host creates the shadow tables starting with the table identified by the ASCE on a on-demand basis. The outer KVM host will get repeated faults for all the shadow tables needed to run the second KVM guest. While a shadow page table for the second KVM guest is active the access to the origin region, segment and page tables needs to be restricted for the first KVM guest. For region and segment and page tables the first KVM guest may read the memory, but write attempt has to lead to an unshadow. This is done using the page invalid and read-only bits in the page table of the first KVM guest. If the first guest re-accesses one of the origin pages of a shadow, it gets a fault and the affected parts of the shadow page table hierarchy needs to be removed again. PGSTE tables don't have to be shadowed, as all interpretation assist can't deal with the invalid bits in the shadow pte being set differently than the original ones provided by the first KVM guest. Many bug fixes and improvements by David Hildenbrand. Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>