summaryrefslogtreecommitdiff
path: root/arch/x86/mm
AgeCommit message (Collapse)AuthorFilesLines
2017-07-05x86/mm/pat: Don't report PAT on CPUs that don't support itMikulas Patocka1-16/+12
The pat_enabled() logic is broken on CPUs which do not support PAT and where the initialization code fails to call pat_init(). Due to that the enabled flag stays true and pat_enabled() returns true wrongfully. As a consequence the mappings, e.g. for Xorg, are set up with the wrong caching mode and the required MTRR setups are omitted. To cure this the following changes are required: 1) Make pat_enabled() return true only if PAT initialization was invoked and successful. 2) Invoke init_cache_modes() unconditionally in setup_arch() and remove the extra callsites in pat_disable() and the pat disabled code path in pat_init(). Also rename __pat_enabled to pat_disabled to reflect the real purpose of this variable. Fixes: 9cd25aac1f44 ("x86/mm/pat: Emulate PAT when it is disabled") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Bernhard Held <berny156@gmx.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: "Luis R. Rodriguez" <mcgrof@suse.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1707041749300.3456@file01.intranet.prod.int.rdu2.redhat.com
2017-07-04Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds11-814/+364
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "The main changes in this cycle were: - Continued work to add support for 5-level paging provided by future Intel CPUs. In particular we switch the x86 GUP code to the generic implementation. (Kirill A. Shutemov) - Continued work to add PCID CPU support to native kernels as well. In this round most of the focus is on reworking/refreshing the TLB flush infrastructure for the upcoming PCID changes. (Andy Lutomirski)" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits) x86/mm: Delete a big outdated comment about TLB flushing x86/mm: Don't reenter flush_tlb_func_common() x86/KASLR: Fix detection 32/64 bit bootloaders for 5-level paging x86/ftrace: Exclude functions in head64.c from function-tracing x86/mmap, ASLR: Do not treat unlimited-stack tasks as legacy mmap x86/mm: Remove reset_lazy_tlbstate() x86/ldt: Simplify the LDT switching logic x86/boot/64: Put __startup_64() into .head.text x86/mm: Add support for 5-level paging for KASLR x86/mm: Make kernel_physical_mapping_init() support 5-level paging x86/mm: Add sync_global_pgds() for configuration with 5-level paging x86/boot/64: Add support of additional page table level during early boot x86/boot/64: Rename init_level4_pgt and early_level4_pgt x86/boot/64: Rewrite startup_64() in C x86/boot/compressed: Enable 5-level paging during decompression stage x86/boot/efi: Define __KERNEL32_CS GDT on 64-bit configurations x86/boot/efi: Fix __KERNEL_CS definition of GDT entry on 64-bit configurations x86/boot/efi: Cleanup initialization of GDT entries x86/asm: Fix comment in return_from_SYSCALL_64() x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation ...
2017-06-30x86/mm: Delete a big outdated comment about TLB flushingAndy Lutomirski1-36/+0
The comment describes the old explicit IPI-based flush logic, which is long gone. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/55e44997e56086528140c5180f8337dc53fb7ffc.1498751203.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-30x86/mm: Don't reenter flush_tlb_func_common()Andy Lutomirski1-2/+15
It was historically possible to have two concurrent TLB flushes targetting the same CPU: one initiated locally and one initiated remotely. This can now cause an OOPS in leave_mm() at arch/x86/mm/tlb.c:47: if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) BUG(); with this call trace: flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline] flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317 Without reentrancy, this OOPS is impossible: leave_mm() is only called if we're not in TLBSTATE_OK, but then we're unexpectedly in TLBSTATE_OK in leave_mm(). This can be caused by flush_tlb_func_remote() happening between the two checks and calling leave_mm(), resulting in two consecutive leave_mm() calls on the same CPU with no intervening switch_mm() calls. We never saw this OOPS before because the old leave_mm() implementation didn't put us back in TLBSTATE_OK, so the assertion didn't fire. Nadav noticed the reentrancy issue in a different context, but neither of us realized that it caused a problem yet. Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Nadav Amit <nadav.amit@gmail.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Fixes: 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm") Link: http://lkml.kernel.org/r/855acf733268d521c9f2e191faee2dcc23a29729.1498751203.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-28x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimmDan Williams1-0/+6
Kill this globally defined wrapper and move to libnvdimm so that we can ultimately remove include/linux/pmem.h and asm/pmem.h. Cc: <x86@kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-26x86/mm/hotplug: Fix BUG_ON() after hot-remove by not freeing PUDJérôme Glisse1-1/+7
Since commit: af2cf278ef4f ("x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()") we no longer free PUDs so that we do not have to synchronize all PGDs on hot-remove/vfree(). But the new 5-level page table patchset reverted that for 4-level page tables, in the following commit: f2a6a7050109: ("x86: Convert the rest of the code to support p4d_t") This patch restores the damage and disables free_pud() if we are in the 4-level page table case, thus avoiding BUG_ON() after hot-remove. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> [ Clarified the changelog and the code comments. ] Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170624180514.3821-1-jglisse@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24x86/mmap, ASLR: Do not treat unlimited-stack tasks as legacy mmapMichal Hocko1-3/+0
Since the following commit in 2008: cc503c1b43e0 ("x86: PIE executable randomization") We added a heuristics to treat applications with RLIMIT_STACK configured to unlimited as legacy. This means: a) set the mmap_base to 1/3 of address space + randomization and b) mmap from bottom to top. This makes some sense as it allows the stack to grow really large. On the other hand it reduces the address space usable for default mmaps (without address hint) quite a lot. We have received a bug report that SAP HANA workload has hit into this limitation. We could argue that the user just got what he asked for when setting up the unlimited stack but to be realistic growing stack up to 1/6 TASK_SIZE (allowed by mmap_base) is pretty much unimited in the real life. This would give mmap 20TB of additional address space which is quite nice. Especially when it is much more likely to use that address space than the reserved stack. Digging into the history the original implementation of the randomization: 8817210d4d96 ("[PATCH] x86_64: Flexmap for 32bit and randomized mappings for 64bit") didn't have this restriction. So let's try and remove this assumption - hopefully nothing breaks. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: hughd@google.com Cc: linux-mm@kvack.org Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/tip-86b110d2ae6365ce91cabd37588bc8611770421a@git.kernel.org [ So I've applied this to tip:x86/mm with a wider Cc: list - if anyone objects to this change please holler. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22x86/ldt: Simplify the LDT switching logicAndy Lutomirski1-18/+2
Originally, Linux reloaded the LDT whenever the prev mm or the next mm had an LDT. It was changed in 2002 in: 0bbed3beb4f2 ("[PATCH] Thread-Local Storage (TLS) support") (commit from the historical tree), like this: - /* load_LDT, if either the previous or next thread - * has a non-default LDT. + /* + * load the LDT, if the LDT is different: */ - if (next->context.size+prev->context.size) + if (unlikely(prev->context.ldt != next->context.ldt)) load_LDT(&next->context); The current code is unlikely to avoid any LDT reloads, since different mms won't share an LDT. When we redo lazy mode to stop flush IPIs without switching to init_mm, though, the current logic would become incorrect: it will be possible to have real_prev == next but nonetheless have a stale LDT descriptor. Simplify the code to update LDTR if either the previous or the next mm has an LDT, i.e. effectively restore the historical logic.. While we're at it, clean up the code by moving all the ifdeffery to a header where it belongs. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/2a859ac01245f9594c58f9d0a8b2ed8a7cd2507e.1498022414.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22Merge branch 'linus' into x86/mm, to pick up fixesIngo Molnar3-4/+7
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-19mm: larger stack guard gap, between vmasHugh Dickins1-1/+1
Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-13x86/mm: Add support for 5-level paging for KASLRKirill A. Shutemov1-19/+62
With 5-level paging randomization happens on P4D level instead of PUD. Maximum amount of physical memory also bumped to 52-bits for 5-level paging. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170606113133.22974-13-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-13x86/mm: Make kernel_physical_mapping_init() support 5-level pagingKirill A. Shutemov1-9/+60
Populate additional page table level if CONFIG_X86_5LEVEL is enabled. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170606113133.22974-12-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-13x86/mm: Add sync_global_pgds() for configuration with 5-level pagingKirill A. Shutemov1-0/+39
This basically restores slightly modified version of original sync_global_pgds() which we had before folded p4d was introduced. The only modification is protection against 'addr' overflow. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170606113133.22974-11-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-13x86/boot/64: Rename init_level4_pgt and early_level4_pgtKirill A. Shutemov2-7/+7
With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables. Let's give these variable more generic names: init_top_pgt and early_top_pgt. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170606113133.22974-9-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-13x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementationKirill A. Shutemov2-497/+1
This patch provides all required callbacks required by the generic get_user_pages_fast() code and switches x86 over - and removes the platform specific implementation. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170606113133.22974-2-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-13x86/mm: Split read_cr3() into read_cr3_pa() and __read_cr3()Andy Lutomirski2-6/+6
The kernel has several code paths that read CR3. Most of them assume that CR3 contains the PGD's physical address, whereas some of them awkwardly use PHYSICAL_PAGE_MASK to mask off low bits. Add explicit mask macros for CR3 and convert all of the CR3 readers. This will keep them from breaking when PCID is enabled. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: xen-devel <xen-devel@lists.xen.org> Link: http://lkml.kernel.org/r/883f8fb121f4616c1c1427ad87350bb2f5ffeca1.1497288170.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-13x86/mm: Disable 1GB direct mappings when disabling 2MB mappingsVlastimil Babka1-3/+3
The kmemleak and debug_pagealloc features both disable using huge pages for direct mappings so they can do cpa() on page level granularity in any context. However they only do that for 2MB pages, which means 1GB pages can still be used if the CPU supports it, unless disabled by a boot param, which is non-obvious. Disable also 1GB pages when disabling 2MB pages. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vegard Nossum <vegardno@ifi.uio.no> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/2be70c78-6130-855d-3dfa-d87bd1dd4fda@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-12x86/debug: Handle early WARN_ONs properPeter Zijlstra1-0/+3
Hans managed to trigger a WARN very early in the boot which killed his (Virtual) box. The reason is that the recent rework of WARN() to use UD0 forgot to add the fixup_bug() call to early_fixup_exception(). As a result the kernel does not handle the WARN_ON injected UD0 exception and panics. Add the missing fixup call, so early UD's injected by WARN() get handled. Fixes: 9a93848fe787 ("x86/debug: Implement __WARN() using UD0") Reported-and-tested-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Frank Mehnert <frank.mehnert@oracle.com> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Michael Thayer <michael.thayer@oracle.com> Link: http://lkml.kernel.org/r/20170612180108.w4vgu2ckucmllf3a@hirez.programming.kicks-ass.net
2017-06-05x86/mm: Be more consistent wrt PAGE_SHIFT vs PAGE_SIZE in tlb flush codeAndy Lutomirski1-3/+2
Nadav pointed out that some code used PAGE_SIZE and other code used PAGE_SHIFT. Use PAGE_SHIFT instead of multiplying or dividing by PAGE_SIZE. Requested-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-05x86/mm: Rework lazy TLB to track the actual loaded mmAndy Lutomirski2-109/+109
Lazy TLB state is currently managed in a rather baroque manner. AFAICT, there are three possible states: - Non-lazy. This means that we're running a user thread or a kernel thread that has called use_mm(). current->mm == current->active_mm == cpu_tlbstate.active_mm and cpu_tlbstate.state == TLBSTATE_OK. - Lazy with user mm. We're running a kernel thread without an mm and we're borrowing an mm_struct. We have current->mm == NULL, current->active_mm == cpu_tlbstate.active_mm, cpu_tlbstate.state != TLBSTATE_OK (i.e. TLBSTATE_LAZY or 0). The current cpu is set in mm_cpumask(current->active_mm). CR3 points to current->active_mm->pgd. The TLB is up to date. - Lazy with init_mm. This happens when we call leave_mm(). We have current->mm == NULL, current->active_mm == cpu_tlbstate.active_mm, but that mm is only relelvant insofar as the scheduler is tracking it for refcounting. cpu_tlbstate.state != TLBSTATE_OK. The current cpu is clear in mm_cpumask(current->active_mm). CR3 points to swapper_pg_dir, i.e. init_mm->pgd. This patch simplifies the situation. Other than perf, x86 stops caring about current->active_mm at all. We have cpu_tlbstate.loaded_mm pointing to the mm that CR3 references. The TLB is always up to date for that mm. leave_mm() just switches us to init_mm. There are no longer any special cases for mm_cpumask, and switch_mm() switches mms without worrying about laziness. After this patch, cpu_tlbstate.state serves only to tell the TLB flush code whether it may switch to init_mm instead of doing a normal flush. This makes fairly extensive changes to xen_exit_mmap(), which used to look a bit like black magic. Perf is unchanged. With or without this change, perf may behave a bit erratically if it tries to read user memory in kernel thread context. We should build on this patch to teach perf to never look at user memory when cpu_tlbstate.loaded_mm != current->mm. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-05x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP codeAndy Lutomirski2-17/+2
The UP asm/tlbflush.h generates somewhat nicer code than the SMP version. Aside from that, it's fallen quite a bit behind the SMP code: - flush_tlb_mm_range() didn't flush individual pages if the range was small. - The lazy TLB code was much weaker. This usually wouldn't matter, but, if a kernel thread flushed its lazy "active_mm" more than once (due to reclaim or similar), it wouldn't be unlazied and would instead pointlessly flush repeatedly. - Tracepoints were missing. Aside from that, simply having the UP code around was a maintanence burden, since it means that any change to the TLB flush code had to make sure not to break it. Simplify everything by deleting the UP code. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-05x86/mm: Use new merged flush logic in arch_tlbbatch_flush()Andy Lutomirski1-6/+2
Now there's only one copy of the local tlb flush logic for non-kernel pages on SMP kernels. The only functional change is that arch_tlbbatch_flush() will now leave_mm() on the local CPU if that CPU is in the batch and is in TLBSTATE_LAZY mode. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-05x86/mm: Refactor flush_tlb_mm_range() to merge local and remote casesAndy Lutomirski1-65/+48
The local flush path is very similar to the remote flush path. Merge them. This is intended to make no difference to behavior whatsoever. It removes some code and will make future changes to the flushing mechanics simpler. This patch does remove one small optimization: flush_tlb_mm_range() now has an unconditional smp_mb() instead of using MOV to CR3 or INVLPG as a full barrier when applicable. I think this is okay for a few reasons. First, smp_mb() is quite cheap compared to the cost of a TLB flush. Second, this rearrangement makes a bigger optimization available: with some work on the SMP function call code, we could do the local and remote flushes in parallel. Third, I'm planning a rework of the TLB flush algorithm that will require an atomic operation at the beginning of each flush, and that operation will replace the smp_mb(). Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-05x86/mm: Change the leave_mm() condition for local TLB flushesAndy Lutomirski1-1/+1
On a remote TLB flush, we leave_mm() if we're TLBSTATE_LAZY. For a local flush_tlb_mm_range(), we leave_mm() if !current->mm. These are approximately the same condition -- the scheduler sets lazy TLB mode when switching to a thread with no mm. I'm about to merge the local and remote flush code, but for ease of verifying and bisecting the patch, I want the local and remote flush behavior to match first. This patch changes the local code to match the remote code. Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-05x86/mm: Pass flush_tlb_info to flush_tlb_others() etcAndy Lutomirski1-32/+32
Rather than passing all the contents of flush_tlb_info to flush_tlb_others(), pass a pointer to the structure directly. For consistency, this also removes the unnecessary cpu parameter from uv_flush_tlb_others() to make its signature match the other *flush_tlb_others() functions. This serves two purposes: - It will dramatically simplify future patches that change struct flush_tlb_info, which I'm planning to do. - struct flush_tlb_info is an adequate description of what to do for a local flush, too, so by reusing it we can remove duplicated code between local and remove flushes in a future patch. Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org [ Fix build warning. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-05Merge tag 'v4.12-rc4' into x86/mm, to pick up fixesIngo Molnar1-1/+1
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-01Revert "x86/PAT: Fix Xorg regression on CPUs that don't support PAT"Ingo Molnar1-6/+3
This reverts commit cbed27cdf0e3f7ea3b2259e86b9e34df02be3fe4. As Andy Lutomirski observed: "I think this patch is bogus. pat_enabled() sure looks like it's supposed to return true if PAT is *enabled*, and these days PAT is 'enabled' even if there's no HW PAT support." Reported-by: Bernhard Held <berny156@gmx.de> Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: stable@vger.kernel.org # v4.2+ Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-27Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds1-3/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A series of fixes for X86: - The final fix for the end-of-stack issue in the unwinder - Handle non PAT systems gracefully - Prevent access to uninitiliazed memory - Move early delay calaibration after basic init - Fix Kconfig help text - Fix a cross compile issue - Unbreak older make versions" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/timers: Move simple_udelay_calibration past init_hypervisor_platform x86/alternatives: Prevent uninitialized stack byte read in apply_alternatives() x86/PAT: Fix Xorg regression on CPUs that don't support PAT x86/watchdog: Fix Kconfig help text file path reference to lockup watchdog documentation x86/build: Permit building with old make versions x86/unwind: Add end-of-stack check for ftrace handlers Revert "x86/entry: Fix the end of the stack for newly forked tasks" x86/boot: Use CROSS_COMPILE prefix for readelf
2017-05-27x86/mm/ftrace: Do not bug in early boot on irqs_disabled in cpu_flush_range()Steven Rostedt (VMware)1-1/+1
With function tracing starting in early bootup and having its trampoline pages being read only, a bug triggered with the following: kernel BUG at arch/x86/mm/pageattr.c:189! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc2-test+ #3 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 task: ffffffffb4222500 task.stack: ffffffffb4200000 RIP: 0010:change_page_attr_set_clr+0x269/0x302 RSP: 0000:ffffffffb4203c88 EFLAGS: 00010046 RAX: 0000000000000046 RBX: 0000000000000000 RCX: 00000001b6000000 RDX: ffffffffb4203d40 RSI: 0000000000000000 RDI: ffffffffb4240d60 RBP: ffffffffb4203d18 R08: 00000001b6000000 R09: 0000000000000001 R10: ffffffffb4203aa8 R11: 0000000000000003 R12: ffffffffc029b000 R13: ffffffffb4203d40 R14: 0000000000000001 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff9a639ea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff9a636b384000 CR3: 00000001ea21d000 CR4: 00000000000406b0 Call Trace: change_page_attr_clear+0x1f/0x21 set_memory_ro+0x1e/0x20 arch_ftrace_update_trampoline+0x207/0x21c ? ftrace_caller+0x64/0x64 ? 0xffffffffc029b000 ftrace_startup+0xf4/0x198 register_ftrace_function+0x26/0x3c function_trace_init+0x5e/0x73 tracer_init+0x1e/0x23 tracing_set_tracer+0x127/0x15a register_tracer+0x19b/0x1bc init_function_trace+0x90/0x92 early_trace_init+0x236/0x2b3 start_kernel+0x200/0x3f5 x86_64_start_reservations+0x29/0x2b x86_64_start_kernel+0x17c/0x18f secondary_startup_64+0x9f/0x9f ? secondary_startup_64+0x9f/0x9f Interrupts should not be enabled at this early in the boot process. It is also fine to leave interrupts enabled during this time as there's only one CPU running, and on_each_cpu() means to only run on the current CPU. If early_boot_irqs_disabled is set, it is safe to run cpu_flush_range() with interrupts disabled. Don't trigger a BUG_ON() in that case. Link: http://lkml.kernel.org/r/20170526093717.0be3b849@gandalf.local.home Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-24mm, x86/mm: Make the batched unmap TLB flush API more genericAndy Lutomirski1-0/+17
try_to_unmap_flush() used to open-code a rather x86-centric flush sequence: local_flush_tlb() + flush_tlb_others(). Rearrange the code so that the arch (only x86 for now) provides arch_tlbbatch_add_mm() and arch_tlbbatch_flush() and the core code calls those functions instead. I'll want this for x86 because, to enable address space ids, I can't support the flush_tlb_others() mode used by exising try_to_unmap_flush() implementation with good performance. I can support the new API fairly easily, though. I imagine that other architectures may be in a similar position. Architectures with strong remote flush primitives (arm64?) may have even worse performance problems with flush_tlb_others() the way that try_to_unmap_flush() uses it. Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/19f25a8581f9fb77876b7ff3b001f89835e34ea3.1495492063.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-24x86/mm: Reduce indentation in flush_tlb_func()Andy Lutomirski1-16/+18
The leave_mm() case can just exit the function early so we don't need to indent the entire remainder of the function. Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/97901ddcc9821d7bc7b296d2918d1179f08aaf22.1495492063.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-24x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()Andy Lutomirski1-27/+0
flush_tlb_page() was very similar to flush_tlb_mm_range() except that it had a couple of issues: - It was missing an smp_mb() in the case where current->active_mm != mm. (This is a longstanding bug reported by Nadav Amit) - It was missing tracepoints and vm counter updates. The only reason that I can see for keeping it at as a separate function is that it could avoid a few branches that flush_tlb_mm_range() needs to decide to flush just one page. This hardly seems worthwhile. If we decide we want to get rid of those branches again, a better way would be to introduce an __flush_tlb_mm_range() helper and make both flush_tlb_page() and flush_tlb_mm_range() use it. Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/3cc3847cf888d8907577569b8bac3f01992ef8f9.1495492063.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-24x86/PAT: Fix Xorg regression on CPUs that don't support PATMikulas Patocka1-3/+6
In the file arch/x86/mm/pat.c, there's a '__pat_enabled' variable. The variable is set to 1 by default and the function pat_init() sets __pat_enabled to 0 if the CPU doesn't support PAT. However, on AMD K6-3 CPUs, the processor initialization code never calls pat_init() and so __pat_enabled stays 1 and the function pat_enabled() returns true, even though the K6-3 CPU doesn't support PAT. The result of this bug is that a kernel warning is produced when attempting to start the Xserver and the Xserver doesn't start (fork() returns ENOMEM). Another symptom of this bug is that the framebuffer driver doesn't set the K6-3 MTRR registers: x86/PAT: Xorg:3891 map pfn expected mapping type uncached-minus for [mem 0xe4000000-0xe5ffffff], got write-combining ------------[ cut here ]------------ WARNING: CPU: 0 PID: 3891 at arch/x86/mm/pat.c:1020 untrack_pfn+0x5c/0x9f ... x86/PAT: Xorg:3891 map pfn expected mapping type uncached-minus for [mem 0xe4000000-0xe5ffffff], got write-combining To fix the bug change pat_enabled() so that it returns true only if PAT initialization was actually done. Also, I changed boot_cpu_has(X86_FEATURE_PAT) to this_cpu_has(X86_FEATURE_PAT) in pat_ap_init(), so that we check the PAT feature on the processor that is being initialized. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: stable@vger.kernel.org # v4.2+ Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1704181501450.26399@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-12Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds3-7/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes: - two boot crash fixes - unwinder fixes - kexec related kernel direct mappings enhancements/fixes - more Clang support quirks - minor cleanups - Documentation fixes" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/intel_rdt: Fix a typo in Documentation x86/build: Don't add -maccumulate-outgoing-args w/o compiler support x86/boot/32: Fix UP boot on Quark and possibly other platforms x86/mm/32: Set the '__vmalloc_start_set' flag in initmem_init() x86/kexec/64: Use gbpages for identity mappings if available x86/mm: Add support for gbpages to kernel_ident_mapping_init() x86/boot: Declare error() as noreturn x86/mm/kaslr: Use the _ASM_MUL macro for multiplication to work around Clang incompatibility x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds() x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic() x86/microcode/AMD: Remove redundant NULL check on mc
2017-05-11Merge tag 'hwparam-20170420' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull hw lockdown support from David Howells: "Annotation of module parameters that configure hardware resources including ioports, iomem addresses, irq lines and dma channels. This allows a future patch to prohibit the use of such module parameters to prevent that hardware from being abused to gain access to the running kernel image as part of locking the kernel down under UEFI secure boot conditions. Annotations are made by changing: module_param(n, t, p) module_param_named(n, v, t, p) module_param_array(n, t, m, p) to: module_param_hw(n, t, hwtype, p) module_param_hw_named(n, v, t, hwtype, p) module_param_hw_array(n, t, hwtype, m, p) where the module parameter refers to a hardware setting hwtype specifies the type of the resource being configured. This can be one of: ioport Module parameter configures an I/O port iomem Module parameter configures an I/O mem address ioport_or_iomem Module parameter could be either (runtime set) irq Module parameter configures an I/O port dma Module parameter configures a DMA channel dma_addr Module parameter configures a DMA buffer address other Module parameter configures some other value Note that the hwtype is compile checked, but not currently stored (the lockdown code probably won't require it). It is, however, there for future use. A bonus is that the hwtype can also be used for grepping. The intention is for the kernel to ignore or reject attempts to set annotated module parameters if lockdown is enabled. This applies to options passed on the boot command line, passed to insmod/modprobe or direct twiddling in /sys/module/ parameter files. The module initialisation then needs to handle the parameter not being set, by (1) giving an error, (2) probing for a value or (3) using a reasonable default. What I can't do is just reject a module out of hand because it may take a hardware setting in the module parameters. Some important modules, some ipmi stuff for instance, both probe for hardware and allow hardware to be manually specified; if the driver is aborts with any error, you don't get any ipmi hardware. Further, trying to do this entirely in the module initialisation code doesn't protect against sysfs twiddling. [!] Note that in and of itself, this series of patches should have no effect on the the size of the kernel or code execution - that is left to a patch in the next series to effect. It does mark annotated kernel parameters with a KERNEL_PARAM_FL_HWPARAM flag in an already existing field" * tag 'hwparam-20170420' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (38 commits) Annotate hardware config module parameters in sound/pci/ Annotate hardware config module parameters in sound/oss/ Annotate hardware config module parameters in sound/isa/ Annotate hardware config module parameters in sound/drivers/ Annotate hardware config module parameters in fs/pstore/ Annotate hardware config module parameters in drivers/watchdog/ Annotate hardware config module parameters in drivers/video/ Annotate hardware config module parameters in drivers/tty/ Annotate hardware config module parameters in drivers/staging/vme/ Annotate hardware config module parameters in drivers/staging/speakup/ Annotate hardware config module parameters in drivers/staging/media/ Annotate hardware config module parameters in drivers/scsi/ Annotate hardware config module parameters in drivers/pcmcia/ Annotate hardware config module parameters in drivers/pci/hotplug/ Annotate hardware config module parameters in drivers/parport/ Annotate hardware config module parameters in drivers/net/wireless/ Annotate hardware config module parameters in drivers/net/wan/ Annotate hardware config module parameters in drivers/net/irda/ Annotate hardware config module parameters in drivers/net/hamradio/ Annotate hardware config module parameters in drivers/net/ethernet/ ...
2017-05-09x86/mm/32: Set the '__vmalloc_start_set' flag in initmem_init()Laura Abbott1-0/+1
'__vmalloc_start_set' currently only gets set in initmem_init() when !CONFIG_NEED_MULTIPLE_NODES. This breaks detection of vmalloc address with virt_addr_valid() with CONFIG_NEED_MULTIPLE_NODES=y, causing a kernel crash: [mm/usercopy] 517e1fbeb6: kernel BUG at arch/x86/mm/physaddr.c:78! Set '__vmalloc_start_set' appropriately for that case as well. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Laura Abbott <labbott@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: dc16ecf7fd1f ("x86-32: use specific __vmalloc_start_set flag in __virt_addr_valid") Link: http://lkml.kernel.org/r/1494278596-30373-1-git-send-email-labbott@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-09x86: use set_memory.h headerLaura Abbott5-4/+5
set_memory_* functions have moved to set_memory.h. Switch to this explicitly. Link: http://lkml.kernel.org/r/1488920133-27229-6-git-send-email-labbott@redhat.com Signed-off-by: Laura Abbott <labbott@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08x86/mm: Add support for gbpages to kernel_ident_mapping_init()Xunlei Pang1-1/+13
Kernel identity mappings on x86-64 kernels are created in two ways: by the early x86 boot code, or by kernel_ident_mapping_init(). Native kernels (which is the dominant usecase) use the former, but the kexec and the hibernation code uses kernel_ident_mapping_init(). There's a subtle difference between these two ways of how identity mappings are created, the current kernel_ident_mapping_init() code creates identity mappings always using 2MB page(PMD level) - while the native kernel boot path also utilizes gbpages where available. This difference is suboptimal both for performance and for memory usage: kernel_ident_mapping_init() needs to allocate pages for the page tables when creating the new identity mappings. This patch adds 1GB page(PUD level) support to kernel_ident_mapping_init() to address these concerns. The primary advantage would be better TLB coverage/performance, because we'd utilize 1GB TLBs instead of 2MB ones. It is also useful for machines with large number of memory to save paging structure allocations(around 4MB/TB using 2MB page) when setting identity mappings for all the memory, after using 1GB page it will consume only 8KB/TB. ( Note that this change alone does not activate gbpages in kexec, we are doing that in a separate patch. ) Signed-off-by: Xunlei Pang <xlpang@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: akpm@linux-foundation.org Cc: kexec@lists.infradead.org Link: http://lkml.kernel.org/r/1493862171-8799-1-git-send-email-xlpang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-05x86/mm: Fix boot crash caused by incorrect loop count calculation in ↵Baoquan He1-6/+6
sync_global_pgds() Jeff Moyer reported that on his system with two memory regions 0~64G and 1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling KASLR will make the system hang intermittently during boot. While adding 'nokaslr' won't. The back trace is: Oops: 0000 [#1] SMP RIP: memcpy_erms() [ .... ] Call Trace: pmem_rw_page() bdev_read_page() do_mpage_readpage() mpage_readpages() blkdev_readpages() __do_page_cache_readahead() force_page_cache_readahead() page_cache_sync_readahead() generic_file_read_iter() blkdev_read_iter() __vfs_read() vfs_read() SyS_read() entry_SYSCALL_64_fastpath() This crash happens because the for loop count calculation in sync_global_pgds() is not correct. When a mapping area crosses PGD entries, we should calculate the starting address of region which next PGD covers and assign it to next for loop count, but not add PGDIR_SIZE directly. The old code works right only if the mapping area is an exact multiple of PGDIR_SIZE, otherwize the end region could be skipped so that it can't be synchronized to all other processes from kernel PGD init_mm.pgd. In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it makes this area be mapped inside one PGD entry. With KASLR enabled, this area could cross two PGD entries, then the next PGD entry won't be synced to all other processes. That is why we saw empty PGD. Fix it. Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jinbum Park <jinb.park7@gmail.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1493864747-8506-1-git-send-email-bhe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-02Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds16-216/+561
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "The main x86 MM changes in this cycle were: - continued native kernel PCID support preparation patches to the TLB flushing code (Andy Lutomirski) - various fixes related to 32-bit compat syscall returning address over 4Gb in applications, launched from 64-bit binaries - motivated by C/R frameworks such as Virtuozzo. (Dmitry Safonov) - continued Intel 5-level paging enablement: in particular the conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov) - x86/mpx ABI corner case fixes/enhancements (Joerg Roedel) - ... plus misc updates, fixes and cleanups" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash x86/mm: Fix flush_tlb_page() on Xen x86/mm: Make flush_tlb_mm_range() more predictable x86/mm: Remove flush_tlb() and flush_tlb_current_task() x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly() x86/mm/64: Fix crash in remove_pagetable() Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" x86/boot/e820: Remove a redundant self assignment x86/mm: Fix dump pagetables for 4 levels of page tables x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()" x86/espfix: Add support for 5-level paging x86/kasan: Extend KASAN to support 5-level paging x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y x86/paravirt: Add 5-level support to the paravirt code x86/mm: Define virtual memory map for 5-level paging x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert x86/boot: Detect 5-level paging support x86/mm/numa: Remove numa_nodemask_from_meminfo() ...
2017-05-02Merge branch 'x86-cpu-for-linus' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu updates from Ingo Molnar: "The biggest changes are an extension of the Intel RDT code to extend it with Intel Memory Bandwidth Allocation CPU support: MBA allows bandwidth allocation between cores, while CBM (already upstream) allows CPU cache partitioning. There's also misc smaller fixes and updates" * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/intel_rdt: Return error for incorrect resource names in schemata x86/intel_rdt: Trim whitespace while parsing schemata input x86/intel_rdt: Fix padding when resource is enabled via mount x86/intel_rdt: Get rid of anon union x86/cpu: Keep model defines sorted by model number x86/intel_rdt/mba: Add schemata file support for MBA x86/intel_rdt: Make schemata file parsers resource specific x86/intel_rdt/mba: Add info directory files for Memory Bandwidth Allocation x86/intel_rdt: Make information files resource specific x86/intel_rdt/mba: Add primary support for Memory Bandwidth Allocation (MBA) x86/intel_rdt/mba: Memory bandwith allocation feature detect x86/intel_rdt: Add resource specific msr update function x86/intel_rdt: Move CBM specific data into a struct x86/intel_rdt: Cleanup namespace to support multiple resource types Documentation, x86: Intel Memory bandwidth allocation x86/intel_rdt: Organize code properly x86/intel_rdt: Init padding only if a device exists x86/intel_rdt: Add cpus_list rdtgroup file x86/intel_rdt: Cleanup kernel-doc x86/intel_rdt: Update schemata read to show data in tabular format ...
2017-05-02Merge branch 'x86-boot-for-linus' of ↵Linus Torvalds12-30/+80
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Ingo Molnar: "The biggest changes in this cycle were: - reworking of the e820 code: separate in-kernel and boot-ABI data structures and apply a whole range of cleanups to the kernel side. No change in functionality. - enable KASLR by default: it's used by all major distros and it's out of the experimental stage as well. - ... misc fixes and cleanups" * 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits) x86/KASLR: Fix kexec kernel boot crash when KASLR randomization fails x86/reboot: Turn off KVM when halting a CPU x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup x86: Enable KASLR by default boot/param: Move next_arg() function to lib/cmdline.c for later reuse x86/boot: Fix Sparse warning by including required header file x86/boot/64: Rename start_cpu() x86/xen: Update e820 table handling to the new core x86 E820 code x86/boot: Fix pr_debug() API braindamage xen, x86/headers: Add <linux/device.h> dependency to <asm/xen/page.h> x86/boot/e820: Simplify e820__update_table() x86/boot/e820: Separate the E820 ABI structures from the in-kernel structures x86/boot/e820: Fix and clean up e820_type switch() statements x86/boot/e820: Rename the remaining E820 APIs to the e820__*() prefix x86/boot/e820: Remove unnecessary #include's x86/boot/e820: Rename e820_mark_nosave_regions() to e820__register_nosave_regions() x86/boot/e820: Rename e820_reserve_resources*() to e820__reserve_resources*() x86/boot/e820: Use bool in query APIs x86/boot/e820: Document e820__reserve_setup_data() x86/boot/e820: Clean up __e820__update_table() et al ...
2017-04-26x86/mm: Fix flush_tlb_page() on XenAndy Lutomirski1-3/+1
flush_tlb_page() passes a bogus range to flush_tlb_others() and expects the latter to fix it up. native_flush_tlb_others() has the fixup but Xen's version doesn't. Move the fixup to flush_tlb_others(). AFAICS the only real effect is that, without this fix, Xen would flush everything instead of just the one page on remote vCPUs in when flush_tlb_page() was called. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: e7b52ffd45a6 ("x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range") Link: http://lkml.kernel.org/r/10ed0e4dfea64daef10b87fb85df1746999b4dba.1492844372.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26x86/mm: Make flush_tlb_mm_range() more predictableAndy Lutomirski1-5/+7
I'm about to rewrite the function almost completely, but first I want to get a functional change out of the way. Currently, if flush_tlb_mm_range() does not flush the local TLB at all, it will never do individual page flushes on remote CPUs. This seems to be an accident, and preserving it will be awkward. Let's change it first so that any regressions in the rewrite will be easier to bisect and so that the rewrite can attempt to change no visible behavior at all. The fix is simple: we can simply avoid short-circuiting the calculation of base_pages_to_flush. As a side effect, this also eliminates a potential corner case: if tlb_single_page_flush_ceiling == TLB_FLUSH_ALL, flush_tlb_mm_range() could have ended up flushing the entire address space one page at a time. Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/4b29b771d9975aad7154c314534fec235618175a.1492844372.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26x86/mm: Remove flush_tlb() and flush_tlb_current_task()Andy Lutomirski1-17/+0
I was trying to figure out what how flush_tlb_current_task() would possibly work correctly if current->mm != current->active_mm, but I realized I could spare myself the effort: it has no callers except the unused flush_tlb() macro. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/e52d64c11690f85e9f1d69d7b48cc2269cd2e94b.1492844372.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26x86/mm/64: Fix crash in remove_pagetable()Kirill A. Shutemov1-3/+3
remove_pagetable() does page walk using p*d_page_vaddr() plus cast. It's not canonical approach -- we usually use p*d_offset() for that. It works fine as long as all page table levels are present. We broke the invariant by introducing folded p4d page table level. As result, remove_pagetable() interprets PMD as PUD and it leads to crash: BUG: unable to handle kernel paging request at ffff880300000000 IP: memchr_inv+0x60/0x110 PGD 317d067 P4D 317d067 PUD 3180067 PMD 33f102067 PTE 8000000300000060 Let's fix this by using p*d_offset() instead of p*d_page_vaddr() for page walk. Reported-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Fixes: f2a6a7050109 ("x86: Convert the rest of the code to support p4d_t") Link: http://lkml.kernel.org/r/20170425092557.21852-1-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() ↵Ingo Molnar2-1/+497
implementation" This reverts commit 2947ba054a4dabbd82848728d765346886050029. Dan Williams reported dax-pmem kernel warnings with the following signature: WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200 percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic ... and bisected it to this commit, which suggests possible memory corruption caused by the x86 fast-GUP conversion. He also pointed out: " This is similar to the backtrace when we were not properly handling pud faults and was fixed with this commit: 220ced1676c4 "mm: fix get_user_pages() vs device-dax pud mappings" I've found some missing _devmap checks in the generic get_user_pages_fast() path, but this does not fix the regression [...] " So given that there are known bugs, and a pretty robust looking bisection points to this commit suggesting that are unknown bugs in the conversion as well, revert it for the time being - we'll re-try in v4.13. Reported-by: Dan Williams <dan.j.williams@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: aneesh.kumar@linux.vnet.ibm.com Cc: dann.frazier@canonical.com Cc: dave.hansen@intel.com Cc: steve.capper@linaro.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-13x86/mm: Fix dump pagetables for 4 levels of page tablesJuergen Gross1-1/+1
Commit fdd3d8ce0ea62 ("x86/dump_pagetables: Add support for 5-level paging") introduced an error for dumping with only 4 levels by setting PGD_LEVEL_MULT to a wrong value. This is leading to e.g. addresses printed as "(null)" for ranges: x86/mm: Found insecure W+X mapping at address (null)/(null) Make PGD_LEVEL_MULT a multiple of PTRS_PER_P4D instead of PTRS_PER_PUD Fixes: fdd3d8ce0ea62 ("x86/dump_pagetables: Add support for 5-level paging") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20170412143634.6846-1-jgross@suse.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-12mm: Tighten x86 /dev/mem with zeroing readsKees Cook1-11/+30
Under CONFIG_STRICT_DEVMEM, reading System RAM through /dev/mem is disallowed. However, on x86, the first 1MB was always allowed for BIOS and similar things, regardless of it actually being System RAM. It was possible for heap to end up getting allocated in low 1MB RAM, and then read by things like x86info or dd, which would trip hardened usercopy: usercopy: kernel memory exposure attempt detected from ffff880000090000 (dma-kmalloc-256) (4096 bytes) This changes the x86 exception for the low 1MB by reading back zeros for System RAM areas instead of blindly allowing them. More work is needed to extend this to mmap, but currently mmap doesn't go through usercopy, so hardened usercopy won't Oops the kernel. Reported-by: Tommi Rantala <tommi.t.rantala@nokia.com> Tested-by: Tommi Rantala <tommi.t.rantala@nokia.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2017-04-12x86/mpx: Correctly report do_mpx_bt_fault() failures to user-spaceJoerg Roedel1-9/+1
When this function fails it just sends a SIGSEGV signal to user-space using force_sig(). This signal is missing essential information about the cause, e.g. the trap_nr or an error code. Fix this by propagating the error to the only caller of mpx_handle_bd_fault(), do_bounds(), which sends the correct SIGSEGV signal to the process. Signed-off-by: Joerg Roedel <jroedel@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: fe3d197f84319 ('x86, mpx: On-demand kernel allocation of bounds tables') Link: http://lkml.kernel.org/r/1491488362-27198-1-git-send-email-joro@8bytes.org Signed-off-by: Ingo Molnar <mingo@kernel.org>