summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2013-02-01Merge remote-tracking branch 'origin/x86/mm' into x86/mm2H. Peter Anvin29-287/+199
Explicitly merging these two branches due to nontrivial conflicts and to allow further work. Resolved Conflicts: arch/x86/kernel/head32.c arch/x86/kernel/head64.c arch/x86/mm/init_64.c arch/x86/realmode/init.c Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-02-01x86-32, mm: Remove reference to alloc_remap()H. Peter Anvin1-18/+11
We have removed the remap allocator for x86-32, and x86-64 never had it (and doesn't need it). Remove residual reference to it. Reported-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/CAE9FiQVn6_QZi3fNQ-JHYiR-7jeDJ5hT0SyT_%2BzVvfOj=PzF3w@mail.gmail.com
2013-02-01x86-32, mm: Remove reference to resume_map_numa_kva()H. Peter Anvin2-8/+0
Remove reference to removed function resume_map_numa_kva(). Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.com
2013-02-01x86-32, mm: Rip out x86_32 NUMA remapping codeDave Hansen4-174/+0
This code was an optimization for 32-bit NUMA systems. It has probably been the cause of a number of subtle bugs over the years, although the conditions to excite them would have been hard to trigger. Essentially, we remap part of the kernel linear mapping area, and then sometimes part of that area gets freed back in to the bootmem allocator. If those pages get used by kernel data structures (say mem_map[] or a dentry), there's no big deal. But, if anyone ever tried to use the linear mapping for these pages _and_ cared about their physical address, bad things happen. For instance, say you passed __GFP_ZERO to the page allocator and then happened to get handed one of these pages, it zero the remapped page, but it would make a pte to the _old_ page. There are probably a hundred other ways that it could screw with things. We don't need to hang on to performance optimizations for these old boxes any more. All my 32-bit NUMA systems are long dead and buried, and I probably had access to more than most people. This code is causing real things to break today: https://lkml.org/lkml/2013/1/9/376 I looked in to actually fixing this, but it requires surgery to way too much brittle code, as well as stuff like per_cpu_ptr_to_phys(). [ hpa: Cc: this for -stable, since it is a memory corruption issue. However, an alternative is to simply mark NUMA as depends BROKEN rather than EXPERIMENTAL in the X86_32 subclause... ] Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@vger.kernel.org>
2013-01-31x86/numa: Use __pa_nodebug() insteadBorislav Petkov1-1/+1
... and fix the following warning: arch/x86/mm/numa.c: In function ‘setup_node_data’: arch/x86/mm/numa.c:222:3: warning: passing argument 1 of ‘__phys_addr_nodebug’ makes integer from pointer without a cast Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1359245901-8512-1-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-30x86: Don't panic if can not alloc buffer for swiotlbYinghai Lu4-21/+35
Normal boot path on system with iommu support: swiotlb buffer will be allocated early at first and then try to initialize iommu, if iommu for intel or AMD could setup properly, swiotlb buffer will be freed. The early allocating is with bootmem, and could panic when we try to use kdump with buffer above 4G only, or with memmap to limit mem under 4G. for example: memmap=4095M$1M to remove memory under 4G. According to Eric, add _nopanic version and no_iotlb_memory to fail map single later if swiotlb is still needed. -v2: don't pass nopanic, and use -ENOMEM return value according to Eric. panic early instead of using swiotlb_full to panic...according to Eric/Konrad. -v3: make swiotlb_init to be notpanic, but will affect: arm64, ia64, powerpc, tile, unicore32, x86. -v4: cleanup swiotlb_init by removing swiotlb_init_with_default_size. Suggested-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-36-git-send-email-yinghai@kernel.org Reviewed-and-tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Andrzej Pietrasiewicz <andrzej.p@samsung.com> Cc: linux-mips@linux-mips.org Cc: xen-devel@lists.xensource.com Cc: virtualization@lists.linux-foundation.org Cc: Shuah Khan <shuahkhan@gmail.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30mm: Add alloc_bootmem_low_pages_nopanic()Yinghai Lu3-0/+21
We don't need to panic in some case, like for swiotlb preallocating. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-35-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, 64bit, mm: hibernate use generic mapping_initYinghai Lu1-44/+22
We should set mappings only for usable memory ranges under max_pfn Otherwise causes same problem that is fixed by x86, mm: Only direct map addresses that are marked as E820_RAM Make it only map range in pfn_mapped array. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-34-git-send-email-yinghai@kernel.org Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: linux-pm@vger.kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, 64bit, mm: Mark data/bss/brk to nxYinghai Lu1-3/+4
HPA said, we should not have RW and +x set at the time. for kernel layout: [ 0.000000] Kernel Layout: [ 0.000000] .text: [0x01000000-0x021434f8] [ 0.000000] .rodata: [0x02200000-0x02a13fff] [ 0.000000] .data: [0x02c00000-0x02dc763f] [ 0.000000] .init: [0x02dc9000-0x0312cfff] [ 0.000000] .bss: [0x0313b000-0x03dd6fff] [ 0.000000] .brk: [0x03dd7000-0x03dfffff] before the patch, we have ---[ High Kernel Mapping ]--- 0xffffffff80000000-0xffffffff81000000 16M pmd 0xffffffff81000000-0xffffffff82200000 18M ro PSE GLB x pmd 0xffffffff82200000-0xffffffff82c00000 10M ro PSE GLB NX pmd 0xffffffff82c00000-0xffffffff82dc9000 1828K RW GLB x pte 0xffffffff82dc9000-0xffffffff82e00000 220K RW GLB NX pte 0xffffffff82e00000-0xffffffff83000000 2M RW PSE GLB NX pmd 0xffffffff83000000-0xffffffff8313a000 1256K RW GLB NX pte 0xffffffff8313a000-0xffffffff83200000 792K RW GLB x pte 0xffffffff83200000-0xffffffff83e00000 12M RW PSE GLB x pmd 0xffffffff83e00000-0xffffffffa0000000 450M pmd after patch,, we get ---[ High Kernel Mapping ]--- 0xffffffff80000000-0xffffffff81000000 16M pmd 0xffffffff81000000-0xffffffff82200000 18M ro PSE GLB x pmd 0xffffffff82200000-0xffffffff82c00000 10M ro PSE GLB NX pmd 0xffffffff82c00000-0xffffffff82e00000 2M RW GLB NX pte 0xffffffff82e00000-0xffffffff83000000 2M RW PSE GLB NX pmd 0xffffffff83000000-0xffffffff83200000 2M RW GLB NX pte 0xffffffff83200000-0xffffffff83e00000 12M RW PSE GLB NX pmd 0xffffffff83e00000-0xffffffffa0000000 450M pmd so data, bss, brk get NX ... Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-33-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86: Merge early kernel reserve for 32bit and 64bitYinghai Lu3-18/+9
They are the same, and we could move them out from head32/64.c to setup.c. We are using memblock, and it could handle overlapping properly, so we don't need to reserve some at first to hold the location, and just need to make sure we reserve them before we are using memblock to find free mem to use. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-32-git-send-email-yinghai@kernel.org Cc: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86: Add Crash kernel low reservationYinghai Lu4-7/+75
During kdump kernel's booting stage, it need to find low ram for swiotlb buffer when system does not support intel iommu/dmar remapping. kexed-tools is appending memmap=exactmap and range from /proc/iomem with "Crash kernel", and that range is above 4G for 64bit after boot protocol 2.12. We need to add another range in /proc/iomem like "Crash kernel low", so kexec-tools could find that info and append to kdump kernel command line. Try to reserve some under 4G if the normal "Crash kernel" is above 4G. User could specify the size with crashkernel_low=XX[KMG]. -v2: fix warning that is found by Fengguang's test robot. -v3: move out get_mem_size change to another patch, to solve compiling warning that is found by Borislav Petkov <bp@alien8.de> -v4: user must specify crashkernel_low if system does not support intel or amd iommu. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-31-git-send-email-yinghai@kernel.org Cc: Eric Biederman <ebiederm@xmission.com> Cc: Rob Landley <rob@landley.net> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, kdump: Remove crashkernel range find limit for 64bitYinghai Lu1-3/+1
Now kexeced kernel/ramdisk could be above 4g, so remove 896 limit for 64bit. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-30-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30memblock: Add memblock_mem_size()Yinghai Lu3-15/+19
Use it to get mem size under the limit_pfn. to replace local version in x86 reserved_initrd. -v2: remove not needed cast that is pointed out by HPA. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-29-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, boot: Not need to check setup_header version for setup_dataYinghai Lu1-6/+0
That is for bootloaders. setup_data is in setup_header, and bootloader is copying that from bzImage. So for old bootloader should keep that as 0 already. old kexec-tools till now for elf image set setup_data to 0, so it is ok. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-28-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, boot: Update comments about entries for 64bit imageYinghai Lu2-9/+51
Now 64bit entry is fixed on 0x200, can not be changed anymore. Update the comments to reflect that. Also put info about it in boot.txt -v2: fix some grammar error Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-27-git-send-email-yinghai@kernel.org Cc: Rob Landley <rob@landley.net> Cc: Matt Fleming <matt.fleming@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, boot: Support loading bzImage, boot_params and ramdisk above 4GYinghai Lu4-1/+17
xloadflags bit 1 indicates that we can load the kernel and all data structures above 4G; it is set if kernel is relocatable and 64bit. bootloader will check if xloadflags bit 1 is set to decide if it could load ramdisk and kernel high above 4G. bootloader will fill value to ext_ramdisk_image/size for high 32bits when it load ramdisk above 4G. kernel use get_ramdisk_image/size to use ext_ramdisk_image/size to get right positon for ramdisk. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Rob Landley <rob@landley.net> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Gokul Caushik <caushik1@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joe Millenbach <jmillenbach@gmail.com> Link: http://lkml.kernel.org/r/1359058816-7615-26-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, kexec, 64bit: Only set ident mapping for ram.Yinghai Lu3-6/+15
We should set mappings only for usable memory ranges under max_pfn Otherwise causes same problem that is fixed by x86, mm: Only direct map addresses that are marked as E820_RAM This patch exposes pfn_mapped array, and only sets ident mapping for ranges in that array. This patch relies on new kernel_ident_mapping_init that could handle existing pgd/pud between different calls. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-25-git-send-email-yinghai@kernel.org Cc: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, kexec: Replace ident_mapping_init and init_level4_pageYinghai Lu1-135/+26
Now ident_mapping_init is checking if pgd/pud is present for every 2M, so several 2Ms are in same PUD, it will keep checking if pud is there with same pud. init_level4_page just does not check existing pgd/pud. We could use generic mapping_init with different settings in info to replace those two local grown version functions. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-24-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, kexec: Set ident mapping for kernel that is above max_pfnYinghai Lu1-6/+37
When first kernel is booted with memmap= or mem= to limit max_pfn. kexec can load second kernel above that max_pfn. We need to set ident mapping for whole image in this case instead of just for first 2M. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-23-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, kexec: Remove 1024G limitation for kexec buffer on 64bitYinghai Lu1-3/+3
Now 64bit kernel supports more than 1T ram and kexec tools could find buffer above 1T, remove that obsolete limitation. and use MAXMEM instead. Tested on system with more than 1024G ram. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-22-git-send-email-yinghai@kernel.org Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, boot: Move lldt/ltr out of 64bit code sectionYinghai Lu1-3/+6
commit 08da5a2ca x86_64: Early segment setup for VT sets up LDT and TR into a valid state in order to speed up boot decompression under VT. Those code are put in code64, and it is using GDT that is only loaded from code32 path. That breaks booting with 64bit bootloader that does not go through code32 path and jump to startup_64 directly, and it has different GDT. Move those lines into code32 after their GDT is loaded. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-21-git-send-email-yinghai@kernel.org Cc: Zachary Amsden <zamsden@gmail.com> Cc: Matt Fleming <matt.fleming@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, boot: Move verify_cpu.S and no_longmode downYinghai Lu1-8/+9
We need to move some code to 32bit section in following patch: x86, boot: Move lldt/ltr out of 64bit code section but that will push startup_64 down from 0x200. According to hpa, we can not change startup_64 position and that is an ABI. We could move function verify_cpu and no_longmode down, because verify_cpu is used via function call and no_longmode will not return, then we don't need to add extra code for jumping back. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-20-git-send-email-yinghai@kernel.org Cc: Matt Fleming <matt.fleming@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, boot: Pass cmd_line_ptr with unsigned long insteadYinghai Lu2-6/+6
boot/compressed/misc.c is used for bzImage in 64bit and 32bit, and cmd_line_ptr could point to buffer that is above 4g, cmd_line_ptr should be 64bit otherwise high 32bit will be capped out. So need to change data type to unsigned long, that will be 64bit get correct address of command line buffer. And it is still ok with 32bit bzImage, because unsigned long on 32bit kernel is still 32bit. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-19-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, boot: Move checking of cmd_line_ptr out of common pathYinghai Lu2-6/+16
cmdline.c::__cmdline_find_option... are shared between 16-bit setup code and 32/64 bit decompressor code. for 32/64 only path via kexec, we should not check if ptr is less 1M. as those cmdline could be put above 1M, or even 4G. Move out accessible checking out of __cmdline_find_option() So decompressor in misc.c can parse cmdline correctly. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-18-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, boot: Add get_cmd_line_ptr()Yinghai Lu2-4/+19
Add an accessor function for the command line address. Later we will add support for holding a 64-bit address via ext_cmd_line_ptr. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-17-git-send-email-yinghai@kernel.org Cc: Gokul Caushik <caushik1@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joe Millenbach <jmillenbach@gmail.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86: Add get_ramdisk_image/size()Yinghai Lu1-8/+21
There are several places to find ramdisk information early for reserving and relocating. Use accessor functions to make code more readable and consistent. Later will add ext_ramdisk_image/size in those functions to support loading ramdisk above 4g. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-16-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86: Merge early_reserve_initrd for 32bit and 64bitYinghai Lu3-26/+18
They are the same, could move them out from head32/64.c to setup.c. We are using memblock, and it could handle overlapping properly, so we don't need to reserve some at first to hold the location, and just need to make sure we reserve them before we are using memblock to find free mem to use. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-15-git-send-email-yinghai@kernel.org Reviewed-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, 64bit: Don't set max_pfn_mapped wrong value early on native pathYinghai Lu3-4/+11
We are not having max_pfn_mapped set correctly until init_memory_mapping. So don't print its initial value for 64bit Also need to use KERNEL_IMAGE_SIZE directly for highmap cleanup. -v2: update comments about max_pfn_mapped according to Stefano Stabellini. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-14-git-send-email-yinghai@kernel.org Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, 64bit: #PF handler set page to cover only 2M per #PFYinghai Lu1-17/+25
We only map a single 2 MiB page per #PF, even though we should be able to do this a full gigabyte at a time with no additional memory cost. This is a workaround for a broken AMD reference BIOS (and its derivatives in shipping system) which maps a large chunk of memory as WB in the MTRR system but will #MC if the processor wanders off and tries to prefetch that memory, which can happen any time the memory is mapped in the TLB. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-13-git-send-email-yinghai@kernel.org Cc: Alexander Duyck <alexander.h.duyck@intel.com> [ hpa: rewrote the patch description ] Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, 64bit: Use a #PF handler to materialize early mappings on demandH. Peter Anvin7-91/+219
Linear mode (CR0.PG = 0) is mutually exclusive with 64-bit mode; all 64-bit code has to use page tables. This makes it awkward before we have first set up properly all-covering page tables to access objects that are outside the static kernel range. So far we have dealt with that simply by mapping a fixed amount of low memory, but that fails in at least two upcoming use cases: 1. We will support load and run kernel, struct boot_params, ramdisk, command line, etc. above the 4 GiB mark. 2. need to access ramdisk early to get microcode to update that as early possible. We could use early_iomap to access them too, but it will make code to messy and hard to be unified with 32 bit. Hence, set up a #PF table and use a fixed number of buffers to set up page tables on demand. If the buffers fill up then we simply flush them and start over. These buffers are all in __initdata, so it does not increase RAM usage at runtime. Thus, with the help of the #PF handler, we can set the final kernel mapping from blank, and switch to init_level4_pgt later. During the switchover in head_64.S, before #PF handler is available, we use three pages to handle kernel crossing 1G, 512G boundaries with sharing page by playing games with page aliasing: the same page is mapped twice in the higher-level tables with appropriate wraparound. The kernel region itself will be properly mapped; other mappings may be spurious. early_make_pgtable is using kernel high mapping address to access pages to set page table. -v4: Add phys_base offset to make kexec happy, and add init_mapping_kernel() - Yinghai -v5: fix compiling with xen, and add back ident level3 and level2 for xen also move back init_level4_pgt from BSS to DATA again. because we have to clear it anyway. - Yinghai -v6: switch to init_level4_pgt in init_mem_mapping. - Yinghai -v7: remove not needed clear_page for init_level4_page it is with fill 512,8,0 already in head_64.S - Yinghai -v8: we need to keep that handler alive until init_mem_mapping and don't let early_trap_init to trash that early #PF handler. So split early_trap_pf_init out and move it down. - Yinghai -v9: switchover only cover kernel space instead of 1G so could avoid touch possible mem holes. - Yinghai -v11: change far jmp back to far return to initial_code, that is needed to fix failure that is reported by Konrad on AMD systems. - Yinghai Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-12-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, realmode: Separate real_mode reserve and setupYinghai Lu3-14/+25
After we switch to use #PF handler help to set page table, init_level4_pgt will only have entries set after init_mem_mapping(). We need to move copying init_level4_pgt to trampoline_pgd after that. So split reserve and setup, and move the setup after init_mem_mapping() Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-11-git-send-email-yinghai@kernel.org Cc: Jarkko Sakkinen <jarkko.sakkinen@intel.com> Acked-by: Jarkko Sakkinen <jarkko.sakkinen@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, 64bit, realmode: Use init_level4_pgt to set trampoline_pgd directlyYinghai Lu1-2/+2
with #PF handler way to set early page table, level3_ident will go away with 64bit native path. So just use entries in init_level4_pgt to set them in trampoline_pgd. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-10-git-send-email-yinghai@kernel.org Cc: Jarkko Sakkinen <jarkko.sakkinen@intel.com> Acked-by: Jarkko Sakkinen <jarkko.sakkinen@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, 64bit: Copy struct boot_params earlyYinghai Lu1-1/+5
We want to support struct boot_params (formerly known as the zero-page, or real-mode data) above the 4 GiB mark. We will have #PF handler to set page table for not accessible ram early, but want to limit it before x86_64_start_reservations to limit the code change to native path only. Also we will need the ramdisk info in struct boot_params to access the microcode blob in ramdisk in x86_64_start_kernel, so copy struct boot_params early makes it accessing ramdisk info simple. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-9-git-send-email-yinghai@kernel.org Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, 64bit, mm: Add generic kernel/ident mapping helperYinghai Lu2-0/+83
It is simple version for kernel_physical_mapping_init. it will work to build one page table that will be used later. Use mapping_info to control 1. alloc_pg_page method 2. if PMD is EXEC, 3. if pgd is with kernel low mapping or ident mapping. Will use to replace some local versions in kexec, hibernation and etc. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-8-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, realmode: Set real_mode permissions earlyYinghai Lu1-5/+6
Trampoline code is executed by APs with kernel low mapping on 64bit. We need to set trampoline code to EXEC early before we boot APs. Found the problem after switching to #PF handler set page table, and we do not set initial kernel low mapping with EXEC anymore in arch/x86/kernel/head_64.S. Change to use early_initcall instead that will make sure trampoline will have EXEC set. -v2: Merge two comments according to Borislav Petkov <bp@alien8.de> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-7-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, 64bit, mm: Make pgd next calculation consistent with pud/pmdYinghai Lu1-4/+2
Just like the way we calculate next for pud and pmd, aka round down and add size. Also, do not do boundary-checking with 'next', and just pass 'end' down to phys_pud_init() instead. Because the loop in phys_pud_init() stops at PTRS_PER_PUD and thus can handle a possibly bigger 'end' properly. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-6-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86: Factor out e820_add_kernel_range()Yinghai Lu1-14/+22
Separate out the reservation of the kernel static memory areas into a separate function. Also add support for case when memmap=xxM$yyM is used without exactmap. Need to remove reserved range at first before we add E820_RAM range, otherwise added E820_RAM range will be ignored. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-5-git-send-email-yinghai@kernel.org Cc: Jacob Shin <jacob.shin@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30x86, mm: Fix page table early allocation offset checkingYinghai Lu1-1/+12
During debugging loading kernel above 4G, found that one page is not used in pre-allocated BRK area for early page allocation. pgt_buf_top is address that can not be used, so should check if that new end is above that top, otherwise last page will not be used. Fix that checking and also add print out for allocation from pre-allocated BRK area to catch possible bugs later. But after we get back that page for pgt, it tiggers one bug in pgt allocation with xen: We need to avoid to use page as pgt to map range that is overlapping with that pgt page. Add checking about overlapping, when it happens, use memblock allocation instead. That fixes crash on Xen PV guest with 2G that Stefan found. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-2-git-send-email-yinghai@kernel.org Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Tested-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30Merge remote-tracking branch 'origin/x86/boot' into x86/mm2H. Peter Anvin11256-298590/+522950
Coming patches to x86/mm2 require the changes and advanced baseline in x86/boot. Resolved Conflicts: arch/x86/kernel/setup.c mm/nobootmem.c Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29x86, boot: Sanitize boot_params if not zeroed on creationH. Peter Anvin5-0/+46
Use the new sentinel field to detect bootloaders which fail to follow protocol and don't initialize fields in struct boot_params that they do not explicitly initialize to zero. Based on an original patch and research by Yinghai Lu. Changed by hpa to be invoked both in the decompression path and in the kernel proper; the latter for the case where a bootloader takes over decompression. Originally-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-26-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-28x86, boot: Define the 2.12 bzImage boot protocolH. Peter Anvin5-29/+106
Define the 2.12 bzImage boot protocol: add xloadflags and additional fields to allow the command line, initramfs and struct boot_params to live above the 4 GiB mark. The xloadflags now communicates if this is a 64-bit kernel with the legacy 64-bit entry point and which of the EFI handover entry points are supported. Avoid adding new read flags to loadflags because of claimed bootloaders testing the whole byte for == 1 to determine bzImageness at least until the issue can be researched further. This is based on patches by Yinghai Lu and David Woodhouse. Originally-by: Yinghai Lu <yinghai@kernel.org> Originally-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Yinghai Lu <yinghai@kernel.org> Acked-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/1359058816-7615-26-git-send-email-yinghai@kernel.org Cc: Rob Landley <rob@landley.net> Cc: Gokul Caushik <caushik1@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joe Millenbach <jmillenbach@gmail.com>
2013-01-27x86/boot: Fix minor fd leakage in tools/relocs.cCong Ding1-2/+4
The opened file should be closed. Signed-off-by: Cong Ding <dinggnu@gmail.com> Cc: Kusanagi Kouichi <slash@ac.auone-net.jp> Cc: Jarkko Sakkinen <jarkko.sakkinen@intel.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1358183628-27784-1-git-send-email-dinggnu@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-26x86, kvm: Fix kvm's use of __pa() on percpu areasDave Hansen2-6/+7
In short, it is illegal to call __pa() on an address holding a percpu variable. This replaces those __pa() calls with slow_virt_to_phys(). All of the cases in this patch are in boot time (or CPU hotplug time at worst) code, so the slow pagetable walking in slow_virt_to_phys() is not expected to have a performance impact. The times when this actually matters are pretty obscure (certain 32-bit NUMA systems), but it _does_ happen. It is important to keep KVM guests working on these systems because the real hardware is getting harder and harder to find. This bug manifested first by me seeing a plain hang at boot after this message: CPU 0 irqstacks, hard=f3018000 soft=f301a000 or, sometimes, it would actually make it out to the console: [ 0.000000] BUG: unable to handle kernel paging request at ffffffff I eventually traced it down to the KVM async pagefault code. This can be worked around by disabling that code either at compile-time, or on the kernel command-line. The kvm async pagefault code was injecting page faults in to the guest which the guest misinterpreted because its "reason" was not being properly sent from the host. The guest passes a physical address of an per-cpu async page fault structure via an MSR to the host. Since __pa() is broken on percpu data, the physical address it sent was bascially bogus and the host went scribbling on random data. The guest never saw the real reason for the page fault (it was injected by the host), assumed that the kernel had taken a _real_ page fault, and panic()'d. The behavior varied, though, depending on what got corrupted by the bad write. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20130122212435.4905663F@kernel.stglabs.ibm.com Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-26x86, mm: Create slow_virt_to_phys()Dave Hansen2-0/+32
This is necessary because __pa() does not work on some kinds of memory, like vmalloc() or the alloc_remap() areas on 32-bit NUMA systems. We have some functions to do conversions _like_ this in the vmalloc() code (like vmalloc_to_page()), but they do not work on sizes other than 4k pages. We would potentially need to be able to handle all the page sizes that we use for the kernel linear mapping (4k, 2M, 1G). In practice, on 32-bit NUMA systems, the percpu areas get stuck in the alloc_remap() area. Any __pa() call on them will break and basically return garbage. This patch introduces a new function slow_virt_to_phys(), which walks the kernel page tables on x86 and should do precisely the same logical thing as __pa(), but actually work on a wider range of memory. It should work on the normal linear mapping, vmalloc(), kmap(), etc... Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20130122212433.4D1FCA62@kernel.stglabs.ibm.com Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-26x86, mm: Use new pagetable helpers in try_preserve_large_page()Dave Hansen1-7/+4
try_preserve_large_page() can be slightly simplified by using the new page_level_*() helpers. This also moves the 'level' over to the new pg_level enum type. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20130122212432.14F3D993@kernel.stglabs.ibm.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-26x86, mm: Pagetable level size/shift/mask helpersDave Hansen2-1/+15
I plan to use lookup_address() to walk the kernel pagetables in a later patch. It returns a "pte" and the level in the pagetables where the "pte" was found. The level is just an enum and needs to be converted to a useful value in order to do address calculations with it. These helpers will be used in at least two places. This also gives the anonymous enum a real name so that no one gets confused about what they should be passing in to these helpers. "PTE_SHIFT" was chosen for naming consistency with the other pagetable levels (PGD/PUD/PMD_SHIFT). Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20130122212431.405D3A8C@kernel.stglabs.ibm.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-26x86, mm: Make DEBUG_VIRTUAL work earlier in bootDave Hansen3-4/+11
The KVM code has some repeated bugs in it around use of __pa() on per-cpu data. Those data are not in an area on which using __pa() is valid. However, they are also called early enough in boot that __vmalloc_start_set is not set, and thus the CONFIG_DEBUG_VIRTUAL debugging does not catch them. This adds a check to also verify __pa() calls against max_low_pfn, which we can use earler in boot than is_vmalloc_addr(). However, if we are super-early in boot, max_low_pfn=0 and this will trip on every call, so also make sure that max_low_pfn is set before we try to use it. With this patch applied, CONFIG_DEBUG_VIRTUAL will actually catch the bug I was chasing (and fix later in this series). I'd love to find a generic way so that any __pa() call on percpu areas could do a BUG_ON(), but there don't appear to be any nice and easy ways to check if an address is a percpu one. Anybody have ideas on a way to do this? Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20130122212430.F46F8159@kernel.stglabs.ibm.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-26Merge tag 'v3.8-rc5' into x86/mmH. Peter Anvin11302-299474/+525253
The __pa() fixup series that follows touches KVM code that is not present in the existing branch based on v3.7-rc5, so merge in the current upstream from Linus. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-26Merge branch 'x86/mm' of ↵H. Peter Anvin3-3/+8
ssh://ra.kernel.org/pub/scm/linux/kernel/git/tip/tip into x86/mm Add missing patch from the __pa_symbol conversion series by Alexander Duyck. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-25Linux 3.8-rc5v3.8-rc5Linus Torvalds1-1/+1