summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2015-08-25x86/ldt: Correct LDT access in single stepping logicJuergen Gross1-2/+2
commit 136d9d83c07c5e30ac49fc83b27e8c4842f108fc upstream. Commit 37868fe113ff ("x86/ldt: Make modify_ldt synchronous") introduced a new struct ldt_struct anchored at mm->context.ldt. convert_ip_to_linear() was changed to reflect this, but indexing into the ldt has to be changed as the pointer is no longer void *. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@suse.de Link: http://lkml.kernel.org/r/1438848278-12906-1-git-send-email-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-08-25x86/ldt: Make modify_ldt synchronousAndy Lutomirski9-150/+208
commit 37868fe113ff2ba814b3b4eb12df214df555f8dc upstream. modify_ldt() has questionable locking and does not synchronize threads. Improve it: redesign the locking and synchronize all threads' LDTs using an IPI on all modifications. This will dramatically slow down modify_ldt in multithreaded programs, but there shouldn't be any multithreaded programs that care about modify_ldt's performance in the first place. This fixes some fallout from the CVE-2015-5157 fixes. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/4c6978476782160600471bd865b318db34c7b628.1438291540.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-08-25arch: Introduce smp_load_acquire(), smp_store_release()Peter Zijlstra1-1/+42
commit 47933ad41a86a4a9b50bed7c9b9bd2ba242aac63 upstream. A number of situations currently require the heavyweight smp_mb(), even though there is no need to order prior stores against later loads. Many architectures have much cheaper ways to handle these situations, but the Linux kernel currently has no portable way to make use of them. This commit therefore supplies smp_load_acquire() and smp_store_release() to remedy this situation. The new smp_load_acquire() primitive orders the specified load against any subsequent reads or writes, while the new smp_store_release() primitive orders the specifed store against any prior reads or writes. These primitives allow array-based circular FIFOs to be implemented without an smp_mb(), and also allow a theoretical hole in rcu_assign_pointer() to be closed at no additional expense on most architectures. In addition, the RCU experience transitioning from explicit smp_read_barrier_depends() and smp_wmb() to rcu_dereference() and rcu_assign_pointer(), respectively resulted in substantial improvements in readability. It therefore seems likely that replacing other explicit barriers with smp_load_acquire() and smp_store_release() will provide similar benefits. It appears that roughly half of the explicit barriers in core kernel code might be so replaced. [Changelog by PaulMck] Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Michael Neuling <mikey@neuling.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Victor Kaplansky <VICTORK@il.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20131213150640.908486364@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-08-25x86/nmi/64: Switch stacks on userspace NMI entryAndy Lutomirski1-4/+61
commit 9b6e6a8334d56354853f9c255d1395c2ba570e0a upstream. Returning to userspace is tricky: IRET can fail, and ESPFIX can rearrange the stack prior to IRET. The NMI nesting fixup relies on a precise stack layout and atomic IRET. Rather than trying to teach the NMI nesting fixup to handle ESPFIX and failed IRET, punt: run NMIs that came from user mode on the normal kernel stack. This will make some nested NMIs visible to C code, but the C code is okay with that. As a side effect, this should speed up perf: it eliminates an RDMSR when NMIs come from user mode. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-08-25x86/nmi/64: Remove asm code that saves CR2Andy Lutomirski1-18/+0
commit 0e181bb58143cb4a2e8f01c281b0816cd0e4798e upstream. Now that do_nmi saves CR2, we don't need to save it in asm. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-08-25x86/nmi: Enable nested do_nmi() handling for 64-bit kernelsAndy Lutomirski1-71/+52
commit 9d05041679904b12c12421cbcf9cb5f4860a8d7b upstream. 32-bit kernels handle nested NMIs in C. Enable the exact same handling on 64-bit kernels as well. This isn't currently necessary, but it will become necessary once the asm code starts allowing limited nesting. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-08-19x86/xen: Probe target addresses in set_aliased_prot() before the hypercallAndy Lutomirski1-0/+40
commit aa1acff356bbedfd03b544051f5b371746735d89 upstream. The update_va_mapping hypercall can fail if the VA isn't present in the guest's page tables. Under certain loads, this can result in an OOPS when the target address is in unpopulated vmap space. While we're at it, add comments to help explain what's going on. This isn't a great long-term fix. This code should probably be changed to use something like set_memory_ro. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <dvrabel@cantab.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: security@kernel.org <security@kernel.org> Cc: xen-devel <xen-devel@lists.xen.org> Link: http://lkml.kernel.org/r/0b0e55b995cda11e7829f140b833ef932fcabe3a.1438291540.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-08-05efi: fix 32bit kernel boot failed problem using efiFupan Li1-1/+1
3.12 commit 065487a10a22a960bc4e41facb011d10692ef470 ("x86/efi: Correct EFI boot stub use of code32_start"), upstream commit 7e8213c1f3acc064aef37813a39f13cbfe7c3ce7 imported a bug, which will cause 32bit kernel boot to fail using EFI method. It should use the label's address instead of the value stored in the label to calculate the address of code32_start. Signed-off-by: Fupan Li <fupan.li@windriver.com> Reviewed-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-07-30kvm: x86: fix kvm_apic_has_events to check for NULL pointerPaolo Bonzini1-1/+1
commit ce40cd3fc7fa40a6119e5fe6c0f2bc0eb4541009 upstream. Malicious (or egregiously buggy) userspace can trigger it, but it should never happen in normal operation. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-07-30x86/PCI: Use host bridge _CRS info on systems with >32 bit addressingBjorn Helgaas1-2/+4
commit 3d9fecf6bfb8b12bc2f9a4c7109895a2a2bb9436 upstream. We enable _CRS on all systems from 2008 and later. On older systems, we ignore _CRS and assume the whole physical address space (excluding RAM and other devices) is available for PCI devices, but on systems that support physical address spaces larger than 4GB, it's doubtful that the area above 4GB is really available for PCI. After d56dbf5bab8c ("PCI: Allocate 64-bit BARs above 4G when possible"), we try to use that space above 4GB *first*, so we're more likely to put a device there. On Juan's Toshiba Satellite Pro U200, BIOS left the graphics, sound, 1394, and card reader devices unassigned (but only after Windows had been booted). Only the sound device had a 64-bit BAR, so it was the only device placed above 4GB, and hence the only device that didn't work. Keep _CRS enabled even on pre-2008 systems if they support physical address space larger than 4GB. Fixes: d56dbf5bab8c ("PCI: Allocate 64-bit BARs above 4G when possible") Reported-and-tested-by: Juan Dayer <jdayer@outlook.com> Reported-and-tested-by: Alan Horsfield <alan@hazelgarth.co.uk> Link: https://bugzilla.kernel.org/show_bug.cgi?id=99221 Link: https://bugzilla.opensuse.org/show_bug.cgi?id=907092 Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Jiri Slaby <jslaby@suse.com>
2015-07-30KVM: x86: make vapics_in_nmi_mode atomicRadim Krčmář3-4/+4
commit 42720138b06301cc8a7ee8a495a6d021c4b6a9bc upstream. Writes were a bit racy, but hard to turn into a bug at the same time. (Particularly because modern Linux doesn't use this feature anymore.) Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> [Actually the next patch makes it much, much easier to trigger the race so I'm including this one for stable@ as well. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-07-30x86/PCI: Use host bridge _CRS info on Foxconn K8M890-8237ABjorn Helgaas1-0/+11
commit 1dace0116d0b05c967d94644fc4dfe96be2ecd3d upstream. The Foxconn K8M890-8237A has two PCI host bridges, and we can't assign resources correctly without the information from _CRS that tells us which address ranges are claimed by which bridge. In the bugs mentioned below, we incorrectly assign a sound card address (this example is from 1033299): bus: 00 index 2 [mem 0x80000000-0xfcffffffff] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-7f]) pci_root PNP0A08:00: host bridge window [mem 0x80000000-0xbfefffff] (ignored) pci_root PNP0A08:00: host bridge window [mem 0xc0000000-0xdfffffff] (ignored) pci_root PNP0A08:00: host bridge window [mem 0xf0000000-0xfebfffff] (ignored) ACPI: PCI Root Bridge [PCI1] (domain 0000 [bus 80-ff]) pci_root PNP0A08:01: host bridge window [mem 0xbff00000-0xbfffffff] (ignored) pci 0000:80:01.0: [1106:3288] type 0 class 0x000403 pci 0000:80:01.0: reg 10: [mem 0xbfffc000-0xbfffffff 64bit] pci 0000:80:01.0: address space collision: [mem 0xbfffc000-0xbfffffff 64bit] conflicts with PCI Bus #00 [mem 0x80000000-0xfcffffffff] pci 0000:80:01.0: BAR 0: assigned [mem 0xfd00000000-0xfd00003fff 64bit] BUG: unable to handle kernel paging request at ffffc90000378000 IP: [<ffffffffa0345f63>] azx_create+0x37c/0x822 [snd_hda_intel] We assigned 0xfd_0000_0000, but that is not in any of the host bridge windows, and the sound card doesn't work. Turn on pci=use_crs automatically for this system. Link: https://bugs.launchpad.net/ubuntu/+source/alsa-driver/+bug/931368 Link: https://bugs.launchpad.net/ubuntu/+source/alsa-driver/+bug/1033299 Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-07-30KVM: nSVM: Check for NRIPS support before updating control fieldBandan Das1-2/+6
commit f104765b4f81fd74d69e0eb161e89096deade2db upstream. If hardware doesn't support DecodeAssist - a feature that provides more information about the intercept in the VMCB, KVM decodes the instruction and then updates the next_rip vmcb control field. However, NRIP support itself depends on cpuid Fn8000_000A_EDX[NRIPS]. Since skip_emulated_instruction() doesn't verify nrip support before accepting control.next_rip as valid, avoid writing this field if support isn't present. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-07-30kprobes/x86: Return correct length in __copy_instruction()Eugene Shatokhin1-2/+5
commit c80e5c0c23ce2282476fdc64c4b5e3d3a40723fd upstream. On x86-64, __copy_instruction() always returns 0 (error) if the instruction uses %rip-relative addressing. This is because kernel_insn_init() is called the second time for 'insn' instance in such cases and sets all its fields to 0. Because of this, trying to place a kprobe on such instruction will fail, register_kprobe() will return -EINVAL. This patch fixes the problem. Signed-off-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20150317100918.28349.94654.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-06-23x86/asm/irq: Stop relying on magic JMP behavior for early_idt_handlersAndy Lutomirski4-26/+42
commit 425be5679fd292a3c36cb1fe423086708a99f11a upstream. The early_idt_handlers asm code generates an array of entry points spaced nine bytes apart. It's not really clear from that code or from the places that reference it what's going on, and the code only works in the first place because GAS never generates two-byte JMP instructions when jumping to global labels. Clean up the code to generate the correct array stride (member size) explicitly. This should be considerably more robust against screw-ups, as GAS will warn if a .fill directive has a negative count. Using '. =' to advance would have been even more robust (it would generate an actual error if it tried to move backwards), but it would pad with nulls, confusing anyone who tries to disassemble the code. The new scheme should be much clearer to future readers. While we're at it, improve the comments and rename the array and common code. Binutils may start relaxing jumps to non-weak labels. If so, this change will fix our build, and we may need to backport this change. Before, on x86_64: 0000000000000000 <early_idt_handlers>: 0: 6a 00 pushq $0x0 2: 6a 00 pushq $0x0 4: e9 00 00 00 00 jmpq 9 <early_idt_handlers+0x9> 5: R_X86_64_PC32 early_idt_handler-0x4 ... 48: 66 90 xchg %ax,%ax 4a: 6a 08 pushq $0x8 4c: e9 00 00 00 00 jmpq 51 <early_idt_handlers+0x51> 4d: R_X86_64_PC32 early_idt_handler-0x4 ... 117: 6a 00 pushq $0x0 119: 6a 1f pushq $0x1f 11b: e9 00 00 00 00 jmpq 120 <early_idt_handler> 11c: R_X86_64_PC32 early_idt_handler-0x4 After: 0000000000000000 <early_idt_handler_array>: 0: 6a 00 pushq $0x0 2: 6a 00 pushq $0x0 4: e9 14 01 00 00 jmpq 11d <early_idt_handler_common> ... 48: 6a 08 pushq $0x8 4a: e9 d1 00 00 00 jmpq 120 <early_idt_handler_common> 4f: cc int3 50: cc int3 ... 117: 6a 00 pushq $0x0 119: 6a 1f pushq $0x1f 11b: eb 03 jmp 120 <early_idt_handler_common> 11d: cc int3 11e: cc int3 11f: cc int3 Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Binutils <binutils@sourceware.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H.J. Lu <hjl.tools@gmail.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/ac027962af343b0c599cbfcf50b945ad2ef3d7a8.1432336324.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-06-10x86: bpf_jit: fix compilation of large bpf programsAlexei Starovoitov1-1/+6
[ Upstream commit 3f7352bf21f8fd7ba3e2fcef9488756f188e12be ] x86 has variable length encoding. x86 JIT compiler is trying to pick the shortest encoding for given bpf instruction. While doing so the jump targets are changing, so JIT is doing multiple passes over the program. Typical program needs 3 passes. Some very short programs converge with 2 passes. Large programs may need 4 or 5. But specially crafted bpf programs may hit the pass limit and if the program converges on the last iteration the JIT compiler will be producing an image full of 'int 3' insns. Fix this corner case by doing final iteration over bpf program. Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Tested-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-06-03KVM: MMU: fix CR4.SMEP=1, CR0.WP=0 with shadow pagesPaolo Bonzini1-1/+1
commit 898761158be7682082955e3efa4ad24725305fc7 upstream. smep_andnot_wp is initialized in kvm_init_shadow_mmu and shadow pages should not be reused for different values of it. Thus, it has to be added to the mask in kvm_mmu_pte_write. Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-06-02perf/x86/amd/ibs: Update IBS MSRs and feature definitionsAravind Gopalakrishnan3-0/+19
commit 904cb3677f3adcd3d837be0a0d0b14251ba8d6f7 upstream. New Fam15h models carry extra feature bits and extend the MSR register space for IBS ops. Adding them here. While at it, add functionality to read IbsBrTarget and OpData4 depending on their availability if user wants a PERF_SAMPLE_RAW. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Acked-by: Borislav Petkov <bp@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Len Brown <len.brown@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <paulus@samba.org> Cc: <acme@kernel.org> Link: http://lkml.kernel.org/r/1415651066-13523-1-git-send-email-Aravind.Gopalakrishnan@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-06-02amd64_edac: Add support for newer F16h modelsAravind Gopalakrishnan1-0/+2
commit 85a8885bd0e00569108aa7b5e26b89c752e3cd51 upstream. Extend ECC decoding support for F16h M30h. Tested on F16h M30h with ECC turned on using mce_amd_inj module and the patch works fine. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Link: http://lkml.kernel.org/r/1392913726-16961-1-git-send-email-Aravind.Gopalakrishnan@amd.com Tested-by: Arindam Nath <Arindam.Nath@amd.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-05-26config: Enable NEED_DMA_MAP_STATE by default when SWIOTLB is selectedKonrad Rzeszutek Wilk1-1/+1
commit a6dfa128ce5c414ab46b1d690f7a1b8decb8526d upstream. A huge amount of NIC drivers use the DMA API, however if compiled under 32-bit an very important part of the DMA API can be ommitted leading to the drivers not working at all (especially if used with 'swiotlb=force iommu=soft'). As Prashant Sreedharan explains it: "the driver [tg3] uses DEFINE_DMA_UNMAP_ADDR(), dma_unmap_addr_set() to keep a copy of the dma "mapping" and dma_unmap_addr() to get the "mapping" value. On most of the platforms this is a no-op, but ... with "iommu=soft and swiotlb=force" this house keeping is required, ... otherwise we pass 0 while calling pci_unmap_/pci_dma_sync_ instead of the DMA address." As such enable this even when using 32-bit kernels. Reported-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Prashant Sreedharan <prashant@broadcom.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Chan <mchan@broadcom.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boris.ostrovsky@oracle.com Cc: cascardo@linux.vnet.ibm.com Cc: david.vrabel@citrix.com Cc: sanjeevb@broadcom.com Cc: siva.kallam@broadcom.com Cc: vyasevich@gmail.com Cc: xen-devel@lists.xensource.com Link: http://lkml.kernel.org/r/20150417190448.GA9462@l.oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-05-04nosave: consolidate __nosave_{begin,end} in <asm/sections.h>Geert Uytterhoeven2-6/+2
commit 7f8998c7aef3ac9c5f3f2943e083dfa6302e90d0 upstream. The different architectures used their own (and different) declarations: extern __visible const void __nosave_begin, __nosave_end; extern const void __nosave_begin, __nosave_end; extern long __nosave_begin, __nosave_end; Consolidate them using the first variant in <asm/sections.h>. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz> [js -- port to 3.12: arm does not have hibernation yet]
2015-04-30sched/idle/x86: Restore mwait_idle() to fix boot hangs, to improve power ↵Len Brown1-0/+49
savings and to improve performance commit b253149b843f89cd300cbdbea27ce1f847506f99 upstream. In Linux-3.9 we removed the mwait_idle() loop: 69fb3676df33 ("x86 idle: remove mwait_idle() and "idle=mwait" cmdline param") The reasoning was that modern machines should be sufficiently happy during the boot process using the default_idle() HALT loop, until cpuidle loads and either acpi_idle or intel_idle invoke the newer MWAIT-with-hints idle loop. But two machines reported problems: 1. Certain Core2-era machines support MWAIT-C1 and HALT only. MWAIT-C1 is preferred for optimal power and performance. But if they support just C1, cpuidle never loads and so they use the boot-time default idle loop forever. 2. Some laptops will boot-hang if HALT is used, but will boot successfully if MWAIT is used. This appears to be a hidden assumption in BIOS SMI, that is presumably valid on the proprietary OS where the BIOS was validated. https://bugzilla.kernel.org/show_bug.cgi?id=60770 So here we effectively revert the patch above, restoring the mwait_idle() loop. However, we don't bother restoring the idle=mwait cmdline parameter, since it appears to add no value. Maintainer notes: For 3.9, simply revert 69fb3676df for 3.10, patch -F3 applies, fuzz needed due to __cpuinit use in context For 3.11, 3.12, 3.13, this patch applies cleanly Tested-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Len Brown <len.brown@intel.com> Acked-by: Mike Galbraith <bitbucket@online.de> Cc: <stable@vger.kernel.org> # 3.9+ Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ian Malone <ibmalone@gmail.com> Cc: Josh Boyer <jwboyer@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/345254a551eb5a6a866e048d7ab570fd2193aca4.1389763084.git.len.brown@intel.com [ Ported to recent kernels. ] [ Mike: 3.10 backport ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-27KVM: x86: SYSENTER emulation is brokenNadav Amit1-19/+8
commit f3747379accba8e95d70cec0eae0582c8c182050 upstream. SYSENTER emulation is broken in several ways: 1. It misses the case of 16-bit code segments completely (CVE-2015-0239). 2. MSR_IA32_SYSENTER_CS is checked in 64-bit mode incorrectly (bits 0 and 1 can still be set without causing #GP). 3. MSR_IA32_SYSENTER_EIP and MSR_IA32_SYSENTER_ESP are not masked in legacy-mode. 4. There is some unneeded code. Fix it. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [zhangzhiqiang: backport to 3.10: - adjust context - in 3.10 context "ctxt->eflags &= ~(EFLG_VM | EFLG_IF | EFLG_RF)" is replaced by "ctxt->eflags &= ~(EFLG_VM | EFLG_IF)" in upstream, which was changed by another commit. - After the above adjustments, becomes same to the original patch: https://github.com/torvalds/linux/commit/f3747379accba8e95d70cec0eae0582c8c182050 ] Signed-off-by: Zhiqiang Zhang <zhangzhiqiang.zhang@huawei.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-22x86/reboot: Add ASRock Q1900DC-ITX mainboard reboot quirkStefan Lippers-Hollmann1-0/+10
commit 80313b3078fcd2ca51970880d90757f05879a193 upstream. The ASRock Q1900DC-ITX mainboard (Baytrail-D) hangs randomly in both BIOS and UEFI mode while rebooting unless reboot=pci is used. Add a quirk to reboot via the pci method. The problem is very intermittent and hard to debug, it might succeed rebooting just fine 40 times in a row - but fails half a dozen times the next day. It seems to be slightly less common in BIOS CSM mode than native UEFI (with the CSM disabled), but it does happen in either mode. Since I've started testing this patch in late january, rebooting has been 100% reliable. Most of the time it already hangs during POST, but occasionally it might even make it through the bootloader and the kernel might even start booting, but then hangs before the mode switch. The same symptoms occur with grub-efi, gummiboot and grub-pc, just as well as (at least) kernel 3.16-3.19 and 4.0-rc6 (I haven't tried older kernels than 3.16). Upgrading to the most current mainboard firmware of the ASRock Q1900DC-ITX, version 1.20, does not improve the situation. ( Searching the web seems to suggest that other Bay Trail-D mainboards might be affected as well. ) -- Signed-off-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/20150330224427.0fb58e42@mir Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-21x86/reboot: Add reboot quirk for Certec BPC600Christian Gmeiner1-0/+10
commit aadca6fa4068ad1f92c492bc8507b7ed350825a2 upstream. Certec BPC600 needs reboot=pci to actually reboot. Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Li Aubrey <aubrey.li@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Jones <davej@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1399446114-2147-1-git-send-email-christian.gmeiner@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-21x86/reboot: Sort reboot DMI quirks by vendorJiri Slaby1-129/+141
commit e56e57f6613d5ed5c3127419341d1aa989a11971 upstream. Grouping them by vendor should make it easier to spot duplicates. Signed-off-by: Dave Jones <davej@fedoraproject.org> Link: http://lkml.kernel.org/r/20131001203655.GA10719@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-21x86/reboot: Remove the duplicate C6100 entry in the reboot quirks listMasoud Sharbiani1-8/+0
commit b5eafc6f07c95e9f3dd047e72737449cb03c9956 upstream. Two entries for the same system type were added, with two different vendor names: 'Dell' and 'Dell, Inc.'. Since a prefix match is being used by the DMI parsing code, we can eliminate the latter as redundant. Reported-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Masoud Sharbiani <msharbiani@twitter.com> Cc: holt@sgi.com Link: http://lkml.kernel.org/r/1380216643-4683-1-git-send-email-masoud.sharbiani@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-09x86/irq: Check for valid irq descriptor in check_irq_vectors_for_cpu_disable()Joerg Roedel1-0/+3
commit d97eb8966c91f2c9d05f0a22eb89ed5b76d966d1 upstream. When an interrupt is migrated away from a cpu it will stay in its vector_irq array until smp_irq_move_cleanup_interrupt succeeded. The cfg->move_in_progress flag is cleared already when the IPI was sent. When the interrupt is destroyed after migration its 'struct irq_desc' is freed and the vector_irq arrays are cleaned up. But since cfg->move_in_progress is already 0 the references at cpus before the last migration will not be cleared. So this would leave a reference to an already destroyed irq alive. When the cpu is taken down at this point, the check_irq_vectors_for_cpu_disable() function finds a valid irq number in the vector_irq array, but gets NULL for its descriptor and dereferences it, causing a kernel panic. This has been observed on real systems at shutdown. Add a check to check_irq_vectors_for_cpu_disable() for a valid 'struct irq_desc' to prevent this issue. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Yinghai Lu <yinghai@kernel.org> Cc: alnovak@suse.com Cc: joro@8bytes.org Link: http://lkml.kernel.org/r/20150204132754.GA10078@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-09x86/microcode/intel: Guard against stack overflow in the loaderQuentin Casasnovas1-1/+1
commit f84598bd7c851f8b0bf8cd0d7c3be0d73c432ff4 upstream. mc_saved_tmp is a static array allocated on the stack, we need to make sure mc_saved_count stays within its bounds, otherwise we're overflowing the stack in _save_mc(). A specially crafted microcode header could lead to a kernel crash or potentially kernel execution. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Link: http://lkml.kernel.org/r/1422964824-22056-1-git-send-email-quentin.casasnovas@oracle.com Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-09x86/vdso: Fix the build on GCC5Jiri Slaby1-0/+1
commit e893286918d2cde3a94850d8f7101cd1039e0c62 upstream. On gcc5 the kernel does not link: ld: .eh_frame_hdr table[4] FDE at 0000000000000648 overlaps table[5] FDE at 0000000000000670. Because prior GCC versions always emitted NOPs on ALIGN directives, but gcc5 started omitting them. .LSTARTFDEDLSI1 says: /* HACK: The dwarf2 unwind routines will subtract 1 from the return address to get an address in the middle of the presumed call instruction. Since we didn't get here via a call, we need to include the nop before the real start to make up for it. */ .long .LSTART_sigreturn-1-. /* PC-relative start address */ But commit 69d0627a7f6e ("x86 vDSO: reorder vdso32 code") from 2.6.25 replaced .org __kernel_vsyscall+32,0x90 by ALIGN right before __kernel_sigreturn. Of course, ALIGN need not generate any NOP in there. Esp. gcc5 collapses vclock_gettime.o and int80.o together with no generated NOPs as "ALIGN". So fix this by adding to that point at least a single NOP and make the function ALIGN possibly with more NOPs then. Kudos for reporting and diagnosing should go to Richard. Reported-by: Richard Biener <rguenther@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1425543211-12542-1-git-send-email-jslaby@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-09x86/fpu: Drop_fpu() should not assume that tsk equals currentOleg Nesterov1-1/+1
commit f4c3686386393c120710dd34df2a74183ab805fd upstream. drop_fpu() does clear_used_math() and usually this is correct because tsk == current. However switch_fpu_finish()->restore_fpu_checking() is called before __switch_to() updates the "current_task" variable. If it fails, we will wrongly clear the PF_USED_MATH flag of the previous task. So use clear_stopped_child_used_math() instead. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pekka Riikonen <priikone@iki.fi> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Suresh Siddha <sbsiddha@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150309171041.GB11388@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-09x86/fpu: Avoid math_state_restore() without used_math() in ↵Oleg Nesterov1-3/+4
__restore_xstate_sig() commit a7c80ebcac3068b1c3cb27d538d29558c30010c8 upstream. math_state_restore() assumes it is called with irqs disabled, but this is not true if the caller is __restore_xstate_sig(). This means that if ia32_fxstate == T and __copy_from_user() fails, __restore_xstate_sig() returns with irqs disabled too. This triggers: BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:41 dump_stack ___might_sleep ? _raw_spin_unlock_irqrestore __might_sleep down_read ? _raw_spin_unlock_irqrestore print_vma_addr signal_fault sys32_rt_sigreturn Change __restore_xstate_sig() to call set_used_math() unconditionally. This avoids enabling and disabling interrupts in math_state_restore(). If copy_from_user() fails, we can simply do fpu_finit() by hand. [ Note: this is only the first step. math_state_restore() should not check used_math(), it should set this flag. While init_fpu() should simply die. ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pekka Riikonen <priikone@iki.fi> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Rik van Riel <riel@redhat.com> Cc: Suresh Siddha <sbsiddha@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150307153844.GB25954@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-04-09crypto: aesni - fix memory usage in GCM decryptionStephan Mueller1-2/+2
commit ccfe8c3f7e52ae83155cb038753f4c75b774ca8a upstream. The kernel crypto API logic requires the caller to provide the length of (ciphertext || authentication tag) as cryptlen for the AEAD decryption operation. Thus, the cipher implementation must calculate the size of the plaintext output itself and cannot simply use cryptlen. The RFC4106 GCM decryption operation tries to overwrite cryptlen memory in req->dst. As the destination buffer for decryption only needs to hold the plaintext memory but cryptlen references the input buffer holding (ciphertext || authentication tag), the assumption of the destination buffer length in RFC4106 GCM operation leads to a too large size. This patch simply uses the already calculated plaintext size. In addition, this patch fixes the offset calculation of the AAD buffer pointer: as mentioned before, cryptlen already includes the size of the tag. Thus, the tag does not need to be added. With the addition, the AAD will be written beyond the already allocated buffer. Note, this fixes a kernel crash that can be triggered from user space via AF_ALG(aead) -- simply use the libkcapi test application from [1] and update it to use rfc4106-gcm-aes. Using [1], the changes were tested using CAVS vectors to demonstrate that the crypto operation still delivers the right results. [1] http://www.chronox.de/libkcapi.html CC: Tadeusz Struk <tadeusz.struk@intel.com> Signed-off-by: Stephan Mueller <smueller@chronox.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-03-16mm/hugetlb: reduce arch dependent code around follow_huge_*Naoya Horiguchi1-12/+0
commit 61f77eda9bbf0d2e922197ed2dcf88638a639ce5 upstream. Currently we have many duplicates in definitions around follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this patch tries to remove the m. The basic idea is to put the default implementation for these functions in mm/hugetlb.c as weak symbols (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement arch-specific code only when the arch needs it. For follow_huge_addr(), only powerpc and ia64 have their own implementation, and in all other architectures this function just returns ERR_PTR(-EINVAL). So this patch sets returning ERR_PTR(-EINVAL) as default. As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to always return 0 in your architecture (like in ia64 or sparc,) it's never called (the callsite is optimized away) no matter how implemented it is. So in such architectures, we don't need arch-specific implementation. In some architecture (like mips, s390 and tile,) their current arch-specific follow_huge_(pmd|pud)() are effectively identical with the common code, so this patch lets these architecture use the common code. One exception is metag, where pmd_huge() could return non-zero but it expects follow_huge_pmd() to always return NULL. This means that we need arch-specific implementation which returns NULL. This behavior looks strange to me (because non-zero pmd_huge() implies that the architecture supports PMD-based hugepage, so follow_huge_pmd() can/should return some relevant value,) but that's beyond this cleanup patch, so let's keep it. Justification of non-trivial changes: - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE is true when follow_huge_pmd() can be called (note that pmd_huge() has the same check and always returns 0 for !MACHINE_HAS_HPAGE.) - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common code. This patch forces these archs use PMD_MASK, but it's OK because they are identical in both archs. In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20. In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but PTE_ORDER is always 0, so these are identical. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-03-12vm: add VM_FAULT_SIGSEGV handling supportLinus Torvalds1-0/+2
commit 33692f27597fcab536d7cbbcc8f52905133e4aa7 upstream. The core VM already knows about VM_FAULT_SIGBUS, but cannot return a "you should SIGSEGV" error, because the SIGSEGV case was generally handled by the caller - usually the architecture fault handler. That results in lots of duplication - all the architecture fault handlers end up doing very similar "look up vma, check permissions, do retries etc" - but it generally works. However, there are cases where the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV. In particular, when accessing the stack guard page, libsigsegv expects a SIGSEGV. And it usually got one, because the stack growth is handled by that duplicated architecture fault handler. However, when the generic VM layer started propagating the error return from the stack expansion in commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page"), that now exposed the existing VM_FAULT_SIGBUS result to user space. And user space really expected SIGSEGV, not SIGBUS. To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those duplicate architecture fault handlers about it. They all already have the code to handle SIGSEGV, so it's about just tying that new return value to the existing code, but it's all a bit annoying. This is the mindless minimal patch to do this. A more extensive patch would be to try to gather up the mostly shared fault handling logic into one generic helper routine, and long-term we really should do that cleanup. Just from this patch, you can generally see that most architectures just copied (directly or indirectly) the old x86 way of doing things, but in the meantime that original x86 model has been improved to hold the VM semaphore for shorter times etc and to handle VM_FAULT_RETRY and other "newer" things, so it would be a good idea to bring all those improvements to the generic case and teach other architectures about them too. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-03-12x86: mm: move mmap_sem unlock from mm_fault_error() to callerLinus Torvalds1-7/+1
commit 7fb08eca45270d0ae86e1ad9d39c40b7a55d0190 upstream. This replaces four copies in various stages of mm_fault_error() handling with just a single one. It will also allow for more natural placement of the unlocking after some further cleanup. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-03-12KVM: emulate: fix CMPXCHG8B on 32-bit hostsPaolo Bonzini1-1/+2
commit 4ff6f8e61eb7f96d3ca535c6d240f863ccd6fb7d upstream. This has been broken for a long time: it broke first in 2.6.35, then was almost fixed in 2.6.36 but this one-liner slipped through the cracks. The bug shows up as an infinite loop in Windows 7 (and newer) boot on 32-bit hosts without EPT. Windows uses CMPXCHG8B to write to page tables, which causes a page fault if running without EPT; the emulator is then called from kvm_mmu_page_fault. The loop then happens if the higher 4 bytes are not 0; the common case for this is that the NX bit (bit 63) is 1. Fixes: 6550e1f165f384f3a46b60a1be9aba4bc3c2adad Fixes: 16518d5ada690643453eb0aef3cc7841d3623c2d Reported-by: Erik Rull <erik.rull@rdsoftware.de> Tested-by: Erik Rull <erik.rull@rdsoftware.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-03-12x86/asm/entry/64: Remove a bogus 'ret_from_fork' optimizationAndy Lutomirski1-5/+8
commit 956421fbb74c3a6261903f3836c0740187cf038b upstream. 'ret_from_fork' checks TIF_IA32 to determine whether 'pt_regs' and the related state make sense for 'ret_from_sys_call'. This is entirely the wrong check. TS_COMPAT would make a little more sense, but there's really no point in keeping this optimization at all. This fixes a return to the wrong user CS if we came from int 0x80 in a 64-bit task. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/4710be56d76ef994ddf59087aad98c000fbab9a4.1424989793.git.luto@amacapital.net [ Backported from tip:x86/asm. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-03-05x86, mm/ASLR: Fix stack randomization on 64-bit systemsHector Marco-Gisbert1-3/+3
commit 4e7c22d447bb6d7e37bfe39ff658486ae78e8d77 upstream. The issue is that the stack for processes is not properly randomized on 64 bit architectures due to an integer overflow. The affected function is randomize_stack_top() in file "fs/binfmt_elf.c": static unsigned long randomize_stack_top(unsigned long stack_top) { unsigned int random_variable = 0; if ((current->flags & PF_RANDOMIZE) && !(current->personality & ADDR_NO_RANDOMIZE)) { random_variable = get_random_int() & STACK_RND_MASK; random_variable <<= PAGE_SHIFT; } return PAGE_ALIGN(stack_top) + random_variable; return PAGE_ALIGN(stack_top) - random_variable; } Note that, it declares the "random_variable" variable as "unsigned int". Since the result of the shifting operation between STACK_RND_MASK (which is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64): random_variable <<= PAGE_SHIFT; then the two leftmost bits are dropped when storing the result in the "random_variable". This variable shall be at least 34 bits long to hold the (22+12) result. These two dropped bits have an impact on the entropy of process stack. Concretely, the total stack entropy is reduced by four: from 2^28 to 2^30 (One fourth of expected entropy). This patch restores back the entropy by correcting the types involved in the operations in the functions randomize_stack_top() and stack_maxrandom_size(). The successful fix can be tested with: $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done 7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0 [stack] 7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0 [stack] 7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0 [stack] 7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0 [stack] ... Once corrected, the leading bytes should be between 7ffc and 7fff, rather than always being 7fff. Signed-off-by: Hector Marco-Gisbert <hecmargi@upv.es> Signed-off-by: Ismael Ripoll <iripoll@upv.es> [ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ] Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Fixes: CVE-2015-1593 Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.net Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-03-05KVM: x86: update masterclock values on TSC writesMarcelo Tosatti1-9/+10
commit 7f187922ddf6b67f2999a76dcb71663097b75497 upstream. When the guest writes to the TSC, the masterclock TSC copy must be updated as well along with the TSC_OFFSET update, otherwise a negative tsc_timestamp is calculated at kvm_guest_time_update. Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)" becomes redundant, so remove the do_request boolean and collapse everything into a single condition. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-03-02mm/hugetlb: pmd_huge() returns true for non-present hugepageNaoya Horiguchi2-2/+8
commit cbef8478bee55775ac312a574aad48af7bb9cf9f upstream. Migrating hugepages and hwpoisoned hugepages are considered as non-present hugepages, and they are referenced via migration entries and hwpoison entries in their page table slots. This behavior causes race condition because pmd_huge() doesn't tell non-huge pages from migrating/hwpoisoned hugepages. follow_page_mask() is one example where the kernel would call follow_page_pte() for such hugepage while this function is supposed to handle only normal pages. To avoid this, this patch makes pmd_huge() return true when pmd_none() is true *and* pmd_present() is false. We don't have to worry about mixing up non-present pmd entry with normal pmd (pointing to leaf level pte entry) because pmd_present() is true in normal pmd. The same race condition could happen in (x86-specific) gup_pmd_range(), where this patch simply adds pmd_present() check instead of pmd_huge(). This is because gup_pmd_range() is fast path. If we have non-present hugepage in this function, we will go into gup_huge_pmd(), then return 0 at flag mask check, and finally fall back to the slow path. Fixes: 290408d4a2 ("hugetlb: hugepage migration core") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-02-16x86: UV BAU: Avoid NULL pointer reference in ptc_seq_showJames Custer1-0/+4
commit fa2a79ce6aef5de35a4d50487da35deb6b634944 upstream. In init_per_cpu(), when get_cpu_topology() fails, init_per_cpu_tunables() is not called afterwards. This means that bau_control->statp is NULL. If a user then reads /proc/sgi_uv/ptc_statistics ptc_seq_show() references a NULL pointer. Therefore, since uv_bau_init calls set_bau_off when init_per_cpu() fails, we add code that detects when the bau is off in ptc_seq_show() to avoid referencing a NULL pointer. Signed-off-by: James Custer <jcuster@sgi.com> Cc: Russ Anderson <rja@sgi.com> Link: http://lkml.kernel.org/r/1414952199-185319-2-git-send-email-jcuster@sgi.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-02-16x86/early quirk: use gen6 stolen detection for VLVJesse Barnes1-2/+2
commit 7bd40c16ccb2cb6877dd00b0e66249c171e6fa43 upstream. We've always been able to use either method on VLV, but it appears more recent BIOSes only support the gen6 method, so switch over to that. References: https://bugs.freedesktop.org/show_bug.cgi?id=71370 Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-02-16ACPI idle: permit sparse C-state sub-state numbersLen Brown1-1/+3
commit 2194324d8bbbad1b179c08b6095649b06abd62d5 upstream. Linux uses CPUID.MWAIT.EDX to validate the C-states reported by ACPI, silently discarding states which are not supported by the HW. This test is too restrictive, as some HW now uses sparse sub-state numbering, so the sub-state number may be higher than the number of sub-states... Also, rather than silently ignoring an invalid state, we should complain about a firmware bug. In practice... Bay Trail systems originally supported C6-no-shrink as MWAIT sub-state 0x58, and in CPUID.MWAIT.EDX 0x03000000 indicated that there were 3 MWAIT-C6 sub-states. So acpi_idle would discard that C-state because 8 >= 3. Upon discovering this issue, the ucode was updated so that C6-no-shrink was also exported as 0x51, and the BIOS was updated to match. However, systems shipped with 0x58, will never get a BIOS update, and this patch allows Linux to see C6-no-shrink on early Bay Trail. Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-02-16KVM: x86: Sysexit emulation does not mask RIP/RSPNadav Amit1-0/+2
commit bf0b682c9b6a6d6d54adf439bfe953feef7be2e8 upstream. If the operand size is not 64-bit, then the sysexit instruction should assign ECX to RSP and EDX to RIP. The current code assigns the full 64-bits. Fix it by masking. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-02-16KVM: vmx: Inject #GP on invalid PAT CRNadav Amit3-2/+7
commit 4566654bb9be9e8864df417bb72ceee5136b6a6a upstream. Guest which sets the PAT CR to invalid value should get a #GP. Currently, if vmx supports loading PAT CR during entry, then the value is not checked. This patch makes the required check in that case. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Bruce Rogers <brogers@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-02-16KVM: x86: emulating descriptor load misses long-mode caseNadav Amit1-0/+9
commit 040c8dc8a5afa7364bb8bb5b1b76c30007d6be14 upstream. In 64-bit mode a #GP should be delivered to the guest "if the code segment descriptor pointed to by the selector in the 64-bit gate doesn't have the L-bit set and the D-bit clear." - Intel SDM "Interrupt 13—General Protection Exception (#GP)". This patch fixes the behavior of CS loading emulation code. Although the comment says that segment loading is not supported in long mode, this function is executed in long mode, so the fix is necassary. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-02-16KVM: x86: Distinguish between stack operation and near branchesNadav Amit1-9/+14
commit 58b7075d059f7d37ca86c76fb1446fa3447b9f4f upstream. In 64-bit, stack operations default to 64-bits, but can be overriden (to 16-bit) using opsize override prefix. In contrast, near-branches are always 64-bit. This patch distinguish between the different behaviors. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Bruce Rogers <brogers@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-02-16KVM: x86: Getting rid of grp45 in emulatorNadav Amit1-27/+18
commit f7784046ab7cfc1645f4110b6ed14fbdffc2abee upstream. Breaking grp45 to the relevant functions to speed up the emulation and simplify the code. In addition, it is necassary the next patch will distinguish between far and near branches according to the flags. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2015-02-16KVM: x86: Handle errors when RIP is set during far jumpsNadav Amit1-36/+77
commit d1442d85cc30ea75f7d399474ca738e0bc96f715 upstream. Far jmp/call/ret may fault while loading a new RIP. Currently KVM does not handle this case, and may result in failed vm-entry once the assignment is done. The tricky part of doing so is that loading the new CS affects the VMCS/VMCB state, so if we fail during loading the new RIP, we are left in unconsistent state. Therefore, this patch saves on 64-bit the old CS descriptor and restores it if loading RIP failed. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Bruce Rogers <brogers@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>