summaryrefslogtreecommitdiff
path: root/arch/x86/include
AgeCommit message (Collapse)AuthorFilesLines
2016-04-27x86/iopl/64: Properly context-switch IOPL on Xen PVAndy Lutomirski1-0/+2
commit b7a584598aea7ca73140cb87b40319944dd3393f upstream. On Xen PV, regs->flags doesn't reliably reflect IOPL and the exit-to-userspace code doesn't change IOPL. We need to context switch it manually. I'm doing this without going through paravirt because this is specific to Xen PV. After the dust settles, we can merge this with the 32-bit code, tidy up the iopl syscall implementation, and remove the set_iopl pvop entirely. Fixes XSA-171. Reviewewd-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/693c3bd7aeb4d3c27c92c622b7d0f554a458173c.1458162709.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> [ kamal: backport to 3.19-stable: no X86_FEATURE_XENPV so just call xen_pv_domain() directly ] Acked-by: Andy Lutomirski <luto <at> kernel.org> Signed-off-by: Kamal Mostafa <kamal <at> canonical.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
2016-04-27KVM: x86: trap AMD MSRs for the TSeg base and maskPaolo Bonzini1-0/+1
commit 3afb1121800128aae9f5722e50097fcf1a9d4d88 upstream. These have roughly the same purpose as the SMRR, which we do not need to implement in KVM. However, Linux accesses MSR_K8_TSEG_ADDR at boot, which causes problems when running a Xen dom0 under KVM. Just return 0, meaning that processor protection of SMRAM is not in effect. Reported-by: M A Young <m.a.young@durham.ac.uk> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
2016-03-21KVM: x86: work around infinite loop in microcode when #AC is deliveredEric Northup1-0/+1
commit 54a20552e1eae07aa240fa370a0293e006b5faed upstream. It was found that a guest can DoS a host by triggering an infinite stream of "alignment check" (#AC) exceptions. This causes the microcode to enter an infinite loop where the core never receives another interrupt. The host kernel panics pretty quickly due to the effects (CVE-2015-5307). Signed-off-by: Eric Northup <digitaleric@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [lizf: Backported to 3.4: - adjust filename - adjust context - add definition of AC_VECTOR] Signed-off-by: Zefan Li <lizefan@huawei.com>
2016-03-21x86/ldt: Make modify_ldt synchronousAndy Lutomirski3-19/+48
commit 37868fe113ff2ba814b3b4eb12df214df555f8dc upstream. modify_ldt() has questionable locking and does not synchronize threads. Improve it: redesign the locking and synchronize all threads' LDTs using an IPI on all modifications. This will dramatically slow down modify_ldt in multithreaded programs, but there shouldn't be any multithreaded programs that care about modify_ldt's performance in the first place. This fixes some fallout from the CVE-2015-5157 fixes. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: security@kernel.org <security@kernel.org> Cc: xen-devel <xen-devel@lists.xen.org> Link: http://lkml.kernel.org/r/4c6978476782160600471bd865b318db34c7b628.1438291540.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: - Adjust context - Drop comment changes in switch_mm() - Drop changes to get_segment_base() in arch/x86/kernel/cpu/perf_event.c - Open-code lockless_dereference(), smp_store_release(), on_each_cpu_mask()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> [lizf: Backported to 3.4: adjust context] Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-10-22KVM: x86: make vapics_in_nmi_mode atomicRadim Krčmář1-1/+1
commit 42720138b06301cc8a7ee8a495a6d021c4b6a9bc upstream. Writes were a bit racy, but hard to turn into a bug at the same time. (Particularly because modern Linux doesn't use this feature anymore.) Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> [Actually the next patch makes it much, much easier to trigger the race so I'm including this one for stable@ as well. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [lizf: Backported to 3.4: adjust context] Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-09-18x86, kvm: fix kvm's usage of kernel_fpu_begin/end()Suresh Siddha1-2/+26
commit b1a74bf8212367be2b1d6685c11a84e056eaaaf1 upstream. Preemption is disabled between kernel_fpu_begin/end() and as such it is not a good idea to use these routines in kvm_load/put_guest_fpu() which can be very far apart. kvm_load/put_guest_fpu() routines are already called with preemption disabled and KVM already uses the preempt notifier to save the guest fpu state using kvm_put_guest_fpu(). So introduce __kernel_fpu_begin/end() routines which don't touch preemption and use them instead of kernel_fpu_begin/end() for KVM's use model of saving/restoring guest FPU state. Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit state in the case of VMX. For eagerFPU case, host cr0.TS is always clear. So no need to worry about it. For the traditional lazyFPU restore case, change the cr0.TS bit for the host state during vm-exit to be always clear and cr0.TS bit is set in the __vmx_load_host_state() when the FPU (guest FPU or the host task's FPU) state is not active. This ensures that the host/guest FPU state is properly saved, restored during context-switch and with interrupts (using irq_fpu_usable()) not stomping on the active FPU state. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1348164109.26695.338.camel@sbsiddha-desk.sc.intel.com Cc: Avi Kivity <avi@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Zefan Li <lizefan@huawei.com> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-09-18x86/iommu: Fix header comments regarding standard and _FINISH macrosAravind Gopalakrishnan1-5/+6
commit b44915927ca88084a7292e4ddd4cf91036f365e1 upstream. The comment line regarding IOMMU_INIT and IOMMU_INIT_FINISH macros is incorrect: "The standard vs the _FINISH differs in that the _FINISH variant will continue detecting other IOMMUs in the call list..." It should be "..the *standard* variant will continue detecting..." Fix that. Also, make it readable while at it. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: konrad.wilk@oracle.com Fixes: 6e9636693373 ("x86, iommu: Update header comments with appropriate naming") Link: http://lkml.kernel.org/r/1428508017-5316-1-git-send-email-Aravind.Gopalakrishnan@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-04-14x86, cpu, amd: Add workaround for family 16h, erratum 793Borislav Petkov1-0/+1
commit 3b56496865f9f7d9bcb2f93b44c63f274f08e3b6 upstream. This adds the workaround for erratum 793 as a precaution in case not every BIOS implements it. This addresses CVE-2013-6885. Erratum text: [Revision Guide for AMD Family 16h Models 00h-0Fh Processors, document 51810 Rev. 3.04 November 2013] 793 Specific Combination of Writes to Write Combined Memory Types and Locked Instructions May Cause Core Hang Description Under a highly specific and detailed set of internal timing conditions, a locked instruction may trigger a timing sequence whereby the write to a write combined memory type is not flushed, causing the locked instruction to stall indefinitely. Potential Effect on System Processor core hang. Suggested Workaround BIOS should set MSR C001_1020[15] = 1b. Fix Planned No fix planned [ hpa: updated description, fixed typo in MSR name ] Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20140114230711.GS29865@pd.tnic Tested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> [bwh: Backported to 3.2: - Adjust filename - Venkatesh Srinivas pointed out we should use {rd,wr}msrl_safe() to avoid crashing on KVM. This was fixed upstream by commit 8f86a7373a1c ("x86, AMD: Convert to the new bit access MSR accessors") but that's too much trouble to backport. Here we must use {rd,wr}msrl_amd_safe().] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Moritz Muehlenhoff <jmm@debian.org> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-04-14x86/tls: Don't validate lm in set_thread_area() after allAndy Lutomirski1-0/+7
commit 3fb2f4237bb452eb4e98f6a5dbd5a445b4fed9d0 upstream. It turns out that there's a lurking ABI issue. GCC, when compiling this in a 32-bit program: struct user_desc desc = { .entry_number = idx, .base_addr = base, .limit = 0xfffff, .seg_32bit = 1, .contents = 0, /* Data, grow-up */ .read_exec_only = 0, .limit_in_pages = 1, .seg_not_present = 0, .useable = 0, }; will leave .lm uninitialized. This means that anything in the kernel that reads user_desc.lm for 32-bit tasks is unreliable. Revert the .lm check in set_thread_area(). The value never did anything in the first place. Fixes: 0e58af4e1d21 ("x86/tls: Disallow unusual TLS segments") Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/d7875b60e28c512f6a6fc0baf5714d58e7eaadbb.1418856405.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org> [lizf: Backported to 3.4: adjust filename] Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-02-02x86, tls: Interpret an all-zero struct user_desc as "no segment"Andy Lutomirski1-0/+13
commit 3669ef9fa7d35f573ec9c0e0341b29251c2734a7 upstream. The Witcher 2 did something like this to allocate a TLS segment index: struct user_desc u_info; bzero(&u_info, sizeof(u_info)); u_info.entry_number = (uint32_t)-1; syscall(SYS_set_thread_area, &u_info); Strictly speaking, this code was never correct. It should have set read_exec_only and seg_not_present to 1 to indicate that it wanted to find a free slot without putting anything there, or it should have put something sensible in the TLS slot if it wanted to allocate a TLS entry for real. The actual effect of this code was to allocate a bogus segment that could be used to exploit espfix. The set_thread_area hardening patches changed the behavior, causing set_thread_area to return -EINVAL and crashing the game. This changes set_thread_area to interpret this as a request to find a free slot and to leave it empty, which isn't *quite* what the game expects but should be close enough to keep it working. In particular, using the code above to allocate two segments will allocate the same segment both times. According to FrostbittenKing on Github, this fixes The Witcher 2. If this somehow still causes problems, we could instead allocate a limit==0 32-bit data segment, but that seems rather ugly to me. Fixes: 41bdc78544b8 x86/tls: Validate TLS entries to protect espfix Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/0cb251abe1ff0958b8e468a9a9a905b80ae3a746.1421954363.git.luto@amacapital.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-02-02x86, tls, ldt: Stop checking lm in LDT_emptyAndy Lutomirski1-7/+2
commit e30ab185c490e9a9381385529e0fd32f0a399495 upstream. 32-bit programs don't have an lm bit in their ABI, so they can't reliably cause LDT_empty to return true without resorting to memset. They shouldn't need to do this. This should fix a longstanding, if minor, issue in all 64-bit kernels as well as a potential regression in the TLS hardening code. Fixes: 41bdc78544b8 x86/tls: Validate TLS entries to protect espfix Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/72a059de55e86ad5e2935c80aa91880ddf19d07c.1421954363.git.luto@amacapital.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-02-02x86_64, traps: Stop using IST for #SSAndy Lutomirski2-7/+5
commit 6f442be2fb22be02cafa606f1769fa1e6f894441 upstream. On a 32-bit kernel, this has no effect, since there are no IST stacks. On a 64-bit kernel, #SS can only happen in user code, on a failed iret to user space, a canonical violation on access via RSP or RBP, or a genuine stack segment violation in 32-bit kernel code. The first two cases don't need IST, and the latter two cases are unlikely fatal bugs, and promoting them to double faults would be fine. This fixes a bug in which the espfix64 code mishandles a stack segment violation. This saves 4k of memory per CPU and a tiny bit of code. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: - No need to define trace_stack_segment - Use the errorentry macro to generate #SS asm code - Adjust context - Checked that this matches Luis's backport for Ubuntu] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-02-02kvm: vmx: handle invvpid vm exit gracefullyPetr Matousek1-0/+1
commit a642fc305053cc1c6e47e4f4df327895747ab485 upstream. On systems with invvpid instruction support (corresponding bit in IA32_VMX_EPT_VPID_CAP MSR is set) guest invocation of invvpid causes vm exit, which is currently not handled and results in propagation of unknown exit to userspace. Fix this by installing an invvpid vm exit handler. This is CVE-2014-3646. Signed-off-by: Petr Matousek <pmatouse@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [lizf: Backported to 3.4: - adjust filename - drop the change to VMX_EXIT_REASON strings] Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-02-02nEPT: Nested INVEPTNadav Har'El1-0/+1
commit bfd0a56b90005f8c8a004baf407ad90045c2b11e upstream. If we let L1 use EPT, we should probably also support the INVEPT instruction. In our current nested EPT implementation, when L1 changes its EPT table for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in the course of this modification already calls INVEPT. But if last level of shadow page is unsync not all L1's changes to EPT12 are intercepted, which means roots need to be synced when L1 calls INVEPT. Global INVEPT should not be different since roots are synced by kvm_mmu_load() each time EPTP02 changes. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [bwh: Backported to 3.2: - Adjust context, filename - Simplify handle_invept() as recommended by Paolo - nEPT is not supported so we always raise #UD] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-02-02KVM: x86: Prevent host from panicking on shared MSR writes.Andy Honig1-1/+1
commit 8b3c3104c3f4f706e99365c3e0d2aa61b95f969f upstream. The previous patch blocked invalid writes directly when the MSR is written. As a precaution, prevent future similar mistakes by gracefulling handle GPs caused by writes to shared MSRs. Signed-off-by: Andrew Honig <ahonig@google.com> [Remove parts obsoleted by Nadav's patch. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [lizf: Backported to 3.4: - adjust context - s/wrmsrl_safe/checking_wrmsrl/] Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-02-02KVM: x86: Check non-canonical addresses upon WRMSRNadav Amit1-0/+14
commit 854e8bb1aa06c578c2c9145fa6bfe3680ef63b23 upstream. Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is written to certain MSRs. The behavior is "almost" identical for AMD and Intel (ignoring MSRs that are not implemented in either architecture since they would anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if non-canonical address is written on Intel but not on AMD (which ignores the top 32-bits). Accordingly, this patch injects a #GP on the MSRs which behave identically on Intel and AMD. To eliminate the differences between the architecutres, the value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to canonical value before writing instead of injecting a #GP. Some references from Intel and AMD manuals: According to Intel SDM description of WRMSR instruction #GP is expected on WRMSR "If the source register contains a non-canonical address and ECX specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE, IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP." According to AMD manual instruction manual: LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the LSTAR and CSTAR registers. If an RIP written by WRMSR is not in canonical form, a general-protection exception (#GP) occurs." IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the base field must be in canonical form or a #GP fault will occur." IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must be in canonical form." This patch fixes CVE-2014-3610. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [lizf: Backported to 3.4: - adjust context - s/msr->index/msr_index and s/msr->data/data] Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-02-02x86: Reject x32 executables if x32 ABI not supportedBen Hutchings1-2/+3
commit 0e6d3112a4e95d55cf6dca88f298d5f4b8f29bd1 upstream. It is currently possible to execve() an x32 executable on an x86_64 kernel that has only ia32 compat enabled. However all its syscalls will fail, even _exit(). This usually causes it to segfault. Change the ELF compat architecture check so that x32 executables are rejected if we don't support the x32 ABI. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Link: http://lkml.kernel.org/r/1410120305.6822.9.camel@decadent.org.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-02-02kvm: x86: fix stale mmio cache bugDavid Matlack1-0/+1
commit 56f17dd3fbc44adcdbc3340fe3988ddb833a47a7 upstream. The following events can lead to an incorrect KVM_EXIT_MMIO bubbling up to userspace: (1) Guest accesses gpa X without a memory slot. The gfn is cached in struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets the SPTE write-execute-noread so that future accesses cause EPT_MISCONFIGs. (2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION covering the page just accessed. (3) Guest attempts to read or write to gpa X again. On Intel, this generates an EPT_MISCONFIG. The memory slot generation number that was incremented in (2) would normally take care of this but we fast path mmio faults through quickly_check_mmio_pf(), which only checks the per-vcpu mmio cache. Since we hit the cache, KVM passes a KVM_EXIT_MMIO up to userspace. This patch fixes the issue by using the memslot generation number to validate the mmio cache. Signed-off-by: David Matlack <dmatlack@google.com> [xiaoguangrong: adjust the code to make it simpler for stable-tree fix.] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Tested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
2014-08-07x86_64/entry/xen: Do not invoke espfix64 on XenAndy Lutomirski1-1/+1
commit 7209a75d2009dbf7745e2fd354abf25c3deb3ca3 upstream. This moves the espfix64 logic into native_iret. To make this work, it gets rid of the native patch for INTERRUPT_RETURN: INTERRUPT_RETURN on native kernels is now 'jmp native_iret'. This changes the 16-bit SS behavior on Xen from OOPSing to leaking some bits of the Xen hypervisor's RSP (I think). [ hpa: this is a nonzero cost on native, but probably not enough to measure. Xen needs to fix this in their own code, probably doing something equivalent to espfix64. ] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/7b8f1d8ef6597cb16ae004a43c56980a7de3cf94.1406129132.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07x86, espfix: Fix broken header guardH. Peter Anvin1-1/+1
commit 20b68535cd27183ebd3651ff313afb2b97dac941 upstream. Header guard is #ifndef, not #ifdef... Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07x86, espfix: Move espfix definitions into a separate header fileH. Peter Anvin2-3/+18
commit e1fe9ed8d2a4937510d0d60e20705035c2609aea upstream. Sparse warns that the percpu variables aren't declared before they are defined. Rather than hacking around it, move espfix definitions into a proper header file. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stackH. Peter Anvin2-0/+5
commit 3891a04aafd668686239349ea58f3314ea2af86b upstream. The IRET instruction, when returning to a 16-bit segment, only restores the bottom 16 bits of the user space stack pointer. This causes some 16-bit software to break, but it also leaks kernel state to user space. We have a software workaround for that ("espfix") for the 32-bit kernel, but it relies on a nonzero stack segment base which is not available in 64-bit mode. In checkin: b3b42ac2cbae x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels we "solved" this by forbidding 16-bit segments on 64-bit kernels, with the logic that 16-bit support is crippled on 64-bit kernels anyway (no V86 support), but it turns out that people are doing stuff like running old Win16 binaries under Wine and expect it to work. This works around this by creating percpu "ministacks", each of which is mapped 2^16 times 64K apart. When we detect that the return SS is on the LDT, we copy the IRET frame to the ministack and use the relevant alias to return to userspace. The ministacks are mapped readonly, so if IRET faults we promote #GP to #DF which is an IST vector and thus has its own stack; we then do the fixup in the #DF handler. (Making #GP an IST exception would make the msr_safe functions unsafe in NMI/MC context, and quite possibly have other effects.) Special thanks to: - Andy Lutomirski, for the suggestion of using very small stack slots and copy (as opposed to map) the IRET frame there, and for the suggestion to mark them readonly and let the fault promote to #DF. - Konrad Wilk for paravirt fixup and testing. - Borislav Petkov for testing help and useful comments. Reported-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andrew Lutomriski <amluto@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dirk Hohndel <dirk@hohndel.org> Cc: Arjan van de Ven <arjan.van.de.ven@intel.com> Cc: comex <comexk@gmail.com> Cc: Alexander van Heukelum <heukelum@fastmail.fm> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09KVM: x86: preserve the high 32-bits of the PAT registerPaolo Bonzini1-1/+1
commit 7cb060a91c0efc5ff94f83c6df3ed705e143cdb9 upstream. KVM does not really do much with the PAT, so this went unnoticed for a long time. It is exposed however if you try to do rdmsr on the PAT register. Reported-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09KVM: x86: Increase the number of fixed MTRR regs to 10Nadav Amit1-1/+1
commit 682367c494869008eb89ef733f196e99415ae862 upstream. Recent Intel CPUs have 10 variable range MTRRs. Since operating systems sometime make assumptions on CPUs while they ignore capability MSRs, it is better for KVM to be consistent with recent CPUs. Reporting more MTRRs than actually supported has no functional implications. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-07ptrace,x86: force IRET path after a ptrace_stop()Tejun Heo1-0/+16
commit b9cd18de4db3c9ffa7e17b0dc0ca99ed5aa4d43a upstream. The 'sysret' fastpath does not correctly restore even all regular registers, much less any segment registers or reflags values. That is very much part of why it's faster than 'iret'. Normally that isn't a problem, because the normal ptrace() interface catches the process using the signal handler infrastructure, which always returns with an iret. However, some paths can get caught using ptrace_event() instead of the signal path, and for those we need to make sure that we aren't going to return to user space using 'sysret'. Otherwise the modifications that may have been done to the register set by the tracer wouldn't necessarily take effect. Fix it by forcing IRET path by setting TIF_NOTIFY_RESUME from arch_ptrace_stop_needed() which is invoked from ptrace_stop(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-08x86, mm, hugetlb: Add missing TLB page invalidation for hugetlb_cow()Anthony Iliopoulos1-0/+1
commit 9844f5462392b53824e8b86726e7c33b5ecbb676 upstream. The invalidation is required in order to maintain proper semantics under CoW conditions. In scenarios where a process clones several threads, a thread operating on a core whose DTLB entry for a particular hugepage has not been invalidated, will be reading from the hugepage that belongs to the forked child process, even after hugetlb_cow(). The thread will not see the updated page as long as the stale DTLB entry remains cached, the thread attempts to write into the page, the child process exits, or the thread gets migrated to a different processor. Signed-off-by: Anthony Iliopoulos <anthony.iliopoulos@huawei.com> Link: http://lkml.kernel.org/r/20140514092948.GA17391@server-36.huawei.corp Suggested-by: Shay Goikhman <shay.goikhman@huawei.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14compiler/gcc4: Add quirk for 'asm goto' miscompilation bugIngo Molnar2-2/+2
commit 3f0116c3238a96bc18ad4b4acefe4e7be32fa861 upstream. Fengguang Wu, Oleg Nesterov and Peter Zijlstra tracked down a kernel crash to a GCC bug: GCC miscompiles certain 'asm goto' constructs, as outlined here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670 Implement a workaround suggested by Jakub Jelinek. Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Suggested-by: Jakub Jelinek <jakub@redhat.com> Reviewed-by: Richard Henderson <rth@twiddle.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> [hq: Backported to 3.4: Adjust context] Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-03x86: fix boot on uniprocessor systemsArtem Fetishev1-1/+2
commit 825600c0f20e595daaa7a6dd8970f84fa2a2ee57 upstream. On x86 uniprocessor systems topology_physical_package_id() returns -1 which causes rapl_cpu_prepare() to leave rapl_pmu variable uninitialized which leads to GPF in rapl_pmu_init(). See arch/x86/kernel/cpu/perf_event_intel_rapl.c. It turns out that physical_package_id and core_id can actually be retreived for uniprocessor systems too. Enabling them also fixes rapl_pmu code. Signed-off-by: Artem Fetishev <artem_fetishev@epam.com> Cc: Stephane Eranian <eranian@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-16x86, fpu, amd: Clear exceptions in AMD FXSAVE workaroundLinus Torvalds1-6/+7
commit 26bef1318adc1b3a530ecc807ef99346db2aa8b0 upstream. Before we do an EMMS in the AMD FXSAVE information leak workaround we need to clear any pending exceptions, otherwise we trap with a floating-point exception inside this code. Reported-by: halfdog <me@halfdog.net> Tested-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/CA%2B55aFxQnY_PCG_n4=0w-VG=YLXL-yr7oMxyy0WU2gCBAf3ydg@mail.gmail.com Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-26KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions ↵Andy Honig1-2/+2
(CVE-2013-1797) commit 0b79459b482e85cb7426aa7da683a9f2c97aeae1 upstream. There is a potential use after free issue with the handling of MSR_KVM_SYSTEM_TIME. If the guest specifies a GPA in a movable or removable memory such as frame buffers then KVM might continue to write to that address even after it's removed via KVM_SET_USER_MEMORY_REGION. KVM pins the page in memory so it's unlikely to cause an issue, but if the user space component re-purposes the memory previously used for the guest, then the guest will be able to corrupt that memory. Tested: Tested against kvmclock unit test Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-17x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metalBoris Ostrovsky2-1/+6
commit 511ba86e1d386f671084b5d0e6f110bb30b8eeb2 upstream. Invoking arch_flush_lazy_mmu_mode() results in calls to preempt_enable()/disable() which may have performance impact. Since lazy MMU is not used on bare metal we can patch away arch_flush_lazy_mmu_mode() so that it is never called in such environment. [ hpa: the previous patch "Fix vmalloc_fault oops during lazy MMU updates" may cause a minor performance regression on bare metal. This patch resolves that performance regression. It is somewhat unclear to me if this is a good -stable candidate. ] Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: http://lkml.kernel.org/r/1364045796-10720-2-git-send-email-konrad.wilk@oracle.com Tested-by: Josh Boyer <jwboyer@redhat.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-12x86: remove the x32 syscall bitmask from syscall_get_nr()Paul Moore1-2/+2
commit 8b4b9f27e57584f3d90e0bb84cf800ad81cfe3a1 upstream. Commit fca460f95e928bae373daa8295877b6905bc62b8 simplified the x32 implementation by creating a syscall bitmask, equal to 0x40000000, that could be applied to x32 syscalls such that the masked syscall number would be the same as a x86_64 syscall. While that patch was a nice way to simplify the code, it went a bit too far by adding the mask to syscall_get_nr(); returning the masked syscall numbers can cause confusion with callers that expect syscall numbers matching the x32 ABI, e.g. unmasked syscall numbers. This patch fixes this by simply removing the mask from syscall_get_nr() while preserving the other changes from the original commit. While there are several syscall_get_nr() callers in the kernel, most simply check that the syscall number is greater than zero, in this case this patch will have no effect. Of those remaining callers, they appear to be few, seccomp and ftrace, and from my testing of seccomp without this patch the original commit definitely breaks things; the seccomp filter does not correctly filter the syscalls due to the difference in syscall numbers in the BPF filter and the value from syscall_get_nr(). Applying this patch restores the seccomp BPF filter functionality on x32. I've tested this patch with the seccomp BPF filters as well as ftrace and everything looks reasonable to me; needless to say general usage seemed fine as well. Signed-off-by: Paul Moore <pmoore@redhat.com> Link: http://lkml.kernel.org/r/20130215172143.12549.10292.stgit@localhost Cc: Will Drewry <wad@chromium.org> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05signal: Define __ARCH_HAS_SA_RESTORER so we know whether to clear sa_restorerBen Hutchings1-0/+2
Vaguely based on upstream commit 574c4866e33d 'consolidate kernel-side struct sigaction declarations'. flush_signal_handlers() needs to know whether sigaction::sa_restorer is defined, not whether SA_RESTORER is defined. Define the __ARCH_HAS_SA_RESTORER macro to indicate this. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28x86-32, mm: Remove reference to resume_map_numa_kva()H. Peter Anvin1-6/+0
commit bb112aec5ee41427e9b9726e3d57b896709598ed upstream. Remove reference to removed function resume_map_numa_kva(). Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-17x86/mm: Check if PUD is large when validating a kernel addressMel Gorman1-0/+5
commit 0ee364eb316348ddf3e0dfcd986f5f13f528f821 upstream. A user reported the following oops when a backup process reads /proc/kcore: BUG: unable to handle kernel paging request at ffffbb00ff33b000 IP: [<ffffffff8103157e>] kern_addr_valid+0xbe/0x110 [...] Call Trace: [<ffffffff811b8aaa>] read_kcore+0x17a/0x370 [<ffffffff811ad847>] proc_reg_read+0x77/0xc0 [<ffffffff81151687>] vfs_read+0xc7/0x130 [<ffffffff811517f3>] sys_read+0x53/0xa0 [<ffffffff81449692>] system_call_fastpath+0x16/0x1b Investigation determined that the bug triggered when reading system RAM at the 4G mark. On this system, that was the first address using 1G pages for the virt->phys direct mapping so the PUD is pointing to a physical address, not a PMD page. The problem is that the page table walker in kern_addr_valid() is not checking pud_large() and treats the physical address as if it was a PMD. If it happens to look like pmd_none then it'll silently fail, probably returning zeros instead of real data. If the data happens to look like a present PMD though, it will be walked resulting in the oops above. This patch adds the necessary pud_large() check. Unfortunately the problem was not readily reproducible and now they are running the backup program without accessing /proc/kcore so the patch has not been validated but I think it makes sense. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.coM> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20130211145236.GX21389@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-14efi: Make 'efi_enabled' a function to query EFI facilitiesMatt Fleming1-0/+1
commit 83e68189745ad931c2afd45d8ee3303929233e7f upstream. Originally 'efi_enabled' indicated whether a kernel was booted from EFI firmware. Over time its semantics have changed, and it now indicates whether or not we are booted on an EFI machine with bit-native firmware, e.g. 64-bit kernel with 64-bit firmware. The immediate motivation for this patch is the bug report at, https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557 which details how running a platform driver on an EFI machine that is designed to run under BIOS can cause the machine to become bricked. Also, the following report, https://bugzilla.kernel.org/show_bug.cgi?id=47121 details how running said driver can also cause Machine Check Exceptions. Drivers need a new means of detecting whether they're running on an EFI machine, as sadly the expression, if (!efi_enabled) hasn't been a sufficient condition for quite some time. Users actually want to query 'efi_enabled' for different reasons - what they really want access to is the list of available EFI facilities. For instance, the x86 reboot code needs to know whether it can invoke the ResetSystem() function provided by the EFI runtime services, while the ACPI OSL code wants to know whether the EFI config tables were mapped successfully. There are also checks in some of the platform driver code to simply see if they're running on an EFI machine (which would make it a bad idea to do BIOS-y things). This patch is a prereq for the samsung-laptop fix patch. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Cc: David Airlie <airlied@linux.ie> Cc: Corentin Chary <corentincj@iksaif.net> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Olof Johansson <olof@lixom.net> Cc: Peter Jones <pjones@redhat.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Steve Langasek <steve.langasek@canonical.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Konrad Rzeszutek Wilk <konrad@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-10x86, fpu: Avoid FPU lazy restore after suspendVincent Palatin1-6/+9
commit 644c154186386bb1fa6446bc5e037b9ed098db46 upstream. When a cpu enters S3 state, the FPU state is lost. After resuming for S3, if we try to lazy restore the FPU for a process running on the same CPU, this will result in a corrupted FPU context. Ensure that "fpu_owner_task" is properly invalided when (re-)initializing a CPU, so nobody will try to lazy restore a state which doesn't exist in the hardware. Tested with a 64-bit kernel on a 4-core Ivybridge CPU with eagerfpu=off, by doing thousands of suspend/resume cycles with 4 processes doing FPU operations running. Without the patch, a process is killed after a few hundreds cycles by a SIGFPE. Signed-off-by: Vincent Palatin <vpalatin@chromium.org> Cc: Duncan Laurie <dlaurie@chromium.org> Cc: Olof Johansson <olofj@chromium.org> Link: http://lkml.kernel.org/r/1354306532-1014-1-git-send-email-vpalatin@chromium.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-03x86-32: Fix invalid stack address while in softirqRobert Richter1-11/+4
commit 1022623842cb72ee4d0dbf02f6937f38c92c3f41 upstream. In 32 bit the stack address provided by kernel_stack_pointer() may point to an invalid range causing NULL pointer access or page faults while in NMI (see trace below). This happens if called in softirq context and if the stack is empty. The address at &regs->sp is then out of range. Fixing this by checking if regs and &regs->sp are in the same stack context. Otherwise return the previous stack pointer stored in struct thread_info. If that address is invalid too, return address of regs. BUG: unable to handle kernel NULL pointer dereference at 0000000a IP: [<c1004237>] print_context_stack+0x6e/0x8d *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: Pid: 4434, comm: perl Not tainted 3.6.0-rc3-oprofile-i386-standard-g4411a05 #4 Hewlett-Packard HP xw9400 Workstation/0A1Ch EIP: 0060:[<c1004237>] EFLAGS: 00010093 CPU: 0 EIP is at print_context_stack+0x6e/0x8d EAX: ffffe000 EBX: 0000000a ECX: f4435f94 EDX: 0000000a ESI: f4435f94 EDI: f4435f94 EBP: f5409ec0 ESP: f5409ea0 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 8005003b CR2: 0000000a CR3: 34ac9000 CR4: 000007d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 Process perl (pid: 4434, ti=f5408000 task=f5637850 task.ti=f4434000) Stack: 000003e8 ffffe000 00001ffc f4e39b00 00000000 0000000a f4435f94 c155198c f5409ef0 c1003723 c155198c f5409f04 00000000 f5409edc 00000000 00000000 f5409ee8 f4435f94 f5409fc4 00000001 f5409f1c c12dce1c 00000000 c155198c Call Trace: [<c1003723>] dump_trace+0x7b/0xa1 [<c12dce1c>] x86_backtrace+0x40/0x88 [<c12db712>] ? oprofile_add_sample+0x56/0x84 [<c12db731>] oprofile_add_sample+0x75/0x84 [<c12ddb5b>] op_amd_check_ctrs+0x46/0x260 [<c12dd40d>] profile_exceptions_notify+0x23/0x4c [<c1395034>] nmi_handle+0x31/0x4a [<c1029dc5>] ? ftrace_define_fields_irq_handler_entry+0x45/0x45 [<c13950ed>] do_nmi+0xa0/0x2ff [<c1029dc5>] ? ftrace_define_fields_irq_handler_entry+0x45/0x45 [<c13949e5>] nmi_stack_correct+0x28/0x2d [<c1029dc5>] ? ftrace_define_fields_irq_handler_entry+0x45/0x45 [<c1003603>] ? do_softirq+0x4b/0x7f <IRQ> [<c102a06f>] irq_exit+0x35/0x5b [<c1018f56>] smp_apic_timer_interrupt+0x6c/0x7a [<c1394746>] apic_timer_interrupt+0x2a/0x30 Code: 89 fe eb 08 31 c9 8b 45 0c ff 55 ec 83 c3 04 83 7d 10 00 74 0c 3b 5d 10 73 26 3b 5d e4 73 0c eb 1f 3b 5d f0 76 1a 3b 5d e8 73 15 <8b> 13 89 d0 89 55 e0 e8 ad 42 03 00 85 c0 8b 55 e0 75 a6 eb cc EIP: [<c1004237>] print_context_stack+0x6e/0x8d SS:ESP 0068:f5409ea0 CR2: 000000000000000a ---[ end trace 62afee3481b00012 ]--- Kernel panic - not syncing: Fatal exception in interrupt V2: * add comments to kernel_stack_pointer() * always return a valid stack address by falling back to the address of regs Reported-by: Yang Wei <wei.yang@windriver.com> Signed-off-by: Robert Richter <robert.richter@amd.com> Link: http://lkml.kernel.org/r/20120912135059.GZ8285@erda.amd.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Jun Zhang <jun.zhang@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-31x86: efi: Turn off efi_enabled after setup on mixed fw/kernelOlof Johansson1-0/+1
commit 5189c2a7c7769ee9d037d76c1a7b8550ccf3481c upstream. When 32-bit EFI is used with 64-bit kernel (or vice versa), turn off efi_enabled once setup is done. Beyond setup, it is normally used to determine if runtime services are available and we will have none. This will resolve issues stemming from efivars modprobe panicking on a 32/64-bit setup, as well as some reboot issues on similar setups. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=45991 Reported-by: Marko Kohtala <marko.kohtala@gmail.com> Reported-by: Maxim Kammerer <mk@dee.su> Signed-off-by: Olof Johansson <olof@lixom.net> Acked-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13mm: thp: fix pmd_present for split_huge_page and PROT_NONE with THPAndrea Arcangeli1-3/+8
commit 027ef6c87853b0a9df53175063028edb4950d476 upstream. In many places !pmd_present has been converted to pmd_none. For pmds that's equivalent and pmd_none is quicker so using pmd_none is better. However (unless we delete pmd_present) we should provide an accurate pmd_present too. This will avoid the risk of code thinking the pmd is non present because it's under __split_huge_page_map, see the pmd_mknotpresent there and the comment above it. If the page has been mprotected as PROT_NONE, it would also lead to a pmd_present false negative in the same way as the race with split_huge_page. Because the PSE bit stays on at all times (both during split_huge_page and when the _PAGE_PROTNONE bit get set), we could only check for the PSE bit, but checking the PROTNONE bit too is still good to remember pmd_present must always keep PROT_NONE into account. This explains a not reproducible BUG_ON that was seldom reported on the lists. The same issue is in pmd_large, it would go wrong with both PROT_NONE and if it races with split_huge_page. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-02xen/m2p: do not reuse kmap_op->dev_bus_addrStefano Stabellini1-1/+2
commit 2fc136eecd0c647a6b13fcd00d0c41a1a28f35a5 upstream. If the caller passes a valid kmap_op to m2p_add_override, we use kmap_op->dev_bus_addr to store the original mfn, but dev_bus_addr is part of the interface with Xen and if we are batching the hypercalls it might not have been written by the hypervisor yet. That means that later on Xen will write to it and we'll think that the original mfn is actually what Xen has written to it. Rather than "stealing" struct members from kmap_op, keep using page->index to store the original mfn and add another parameter to m2p_remove_override to get the corresponding kmap_op instead. It is now responsibility of the caller to keep track of which kmap_op corresponds to a particular page in the m2p_override (gntdev, the only user of this interface that passes a valid kmap_op, is already doing that). Reported-and-Tested-By: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16thp: avoid atomic64_read in pmd_read_atomic for 32bit PAEAndrea Arcangeli1-13/+17
commit e4eed03fd06578571c01d4f1478c874bb432c815 upstream. In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding the mmap_sem for reading, cmpxchg8b cannot be used to read pmd contents under Xen. So instead of dealing only with "consistent" pmdvals in pmd_none_or_trans_huge_or_clear_bad() (which would be conceptually simpler) we let pmd_none_or_trans_huge_or_clear_bad() deal with pmdvals where the low 32bit and high 32bit could be inconsistent (to avoid having to use cmpxchg8b). The only guarantee we get from pmd_read_atomic is that if the low part of the pmd was found null, the high part will be null too (so the pmd will be considered unstable). And if the low part of the pmd is found "stable" later, then it means the whole pmd was read atomically (because after a pmd is stable, neither MADV_DONTNEED nor page faults can alter it anymore, and we read the high part after the low part). In the 32bit PAE x86 case, it is enough to read the low part of the pmdval atomically to declare the pmd as "stable" and that's true for THP and no THP, furthermore in the THP case we also have a barrier() that will prevent any inconsistent pmdvals to be cached by a later re-read of the *pmd. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Jonathan Nieder <jrnieder@gmail.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Tested-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race conditionAndrea Arcangeli1-0/+50
commit 26c191788f18129af0eb32a358cdaea0c7479626 upstream. When holding the mmap_sem for reading, pmd_offset_map_lock should only run on a pmd_t that has been read atomically from the pmdp pointer, otherwise we may read only half of it leading to this crash. PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic" #0 [f06a9dd8] crash_kexec at c049b5ec #1 [f06a9e2c] oops_end at c083d1c2 #2 [f06a9e40] no_context at c0433ded #3 [f06a9e64] bad_area_nosemaphore at c043401a #4 [f06a9e6c] __do_page_fault at c0434493 #5 [f06a9eec] do_page_fault at c083eb45 #6 [f06a9f04] error_code (via page_fault) at c083c5d5 EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP: 00000000 DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0 CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246 #7 [f06a9f38] _spin_lock at c083bc14 #8 [f06a9f44] sys_mincore at c0507b7d #9 [f06a9fb0] system_call at c083becd start len EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00 SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033 CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286 This should be a longstanding bug affecting x86 32bit PAE without THP. Only archs with 64bit large pmd_t and 32bit unsigned long should be affected. With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad() would partly hide the bug when the pmd transition from none to stable, by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is enabled a new set of problem arises by the fact could then transition freely in any of the none, pmd_trans_huge or pmd_trans_stable states. So making the barrier in pmd_none_or_trans_huge_or_clear_bad() unconditional isn't good idea and it would be a flakey solution. This should be fully fixed by introducing a pmd_read_atomic that reads the pmd in order with THP disabled, or by reading the pmd atomically with cmpxchg8b with THP enabled. Luckily this new race condition only triggers in the places that must already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix is localized there but this bug is not related to THP. NOTE: this can trigger on x86 32bit systems with PAE enabled with more than 4G of ram, otherwise the high part of the pmd will never risk to be truncated because it would be zero at all times, in turn so hiding the SMP race. This bug was discovered and fully debugged by Ulrich, quote: ---- [..] pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and eax. 496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) 497 { 498 /* depend on compiler for an atomic pmd read */ 499 pmd_t pmdval = *pmd; // edi = pmd pointer 0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi ... // edx = PTE page table high address 0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx ... // eax = PTE page table low address 0xc0507a8e <sys_mincore+574>: mov (%edi),%eax [..] Please note that the PMD is not read atomically. These are two "mov" instructions where the high order bits of the PMD entry are fetched first. Hence, the above machine code is prone to the following race. - The PMD entry {high|low} is 0x0000000000000000. The "mov" at 0xc0507a84 loads 0x00000000 into edx. - A page fault (on another CPU) sneaks in between the two "mov" instructions and instantiates the PMD. - The PMD entry {high|low} is now 0x00000003fda38067. The "mov" at 0xc0507a8e loads 0xfda38067 into eax. ---- Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16x86, cpufeature: Rename X86_FEATURE_DTS to X86_FEATURE_DTHERMH. Peter Anvin1-1/+1
commit 4ad33411308596f2f918603509729922a1ec4411 upstream. It makes sense to label "Digital Thermal Sensor" as "DTS", but unfortunately the string "dts" was already used for "Debug Store", and /proc/cpuinfo is a user space ABI. Therefore, rename this to "dtherm". This conflict went into mainline via the hwmon tree without any x86 maintainer ack, and without any kind of hint in the subject. a4659053 x86/hwmon: fix initialization of coretemp Reported-by: Jean Delvare <khali@linux-fr.org> Link: http://lkml.kernel.org/r/4FE34BCB.5050305@linux.intel.com Cc: Jan Beulich <JBeulich@suse.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-17x86/uv: Fix UV2 BAU legacy modeCliff Wickman1-1/+0
commit d5d2d2eea84b0d8450b082edbc3dbde41fb8bfd8 upstream. The SGI Altix UV2 BAU (Broadcast Assist Unit) as used for tlb-shootdown (selective broadcast mode) always uses UV2 broadcast descriptor format. There is no need to clear the 'legacy' (UV1) mode, because the hardware always uses UV2 mode for selective broadcast. But the BIOS uses general broadcast and legacy mode, and the hardware pays attention to the legacy mode bit for general broadcast. So the kernel must not clear that mode bit. Signed-off-by: Cliff Wickman <cpw@sgi.com> Link: http://lkml.kernel.org/r/E1SccoO-0002Lh-Cb@eag09.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-14x86, kvm: KVM paravirt kernels don't check for CPUID being unavailableAlan Cox1-0/+3
We set cpuid_level to -1 if there is no CPUID instruction (only possible on i386). Signed-off-by: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/r/20120514174059.30236.1064.stgit@bluebook Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=12122 Cc: Avi Kivity <avi@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-05-04vfs: make word-at-a-time accesses handle a non-existing pageLinus Torvalds1-0/+33
It turns out that there are more cases than CONFIG_DEBUG_PAGEALLOC that can have holes in the kernel address space: it seems to happen easily with Xen, and it looks like the AMD gart64 code will also punch holes dynamically. Actually hitting that case is still very unlikely, so just do the access, and take an exception and fix it up for the very unlikely case of it being a page-crosser with no next page. And hey, this abstraction might even help other architectures that have other issues with unaligned word accesses than the possible missing next page. IOW, this could do the byte order magic too. Peter Anvin fixed a thinko in the shifting for the exception case. Reported-and-tested-by: Jana Saout <jana@saout.de> Cc: Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-24x32, siginfo: Provide proper overrides for x32 siginfo_tH. Peter Anvin1-1/+7
Provide the proper override macros for x32 siginfo_t. The combination of a special type here and an overall alignment constraint actually ends up with all the types being properly aligned, but the hack is needed to keep the substructures inside siginfo_t from adding padding. Note: use __attribute__((aligned())) since __aligned() is not exported to user space. [ v2: fix stray semicolon ] Reported-by: H.J. Lu <hjl.rools@gmail.com> Cc: Bruce J. Beare <bruce.j.beare@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Link: http://lkml.kernel.org/r/CAMe9rOqF6Kh6-NK7oP0Fpzkd4SBAWU%2BG53hwBbSD4iA2UzyxuA@mail.gmail.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-04-24x32: Check __ILP32__ instead of __LP64__ for x32H.J. Lu3-7/+7
Check __LP64__ isn't a reliable way to tell if we are compiling for x32 since __LP64__ isnn't specified by x86-64 psABI. Not all x86-64 compilers define __LP64__, which was added to GCC 3.3. The updated x32 psABI: https://sites.google.com/site/x32abi/documents definse _ILP32 and __ILP32__ for x32. GCC trunk and 4.7 branch have been updated to define _ILP32 and __ILP32__ for x32. This patch replaces __LP64__ check with __ILP32__. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-04-16x86/platform: Remove incorrect error message in x86_default_fixup_cpu_id()Andreas Herrmann1-1/+0
It's only called from amd.c:srat_detect_node(). The introduced condition for calling the fixup code is true for all AMD multi-node processors, e.g. Magny-Cours and Interlagos. There we have 2 NUMA nodes on one socket. Thus there are cores having different numa-node-id but with equal phys_proc_id. There is no point to print error messages in such a situation. The confusing/misleading error message was introduced with commit 64be4c1c2428e148de6081af235e2418e6a66dda ("x86: Add x86_init platform override to fix up NUMA core numbering"). Remove the default fixup function (especially the error message) and replace it by a NULL pointer check, move the Numascale-specific condition for calling the fixup into the fixup-function itself and slightly adapt the comment. Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Acked-by: Borislav Petkov <borislav.petkov@amd.com> Cc: <stable@kernel.org> Cc: <sp@numascale.com> Cc: <bp@amd64.org> Cc: <daniel@numascale-asia.com> Link: http://lkml.kernel.org/r/20120402160648.GR27684@alberich.amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>