summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/entry_64.S
AgeCommit message (Collapse)AuthorFilesLines
2015-03-10x86/asm/entry/64: Save user RSP in pt_regs->sp on SYSCALL64 fastpathDenys Vlasenko1-11/+7
Prepare for the removal of 'usersp', by simplifying PER_CPU(old_rsp) usage: - use it only as temp storage - store the userspace stack pointer immediately in pt_regs->sp on syscall entry, instead of using it later, on syscall exit. - change C code to use pt_regs->sp only, instead of PER_CPU(old_rsp) and task->thread.usersp. FIXUP/RESTORE_TOP_OF_STACK are simplified as well. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1425926364-9526-4-git-send-email-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-10x86/asm/entry/64: Remove stub_ioplDenys Vlasenko1-13/+0
stub_iopl is no longer needed: pt_regs->flags needs no fixing up after previous change. Remove it. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1425984307-2143-1-git-send-email-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-10x86/asm/entry/64: Save R11 into pt_regs->flags on SYSCALL64 fastpathDenys Vlasenko1-13/+11
Before this patch, R11 was saved in pt_regs->r11. Which looks natural, but requires messy shuffling to/from iret frame whenever ptrace or e.g. sys_iopl() wants to modify flags - because that's how this register is used by SYSCALL/SYSRET. This patch saves R11 in pt_regs->flags, and uses that value for the SYSRET64 instruction. Shuffling is eliminated. FIXUP/RESTORE_TOP_OF_STACK are simplified. stub_iopl is no longer needed: pt_regs->flags needs no fixing up. Testing shows that syscall fast path is ~54.3 ns before and after the patch (on 2.7 GHz Sandy Bridge CPU). Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1425926364-9526-2-git-send-email-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-06x86/asm/entry: Rename 'INIT_TSS_IST' to 'CPU_TSS_IST'Andy Lutomirski1-3/+3
This has nothing to do with the init thread or the initial anything. It's just the CPU's TSS. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a0bd5e26b32a2e1f08ff99017d0997118fbb2485.1425611534.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-06x86/asm/entry: Rename 'init_tss' to 'cpu_tss'Andy Lutomirski1-1/+1
It has nothing to do with init -- there's only one TSS per cpu. Other names considered include: - current_tss: Confusing because we never switch the tss. - singleton_tss: Too long. This patch was generated with 's/init_tss/cpu_tss/g'. Followup patches will fix INIT_TSS and INIT_TSS_IST by hand. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/da29fb2a793e4f649d93ce2d1ed320ebe8516262.1425611534.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry/64: Remove a bogus 'ret_from_fork' optimizationAndy Lutomirski1-5/+8
'ret_from_fork' checks TIF_IA32 to determine whether 'pt_regs' and the related state make sense for 'ret_from_sys_call'. This is entirely the wrong check. TS_COMPAT would make a little more sense, but there's really no point in keeping this optimization at all. This fixes a return to the wrong user CS if we came from int 0x80 in a 64-bit task. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/4710be56d76ef994ddf59087aad98c000fbab9a4.1424989793.git.luto@amacapital.net [ Backported from tip:x86/asm. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry/64: Remove a bogus 'ret_from_fork' optimizationAndy Lutomirski1-5/+8
'ret_from_fork' checks TIF_IA32 to determine whether 'pt_regs' and the related state make sense for 'ret_from_sys_call'. This is entirely the wrong check. TS_COMPAT would make a little more sense, but there's really no point in keeping this optimization at all. This fixes a return to the wrong user CS if we came from int 0x80 in a 64-bit task. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/4710be56d76ef994ddf59087aad98c000fbab9a4.1424989793.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry/64: Simplify optimistic SYSRETDenys Vlasenko1-3/+3
Avoid redundant load of %r11 (it is already loaded a few instructions before). Also simplify %rsp restoration, instead of two steps: add $0x80, %rsp mov 0x18(%rsp), %rsp we can do a simplified single step to restore user-space RSP: mov 0x98(%rsp), %rsp and get the same result. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> [ Clarified the changelog. ] Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1aef69b346a6db0d99cdfb0f5ba83e8c985e27d7.1424989793.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry/64: Use more readable constantsDenys Vlasenko1-12/+16
Constants such as SS+8 or SS+8-RIP are mysterious. In most cases, SS+8 is just meant to be SIZEOF_PTREGS, SS+8-RIP is RIP's offset in the iret frame. This patch changes some of these constants to be less mysterious. No code changes (verified with objdump). Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1d20491384773bd606e23a382fac23ddb49b5178.1424989793.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry/64: Clean up and document various entry code detailsDenys Vlasenko1-32/+36
This patch does a lot of cleanup in comments and formatting, but it does not change any code: - Rename 'save_paranoid' to 'paranoid_entry': this makes naming similar to its "non-paranoid" sibling, 'error_entry', and to its counterpart, 'paranoid_exit'. - Use the same CFI annotation atop 'paranoid_entry' and 'error_entry'. - Fix irregular indentation of assembler operands. - Add/fix comments on top of 'paranoid_entry' and 'error_entry'. - Remove stale comment about "oldrax". - Make comments about "no swapgs" flag in ebx more prominent. - Deindent wrongly indented top-level comment atop 'paranoid_exit'. - Indent wrongly deindented comment inside 'error_entry'. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/4640f9fcd5ea46eb299b1cd6d3f5da3167d2f78d.1424989793.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry/64: Move 'save_paranoid' and 'ret_from_fork' closer to their usersDenys Vlasenko1-53/+53
For some odd reason, these two functions are at the very top of the file. "save_paranoid"'s caller is approximately in the middle of it, move it there. Move 'ret_from_fork' to be right after fork/exec helpers. This is a pure block move, nothing is changed in the function bodies. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/6446bbfe4094532623a5b83779b7015fec167a9d.1424989793.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry: Add comments about various syscall instructionsDenys Vlasenko1-13/+19
SYSCALL/SYSRET and SYSENTER/SYSEXIT have weird semantics. Moreover, they differ in 32- and 64-bit mode. What is saved? What is not? Is rsp set? Are interrupts disabled? People tend to not remember these details well enough. This patch adds comments which explain in detail what registers are modified by each of these instructions. The comments are placed immediately before corresponding entry and exit points. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/a94b98b63527797c871a81402ff5060b18fa880a.1424989793.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry/64: Remove 'int_check_syscall_exit_work'Andy Lutomirski1-1/+0
Nothing references it anymore. Reported-by: Denys Vlasenko <vda.linux@googlemail.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 96b6352c1271 ("x86_64, entry: Remove the syscall exit audit and schedule optimizations") Link: http://lkml.kernel.org/r/dd2a4d26ecc7a5db61b476727175cd99ae2b32a4.1424989793.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry: Do mass removal of 'ARGOFFSET'Denys Vlasenko1-43/+43
ARGOFFSET is zero now, removing it changes no code. A few macros lost "offset" parameter, since it is always zero now too. No code changes - verified with objdump. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/8689f937622d9d2db0ab8be82331fa15e4ed4713.1424989793.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry/64: Shrink code in 'paranoid_exit'Denys Vlasenko1-6/+4
RESTORE_EXTRA_REGS + RESTORE_C_REGS looks small, but it's a lot of instructions (fourteen). Let's reuse them. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> [ Cleaned up the labels. ] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1421272101-16847-2-git-send-email-dvlasenk@redhat.com Link: http://lkml.kernel.org/r/59d71848cee3ec9eb48c0252e602efd6bd560e3c.1424989793.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry/64: Fix commentsDenys Vlasenko1-6/+12
- Misleading and slightly incorrect comments in "struct pt_regs" are fixed (four instances). - Fix incorrect comment atop EMPTY_FRAME macro. - Explain in more detail what we do with stack layout during hw interrupt. - Correct comments about "partial stack frame" which are no longer true. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1423778052-21038-3-git-send-email-dvlasenk@redhat.com Link: http://lkml.kernel.org/r/e1f4429c491fe6ceeddb879dea2786e0f8920f9c.1424989793.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry/64: Always allocate a complete "struct pt_regs" on the kernel ↵Denys Vlasenko1-121/+75
stack The 64-bit entry code was using six stack slots less by not saving/restoring registers which are callee-preserved according to the C ABI, and was not allocating space for them. Only when syscalls needed a complete "struct pt_regs" was the complete area allocated and filled in. As an additional twist, on interrupt entry a "slightly less truncated pt_regs" trick is used, to make nested interrupt stacks easier to unwind. This proved to be a source of significant obfuscation and subtle bugs. For example, 'stub_fork' had to pop the return address, extend the struct, save registers, and push return address back. Ugly. 'ia32_ptregs_common' pops return address and "returns" via jmp insn, throwing a wrench into CPU return stack cache. This patch changes the code to always allocate a complete "struct pt_regs" on the kernel stack. The saving of registers is still done lazily. "Partial pt_regs" trick on interrupt stack is retained. Macros which manipulate "struct pt_regs" on stack are reworked: - ALLOC_PT_GPREGS_ON_STACK allocates the structure. - SAVE_C_REGS saves to it those registers which are clobbered by C code. - SAVE_EXTRA_REGS saves to it all other registers. - Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. 'ia32_ptregs_common', 'stub_fork' and friends lost their ugly dance with the return pointer. LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets instead of magic numbers. 'error_entry' and 'save_paranoid' now use SAVE_C_REGS + SAVE_EXTRA_REGS instead of having it open-coded yet again. Patch was run-tested: 64-bit executables, 32-bit executables, strace works. Timing tests did not show measurable difference in 32-bit and 64-bit syscalls. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1423778052-21038-2-git-send-email-dvlasenk@redhat.com Link: http://lkml.kernel.org/r/b89763d354aa23e670b9bdf3a40ae320320a7c2e.1424989793.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/asm/entry/64: Fix incorrect symbolic constant usage: R11->ARGOFFSETDenys Vlasenko1-3/+3
Since the last fix of this nature, a few more instances have crept in. Fix them up. No object code changes (constants have the same value). Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1423778052-21038-1-git-send-email-dvlasenk@redhat.com Link: http://lkml.kernel.org/r/f5e1c4084319a42e5f14d41e2d638949ce66bc08.1424989793.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-23x86/xen: allow privcmd hypercalls to be preemptedDavid Vrabel1-0/+3
Hypercalls submitted by user space tools via the privcmd driver can take a long time (potentially many 10s of seconds) if the hypercall has many sub-operations. A fully preemptible kernel may deschedule such as task in any upcall called from a hypercall continuation. However, in a kernel with voluntary or no preemption, hypercall continuations in Xen allow event handlers to be run but the task issuing the hypercall will not be descheduled until the hypercall is complete and the ioctl returns to user space. These long running tasks may also trigger the kernel's soft lockup detection. Add xen_preemptible_hcall_begin() and xen_preemptible_hcall_end() to bracket hypercalls that may be preempted. Use these in the privcmd driver. When returning from an upcall, call xen_maybe_preempt_hcall() which adds a schedule point if if the current task was within a preemptible hypercall. Since _cond_resched() can move the task to a different CPU, clear and set xen_in_preemptible_hcall around the call. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2015-02-01x86_64, entry: Remove the syscall exit audit and schedule optimizationsAndy Lutomirski1-47/+5
We used to optimize rescheduling and audit on syscall exit. Now that the full slow path is reasonably fast, remove these optimizations. Syscall exit auditing is now handled exclusively by syscall_trace_leave. This adds something like 10ns to the previously optimized paths on my computer, presumably due mostly to SAVE_REST / RESTORE_REST. I think that we should eventually replace both the syscall and non-paranoid interrupt exit slow paths with a pair of C functions along the lines of the syscall entry hooks. Link: http://lkml.kernel.org/r/22f2aa4a0361707a5cfb1de9d45260b39965dead.1421453410.git.luto@amacapital.net Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-02-01x86_64, entry: Use sysret to return to userspace when possibleAndy Lutomirski1-0/+54
The x86_64 entry code currently jumps through complex and inconsistent hoops to try to minimize the impact of syscall exit work. For a true fast-path syscall, almost nothing needs to be done, so returning is just a check for exit work and sysret. For a full slow-path return from a syscall, the C exit hook is invoked if needed and we join the iret path. Using iret to return to userspace is very slow, so the entry code has accumulated various special cases to try to do certain forms of exit work without invoking iret. This is error-prone, since it duplicates assembly code paths, and it's dangerous, since sysret can malfunction in interesting ways if used carelessly. It's also inefficient, since a lot of useful cases aren't optimized and therefore force an iret out of a combination of paranoia and the fact that no one has bothered to write even more asm code to avoid it. I would argue that this approach is backwards. Rather than trying to avoid the iret path, we should instead try to make the iret path fast. Under a specific set of conditions, iret is unnecessary. In particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not set, and both SS and CS are as expected, then movq 32(%rsp),%rsp;sysret does the same thing as iret. This set of conditions is nearly always satisfied on return from syscalls, and it can even occasionally be satisfied on return from an irq. Even with the careful checks for sysret applicability, this cuts nearly 80ns off of the overhead from syscalls with unoptimized exit work. This includes tracing and context tracking, and any return that invokes KVM's user return notifier. For example, the cost of getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to ~280ns on my computer. This may allow the removal and even eventual conversion to C of a respectable amount of exit asm. This may require further tweaking to give the full benefit on Xen. It may be worthwhile to adjust signal delivery and exec to try hit the sysret path. This does not optimize returns to 32-bit userspace. Making the same optimization for CS == __USER32_CS is conceptually straightforward, but it will require some tedious code to handle the differences between sysretl and sysexitl. Link: http://lkml.kernel.org/r/71428f63e681e1b4aa1a781e3ef7c27f027d1103.1421453410.git.luto@amacapital.net Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-01-28Merge tag 'pr-20150114-x86-entry' of ↵Ingo Molnar1-121/+87
git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux into x86/asm Pull x86/entry enhancements from Andy Lutomirski: " This is my accumulated x86 entry work, part 1, for 3.20. The meat of this is an IST rework. When an IST exception interrupts user space, we will handle it on the per-thread kernel stack instead of on the IST stack. This sounds messy, but it actually simplifies the IST entry/exit code, because it eliminates some ugly games we used to play in order to handle rescheduling, signal delivery, etc on the way out of an IST exception. The IST rework introduces proper context tracking to IST exception handlers. I haven't seen any bug reports, but the old code could have incorrectly treated an IST exception handler as an RCU extended quiescent state. The memory failure change (included in this pull request with Borislav and Tony's permission) eliminates a bunch of code that is no longer needed now that user memory failure handlers are called in process context. Finally, this includes a few on Denys' uncontroversial and Obviously Correct (tm) cleanups. The IST and memory failure changes have been in -next for a while. LKML references: IST rework: http://lkml.kernel.org/r/cover.1416604491.git.luto@amacapital.net Memory failure change: http://lkml.kernel.org/r/54ab2ffa301102cd6e@agluck-desk.sc.intel.com Denys' cleanups: http://lkml.kernel.org/r/1420927210-19738-1-git-send-email-dvlasenk@redhat.com " This tree semantically depends on and is based on the following RCU commit: 734d16801349 ("rcu: Make rcu_nmi_enter() handle nesting") ... and for that reason won't be pushed upstream before the RCU bits hit Linus's tree. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-17x86_64 entry: Fix RCX for ptraced syscallsAndy Lutomirski1-1/+2
The int_ret_from_sys_call and syscall tracing code disagrees with the sysret path as to the value of RCX. The Intel SDM, the AMD APM, and my laptop all agree that sysret returns with RCX == RIP. The syscall tracing code does not respect this property. For example, this program: int main() { extern const char syscall_rip[]; unsigned long rcx = 1; unsigned long orig_rcx = rcx; asm ("mov $-1, %%eax\n\t" "syscall\n\t" "syscall_rip:" : "+c" (rcx) : : "r11"); printf("syscall: RCX = %lX RIP = %lX orig RCX = %lx\n", rcx, (unsigned long)syscall_rip, orig_rcx); return 0; } prints: syscall: RCX = 400556 RIP = 400556 orig RCX = 1 Running it under strace gives this instead: syscall: RCX = FFFFFFFFFFFFFFFF RIP = 400556 orig RCX = 1 This changes FIXUP_TOP_OF_STACK to match sysret, causing the test to show RCX == RIP even under strace. It looks like this is a partial revert of: 88e4bc32686e ("[PATCH] x86-64 architecture specific sync for 2.5.8") from the historic git tree. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/c9a418c3dc3993cb88bb7773800225fd318a4c67.1421453410.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-14x86: entry_64.S: fold SAVE_ARGS_IRQ macro into its sole userDenys Vlasenko1-46/+42
No code changes. This is a preparatory patch for change in "struct pt_regs" handling. CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Oleg Nesterov <oleg@redhat.com> CC: "H. Peter Anvin" <hpa@zytor.com> CC: Andy Lutomirski <luto@amacapital.net> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: X86 ML <x86@kernel.org> CC: Alexei Starovoitov <ast@plumgrid.com> CC: Will Drewry <wad@chromium.org> CC: Kees Cook <keescook@chromium.org> CC: linux-kernel@vger.kernel.org Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-01-14x86: entry_64.S: delete unused codeDenys Vlasenko1-34/+0
A define, two macros and an unreferenced bit of assembly are gone. Acked-by: Borislav Petkov <bp@suse.de> CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Oleg Nesterov <oleg@redhat.com> CC: "H. Peter Anvin" <hpa@zytor.com> CC: Andy Lutomirski <luto@amacapital.net> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: X86 ML <x86@kernel.org> CC: Alexei Starovoitov <ast@plumgrid.com> CC: Will Drewry <wad@chromium.org> CC: Kees Cook <keescook@chromium.org> CC: linux-kernel@vger.kernel.org Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-01-02x86, entry: Switch stacks on a paranoid entry from userspaceAndy Lutomirski1-41/+45
This causes all non-NMI, non-double-fault kernel entries from userspace to run on the normal kernel stack. Double-fault is exempt to minimize confusion if we double-fault directly from userspace due to a bad kernel stack. This is, suprisingly, simpler and shorter than the current code. It removes the IMO rather frightening paranoid_userspace path, and it make sync_regs much simpler. There is no risk of stack overflow due to this change -- the kernel stack that we switch to is empty. This will also enable us to create non-atomic sections within machine checks from userspace, which will simplify memory failure handling. It will also allow the upcoming fsgsbase code to be simplified, because it doesn't need to worry about usergs when scheduling in paranoid_exit, as that code no longer exists. Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Acked-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2014-12-20Merge branch 'x86-apic-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 apic updates from Thomas Gleixner: "After stopping the full x86/apic branch, I took some time to go through the first block of patches again, which are mostly cleanups and preparatory work for the irqdomain conversion and ioapic hotplug support. Unfortunaly one of the real problematic commits was right at the beginning, so I rebased this portion of the pending patches without the offenders. It would be great to get this into 3.19. That makes reworking the problematic parts simpler. The usual tip testing did not unearth any issues and it is fully bisectible now. I'm pretty confident that this wont affect the calmness of the xmas season. Changes: - Split the convoluted io_apic.c code into domain specific parts (vector, ioapic, msi, htirq) - Introduce proper helper functions to retrieve irq specific data instead of open coded dereferencing of pointers - Preparatory work for ioapic hotplug and irqdomain conversion - Removal of the non functional pci-ioapic driver - Removal of unused irq entry stubs - Make native_smp_prepare_cpus() preemtible to avoid GFP_ATOMIC allocations for everything which is called from there. - Small cleanups and fixes" * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) iommu/amd: Use helpers to access irq_cfg data structure associated with IRQ iommu/vt-d: Use helpers to access irq_cfg data structure associated with IRQ x86: irq_remapping: Use helpers to access irq_cfg data structure associated with IRQ x86, irq: Use helpers to access irq_cfg data structure associated with IRQ x86, irq: Make MSI and HT_IRQ indepenent of X86_IO_APIC x86, irq: Move IRQ initialization routines from io_apic.c into vector.c x86, irq: Move IOAPIC related declarations from hw_irq.h into io_apic.h x86, irq: Move HT IRQ related code from io_apic.c into htirq.c x86, irq: Move PCI MSI related code from io_apic.c into msi.c x86, irq: Replace printk(KERN_LVL) with pr_lvl() utilities x86, irq: Make UP version of irq_complete_move() an inline stub x86, irq: Move local APIC related code from io_apic.c into vector.c x86, irq: Introduce helpers to access struct irq_cfg x86, irq: Protect __clear_irq_vector() with vector_lock x86, irq: Rename local APIC related functions in io_apic.c as apic_xxx() x86, irq: Refine hw_irq.h to prepare for irqdomain support x86, irq: Convert irq_2_pin list to generic list x86, irq: Kill useless parameter 'irq_attr' of IO_APIC_get_PCI_irq_vector() x86, irq, acpi: Get rid of special handling of GSI for ACPI SCI x86, irq: Introduce helper to check whether an IOAPIC has been registered ...
2014-12-16x86: Avoid building unused IRQ entry stubsJan Beulich1-2/+2
When X86_LOCAL_APIC (i.e. unconditionally on x86-64), first_system_vector will never end up being higher than LOCAL_TIMER_VECTOR (0xef), and hence building stubs for vectors 0xef...0xff is pointlessly reducing code density. Deal with this at build time already. Taking into consideration that X86_64 implies X86_LOCAL_APIC, also simplify (and hence make easier to read and more consistent with the change done here) some #if-s in arch/x86/kernel/irqinit.c. While we could further improve the packing of the IRQ entry stubs (the four ones now left in the last set could be fit into the four padding bytes each of the final four sets have) this doesn't seem to provide any real benefit: Both irq_entries_start and common_interrupt getting cache line aligned, eliminating the 30th set would just produce 32 bytes of padding between the 29th and common_interrupt. [ tglx: Folded lguest fix from Dan Carpenter ] Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: lguest@lists.ozlabs.org Cc: Rusty Russell <rusty@rustcorp.com.au> Link: http://lkml.kernel.org/r/54574D5F0200007800044389@mail.emea.novell.com Link: http://lkml.kernel.org/r/20141115185718.GB6530@mwanda Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-12-13x86: hook up execveat system callDavid Drysdale1-0/+28
Hook up x86-64, i386 and x32 ABIs. Signed-off-by: David Drysdale <drysdale@google.com> Cc: Meredydd Luff <meredydd@senatehouse.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rich Felker <dalias@aerifal.cx> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-24x86_64, traps: Rework bad_iretAndy Lutomirski1-26/+19
It's possible for iretq to userspace to fail. This can happen because of a bad CS, SS, or RIP. Historically, we've handled it by fixing up an exception from iretq to land at bad_iret, which pretends that the failed iret frame was really the hardware part of #GP(0) from userspace. To make this work, there's an extra fixup to fudge the gs base into a usable state. This is suboptimal because it loses the original exception. It's also buggy because there's no guarantee that we were on the kernel stack to begin with. For example, if the failing iret happened on return from an NMI, then we'll end up executing general_protection on the NMI stack. This is bad for several reasons, the most immediate of which is that general_protection, as a non-paranoid idtentry, will try to deliver signals and/or schedule from the wrong stack. This patch throws out bad_iret entirely. As a replacement, it augments the existing swapgs fudge into a full-blown iret fixup, mostly written in C. It's should be clearer and more correct. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-24x86_64, traps: Stop using IST for #SSAndy Lutomirski1-1/+1
On a 32-bit kernel, this has no effect, since there are no IST stacks. On a 64-bit kernel, #SS can only happen in user code, on a failed iret to user space, a canonical violation on access via RSP or RBP, or a genuine stack segment violation in 32-bit kernel code. The first two cases don't need IST, and the latter two cases are unlikely fatal bugs, and promoting them to double faults would be fine. This fixes a bug in which the espfix64 code mishandles a stack segment violation. This saves 4k of memory per CPU and a tiny bit of code. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-24x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in CAndy Lutomirski1-32/+2
There's nothing special enough about the espfix64 double fault fixup to justify writing it in assembly. Move it to C. This also fixes a bug: if the double fault came from an IST stack, the old asm code would return to a partially uninitialized stack frame. Fixes: 3891a04aafd668686239349ea58f3314ea2af86b Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-09x86_64, entry: Use split-phase syscall_trace_enter for 64-bit syscallsAndy Lutomirski1-23/+15
On KVM on my box, this reduces the overhead from an always-accept seccomp filter from ~130ns to ~17ns. Most of that comes from avoiding IRET on every syscall when seccomp is enabled. In extremely approximate hacked-up benchmarking, just bypassing IRET saves about 80ns, so there's another 43ns of savings here from simplifying the seccomp path. The diffstat is also rather nice :) Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/a3dbd267ee990110478d349f78cccfdac5497a84.1409954077.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-09-09x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscallsAndy Lutomirski1-9/+4
For slowpath syscalls, we initialize regs->ax to -ENOSYS and stick the syscall number into regs->orig_ax prior to any possible tracing and syscall execution. This is user-visible ABI used by ptrace syscall emulation and seccomp. For fastpath syscalls, there's no good reason not to do the same thing. It's even slightly simpler than what we're currently doing. It probably has no measureable performance impact. It should have no user-visible effect. The purpose of this patch is to prepare for two-phase syscall tracing, in which the first phase might modify the saved RAX without leaving the fast path. This change is just subtle enough that I'm keeping it separate. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/01218b493f12ae2f98034b78c9ae085e38e94350.1409954077.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-08-05Merge branches 'x86-build-for-linus', 'x86-cleanups-for-linus' and ↵Linus Torvalds1-26/+26
'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 build/cleanup/debug updates from Ingo Molnar: "Robustify the build process with a quirk to avoid GCC reordering related bugs. Two code cleanups. Simplify entry_64.S CFI annotations, by Jan Beulich" * 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, build: Change code16gcc.h from a C header to an assembly header * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Simplify __HAVE_ARCH_CMPXCHG tests x86/tsc: Get rid of custom DIV_ROUND() macro * 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/debug: Drop several unnecessary CFI annotations
2014-07-29x86_64/entry/xen: Do not invoke espfix64 on XenAndy Lutomirski1-18/+10
This moves the espfix64 logic into native_iret. To make this work, it gets rid of the native patch for INTERRUPT_RETURN: INTERRUPT_RETURN on native kernels is now 'jmp native_iret'. This changes the 16-bit SS behavior on Xen from OOPSing to leaking some bits of the Xen hypervisor's RSP (I think). [ hpa: this is a nonzero cost on native, but probably not enough to measure. Xen needs to fix this in their own code, probably doing something equivalent to espfix64. ] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/7b8f1d8ef6597cb16ae004a43c56980a7de3cf94.1406129132.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@vger.kernel.org>
2014-07-16x86/debug: Drop several unnecessary CFI annotationsJan Beulich1-26/+26
With the conversion of the register saving code from macros to functions, and with those functions not clobbering most of the registers they spill, there's no need to annotate most of the spill operations; the only exceptions being %rbx (always modified) and %rcx (modified on the error_kernelspace: path). Also remove a bogus commented out annotation - there's no register %orig_rax after all. Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/53AAE69A020000780001D3C7@mail.emea.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-13Merge branch 'perf-core-for-linus' of ↵Linus Torvalds1-21/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more perf updates from Ingo Molnar: "A second round of perf updates: - wide reaching kprobes sanitization and robustization, with the hope of fixing all 'probe this function crashes the kernel' bugs, by Masami Hiramatsu. - uprobes updates from Oleg Nesterov: tmpfs support, corner case fixes and robustization work. - perf tooling updates and fixes from Jiri Olsa, Namhyung Ki, Arnaldo et al: * Add support to accumulate hist periods (Namhyung Kim) * various fixes, refactorings and enhancements" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits) perf: Differentiate exec() and non-exec() comm events perf: Fix perf_event_comm() vs. exec() assumption uprobes/x86: Rename arch_uprobe->def to ->defparam, minor comment updates perf/documentation: Add description for conditional branch filter perf/x86: Add conditional branch filtering support perf/tool: Add conditional branch filter 'cond' to perf record perf: Add new conditional branch filter 'PERF_SAMPLE_BRANCH_COND' uprobes: Teach copy_insn() to support tmpfs uprobes: Shift ->readpage check from __copy_insn() to uprobe_register() perf/x86: Use common PMU interrupt disabled code perf/ARM: Use common PMU interrupt disabled code perf: Disable sampled events if no PMU interrupt perf: Fix use after free in perf_remove_from_context() perf tools: Fix 'make help' message error perf record: Fix poll return value propagation perf tools: Move elide bool into perf_hpp_fmt struct perf tools: Remove elide setup for SORT_MODE__MEMORY mode perf tools: Fix "==" into "=" in ui_browser__warning assignment perf tools: Allow overriding sysfs and proc finding with env var perf tools: Consider header files outside perf directory in tags target ...
2014-06-10Merge tag 'trace-3.16' of ↵Linus Torvalds1-204/+0
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "Lots of tweaks, small fixes, optimizations, and some helper functions to help out the rest of the kernel to ease their use of trace events. The big change for this release is the allowing of other tracers, such as the latency tracers, to be used in the trace instances and allow for function or function graph tracing to be in the top level simultaneously" * tag 'trace-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits) tracing: Fix memory leak on instance deletion tracing: Fix leak of ring buffer data when new instances creation fails tracing/kprobes: Avoid self tests if tracing is disabled on boot up tracing: Return error if ftrace_trace_arrays list is empty tracing: Only calculate stats of tracepoint benchmarks for 2^32 times tracing: Convert stddev into u64 in tracepoint benchmark tracing: Introduce saved_cmdlines_size file tracing: Add __get_dynamic_array_len() macro for trace events tracing: Remove unused variable in trace_benchmark tracing: Eliminate double free on failure of allocation on boot up ftrace/x86: Call text_ip_addr() instead of the duplicated code tracing: Print max callstack on stacktrace bug tracing: Move locking of trace_cmdline_lock into start/stop seq calls tracing: Try again for saved cmdline if failed due to locking tracing: Have saved_cmdlines use the seq_read infrastructure tracing: Add tracepoint benchmark tracepoint tracing: Print nasty banner when trace_printk() is in use tracing: Add funcgraph_tail option to print function name after closing braces tracing: Eliminate duplicate TRACE_GRAPH_PRINT_xx defines tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks ...
2014-06-05Merge branch 'x86/espfix' of ↵Linus Torvalds1-4/+77
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull x86-64 espfix changes from Peter Anvin: "This is the espfix64 code, which fixes the IRET information leak as well as the associated functionality problem. With this code applied, 16-bit stack segments finally work as intended even on a 64-bit kernel. Consequently, this patchset also removes the runtime option that we added as an interim measure. To help the people working on Linux kernels for very small systems, this patchset also makes these compile-time configurable features" * 'x86/espfix' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "x86-64, modify_ldt: Make support for 16-bit segments a runtime option" x86, espfix: Make it possible to disable 16-bit support x86, espfix: Make espfix64 a Kconfig option, fix UML x86, espfix: Fix broken header guard x86, espfix: Move espfix definitions into a separate header file x86-32, espfix: Remove filter for espfix32 due to race x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack
2014-05-22x86_64, entry: Merge paranoidzeroentry_ist into idtentryAndy Lutomirski1-25/+22
One more specialized entry function is now gone. Again, this seems to only change line numbers in entry_64.o. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/f54854f07ff3be8162b166124dbead23feeefe10.1400709717.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-22x86_64, entry: Merge most 64-bit asm entry macrosAndy Lutomirski1-88/+64
I haven't touched the device interrupt code, which is different enough that it's probably not worth merging, and I haven't done anything about paranoidzeroentry_ist yet. This appears to produce an entry_64.o file that differs only in the debug info line numbers. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/e7a6acfb130471700370e77af9e4b4b6ed46f5ef.1400709717.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-22x86_64, entry: Add missing 'DEFAULT_FRAME 0' entry annotationsAndy Lutomirski1-0/+2
The paranoidzeroentry macros were missing them. I'm not at all convinced that these annotations are correct and/or necessary, but this makes the macros more consistent with each other. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/10ad65f534f8bc62e77f74fe15f68e8d4a59d8b3.1400709717.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-14ftrace/x86: Move the mcount/fentry code out of entry_64.SSteven Rostedt1-204/+0
As the mcount code gets more complex, it really does not belong in the entry.S file. By moving it into its own file "mcount.S" keeps things a bit cleaner. Link: http://lkml.kernel.org/p/20140508152152.2130e8cf@gandalf.local.home Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-04x86, espfix: Make it possible to disable 16-bit supportH. Peter Anvin1-0/+8
Embedded systems, which may be very memory-size-sensitive, are extremely unlikely to ever encounter any 16-bit software, so make it a CONFIG_EXPERT option to turn off support for any 16-bit software whatsoever. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
2014-05-01x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stackH. Peter Anvin1-4/+69
The IRET instruction, when returning to a 16-bit segment, only restores the bottom 16 bits of the user space stack pointer. This causes some 16-bit software to break, but it also leaks kernel state to user space. We have a software workaround for that ("espfix") for the 32-bit kernel, but it relies on a nonzero stack segment base which is not available in 64-bit mode. In checkin: b3b42ac2cbae x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels we "solved" this by forbidding 16-bit segments on 64-bit kernels, with the logic that 16-bit support is crippled on 64-bit kernels anyway (no V86 support), but it turns out that people are doing stuff like running old Win16 binaries under Wine and expect it to work. This works around this by creating percpu "ministacks", each of which is mapped 2^16 times 64K apart. When we detect that the return SS is on the LDT, we copy the IRET frame to the ministack and use the relevant alias to return to userspace. The ministacks are mapped readonly, so if IRET faults we promote #GP to #DF which is an IST vector and thus has its own stack; we then do the fixup in the #DF handler. (Making #GP an IST exception would make the msr_safe functions unsafe in NMI/MC context, and quite possibly have other effects.) Special thanks to: - Andy Lutomirski, for the suggestion of using very small stack slots and copy (as opposed to map) the IRET frame there, and for the suggestion to mark them readonly and let the fault promote to #DF. - Konrad Wilk for paravirt fixup and testing. - Borislav Petkov for testing help and useful comments. Reported-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andrew Lutomriski <amluto@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dirk Hohndel <dirk@hohndel.org> Cc: Arjan van de Ven <arjan.van.de.ven@intel.com> Cc: comex <comexk@gmail.com> Cc: Alexander van Heukelum <heukelum@fastmail.fm> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: <stable@vger.kernel.org> # consider after upstream merge
2014-04-24kprobes: Prohibit probing on .entry.text codeMasami Hiramatsu1-20/+0
.entry.text is a code area which is used for interrupt/syscall entries, which includes many sensitive code. Thus, it is better to prohibit probing on all of such code instead of a part of that. Since some symbols are already registered on kprobe blacklist, this also removes them from the blacklist. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: David S. Miller <davem@davemloft.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jonathan Lebon <jlebon@redhat.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Link: http://lkml.kernel.org/r/20140417081658.26341.57354.stgit@ltc230.yrl.intra.hitachi.co.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-10ftrace/x86: Load ftrace_ops in parameter not the variable holding itSteven Rostedt1-1/+1
Function tracing callbacks expect to have the ftrace_ops that registered it passed to them, not the address of the variable that holds the ftrace_ops that registered it. Use a mov instead of a lea to store the ftrace_ops into the parameter of the function tracing callback. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20131113152004.459787f9@gandalf.local.home Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@vger.kernel.org> # v3.8+
2013-11-14Merge branch 'x86-trace-for-linus' of ↵Linus Torvalds1-1/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/trace changes from Ingo Molnar: "This adds page fault tracepoints which have zero runtime cost in the disabled case via IDT trickery (no NOPs in the page fault hotpath)" * 'x86-trace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, trace: Change user|kernel_page_fault to page_fault_user|kernel x86, trace: Add page fault tracepoints x86, trace: Delete __trace_alloc_intr_gate() x86, trace: Register exception handler to trace IDT x86, trace: Remove __alloc_intr_gate()
2013-11-12Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "The main changes in this cycle are: - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik van Riel, Peter Zijlstra et al. Yay! - optimize preemption counter handling: merge the NEED_RESCHED flag into the preempt_count variable, by Peter Zijlstra. - wait.h fixes and code reorganization from Peter Zijlstra - cfs_bandwidth fixes from Ben Segall - SMP load-balancer cleanups from Peter Zijstra - idle balancer improvements from Jason Low - other fixes and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits) ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED stop_machine: Fix race between stop_two_cpus() and stop_cpus() sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus sched: Fix asymmetric scheduling for POWER7 sched: Move completion code from core.c to completion.c sched: Move wait code from core.c to wait.c sched: Move wait.c into kernel/sched/ sched/wait: Fix __wait_event_interruptible_lock_irq_timeout() sched: Avoid throttle_cfs_rq() racing with period_timer stopping sched: Guarantee new group-entities always have weight sched: Fix hrtimer_cancel()/rq->lock deadlock sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining sched: Fix race on toggling cfs_bandwidth_used sched: Remove extra put_online_cpus() inside sched_setaffinity() sched/rt: Fix task_tick_rt() comment sched/wait: Fix build breakage sched/wait: Introduce prepare_to_wait_event() sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too sched: Remove get_online_cpus() usage sched: Fix race in migrate_swap_stop() ...