summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2015-04-02watchdog: Add watchdog enable/disable all functionsStephane Eranian1-0/+28
This patch adds two new functions to enable/disable the watchdog across all CPUs. This will be used by the HT PMU bug workaround code to disable/enable the NMI watchdog across quirk enablement. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Cc: maria.n.dimakopoulou@gmail.com Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1416251225-17721-12-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02Merge branch 'perf/urgent' into perf/core, before applying dependent patchesIngo Molnar4-32/+70
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add ITRACE_START record to indicate that tracing has startedAlexander Shishkin1-0/+41
For counters that generate AUX data that is bound to the context of a running task, such as instruction tracing, the decoder needs to know exactly which task is running when the event is first scheduled in, before the first sched_switch. The decoder's need to know this stems from the fact that instruction flow trace decoding will almost always require program's object code in order to reconstruct said flow and for that we need at least its pid/tid in the perf stream. To single out such instruction tracing pmus, this patch introduces ITRACE PMU capability. The reason this is not part of RECORD_AUX record is that not all pmus capable of generating AUX data need this, and the opposite is *probably* also true. While sched_switch covers for most cases, there are two problems with it: the consumer will need to process events out of order (that is, having found RECORD_AUX, it will have to skip forward to the nearest sched_switch to figure out which task it was, then go back to the actual trace to decode it) and it completely misses the case when the tracing is enabled and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-15-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add wakeup watermark control to the AUX areaAlexander Shishkin3-5/+24
When AUX area gets a certain amount of new data, we want to wake up userspace to collect it. This adds a new control to specify how much data will cause a wakeup. This is then passed down to pmu drivers via output handle's "wakeup" field, so that the driver can find the nearest point where it can generate an interrupt. We repurpose __reserved_2 in the event attribute for this, even though it was never checked to be zero before, aux_watermark will only matter for new AUX-aware code, so the old code should still be fine. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-10-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Support overwrite mode for the AUX areaAlexander Shishkin2-15/+34
This adds support for overwrite mode in the AUX area, which means "keep collecting data till you're stopped", turning AUX area into a circular buffer, where new data overwrites old data. It does not depend on data buffer's overwrite mode, so that it doesn't lose sideband data that is instrumental for processing AUX data. Overwrite mode is enabled at mapping AUX area read only. Even though aux_tail in the buffer's user page might be user writable, it will be ignored in this mode. A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf data stream every time an event writes new data to the AUX area. The pmu driver might not be able to infer the exact beginning of the new data in each snapshot, some drivers will only provide the tail, which is aux_offset + aux_size in the AUX record. Consumer has to be able to tell the new data from the old one, for example, by means of time stamps if such are provided in the trace. Consumer is also responsible for disabling any events that might write to the AUX area (thus potentially racing with the consumer) before collecting the data. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-9-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add API for PMUs to write to the AUX areaAlexander Shishkin3-3/+145
For pmus that wish to write data to ring buffer's AUX area, provide perf_aux_output_{begin,end}() calls to initiate/commit data writes, similarly to perf_output_{begin,end}. These also use the same output handle structure. Also, similarly to software counterparts, these will direct inherited events' output to parents' ring buffers. After the perf_aux_output_begin() returns successfully, handle->size is set to the maximum amount of data that can be written wrt aux_tail pointer, so that no data that the user hasn't seen will be overwritten, therefore this should always be called before hardware writing is enabled. On success, this will return the pointer to pmu driver's private structure allocated for this aux area by pmu::setup_aux. Same pointer can also be retrieved using perf_get_aux() while hardware writing is enabled. PMU driver should pass the actual amount of data written as a parameter to perf_aux_output_end(). All hardware writes should be completed and visible before this one is called. Additionally, perf_aux_output_skip() will adjust output handle and aux_head in case some part of the buffer has to be skipped over to maintain hardware's alignment constraints. Nested writers are forbidden and guards are in place to catch such attempts. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-8-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add AUX recordAlexander Shishkin2-0/+37
When there's new data in the AUX space, output a record indicating its offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to mean the described data was truncated to fit in the ring buffer. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-7-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add a pmu capability for "exclusive" eventsAlexander Shishkin1-2/+117
Usually, pmus that do, for example, instruction tracing, would only ever be able to have one event per task per cpu (or per perf_event_context). For such pmus it makes sense to disallow creating conflicting events early on, so as to provide consistent behavior for the user. This patch adds a pmu capability that indicates such constraint on event creation. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1422613866-113186-1-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add a capability for AUX_NO_SG pmus to do software double bufferingAlexander Shishkin1-1/+14
For pmus that don't support scatter-gather for AUX data in hardware, it might still make sense to implement software double buffering to avoid losing data while the user is reading data out. For this purpose, add a pmu capability that guarantees multiple high-order chunks for AUX buffer, so that the pmu driver can do switchover tricks. To make use of this feature, add PERF_PMU_CAP_AUX_SW_DOUBLEBUF to your pmu's capability mask. This will make the ring buffer AUX allocation code ensure that the biggest high order allocation for the aux buffer pages is no bigger than half of the total requested buffer size, thus making sure that the buffer has at least two high order allocations. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-5-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Support high-order allocations for AUX spaceAlexander Shishkin1-6/+50
Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability) don't support scatter-gather and will prefer larger contiguous areas for their output regions. This patch adds a new pmu capability to request higher order allocations. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-4-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add AUX area to ring buffer for raw data streamsPeter Zijlstra3-29/+232
This patch introduces "AUX space" in the perf mmap buffer, intended for exporting high bandwidth data streams to userspace, such as instruction flow traces. AUX space is a ring buffer, defined by aux_{offset,size} fields in the user_page structure, and read/write pointers aux_{head,tail}, which abide by the same rules as data_* counterparts of the main perf buffer. In order to allocate/mmap AUX, userspace needs to set up aux_offset to such an offset that will be greater than data_offset+data_size and aux_size to be the desired buffer size. Both need to be page aligned. Then, same aux_offset and aux_size should be passed to mmap() call and if everything adds up, you should have an AUX buffer as a result. Pages that are mapped into this buffer also come out of user's mlock rlimit plus perf_event_mlock_kb allowance. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-3-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add data_{offset,size} to user_pageAlexander Shishkin1-0/+2
Currently, the actual perf ring buffer is one page into the mmap area, following the user page and the userspace follows this convention. This patch adds data_{offset,size} fields to user_page that can be used by userspace instead for locating perf data in the mmap area. This is also helpful when mapping existing or shared buffers if their size is not known in advance. Right now, it is made to follow the existing convention that data_offset == PAGE_SIZE and data_offset + data_size == mmap_size. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-2-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02bpf: Fix the build on BPF_SYSCALL=y && !CONFIG_TRACING kernels, make it more ↵Ingo Molnar2-1/+9
configurable So bpf_tracing.o depends on CONFIG_BPF_SYSCALL - but that's not its only dependency, it also depends on the tracing infrastructure and on kprobes, without which it will fail to build with: In file included from kernel/trace/bpf_trace.c:14:0: kernel/trace/trace.h: In function ‘trace_test_and_set_recursion’: kernel/trace/trace.h:491:28: error: ‘struct task_struct’ has no member named ‘trace_recursion’ unsigned int val = current->trace_recursion; [...] It took quite some time to trigger this build failure, because right now BPF_SYSCALL is very obscure, depends on CONFIG_EXPERT. So also make BPF_SYSCALL more configurable, not just under CONFIG_EXPERT. If BPF_SYSCALL, tracing and kprobes are enabled then enable the bpf_tracing gateway as well. We might want to make this an interactive option later on, although I'd not complicate it unnecessarily: enabling BPF_SYSCALL is enough of an indicator that the user wants BPF support. Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02tracing: Allow BPF programs to call bpf_trace_printk()Alexei Starovoitov1-0/+78
Debugging of BPF programs needs some form of printk from the program, so let programs call limited trace_printk() with %d %u %x %p modifiers only. Similar to kernel modules, during program load verifier checks whether program is calling bpf_trace_printk() and if so, kernel allocates trace_printk buffers and emits big 'this is debug only' banner. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427312966-8434-6-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02tracing: Allow BPF programs to call bpf_ktime_get_ns()Alexei Starovoitov1-0/+14
bpf_ktime_get_ns() is used by programs to compute time delta between events or as a timestamp Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427312966-8434-5-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02tracing, perf: Implement BPF programs attached to kprobesAlexei Starovoitov5-1/+204
BPF programs, attached to kprobes, provide a safe way to execute user-defined BPF byte-code programs without being able to crash or hang the kernel in any way. The BPF engine makes sure that such programs have a finite execution time and that they cannot break out of their sandbox. The user interface is to attach to a kprobe via the perf syscall: struct perf_event_attr attr = { .type = PERF_TYPE_TRACEPOINT, .config = event_id, ... }; event_fd = perf_event_open(&attr,...); ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); 'prog_fd' is a file descriptor associated with BPF program previously loaded. 'event_id' is an ID of the kprobe created. Closing 'event_fd': close(event_fd); ... automatically detaches BPF program from it. BPF programs can call in-kernel helper functions to: - lookup/update/delete elements in maps - probe_read - wraper of probe_kernel_read() used to access any kernel data structures BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is architecture dependent) and return 0 to ignore the event and 1 to store kprobe event into the ring buffer. Note, kprobes are a fundamentally _not_ a stable kernel ABI, so BPF programs attached to kprobes must be recompiled for every kernel version and user must supply correct LINUX_VERSION_CODE in attr.kern_version during bpf_prog_load() call. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02tracing: Add kprobe flagAlexei Starovoitov1-1/+1
add TRACE_EVENT_FL_KPROBE flag to differentiate kprobe type of tracepoints, since bpf programs can only be attached to kprobe type of PERF_TYPE_TRACEPOINT perf events. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427312966-8434-3-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-28Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds1-2/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Two clocksource driver fixes, and an idle loop RCU warning fix" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/sun5i: Fix cpufreq interaction with sched_clock() clocksource/drivers: Fix various !CONFIG_HAS_IOMEM build errors timers/tick/broadcast-hrtimer: Fix suspicious RCU usage in idle loop
2015-03-28Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "A single sched/rt corner case fix for RLIMIT_RTIME correctness" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix RLIMIT_RTTIME when PI-boosting to RT
2015-03-28Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds1-0/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Ingo Molnar: "A perf kernel side fix for a fuzzer triggered lockup" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix irq_work 'tail' recursion
2015-03-28Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds2-30/+59
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "A module unload lockdep race fix" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Fix the module unload key range freeing logic
2015-03-27perf: Add per event clockid supportPeter Zijlstra1-3/+74
While thinking on the whole clock discussion it occurred to me we have two distinct uses of time: 1) the tracking of event/ctx/cgroup enabled/running/stopped times which includes the self-monitoring support in struct perf_event_mmap_page. 2) the actual timestamps visible in the data records. And we've been conflating them. The first is all about tracking time deltas, nobody should really care in what time base that happens, its all relative information, as long as its internally consistent it works. The second however is what people are worried about when having to merge their data with external sources. And here we have the discussion on MONOTONIC vs MONOTONIC_RAW etc.. Where MONOTONIC is good for correlating between machines (static offset), MONOTNIC_RAW is required for correlating against a fixed rate hardware clock. This means configurability; now 1) makes that hard because it needs to be internally consistent across groups of unrelated events; which is why we had to have a global perf_clock(). However, for 2) it doesn't really matter, perf itself doesn't care what it writes into the buffer. The below patch makes the distinction between these two cases by adding perf_event_clock() which is used for the second case. It further makes this configurable on a per-event basis, but adds a few sanity checks such that we cannot combine events with different clocks in confusing ways. And since we then have per-event configurability we might as well retain the 'legacy' behaviour as a default. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27Merge branch 'perf/core' into perf/timer, before applying new changesIngo Molnar3-141/+160
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27Merge branch 'timers/core' into perf/timer, to apply dependent patchIngo Molnar6-287/+589
An upcoming patch will depend on tai_ns() and NMI-safe ktime_get_raw_fast(), so merge timers/core here in a separate topic branch until it's all cooked and timers/core is merged upstream. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27perf: Fix racy group accessPeter Zijlstra1-0/+11
While looking at some fuzzer output I noticed that we do not hold any locks on leader->ctx and therefore the sibling_list iteration is unsafe. Acquire the relevant ctx->mutex before calling into the pmu specific code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Link: http://lkml.kernel.org/r/20150225151639.GL5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27Merge branch 'perf/x86' into perf/core, because it's readyIngo Molnar3-57/+34
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27Merge branch 'perf/urgent' into perf/core, to pick up fixes and to refresh ↵Ingo Molnar13-57/+182
the tree Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27time: Introduce tk_fast_rawPeter Zijlstra1-0/+13
Add the NMI safe CLOCK_MONOTONIC_RAW accessor.. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150319093400.562746929@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27time: Parametrize all tk_fast_mono usersPeter Zijlstra1-10/+15
In preparation for more tk_fast instances, remove all hard-coded tk_fast_mono references. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150319093400.484279927@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27time: Add timerkeeper::tkr_rawPeter Zijlstra1-22/+19
Introduce tkr_raw and make use of it. base_raw -> tkr_raw.base clock->{mult,shift} -> tkr_raw.{mult.shift} Kill timekeeping_get_ns_raw() in favour of timekeeping_get_ns(&tkr_raw), this removes all mono_raw special casing. Duplicate the updates to tkr_mono.cycle_last into tkr_raw.cycle_last, both need the same value. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150319093400.422589590@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27time: Rename timekeeper::tkr to timekeeper::tkr_monoPeter Zijlstra1-75/+75
In preparation of adding another tkr field, rename this one to tkr_mono. Also rename tk_read_base::base_mono to tk_read_base::base, since the structure is not specific to CLOCK_MONOTONIC and the mono name got added to the tk_read_base instance. Lots of trivial churn. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150319093400.344679419@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27timers, sched/clock: Clean up the code a bitIngo Molnar1-51/+56
Trivial cleanups, to improve the readability of the generic sched_clock() code: - Improve and standardize comments - Standardize the coding style - Use vertical spacing where appropriate - etc. No code changed: md5: 19a053b31e0c54feaeff1492012b019a sched_clock.o.before.asm 19a053b31e0c54feaeff1492012b019a sched_clock.o.after.asm Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27timers, sched/clock: Avoid deadlock during read from NMIDaniel Thompson1-35/+68
Currently it is possible for an NMI (or FIQ on ARM) to come in and read sched_clock() whilst update_sched_clock() has locked the seqcount for writing. This results in the NMI handler locking up when it calls raw_read_seqcount_begin(). This patch fixes the NMI safety issues by providing banked clock data. This is a similar approach to the one used in Thomas Gleixner's 4396e058c52e("timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC"). Suggested-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1427397806-20889-6-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27timers, sched/clock: Remove redundant notrace from update functionDaniel Thompson1-1/+1
Currently update_sched_clock() is marked as notrace but this function is not called by ftrace. This is trivially fixed by removing the mark up. Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1427397806-20889-5-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27timers, sched/clock: Remove suspend from clock_read_data()Daniel Thompson1-15/+25
Currently cd.read_data.suspended is read by the hotpath function sched_clock(). This variable need not be accessed on the hotpath. In fact, once it is removed, we can remove the conditional branches from sched_clock() and install a dummy read_sched_clock function to suspend the clock. The new master copy of the function pointer (actual_read_sched_clock) is introduced and is used for all reads of the clock hardware except those within sched_clock itself. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1427397806-20889-4-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27timers, sched/clock: Optimize cache line usageDaniel Thompson1-35/+77
Currently sched_clock(), a very hot code path, is not optimized to minimise its cache profile. In particular: 1. cd is not ____cacheline_aligned, 2. struct clock_data does not distinguish between hotpath and coldpath data, reducing locality of reference in the hotpath, 3. Some hotpath data is missing from struct clock_data and is marked __read_mostly (which more or less guarantees it will not share a cache line with cd). This patch corrects these problems by extracting all hotpath data into a separate structure and using ____cacheline_aligned to ensure the hotpath uses a single (64 byte) cache line. Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1427397806-20889-3-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27timers, sched/clock: Match scope of read and write seqcountsDaniel Thompson1-15/+11
Currently the scope of the raw_write_seqcount_begin/end() in sched_clock_register() far exceeds the scope of the read section in sched_clock(). This gives the impression of safety during cursory review but achieves little. Note that this is likely to be a latent issue at present because sched_clock_register() is typically called before we enable interrupts, however the issue does risk bugs being needlessly introduced as the code evolves. This patch fixes the problem by increasing the scope of the read locking performed by sched_clock() to cover all data modified by sched_clock_register. We also improve clarity by moving writes to struct clock_data that do not impact sched_clock() outside of the critical section. Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> [ Reworked it slightly to apply to tip/timers/core] Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1427397806-20889-2-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26mm: numa: slow PTE scan rate if migration failures occurMel Gorman1-2/+6
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config: - 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys - default_send_IPI_mask_sequence_phys - 99.99% physflat_send_IPI_mask - 99.37% native_send_call_func_ipi smp_call_function_many - native_flush_tlb_others - 99.85% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.73% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault 0.63% native_send_call_func_single_ipi generic_exec_single smp_call_function_single This is showing excessive migration activity even though excessive migrations are meant to get throttled. Normally, the scan rate is tuned on a per-task basis depending on the locality of faults. However, if migrations fail for any reason then the PTE scanner may scan faster if the faults continue to be remote. This means there is higher system CPU overhead and fault trapping at exactly the time we know that migrations cannot happen. This patch tracks when migration failures occur and slows the PTE scanner. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Dave Chinner <david@fromorbit.com> Tested-by: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-23perf: Remove type specific target pointersPeter Zijlstra3-19/+13
The only reason CQM had to use a hard-coded pmu type was so it could use cqm_target in hw_perf_event. Do away with the {tp,bp,cqm}_target pointers and provide a non type specific one. This allows us to do away with that silly pmu type as well. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Vince Weaver <vince@deater.net> Cc: acme@kernel.org Cc: acme@redhat.com Cc: hpa@zytor.com Cc: jolsa@redhat.com Cc: kanaka.d.juvva@intel.com Cc: matt.fleming@intel.com Cc: tglx@linutronix.de Cc: torvalds@linux-foundation.org Cc: vikas.shivappa@linux.intel.com Link: http://lkml.kernel.org/r/20150305211019.GU21418@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23timers/tick/broadcast-hrtimer: Fix suspicious RCU usage in idle loopPreeti U Murthy1-2/+9
The hrtimer mode of broadcast queues hrtimers in the idle entry path so as to wakeup cpus in deep idle states. The associated call graph is : cpuidle_idle_call() |____ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, ....)) |_____tick_broadcast_set_event() |____clockevents_program_event() |____bc_set_next() The hrtimer_{start/cancel} functions call into tracing which uses RCU. But it is not legal to call into RCU in cpuidle because it is one of the quiescent states. Hence protect this region with RCU_NONIDLE which informs RCU that the cpu is momentarily non-idle. As an aside it is helpful to point out that the clock event device that is programmed here is not a per-cpu clock device; it is a pseudo clock device, used by the broadcast framework alone. The per-cpu clock device programming never goes through bc_set_next(). Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: linuxppc-dev@ozlabs.org Cc: mpe@ellerman.id.au Cc: tglx@linutronix.de Link: http://lkml.kernel.org/r/20150318104705.17763.56668.stgit@preeti.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23lockdep: Fix the module unload key range freeing logicPeter Zijlstra2-30/+59
Module unload calls lockdep_free_key_range(), which removes entries from the data structures. Most of the lockdep code OTOH assumes the data structures are append only; in specific see the comments in add_lock_to_list() and look_up_lock_class(). Clearly this has only worked by accident; make it work proper. The actual scenario to make it go boom would involve the memory freed by the module unlock being re-allocated and re-used for a lock inside of a rcu-sched grace period. This is a very unlikely scenario, still better plug the hole. Use RCU list iteration in all places and ammend the comments. Change lockdep_free_key_range() to issue a sync_sched() between removal from the lists and returning -- which results in the memory being freed. Further ensure the callers are placed correctly and comment the requirements. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Tsyvarev <tsyvarev@ispras.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23sched: Fix RLIMIT_RTTIME when PI-boosting to RTBrian Silverman1-0/+2
When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Signed-off-by: Brian Silverman <brian@peloton-tech.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: austin@peloton-tech.com Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23perf: Fix irq_work 'tail' recursionPeter Zijlstra1-0/+10
Vince reported a watchdog lockup like: [<ffffffff8115e114>] perf_tp_event+0xc4/0x210 [<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160 [<ffffffff810b7f10>] lock_release+0x130/0x260 [<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40 [<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80 [<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0 [<ffffffff811f71ce>] send_sigio+0xae/0x100 [<ffffffff811f72b7>] kill_fasync+0x97/0xf0 [<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0 [<ffffffff8115d103>] perf_pending_event+0x33/0x60 [<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80 [<ffffffff8114e448>] irq_work_run+0x18/0x40 [<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0 [<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80 Which is caused by an irq_work generating new irq_work and therefore not allowing forward progress. This happens because processing the perf irq_work triggers another perf event (tracepoint stuff) which in turn generates an irq_work ad infinitum. Avoid this by raising the recursion counter in the irq_work -- which effectively disables all software events (including tracepoints) from actually triggering again. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20150219170311.GH21418@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-18Merge branch 'for-linus' of ↵Linus Torvalds1-4/+26
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull livepatching fix from Jiri Kosina: - fix for potential race with module loading, from Petr Mladek. The race is very unlikely to be seen in real world and has been found by code inspection, but should be fixed for 4.0 anyway. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch: Fix subtle race with coming and going modules
2015-03-17Merge branches 'perf-urgent-for-linus' and 'timers-urgent-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf and timer fixes from Ingo Molnar: "Two small perf fixes: - kernel side context leak fix - tooling crash fix And two clocksource driver fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix context leak in put_event() perf annotate: Fix fallback to unparsed disassembler line * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clockevents: sun5i: Fix setup_irq init sequence clocksource: efm32: Fix a NULL pointer dereference
2015-03-17livepatch: Fix subtle race with coming and going modulesPetr Mladek1-4/+26
There is a notifier that handles live patches for coming and going modules. It takes klp_mutex lock to avoid races with coming and going patches but it does not keep the lock all the time. Therefore the following races are possible: 1. The notifier is called sometime in STATE_MODULE_COMING. The module is visible by find_module() in this state all the time. It means that new patch can be registered and enabled even before the notifier is called. It might create wrong order of stacked patches, see below for an example. 2. New patch could still see the module in the GOING state even after the notifier has been called. It will try to initialize the related object structures but the module could disappear at any time. There will stay mess in the structures. It might even cause an invalid memory access. This patch solves the problem by adding a boolean variable into struct module. The value is true after the coming and before the going handler is called. New patches need to be applied when the value is true and they need to ignore the module when the value is false. Note that we need to know state of all modules on the system. The races are related to new patches. Therefore we do not know what modules will get patched. Also note that we could not simply ignore going modules. The code from the module could be called even in the GOING state until mod->exit() finishes. If we start supporting patches with semantic changes between function calls, we need to apply new patches to any still usable code. See below for an example. Finally note that the patch solves only the situation when a new patch is registered. There are no such problems when the patch is being removed. It does not matter who disable the patch first, whether the normal disable_patch() or the module notifier. There is nothing to do once the patch is disabled. Alternative solutions: ====================== + reject new patches when a patched module is coming or going; this is ugly + wait with adding new patch until the module leaves the COMING and GOING states; this might be dangerous and complicated; we would need to release kgr_lock in the middle of the patch registration to avoid a deadlock with the coming and going handlers; also we might need a waitqueue for each module which seems to be even bigger overhead than the boolean + stop modules from entering COMING and GOING states; wait until modules leave these states when they are already there; looks complicated; we would need to ignore the module that asked to stop the others to avoid a deadlock; also it is unclear what to do when two modules asked to stop others and both are in COMING state (situation when two new patches are applied) + always register/enable new patches and fix up the potential mess (registered patches order) in klp_module_init(); this is nasty and prone to regressions in the future development + add another MODULE_STATE where the kallsyms are visible but the module is not used yet; this looks too complex; the module states are checked on "many" locations Example of patch stacking breakage: =================================== The notifier could _not_ _simply_ ignore already initialized module objects. For example, let's have three patches (P1, P2, P3) for functions a() and b() where a() is from vmcore and b() is from a module M. Something like: a() b() P1 a1() b1() P2 a2() b2() P3 a3() b3(3) If you load the module M after all patches are registered and enabled. The ftrace ops for function a() and b() has listed the functions in this order: ops_a->func_stack -> list(a3,a2,a1) ops_b->func_stack -> list(b3,b2,b1) , so the pointer to b3() is the first and will be used. Then you might have the following scenario. Let's start with state when patches P1 and P2 are registered and enabled but the module M is not loaded. Then ftrace ops for b() does not exist. Then we get into the following race: CPU0 CPU1 load_module(M) complete_formation() mod->state = MODULE_STATE_COMING; mutex_unlock(&module_mutex); klp_register_patch(P3); klp_enable_patch(P3); # STATE 1 klp_module_notify(M) klp_module_notify_coming(P1); klp_module_notify_coming(P2); klp_module_notify_coming(P3); # STATE 2 The ftrace ops for a() and b() then looks: STATE1: ops_a->func_stack -> list(a3,a2,a1); ops_b->func_stack -> list(b3); STATE2: ops_a->func_stack -> list(a3,a2,a1); ops_b->func_stack -> list(b2,b1,b3); therefore, b2() is used for the module but a3() is used for vmcore because they were the last added. Example of the race with going modules: ======================================= CPU0 CPU1 delete_module() #SYSCALL try_stop_module() mod->state = MODULE_STATE_GOING; mutex_unlock(&module_mutex); klp_register_patch() klp_enable_patch() #save place to switch universe b() # from module that is going a() # from core (patched) mod->exit(); Note that the function b() can be called until we call mod->exit(). If we do not apply patch against b() because it is in MODULE_STATE_GOING, it will call patched a() with modified semantic and things might get wrong. [jpoimboe@redhat.com: use one boolean instead of two] Signed-off-by: Petr Mladek <pmladek@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-03-13perf: Fix context leak in put_event()Leon Yu1-1/+1
Commit: a83fe28e2e45 ("perf: Fix put_event() ctx lock") changed the locking logic in put_event() by replacing mutex_lock_nested() with perf_event_ctx_lock_nested(), but didn't fix the subsequent mutex_unlock() with a correct counterpart, perf_event_ctx_unlock(). Contexts are thus leaked as a result of incremented refcount in perf_event_ctx_lock_nested(). Signed-off-by: Leon Yu <chianglungyu@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Fixes: a83fe28e2e45 ("perf: Fix put_event() ctx lock") Link: http://lkml.kernel.org/r/1424954613-5034-1-git-send-email-chianglungyu@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-13clocksource: Rename __clocksource_updatefreq_*() to ↵John Stultz1-5/+6
__clocksource_update_freq_*() Ingo requested this function be renamed to improve readability, so I've renamed __clocksource_updatefreq_scale() as well as the __clocksource_updatefreq_hz/khz() functions to avoid squishedtogethernames. This touches some of the sh clocksources, which I've not tested. The arch/arm/plat-omap change is just a comment change for consistency. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1426133800-29329-13-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-13clocksource: Add some debug info about clocksources being registeredJohn Stultz1-0/+3
Print the mask, max_cycles, and max_idle_ns values for clocksources being registered. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1426133800-29329-12-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-13clocksource: Mostly kill clocksource_register()John Stultz2-49/+36
A long running project has been to clean up remaining uses of clocksource_register(), replacing it with the simpler clocksource_register_khz/hz() functions. However, there are a few cases where we need to self-define our mult/shift values, so switch the function to a more obviously internal __clocksource_register() name, and consolidate much of the internal logic so we don't have duplication. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: David S. Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1426133800-29329-10-git-send-email-john.stultz@linaro.org [ Minor cleanups. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>