kernel/linux.git/tools/perf/Documentation, branch v6.3.6

perf intel-pt: Synthesize cycle events

2023-02-17T14:02:44+00:00

There is no good reason why we cannot synthesize "cycle" events from Intel PT just as we can synthesize "instruction" events, in particular when CYC packets are available. This enables using PT to getting much more accurate cycle profiles than regular sampling (record -e cycles) when the work last for very short periods (<10 ms). Thus, add support for this, based off of the existing IPC calculation framework. The new option to --itrace is "y" (for cYcles), as c was taken for calls. Cycle and instruction events can be synthesized together, and are by default. The only real caveat is that CYC packets are only emitted whenever some other packet is, which in practice is when a branch instruction is encountered (and not even all branches). Thus, even at no subsampling (e.g. --itrace=y0ns), it is impossible to get more accuracy than a single basic block, and all cycles spent executing that block will get attributed to the branch instruction that ends the packet. Thus, one cannot know whether the cycles came from e.g. a specific load, a mispredicted branch, or something else. When subsampling (which is the default), the cycle events will get smeared out even more, but will still be generally useful to attribute cycle counts to functions. Reviewed-by: Adrian Hunter Signed-off-by: Steinar H. Gunderson Cc: Alexander Shishkin Cc: Ingo Molnar Cc: Jiri Olsa Cc: Namhyung Kim Cc: Peter Zijlstra Link: https://lore.kernel.org/r/20220322082452.1429091-1-sesse@google.com Signed-off-by: Arnaldo Carvalho de Melo

perf c2c: Add report option to show false sharing in adjacent cachelines

2023-02-16T12:33:45+00:00

Many platforms have feature of adjacent cachelines prefetch, when it is enabled, for data in RAM of 2 cachelines (2N and 2N+1) granularity, if one is fetched to cache, the other one could likely be fetched too, which sort of extends the cacheline size to double, thus the false sharing could happens in adjacent cachelines. 0Day has captured performance changed related with this [1], and some commercial software explicitly makes its hot global variables 128 bytes aligned (2 cache lines) to avoid this kind of extended false sharing. So add an option "--double-cl" for 'perf c2c report' to show false sharing in double cache line granularity, which acts just like the cacheline size is doubled. There is no change to c2c record. The hardware events of shared cacheline are still per cacheline, and this option just changes the granularity of how events are grouped and displayed. In the 'perf c2c report' output below (will-it-scale's 'pagefault2' case on old kernel): ---------------------------------------------------------------------- 26 31 2 0 0 0 0xffff888103ec6000 ---------------------------------------------------------------------- 35.48% 50.00% 0.00% 0.00% 0.00% 0x10 0 1 0xffffffff8133148b 1153 66 971 3748 74 [k] get_mem_cgroup_from_mm 6.45% 0.00% 0.00% 0.00% 0.00% 0x10 0 1 0xffffffff813396e4 570 0 1531 879 75 [k] mem_cgroup_charge 25.81% 50.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff81331472 949 70 593 3359 74 [k] get_mem_cgroup_from_mm 19.35% 0.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff81339686 1352 0 1073 1022 74 [k] mem_cgroup_charge 9.68% 0.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff813396d6 1401 0 863 768 74 [k] mem_cgroup_charge 3.23% 0.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff81333106 618 0 804 11 9 [k] uncharge_batch The offset 0x10 and 0x54 used to displayed in 2 groups, and now they are listed together to give users a hint of extended false sharing. [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/ Committer notes: Link: https://lore.kernel.org/r/Y+wvVNWqXb70l4uy@feng-clx Removed -a, leaving just as --double-cl, as this probably is not used so frequently and perhaps will be even auto-detected if we manage to record the MSR where this is configured. Reviewed-by: Andi Kleen Reviewed-by: Leo Yan Signed-off-by: Feng Tang Tested-by: Leo Yan Acked-by: Joe Mario Cc: Alexander Shishkin Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kan Liang Cc: Mark Rutland Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Tim Chen Cc: Xing Zhengjun Link: https://lore.kernel.org/r/20230214075823.246414-1-feng.tang@intel.com Signed-off-by: Arnaldo Carvalho de Melo

perf lock contention: Add -o/--lock-owner option

2023-02-08T13:33:32+00:00

When there're many lock contentions in the system, people sometimes want to know who caused the contention, IOW who's the owner of the locks. The -o/--lock-owner option tries to follow the lock owners for the contended mutexes and rwsems from BPF, and then attributes the contention time to the owner instead of the waiter. It's a best effort approach to get the owner info at the time of the contention and doesn't guarantee to have the precise tracking of owners if it's changing over time. Currently it only handles mutex and rwsem that have owner field in their struct and it basically points to a task_struct that owns the lock at the moment. Technically its type is atomic_long_t and it comes with some LSB bits used for other meanings. So it needs to clear them when casting it to a pointer to task_struct. Also the atomic_long_t is a typedef of the atomic 32 or 64 bit types depending on arch which is a wrapper struct for the counter value. I'm not aware of proper ways to access those kernel atomic types from BPF so I just read the internal counter value directly. Please let me know if there's a better way. When -o/--lock-owner option is used, it goes to the task aggregation mode like -t/--threads option does. However it cannot get the owner for other lock types like spinlock and sometimes even for mutex. $ sudo ./perf lock con -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome As you can see, the owner is unknown for the most cases. But if we filter only for the mutex locks, it'd more likely get the onwers. $ sudo ./perf lock con -abo -Y mutex -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.910 [sec] 4.910435 usecs/op 203647 ops/sec contended total wait max wait avg wait pid owner 2 15.50 us 8.29 us 7.75 us 1582852 sched-pipe 7 7.20 us 2.47 us 1.03 us -1 Unknown 1 6.74 us 6.74 us 6.74 us 1582851 sched-pipe Signed-off-by: Namhyung Kim Cc: Adrian Hunter Cc: Boqun Feng Cc: Davidlohr Bueso Cc: Hao Luo Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Peter Zijlstra Cc: Song Liu Cc: Waiman Long Cc: Will Deacon Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230207002403.63590-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo

perf script: Fix missing Retire Latency fields option documentation

2023-02-06T17:57:50+00:00

The 'perf script' documentation is missing the fields option for Retire Latency. Add it. Signed-off-by: Kan Liang Cc: Andi Kleen Cc: Ian Rogers Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Stephane Eranian Link: https://lore.kernel.org/r/20230206162100.3329395-2-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo

perf report: Support Retire Latency

2023-02-03T20:24:02+00:00

The Retire Latency field is added in the var3_w of the PERF_SAMPLE_WEIGHT_STRUCT. The Retire Latency reports pipeline stall of this instruction compared to the previous instruction in cycles. That's quite useful to display the information with perf mem report. The p_stage_cyc for Power is also from the var3_w. Union the p_stage_cyc and retire_lat to share the code. Implement X86 specific codes to display the X86 specific header. Add a new sort key retire_lat for the Retire Latency. Reviewed-by: Andi Kleen Signed-off-by: Kan Liang Cc: Ian Rogers Cc: Peter Zijlstra Cc: Stephane Eranian Link: http://lore.kernel.org/lkml/20230104201349.1451191-8-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo

perf lock contention: Add -S/--callstack-filter option

2023-02-02T19:32:19+00:00

The -S/--callstack-filter is to limit display entries having the given string in the callstack (not only in the caller in the output). The following example shows lock contention results if the callstack has 'net' substring somewhere. Note that the caller '__dev_queue_xmit' does not match to it, but it has 'inet6_csk_xmit' in the callstack. This applies even if you don't use -v option to show the full callstack. $ sudo ./perf lock con -abv -S net sleep 1 ... contended total wait max wait avg wait type caller 5 70.20 us 16.13 us 14.04 us spinlock __dev_queue_xmit+0xb6d 0xffffffffa5dd1c60 _raw_spin_lock+0x30 0xffffffffa5b8f6ed __dev_queue_xmit+0xb6d 0xffffffffa5cd8267 ip6_finish_output2+0x2c7 0xffffffffa5cdac14 ip6_finish_output+0x1d4 0xffffffffa5cdb477 ip6_xmit+0x457 0xffffffffa5d1fd17 inet6_csk_xmit+0xd7 0xffffffffa5c5f4aa __tcp_transmit_skb+0x54a 0xffffffffa5c6467d tcp_keepalive_timer+0x2fd Signed-off-by: Namhyung Kim Cc: Adrian Hunter Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Peter Zijlstra Cc: Song Liu Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230126000936.3017683-1-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo

perf script: Add 'cgroup' field for output

2023-02-02T19:32:19+00:00

There's no field for the cgroup, let's add one. To do that, users need to specify --all-cgroup option for perf record to capture the cgroup info. $ perf record --all-cgroups -- true $ perf script -F comm,pid,cgroup true 337112 /user.slice/user-657345.slice/user@657345.service/... true 337112 /user.slice/user-657345.slice/user@657345.service/... true 337112 /user.slice/user-657345.slice/user@657345.service/... true 337112 /user.slice/user-657345.slice/user@657345.service/... If it's recorded without the --all-cgroups, it'd complain. $ perf script -F comm,pid,cgroup Samples for 'cycles:u' event do not have CGROUP attribute set. Cannot print 'cgroup' field. Hint: run 'perf record --all-cgroups ...' Signed-off-by: Namhyung Kim Cc: Adrian Hunter Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Peter Zijlstra Cc: Stephane Eranian Link: https://lore.kernel.org/r/20230126213610.3381147-1-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo

perf tools docs: Use canonical ftrace path

2023-02-02T19:32:19+00:00

The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing A few spots in the perf docs still refer to this older debugfs path, so let's update them to avoid confusion. Signed-off-by: Ross Zwisler Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Mark Rutland Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Steven Rostedt (VMware) Cc: linux-trace-kernel@vger.kernel.org Link: http://lore.kernel.org/lkml/20230130181915.1113313-5-zwisler@google.com Signed-off-by: Arnaldo Carvalho de Melo

perf intel-pt: Do not try to queue auxtrace data on pipe

2023-02-02T00:30:05+00:00

When it processes AUXTRACE_INFO, it calls to auxtrace_queue_data() to collect AUXTRACE data first. That won't work with pipe since it needs lseek() to read the scattered aux data. $ perf record -o- -e intel_pt// true | perf report -i- --itrace=i100 # To display the perf.data header info, please use --header/--header-only options. # 0x4118 [0xa0]: failed to process type: 70 Error: failed to process sample For the pipe mode, it can handle the aux data as it gets. But there's no guarantee it can get the aux data in time. So the following warning will be shown at the beginning: WARNING: Intel PT with pipe mode is not recommended. The output cannot relied upon. In particular, time stamps and the order of events may be incorrect. Fixes: dbd134322e74f19d ("perf intel-pt: Add support for decoding AUX area samples") Reviewed-by: Adrian Hunter Reviewed-by: James Clark Signed-off-by: Namhyung Kim Cc: Adrian Hunter Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Leo Yan Cc: Peter Zijlstra Cc: Stephane Eranian Link: https://lore.kernel.org/r/20230131023350.1903992-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo

perf mem/c2c: Document that SPE is used for mem and c2c on ARM

2023-01-27T18:00:34+00:00

Setup is non-trivial so also link to the full SPE docs. Signed-off-by: James Clark Cc: Alexander Shishkin Cc: Ingo Molnar Cc: Jiri Olsa Cc: Leo Yan Cc: Mark Rutland Cc: Namhyung Kim Cc: Peter Zijlstra Cc: linux-perf-users@vger.kernel.or Link: https://lore.kernel.org/r/20230124145929.557891-1-james.clark@arm.com Signed-off-by: Arnaldo Carvalho de Melo