summaryrefslogtreecommitdiff
path: root/tools/perf/util
AgeCommit message (Collapse)AuthorFilesLines
2025-08-07perf bpf-filter: Enable events manuallyIlya Leoshkevich1-1/+4
On s390, and, in general, on all platforms where the respective event supports auxiliary data gathering, the command: # ./perf record -u 0 -aB --synth=no -- ./perf test -w thloop [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.011 MB perf.data ] # ./perf report --stats | grep SAMPLE # does not generate samples in the perf.data file. On x86 the command: # sudo perf record -e intel_pt// -u 0 ls is broken too. Looking at the sequence of calls in 'perf record' reveals this behavior: 1. The event 'cycles' is created and enabled: record__open() +-> evlist__apply_filters() +-> perf_bpf_filter__prepare() +-> bpf_program.attach_perf_event() +-> bpf_program.attach_perf_event_opts() +-> __GI___ioctl(..., PERF_EVENT_IOC_ENABLE, ...) The event 'cycles' is enabled and active now. However the event's ring-buffer to store the samples generated by hardware is not allocated yet. 2. The event's fd is mmap()ed to create the ring buffer: record__open() +-> record__mmap() +-> record__mmap_evlist() +-> evlist__mmap_ex() +-> perf_evlist__mmap_ops() +-> mmap_per_cpu() +-> mmap_per_evsel() +-> mmap__mmap() +-> perf_mmap__mmap() +-> mmap() This allocates the ring buffer for the event 'cycles'. With mmap() the kernel creates the ring buffer: perf_mmap(): kernel function to create the event's ring | buffer to save the sampled data. | +-> ring_buffer_attach(): Allocates memory for ring buffer. | The PMU has auxiliary data setup function. The | has_aux(event) condition is true and the PMU's | stop() is called to stop sampling. It is not | restarted: | | if (has_aux(event)) | perf_event_stop(event, 0); | +-> cpumsf_pmu_stop(): Hardware sampling is stopped. No samples are generated and saved anymore. 3. After the event 'cycles' has been mapped, the event is enabled a second time in: __cmd_record() +-> evlist__enable() +-> __evlist__enable() +-> evsel__enable_cpu() +-> perf_evsel__enable_cpu() +-> perf_evsel__run_ioctl() +-> perf_evsel__ioctl() +-> __GI___ioctl(., PERF_EVENT_IOC_ENABLE, .) The second ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); is just a NOP in this case. The first invocation in (1.) sets the event::state to PERF_EVENT_STATE_ACTIVE. The kernel functions perf_ioctl() +-> _perf_ioctl() +-> _perf_event_enable() +-> __perf_event_enable() return immediately because event::state is already set to PERF_EVENT_STATE_ACTIVE. This happens on s390, because the event 'cycles' offers the possibility to save auxilary data. The PMU callbacks setup_aux() and free_aux() are defined. Without both callback functions, cpumsf_pmu_stop() is not invoked and sampling continues. To remedy this, remove the first invocation of ioctl(..., PERF_EVENT_IOC_ENABLE, ...). in step (1.) Create the event in step (1.) and enable it in step (3.) after the ring buffer has been mapped. Output after: # ./perf record -aB --synth=no -u 0 -- ./perf test -w thloop 2 [ perf record: Woken up 3 times to write data ] [ perf record: Captured and wrote 0.876 MB perf.data ] # ./perf report --stats | grep SAMPLE SAMPLE events: 16200 (99.5%) SAMPLE events: 16200 # The software event succeeded both before and after the patch: # ./perf record -e cpu-clock -aB --synth=no -u 0 -- \ ./perf test -w thloop 2 [ perf record: Woken up 7 times to write data ] [ perf record: Captured and wrote 2.870 MB perf.data ] # ./perf report --stats | grep SAMPLE SAMPLE events: 53506 (99.8%) SAMPLE events: 53506 # Fixes: b4c658d4d63d61 ("perf target: Remove uid from target") Suggested-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Co-developed-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20250806162417.19666-3-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-31perf record: Cache build-ID of hit DSOs onlyNamhyung Kim1-1/+1
It post-processes samples to find which DSO has samples. Based on that info, it can save used DSOs in the build-ID cache directory. But for some reason, it saves all DSOs without checking the hit mark. Skipping unused DSOs can give some speedup especially with --buildid-mmap being default. On my idle machine, `time perf record -a sleep 1` goes down from 3 sec to 1.5 sec with this change. Fixes: e29386c8f7d71fa5 ("perf record: Add --buildid-mmap option to enable PERF_RECORD_MMAP2's build id") Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250731070330.57116-1-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-30perf python: Stop using deprecated PyUnicode_AsString()Arnaldo Carvalho de Melo1-1/+11
As noticed while building for Fedora 43: GEN /tmp/build/perf/python/perf.cpython-314-x86_64-linux-gnu.so /git/perf-6.16.0-rc3/tools/perf/util/python.c: In function ‘get_tracepoint_field’: /git/perf-6.16.0-rc3/tools/perf/util/python.c:340:9: error: ‘_PyUnicode_AsString’ is deprecated [-Werror=deprecated-declarations] 340 | const char *str = _PyUnicode_AsString(PyObject_Str(attr_name)); | ^~~~~ In file included from /usr/include/python3.14/unicodeobject.h:1022, from /usr/include/python3.14/Python.h:89, from /git/perf-6.16.0-rc3/tools/perf/util/python.c:2: /usr/include/python3.14/cpython/unicodeobject.h:648:1: note: declared here 648 | _PyUnicode_AsString(PyObject *unicode) | ^~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors error: command '/usr/bin/gcc' failed with exit code 1 Use PyUnicode_AsUTF8() instead and also check if PyObject_Str() fails before doing so. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/aIofXNK8QLtLIaI3@x1 Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-27perf list: Skip ABI PMUs when printing pmu valuesIan Rogers4-1/+10
Avoid printing tracepoint, legacy and software events when listing for the pmu option. Add the PMU type to the print_event callbacks to ease detection. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250725185202.68671-8-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-27perf list: Remove tracepoint printing codeIan Rogers2-95/+0
Now that the tp_pmu can iterate and describe events remove the custom tracepoint printing logic, this avoids perf list showing the tracepoint events twice. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250725185202.68671-7-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-27perf tp_pmu: Add event APIsIan Rogers3-0/+129
Add event APIs for the tracepoint PMU allowing things like perf list to function using it. For perf list add the tracepoint format in the long description (shown with -v). $ sudo perf list -v tracepoint List of pre-defined events (to be used in -e or -M): alarmtimer:alarmtimer_cancel [Tracepoint event] [name: alarmtimer_cancel ID: 416 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:void * alarm; offset:8; size:8; signed:0; field:unsigned char alarm_type; offset:16; size:1; signed:0; field:s64 expires; offset:24; size:8; signed:1; field:s64 now; offset:32; size:8; signed:1; print fmt: "alarmtimer:%p type:%s expires:%llu now:%llu",REC->alarm,__print_flags((1 << REC->alarm_type)," | ",{ 1 << 0, "REALTIME" },{ 1 << 1,"BOOTTIME" },{ 1 << 3,"REALTIME Freezer" },{ 1 << 4,"BOOTTIME Freezer" }),REC->expires,REC->now . Unit: tracepoint] alarmtimer:alarmtimer_fired [Tracepoint event] [name: alarmtimer_fired ID: 418 ... Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250725185202.68671-6-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-27perf tp_pmu: Factor existing tracepoint logic to new fileIan Rogers5-106/+170
Start the creation of a tracepoint PMU abstraction. Tracepoint events don't follow the regular sysfs perf conventions. Eventually the new PMU abstraction will bridge the gap so tracepoint events look more like regular perf ones. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250725185202.68671-5-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-27perf parse-events: Remove non-json software eventsIan Rogers5-97/+24
Remove the hard coded encodings from parse-events. This has the consequence that software events are matched using the sysfs/json priority, will be case insensitive and will be wildcarded across PMUs. As there were software and hardware types in the parsing code, the removal means software vs hardware logic can be removed and hardware assumed. Now the perf json provides detailed descriptions of software events, remove the previous listing support that didn't contain event descriptions. When globbing is required for the "sw" option in perf list, use string PMU globbing as was done previously for the tool PMU. The output of `perf list sw` command changed like this. Before: List of pre-defined events (to be used in -e or -M): alignment-faults [Software event] bpf-output [Software event] cgroup-switches [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event] After: List of pre-defined events (to be used in -e or -M): software: alignment-faults [Number of kernel handled memory alignment faults. Unit: software] bpf-output [An event used by BPF programs to write to the perf ring buffer. Unit: software] cgroup-switches [Number of context switches to a task in a different cgroup. Unit: software] context-switches [Number of context switches [This event is an alias of cs]. Unit: software] cpu-clock [Per-CPU high-resolution timer based event. Unit: software] cpu-migrations [Number of times a process has migrated to a new CPU [This event is an alias of migrations]. Unit: software] cs [Number of context switches [This event is an alias of context-switches]. Unit: software] dummy [A placeholder event that doesn't count anything. Unit: software] ... Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250725185202.68671-4-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf sort: Use perf_env to set arch sort keys and headerIan Rogers3-23/+46
Previously arch_support_sort_key and arch_perf_header_entry used a weak symbol to compile as appropriate for x86 and powerpc. A limitation to this is that the handling of a data file could vary in cross-platform development. Change to using the perf_env of the current session to determine the architecture kind and set the sort key and header entries as appropriate. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-23-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf sample: Remove arch notion of sample parsingIan Rogers10-24/+33
By definition arch sample parsing and synthesis will inhibit certain kinds of cross-platform record then analysis (report, script, etc.). Remove arch_perf_parse_sample_weight and arch_perf_synthesize_sample_weight replacing with a common implementation. Combine perf_sample p_stage_cyc and retire_lat as weight3 to capture the differing uses regardless of compiled for architecture. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-21-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf env: Remove global perf_envIan Rogers5-7/+4
The global perf_env was used for the host, but if a perf_env wasn't easy to come by it was used in a lot of places where potentially recorded and host data could be confused. Remove the global variable as now the majority of accesses retrieve the perf_env for the host from the session. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-20-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf auxtrace: Pass perf_env from session through to mmap readIan Rogers2-8/+11
auxtrace_mmap__read and auxtrace_mmap__read_snapshot end up calling `evsel__env(NULL)` which returns the global perf_env variable for the host. Their only call is in perf record. Rather than use the global variable pass through the perf_env for `perf record`. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-18-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf machine: Explicitly pass in host perf_envIan Rogers4-13/+23
When creating a machine for the host explicitly pass in a scoped perf_env. This removes a use of the global perf_env. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-17-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf session: Add host_env argument to perf_session__newIan Rogers2-4/+6
When creating a perf_session the host perf_env may or may not want to be used. For example, `perf top` uses a host perf_env while `perf inject` does not. Add a host_env argument to perf_session__new so that sessions requiring a host perf_env can pass it in. Currently if none is specified the global perf_env variable is used, but this will change in later patches. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-14-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf test: Avoid use perf_envIan Rogers1-0/+1
The perf_env global variable holds the host perf_env data but its use is hit and miss. Switch to using local perf_env variables and ensure scoped perf_env__init and perf_env__exit. This loses command line setting of the perf_env, but this doesn't matter for tests. So the perf_env is fully initialized, clear it with memset in perf_env__init. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-13-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf header: Clean up use of perf_envIan Rogers1-76/+98
Always use the perf_env from the feat_fd's perf_header. Cache the value on entry to a function in `env` and use `env->` consistently in the code. Ensure the header is initialized for use in perf_session__do_write_header. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-12-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf evlist: Change env variable to sessionIan Rogers10-13/+23
The session holds a perf_env pointer env. In UI code container_of is used to turn the env to a session, but this assumes the session header's env is in use. Rather than a dubious container_of, hold the session in the evlist and derive the env from the session with evsel__env, perf_session__env, etc. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-11-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf session: Add accessor for session->header.envIan Rogers7-30/+37
The perf_env from the header in the session is frequently accessed, add an accessor function rather than access directly. Cache the value to avoid repeated calls. No behavioral change. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-10-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf record: Make --buildid-mmap the defaultIan Rogers2-9/+9
Support for build IDs in mmap2 perf events has been present since Linux v5.12: https://lore.kernel.org/lkml/20210219194619.1780437-1-acme@kernel.org/ Build ID mmap events don't avoid the need to inject build IDs for DSO touched by samples as the build ID cache is populated by perf record. They can avoid some cases of symbol mis-resolution caused by the file system changing from when a sample occurred and when the DSO is sought. Unlike the --buildid-mmap option, this chnage doesn't disable the build ID cache but it does disable the processing of samples looking for DSOs to inject build IDs for. To disable the build ID cache the -B (--no-buildid) option should be used. Making this option the default was raised on the list in: https://lore.kernel.org/linux-perf-users/CAP-5=fXP7jN_QrGUcd55_QH5J-Y-FCaJ6=NaHVtyx0oyNh8_-Q@mail.gmail.com/ Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-9-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf jitdump: Directly mark the jitdump DSOIan Rogers1-4/+17
The DSO being generated was being accessed through a thread's maps, this is unnecessary as the dso can just be directly found. This avoids problems with passing a NULL evsel which may be inspected to determine properties of a callchain when using the buildid DSO marking code. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-8-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf dso: Move build_id to dso_idIan Rogers9-134/+163
The dso_id previously contained the major, minor, inode and inode generation information from a mmap2 event - the inode generation would be zero when reading from /proc/pid/maps. The build_id was in the dso. With build ID mmap2 events these fields wouldn't be initialized which would largely mean the special empty case where any dso would match for equality. This isn't desirable as if a dso is replaced we want the comparison to yield a difference. To support detecting the difference between DSOs based on build_id, move the build_id out of the DSO and into the dso_id. The dso_id is also stored in the DSO so nothing is lost. Capture in the dso_id what parts have been initialized and rename dso_id__inject to dso_id__improve_id so that it is clear the dso_id is being improved upon with additional information. With the build_id in the dso_id, use memcmp to compare for equality. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-7-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf build-id: Ensure struct build_id is empty before useIan Rogers7-10/+13
If a build ID is read then not all code paths may ensure it is empty before use. Initialize the build_id to be zero-ed unless there is clear initialization such as a call to build_id__init. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-6-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf build-id: Mark DSO in sample callchainsIan Rogers1-1/+16
Previously only the sample IP's map DSO would be marked hit for the purposes of populating the build ID cache. Walk the call chain to mark all IPs and DSOs. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-5-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25perf build-id: Change sprintf functions to snprintfIan Rogers13-40/+32
Pass in a size argument rather than implying all build id strings must be SBUILD_ID_SIZE. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-4-irogers@google.com [ fixed some build errors ] Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf build-id: Truncate to avoid overflowing the build_id dataIan Rogers1-1/+4
Warning when the build_id data would be overflowed would lead to memory corruption, switch to truncation. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-3-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf build-id: Reduce size of "size" variableIan Rogers2-3/+7
Later clean up of the dso_id to include a build_id will suffer from alignment and size issues. The size can only hold up to a value of BUILD_ID_SIZE (20) and the mmap2 event uses a byte for the value. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250724163302.596743-2-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf metricgroups: Add NO_THRESHOLD_AND_NMI constraintIan Rogers1-4/+12
Thresholds can increase the number of counters a metric needs. The NMI watchdog can take away a counter (hopefully the buddy watchdog will become the default and this will no longer be true). Add a new constraint for the case that a metric and its thresholds would fit in counters but only if the NMI watchdog isn't enabled. Either the threshold or the NMI watchdog should be disabled to make the metric fit. Wire this up into the metric__group_events logic. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250719030517.1990983-16-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf parse-events: Fix missing slots for Intel topdown metric eventsIan Rogers2-0/+11
Topdown metric events require grouping with a slots event. In perf metrics this is currently achieved by metrics adding an unnecessary "0 * tma_info_thread_slots". New TMA metrics trigger optimizations of the metric expression that removes the event and breaks the metric due to the missing but required event. Add a pass immediately before sorting and fixing parsed events, that insert a slots event if one is missing. Update test expectations to match this. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250719030517.1990983-15-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf parse-events: Support user CPUs mixed with threads/processesIan Rogers2-10/+6
Counting events system-wide with a specified CPU prior to this change worked: ``` $ perf stat -e 'msr/tsc/,msr/tsc,cpu=cpu_core/,msr/tsc,cpu=cpu_atom/' -a sleep 1 Performance counter stats for 'system wide': 59,393,419,099 msr/tsc/ 33,927,965,927 msr/tsc,cpu=cpu_core/ 25,465,608,044 msr/tsc,cpu=cpu_atom/ ``` However, when counting with process the counts became system wide: ``` $ perf stat -e 'msr/tsc/,msr/tsc,cpu=cpu_core/,msr/tsc,cpu=cpu_atom/' perf test -F 10 10.1: Basic parsing test : Ok 10.2: Parsing without PMU name : Ok 10.3: Parsing with PMU name : Ok Performance counter stats for 'perf test -F 10': 59,233,549 msr/tsc/ 59,227,556 msr/tsc,cpu=cpu_core/ 59,224,053 msr/tsc,cpu=cpu_atom/ ``` Make the handling of CPU maps with event parsing clearer. When an event is parsed creating an evsel the cpus should be either the PMU's cpumask or user specified CPUs. Update perf_evlist__propagate_maps so that it doesn't clobber the user specified CPUs. Try to make the behavior clearer, firstly fix up missing cpumasks. Next, perform sanity checks and adjustments from the global evlist CPU requests and for the PMU including simplifying to the "any CPU"(-1) value. Finally remove the event if the cpumask is empty. So that events are opened with a CPU and a thread change stat's create_perf_stat_counter to give both. With the change things are fixed: ``` $ perf stat --no-scale -e 'msr/tsc/,msr/tsc,cpu=cpu_core/,msr/tsc,cpu=cpu_atom/' perf test -F 10 10.1: Basic parsing test : Ok 10.2: Parsing without PMU name : Ok 10.3: Parsing with PMU name : Ok Performance counter stats for 'perf test -F 10': 63,704,975 msr/tsc/ 47,060,704 msr/tsc,cpu=cpu_core/ (4.62%) 16,640,591 msr/tsc,cpu=cpu_atom/ (2.18%) ``` However, note the "--no-scale" option is used. This is necessary as the running time for the event on the counter isn't the same as the enabled time because the thread doesn't necessarily run on the CPUs specified for the counter. All counter values are scaled with: scaled_value = value * time_enabled / time_running and so without --no-scale the scaled_value becomes very large. This problem already exists on hybrid systems for the same reason. Here are 2 runs of the same code with an instructions event that counts the same on both types of core, there is no real multiplexing happening on the event: ``` $ perf stat -e instructions perf test -F 10 ... Performance counter stats for 'perf test -F 10': 87,896,447 cpu_atom/instructions/ (14.37%) 98,171,964 cpu_core/instructions/ (85.63%) ... $ perf stat --no-scale -e instructions perf test -F 10 ... Performance counter stats for 'perf test -F 10': 13,069,890 cpu_atom/instructions/ (19.32%) 83,460,274 cpu_core/instructions/ (80.68%) ... ``` The scaling has inflated per-PMU instruction counts and the overall count by 2x. To fix this the kernel needs changing when a task+CPU event (or just task event on hybrid) is scheduled out. A fix could be that the state isn't inactive but off for such events, so that time_enabled counts don't accumulate on them. Reviewed-by: Thomas Falcon <thomas.falcon@intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250719030517.1990983-13-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf evsel: Add evsel__open_per_cpu_and_threadIan Rogers2-4/+22
Add evsel__open_per_cpu_and_thread that combines the operation of evsel__open_per_cpu and evsel__open_per_thread so that an event without the "any" cpumask can be opened with its cpumask and with threads it specifies. Change the implementation of evsel__open_per_cpu and evsel__open_per_thread to use evsel__open_per_cpu_and_thread to make the implementation of those functions clearer. Reviewed-by: Thomas Falcon <thomas.falcon@intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250719030517.1990983-12-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf parse-events: Minor __add_event refactoringIan Rogers1-21/+48
Rename cpu_list to user_cpus. If a PMU isn't given, find it early from the perf_event_attr. Make the pmu_cpus more explicitly a copy from the PMU (except when user_cpus are given). Derive the cpus from pmu_cpus and user_cpus as appropriate. Handle strdup errors on name and metric_id. Reviewed-by: Thomas Falcon <thomas.falcon@intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250719030517.1990983-11-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf pmus: Factor perf_pmus__find_by_attr out of evsel__find_pmuIan Rogers2-12/+19
Allow a PMU to be found by a perf_event_attr, useful when creating evsels. Reviewed-by: Thomas Falcon <thomas.falcon@intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250719030517.1990983-10-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf evsel: Use libperf perf_evsel__exitIan Rogers1-3/+1
Avoid the duplicated code and better enable perf_evsel to change. Reviewed-by: Thomas Falcon <thomas.falcon@intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250719030517.1990983-9-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24libperf evsel: Rename own_cpus to pmu_cpusIan Rogers5-14/+14
own_cpus is generally the cpumask from the PMU. Rename to pmu_cpus to try to make this clearer. Variable rename with no other changes. Reviewed-by: Thomas Falcon <thomas.falcon@intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250719030517.1990983-7-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf tool_pmu: Allow num_cpus(_online) to be specific to a cpumaskIan Rogers3-9/+51
For hybrid metrics it is useful to know the number of p-core or e-core CPUs. If a cpumask is specified for the num_cpus or num_cpus_online tool events, compute the value relative to the given mask rather than for the full system. ``` $ sudo /tmp/perf/perf stat -e 'tool/num_cpus/,tool/num_cpus,cpu=cpu_core/, tool/num_cpus,cpu=cpu_atom/,tool/num_cpus_online/,tool/num_cpus_online, cpu=cpu_core/,tool/num_cpus_online,cpu=cpu_atom/' true Performance counter stats for 'true': 28 tool/num_cpus/ 16 tool/num_cpus,cpu=cpu_core/ 12 tool/num_cpus,cpu=cpu_atom/ 28 tool/num_cpus_online/ 16 tool/num_cpus_online,cpu=cpu_core/ 12 tool/num_cpus_online,cpu=cpu_atom/ 0.000767205 seconds time elapsed 0.000938000 seconds user 0.000000000 seconds sys ``` Reviewed-by: Thomas Falcon <thomas.falcon@intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250719030517.1990983-6-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf parse-events: Allow the cpu term to be a PMU or CPU rangeIan Rogers1-9/+38
On hybrid systems, events like msr/tsc/ will aggregate counts across all CPUs. Often metrics only want a value like msr/tsc/ for the cores on which the metric is being computed. Listing each CPU with terms cpu=0,cpu=1.. is laborious and would need to be encoded for all variations of a CPU model. Allow the cpumask from a PMU to be an argument to the cpu term. For example in the following the cpumask of the cstate_pkg PMU selects the CPUs to count msr/tsc/ counter upon: ``` $ cat /sys/bus/event_source/devices/cstate_pkg/cpumask 0 $ perf stat -A -e 'msr/tsc,cpu=cstate_pkg/' -a sleep 0.1 Performance counter stats for 'system wide': CPU0 252,621,253 msr/tsc,cpu=cstate_pkg/ 0.101184092 seconds time elapsed ``` As the cpu term is now also allowed to be a string, allow it to encode a range of CPUs (a list can't be supported as ',' is already a special token). The "event qualifiers" section of the `perf list` man page is updated to detail the additional behavior. The man page formatting is tidied up in this section, as it was incorrectly appearing within the "parameterized events" section. Reviewed-by: Thomas Falcon <thomas.falcon@intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250719030517.1990983-5-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf parse-events: Warn if a cpu term is unsupported by a CPUIan Rogers4-18/+35
Factor requested CPU warning out of evlist and into evsel. At the end of adding an event, perform the warning check. To avoid repeatedly testing if the cpu_list is empty, add a local variable. ``` $ perf stat -e cpu_atom/cycles,cpu=1/ -a true WARNING: A requested CPU in '1' is not supported by PMU 'cpu_atom' (CPUs 16-27) for event 'cpu_atom/cycles/' Performance counter stats for 'system wide': <not supported> cpu_atom/cycles/ 0.000781511 seconds time elapsed ``` Reviewed-by: Thomas Falcon <thomas.falcon@intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250719030517.1990983-2-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24perf pfm: Don't force loading of all PMUsIan Rogers1-4/+0
Force loading all PMUs adds significant cost because DRM and other PMUs are loaded, it should also not be required if the pmus__ functions are used. Tested by run perf test, in particular the pfm related tests. Also `perf list` is identical before and after. Before: $ time ./perf test pfm 54: Test libpfm4 support : 54.1: test of individual --pfm-events : Ok 54.2: test groups of --pfm-events : Ok 103: perf all libpfm4 events test : Ok real 0m8.933s user 0m1.824s sys 0m7.122s After: $ time ./perf test pfm 54: Test libpfm4 support : 54.1: test of individual --pfm-events : Ok 54.2: test groups of --pfm-events : Ok 103: perf all libpfm4 events test : Ok real 0m5.259s user 0m1.793s sys 0m3.570s Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250722013449.146233-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-23perf stat: Remove duplicated include in stat-shadow.cYang Li1-1/+0
The header files rblist.h is included twice in stat-shadow.c, so one inclusion of each can be removed. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=22933 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20250723070418.2195172-1-yang.lee@linux.alibaba.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-23perf pmu: Switch FILENAME_MAX to NAME_MAXIan Rogers1-2/+2
FILENAME_MAX is the same as PATH_MAX (4kb) in glibc rather than NAME_MAX's 255. Switch to using NAME_MAX and ensure the '\0' is accounted for in the path's buffer size. Fixes: 754baf426e09 ("perf pmu: Change aliases from list to hashmap") Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250717150855.1032526-2-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-23perf: ftrace: add graph tracer options args/retval/retval-hex/retaddrChangbin Du1-0/+4
This change adds support for new funcgraph tracer options funcgraph-args, funcgraph-retval, funcgraph-retval-hex and funcgraph-retaddr. The new added options are: - args : Show function arguments. - retval : Show function return value. - retval-hex : Show function return value in hexadecimal format. - retaddr : Show function return address. # ./perf ftrace -G vfs_write --graph-opts retval,retaddr # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 5) | mutex_unlock() { /* <-rb_simple_write+0xda/0x150 */ 5) 0.188 us | local_clock(); /* <-lock_release+0x2ad/0x440 ret=0x3bf2a3cf90e */ 5) | rt_mutex_slowunlock() { /* <-rb_simple_write+0xda/0x150 */ 5) | _raw_spin_lock_irqsave() { /* <-rt_mutex_slowunlock+0x4f/0x200 */ 5) 0.123 us | preempt_count_add(); /* <-_raw_spin_lock_irqsave+0x23/0x90 ret=0x0 */ 5) 0.128 us | local_clock(); /* <-__lock_acquire.isra.0+0x17a/0x740 ret=0x3bf2a3cfc8b */ 5) 0.086 us | do_raw_spin_trylock(); /* <-_raw_spin_lock_irqsave+0x4a/0x90 ret=0x1 */ 5) 0.845 us | } /* _raw_spin_lock_irqsave ret=0x292 */ 5) | _raw_spin_unlock_irqrestore() { /* <-rt_mutex_slowunlock+0x191/0x200 */ 5) 0.097 us | local_clock(); /* <-lock_release+0x2ad/0x440 ret=0x3bf2a3cff1f */ 5) 0.086 us | do_raw_spin_unlock(); /* <-_raw_spin_unlock_irqrestore+0x23/0x60 ret=0x1 */ 5) 0.104 us | preempt_count_sub(); /* <-_raw_spin_unlock_irqrestore+0x35/0x60 ret=0x0 */ 5) 0.726 us | } /* _raw_spin_unlock_irqrestore ret=0x80000000 */ 5) 1.881 us | } /* rt_mutex_slowunlock ret=0x0 */ 5) 2.931 us | } /* mutex_unlock ret=0x0 */ Signed-off-by: Changbin Du <changbin.du@huawei.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250613114048.132336-1-changbin.du@huawei.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-15perf ftrace latency: Add -e option to measure time between two eventsNamhyung Kim3-73/+151
In addition to the function latency, it can measure events latencies. Some kernel tracepoints are paired and it's menningful to measure how long it takes between the two events. The latency is tracked for the same thread. Currently it only uses BPF to do the work but it can be lifted later. Instead of having separate a BPF program for each tracepoint, it only uses generic 'event_begin' and 'event_end' programs to attach to any (raw) tracepoints. $ sudo perf ftrace latency -a -b --hide-empty \ -e i915_request_wait_begin,i915_request_wait_end -- sleep 1 # DURATION | COUNT | GRAPH | 256 - 512 us | 4 | ###### | 2 - 4 ms | 2 | ### | 4 - 8 ms | 12 | ################### | 8 - 16 ms | 10 | ################ | # statistics (in usec) total time: 194915 avg time: 6961 max time: 12855 min time: 373 count: 28 Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250714052143.342851-1-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-11perf python: Set index error for invalid thread/cpu map itemsIan Rogers1-2/+6
Returning NULL for out of bound CPU or thread map items causes internal errors. Fix by correctly setting the error to be an index error. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250710235126.1086011-14-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-11perf python: Improve leader copying from evlistIan Rogers1-0/+57
The struct pyrf_evlist embeds the evlist requiring the copying from things like parsed events. The copying logic handles the leader being the event itself, but if the leader group event is a different in the list it will cause an evsel to point to the evsel in the list that was copied from which is bad. Fix this by adding another pass over the evlist rewriting leaders, simplified by the introductin of two evlist helpers. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250710235126.1086011-13-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-11perf python: Correct pyrf_evsel__read for tool PMUsIan Rogers1-3/+44
Tool PMUs assume that stat's process_counter_values is being used to read the counters. Specifically they hold onto old values in evsel->prev_raw_counts and give the cumulative count based off of this value. Update pyrf_evsel__read to allocate counts and prev_raw_counts, use evsel__read_counter rather than perf_evsel__read so tool PMUs are read from not just perf_event_open events, make the returned pyrf_counts_values contain the delta value rather than the cumulative value. Fixes: 739621f65702 ("perf python: Add evsel read method") Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250710235126.1086011-12-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-11perf python: Fix thread check in pyrf_evsel__readIan Rogers1-1/+1
The CPU index is incorrectly checked rather than the thread index. Fixes: 739621f65702 ("perf python: Add evsel read method") Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250710235126.1086011-11-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-11perf python: In str(evsel) use the evsel__pmu_name helperIan Rogers1-4/+1
The evsel__pmu_name helper will internally use evsel__find_pmu that handles legacy events, extended types, etc. in determining a PMU and will provide a better value than just trying to access the PMU's name directly as the PMU may not have been computed. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250710235126.1086011-10-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-11perf jevents: If the long_desc and desc are identical then drop the long_descIan Rogers1-2/+1
If the short and long descriptions are the same then save space and don't store both of them. When storing the desc in the perf_pmu_alias, don't duplicate the desc into the long_desc. By avoiding storing the duplicate the size of the events string in the binary on x86 is reduced by 29,840 bytes. Fix tests that expect a duplicated description. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250710235126.1086011-9-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-11perf expr: Accumulate rather than replace in the context countsIan Rogers1-1/+5
Metrics will fill in the context to have mappings from an event to a count. When counts are added they replace existing mappings which generally shouldn't exist with aggregation. Switch to accumulating to better support cases where perf stat's aggregation isn't used and we may see a counter more than once. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250710235126.1086011-8-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-11perf stat: Move metric list from config to evlistIan Rogers10-59/+48
The rblist of metric_event that then have a list of associated metric_expr is moved out of the stat_config and into the evlist. This is done as part of refactoring things for python, having the state split in two places complicates that implementation. The evlist is doing the harder work of enabling and disabling events, the metrics are needed to compute a value and it doesn't seem unreasonable to hang them from the evlist. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250710235126.1086011-7-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>