kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
2025-02-28	perf lock: Report owner stack in usermode	Chun-Tse Shao	2	-7/+71
	This patch parses `owner_lock_stat` into a RB tree, enabling ordered reporting of owner lock statistics with stack traces. It also updates the documentation for the `-o` option in contention mode, decouples `-o` from `-t`, and issues a warning to inform users about the new behavior of `-ov`. Example output: $ sudo ~/linux/tools/perf/perf lock con -abvo -Y mutex-spin -E3 perf bench sched pipe ... contended total wait max wait avg wait type caller 171 1.55 ms 20.26 us 9.06 us mutex pipe_read+0x57 0xffffffffac6318e7 pipe_read+0x57 0xffffffffac623862 vfs_read+0x332 0xffffffffac62434b ksys_read+0xbb 0xfffffffface604b2 do_syscall_64+0x82 0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76 36 193.71 us 15.27 us 5.38 us mutex pipe_write+0x50 0xffffffffac631ee0 pipe_write+0x50 0xffffffffac6241db vfs_write+0x3bb 0xffffffffac6244ab ksys_write+0xbb 0xfffffffface604b2 do_syscall_64+0x82 0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76 4 51.22 us 16.47 us 12.80 us mutex do_epoll_wait+0x24d 0xffffffffac691f0d do_epoll_wait+0x24d 0xffffffffac69249b do_epoll_pwait.part.0+0xb 0xffffffffac693ba5 __x64_sys_epoll_pwait+0x95 0xfffffffface604b2 do_syscall_64+0x82 0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76 === owner stack trace === 3 31.24 us 15.27 us 10.41 us mutex pipe_read+0x348 0xffffffffac631bd8 pipe_read+0x348 0xffffffffac623862 vfs_read+0x332 0xffffffffac62434b ksys_read+0xbb 0xfffffffface604b2 do_syscall_64+0x82 0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76 ... Signed-off-by: Chun-Tse Shao <ctshao@google.com> Tested-by: Athira Rajeev <atrajeev@linux.ibm.com> Link: https://lore.kernel.org/r/20250227003359.732948-5-ctshao@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-28	perf lock: Retrieve owner callstack in bpf program	Chun-Tse Shao	1	-9/+203
	This implements per-callstack aggregation of lock owners in addition to per-thread. The owner callstack is captured using `bpf_get_task_stack()` at `contention_begin()` and it also adds a custom stackid function for the owner stacks to be compared easily. The owner info is kept in a hash map using lock addr as a key to handle multiple waiters for the same lock. At `contention_end()`, it updates the owner lock stat based on the info that was saved at `contention_begin()`. If there are more waiters, it'd update the owner pid to itself as `contention_end()` means it gets the lock now. But it also needs to check the return value of the lock function in case task was killed by a signal or something. Signed-off-by: Chun-Tse Shao <ctshao@google.com> Tested-by: Athira Rajeev <atrajeev@linux.ibm.com> Link: https://lore.kernel.org/r/20250227003359.732948-3-ctshao@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-28	perf lock: Add bpf maps for owner stack tracing	Chun-Tse Shao	3	-2/+52
	Add a struct and few bpf maps in order to tracing owner stack. `struct owner_tracing_data`: Contains owner's pid, stack id, timestamp for when the owner acquires lock, and the count of lock waiters. `stack_buf`: Percpu buffer for retrieving owner stacktrace. `owner_stacks`: For tracing owner stacktrace to customized owner stack id. `owner_data`: For tracing lock_address to `struct owner_tracing_data` in bpf program. `owner_stat`: For reporting owner stacktrace in usermode. Signed-off-by: Chun-Tse Shao <ctshao@google.com> Tested-by: Athira Rajeev <atrajeev@linux.ibm.com> Link: https://lore.kernel.org/r/20250227003359.732948-2-ctshao@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-27	perf cpumap: Reduce cpu size from int to int16_t	Ian Rogers	2	-22/+48
	Fewer than 32k logical CPUs are currently supported by perf. A cpumap is indexed by an integer (see perf_cpu_map__cpu) yielding a perf_cpu that wraps a 4-byte int for the logical CPU - the wrapping is done deliberately to avoid confusing a logical CPU with an index into a cpumap. Using a 4-byte int within the perf_cpu is larger than required so this patch reduces it to the 2-byte int16_t. For a cpumap containing 16 entries this will reduce the array size from 64 to 32 bytes. For very large servers with lots of logical CPUs the size savings will be greater. Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250210191231.156294-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-27	perf pmu: Don't double count common sysfs and json events	James Clark	2	-3/+9
	After pmu_add_cpu_aliases() is called, perf_pmu__num_events() returns an incorrect value that double counts common events and doesn't match the actual count of events in the alias list. This is because after 'cpu_aliases_added == true', the number of events returned is 'sysfs_aliases + cpu_json_aliases'. But when adding 'case EVENT_SRC_SYSFS' events, 'sysfs_aliases' and 'cpu_json_aliases' are both incremented together, failing to account that these ones overlap and only add a single item to the list. Fix it by adding another counter for overlapping events which doesn't influence 'cpu_json_aliases'. There doesn't seem to be a current issue because it's used in perf list before pmu_add_cpu_aliases() so the correct value is returned. Other uses in tests may also miss it for other reasons like only looking at uncore events. However it's marked as a fixes commit in case any new fix with new uses of perf_pmu__num_events() is backported. Fixes: d9c5f5f94c2d ("perf pmu: Count sys and cpuid JSON events separately") Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250226104111.564443-3-james.clark@linaro.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-27	perf pmu: Dynamically allocate tool PMU	James Clark	3	-14/+13
	perf_pmus__destroy() treats all PMUs as allocated and free's them so we can't have any static PMUs that are added to the PMU lists. Fix it by allocating the tool PMU in the same way as the others. Current users of the tool PMU already use find_pmu() and not perf_pmus__tool_pmu(), so rename the function to add 'new' to avoid it being misused in the future. perf_pmus__fake_pmu() can remain as static as it's not added to the PMU lists. Fixes the following error: $ perf bench internals pmu-scan # Running 'internals/pmu-scan' benchmark: Computing performance of sysfs PMU event scan for 100 times munmap_chunk(): invalid pointer Aborted (core dumped) Fixes: 240505b2d0ad ("perf tool_pmu: Factor tool events into their own PMU") Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250226104111.564443-2-james.clark@linaro.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-27	perf probe: Pick the correct dwarf die while adding probe points	Athira Rajeev	2	-3/+19
	Perf probe on vfs_fstatat fails as below on a powerpc system $ ./perf probe -nf --max-probes=512 -a 'vfs_fstatat $params' Segmentation fault (core dumped) This is observed while running perftool-testsuite_probe testcase. While running with verbose, its observed that segfault happens at: synthesize_probe_trace_arg () synthesize_probe_trace_command () probe_file.add_event () apply_perf_probe_events () __cmd_probe () cmd_probe () run_builtin () handle_internal_command () main () Code in synthesize_probe_trace_arg() access a null value and results in segfault. Data structure which is null: struct probe_trace_arg arg->value We are hitting a case where arg->value is null in probe point: "vfs_fstatat $params". This is happening since 'commit e896474fe485 ("getname_maybe_null() - the third variant of pathname copy-in")' Before the commit, probe point for vfs_fstatat was getting added only for one location: Writing event: p:probe/vfs_fstatat _text+6345404 dfd=%gpr3:s32 filename=%gpr4:x64 stat=%gpr5:x64 flags=%gpr6:s32 With this change, vfs_fstatat code is inlined for other locations in the code: Probe point found: __do_sys_lstat64+48 Probe point found: __do_sys_stat64+48 Probe point found: __do_sys_newlstat+48 Probe point found: __do_sys_newstat+48 Probe point found: vfs_fstatat+0 When trying to find matching dwarf information entry (DIE) from the debuginfo, the code incorrectly picks DIE which is not referring to vfs_fstatat. Snippet from dwarf entry in vmlinux debuginfo file. The main abstract die is: <1><4214883>: Abbrev Number: 147 (DW_TAG_subprogram) <4214885> DW_AT_external : 1 <4214885> DW_AT_name : (indirect string, offset: 0x17b9f3): vfs_fstatat With formal parameters: <2><4214896>: Abbrev Number: 51 (DW_TAG_formal_parameter) <4214897> DW_AT_name : dfd <2><42148a3>: Abbrev Number: 23 (DW_TAG_formal_parameter) <42148a4> DW_AT_name : (indirect string, offset: 0x8fda9): filename <2><42148b0>: Abbrev Number: 23 (DW_TAG_formal_parameter) <42148b1> DW_AT_name : (indirect string, offset: 0x16bd9c): stat <2><42148bd>: Abbrev Number: 23 (DW_TAG_formal_parameter) <42148be> DW_AT_name : (indirect string, offset: 0x39832b): flags While collecting variables/parameters for a probe point, the function copy_variables_cb() also looks at dwarf debug entries based on the instruction address. Snippet if (dwarf_haspc(die_mem, vf->pf->addr)) return DIE_FIND_CB_CONTINUE; else return DIE_FIND_CB_SIBLING; But incase of inlined function instance for vfs_fstatat, there are two entries which has the instruction address entry point as same. Instance 1: which is for vfs_fstatat and DW_AT_abstract_origin points to 0x4214883 (reference above for main abstract die) <3><42131fa>: Abbrev Number: 59 (DW_TAG_inlined_subroutine) <42131fb> DW_AT_abstract_origin: <0x4214883> <42131ff> DW_AT_entry_pc : 0xc00000000062b1e0 Instance 2: which is not for vfs_fstatat but for getname <5><4213270>: Abbrev Number: 39 (DW_TAG_inlined_subroutine) <4213271> DW_AT_abstract_origin: <0x4215b6b> <4213275> DW_AT_entry_pc : 0xc00000000062b1e0 But the copy_variables_cb() continues to add parameters from second instance also based on the dwarf_haspc() check. This results in formal parameters for getname also appended to params. But while filling in the args->value for these parameters, since these args are not part of dwarf with offset "42131fa". Hence value will be null. This incorrect args results in segfault when value field is accessed. Save the dwarf dieoffset of the actual DW_TAG_subprogram as part of "struct probe_finder". In copy_variables_cb(), include check to make sure the DW_AT_abstract_origin points to the correct entry if the dwarf_haspc() matches the instruction address. Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250225123042.37263-1-atrajeev@linux.ibm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-27	perf ftrace latency: allow to hide empty buckets	Gabriele Monaco	1	-0/+1
	Especially while using several buckets, it isn't uncommon to have some of them empty and reading the histogram may be a bit more complex: # perf ftrace latency -a -T mutex_lock --bucket-range 5 --max-latency 200 # DURATION \| COUNT \| GRAPH \| 0 - 5 us \| 14816 \| ###################################### \| 5 - 10 us \| 1228 \| ### \| 10 - 15 us \| 438 \| # \| 15 - 20 us \| 106 \| \| 20 - 25 us \| 21 \| \| 25 - 30 us \| 11 \| \| 30 - 35 us \| 1 \| \| 35 - 40 us \| 2 \| \| 40 - 45 us \| 4 \| \| 45 - 50 us \| 0 \| \| 50 - 55 us \| 1 \| \| 55 - 60 us \| 0 \| \| 60 - 65 us \| 1 \| \| 65 - 70 us \| 1 \| \| 70 - 75 us \| 1 \| \| 75 - 80 us \| 2 \| \| 80 - 85 us \| 0 \| \| 85 - 90 us \| 1 \| \| 90 - 95 us \| 0 \| \| 95 - 100 us \| 1 \| \| 100 - 105 us \| 0 \| \| 105 - 110 us \| 0 \| \| 110 - 115 us \| 0 \| \| 115 - 120 us \| 0 \| \| 120 - 125 us \| 1 \| \| 125 - 130 us \| 0 \| \| 130 - 135 us \| 0 \| \| 135 - 140 us \| 1 \| \| 140 - 145 us \| 0 \| \| 145 - 150 us \| 0 \| \| 150 - 155 us \| 0 \| \| 155 - 160 us \| 0 \| \| 160 - 165 us \| 0 \| \| 165 - 170 us \| 0 \| \| 170 - 175 us \| 0 \| \| 175 - 180 us \| 0 \| \| 180 - 185 us \| 0 \| \| 185 - 190 us \| 0 \| \| 190 - 195 us \| 0 \| \| 195 - 200 us \| 0 \| \| 200 - ... us \| 2 \| \| Allow the optional flag --hide-empty to remove buckets with no element and produce a more compact graph. This feature could be misleading since there is no clear indication for missing buckets, for this reason it's disabled by default. # perf ftrace latency -a -T mutex_lock --bucket-range 5 --max-latency --hide-empty 200 # DURATION \| COUNT \| GRAPH \| 0 - 5 us \| 14816 \| ###################################### \| 5 - 10 us \| 1228 \| ### \| 10 - 15 us \| 438 \| # \| 15 - 20 us \| 106 \| \| 20 - 25 us \| 21 \| \| 25 - 30 us \| 11 \| \| 30 - 35 us \| 1 \| \| 35 - 40 us \| 2 \| \| 40 - 45 us \| 4 \| \| 50 - 55 us \| 1 \| \| 60 - 65 us \| 1 \| \| 65 - 70 us \| 1 \| \| 70 - 75 us \| 1 \| \| 75 - 80 us \| 2 \| \| 85 - 90 us \| 1 \| \| 95 - 100 us \| 1 \| \| 120 - 125 us \| 1 \| \| 135 - 140 us \| 1 \| \| 200 - ... us \| 2 \| \| Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20250207080446.77630-2-gmonaco@redhat.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-27	perf ftrace latency: variable histogram buckets	Gabriele Monaco	3	-4/+10
	The max-latency value can make the histogram smaller, but not larger, we have a maximum of 22 buckets and specifying a max-latency that would require more buckets has no effect. Dynamically allocate the buckets and compute the bucket number from the max latency as (max-min) / range + 2 If the maximum is not specified, we still set the bucket number to 22 and compute the maximum accordingly. Fail if the maximum is smaller than min+range, this way we make sure we always have 3 buckets: those below min, those above max and one in the middle. Since max-latency is not available in log2 mode, always use 22 buckets. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20250207080446.77630-1-gmonaco@redhat.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-27	perf annotate-data: Handle direct use of stack pointer without fbreg	Namhyung Kim	1	-5/+10
	Sometimes compiler generates code to use the stack pointer register without frame pointer. As we know RSP is the stack register on x86, let's treat it as same as fbreg. But the offset would be opposite direction so update the debug message accordingly. Reported-by: Blake Jones <blakejones@google.com> Link: https://lore.kernel.org/r/20250126210242.1181225-1-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-25	perf report: Fix sample number stats for branch entry mode	Thomas Falcon	1	-2/+6
	Currently, stats->nr_samples is incremented per entry in the branch stack instead of per sample taken. As a result, statistics of samples taken during perf record in --branch-filter or --branch-any mode does not seem correct. Instead call hists__inc_nr_samples() for each sample taken instead of for each entry in the branch stack. Before: $ ./perf record -e cycles:u -b -c 10000000000 ./tchain_edit [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.005 MB perf.data (2 samples) ] $ perf report -D \| tail -n 16 Aggregated stats: TOTAL events: 16 COMM events: 2 (12.5%) EXIT events: 1 ( 6.2%) SAMPLE events: 2 (12.5%) MMAP2 events: 2 (12.5%) KSYMBOL events: 1 ( 6.2%) FINISHED_ROUND events: 1 ( 6.2%) ID_INDEX events: 1 ( 6.2%) THREAD_MAP events: 1 ( 6.2%) CPU_MAP events: 1 ( 6.2%) EVENT_UPDATE events: 2 (12.5%) TIME_CONV events: 1 ( 6.2%) FINISHED_INIT events: 1 ( 6.2%) cpu_core/cycles/u stats: SAMPLE events: 64 After: $ ./perf report -D \| tail -n 16 Aggregated stats: TOTAL events: 16 COMM events: 2 (12.5%) EXIT events: 1 ( 6.2%) SAMPLE events: 2 (12.5%) MMAP2 events: 2 (12.5%) KSYMBOL events: 1 ( 6.2%) FINISHED_ROUND events: 1 ( 6.2%) ID_INDEX events: 1 ( 6.2%) THREAD_MAP events: 1 ( 6.2%) CPU_MAP events: 1 ( 6.2%) EVENT_UPDATE events: 2 (12.5%) TIME_CONV events: 1 ( 6.2%) FINISHED_INIT events: 1 ( 6.2%) cpu_core/cycles/u stats: SAMPLE events: 2 Signed-off-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lore.kernel.org/r/20250220045942.114965-1-thomas.falcon@intel.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-25	perf machine: Reuse module path buffer	Ian Rogers	1	-11/+23
	Rather than copying the path and appending the directory entry in a fresh path buffer, append to the path at the end of where it is for the recursion level. This saves a PATH_MAX buffer per recursion level and some unnecessary copying. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250222061015.303622-9-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-25	perf hwmon_pmu: Switch event discovery to io_dir__readdir	Ian Rogers	1	-25/+17
	Avoid DIR allocations when scanning sysfs by using io_dir for the readdir implementation, that allocates about 1kb on the stack. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250222061015.303622-8-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-25	perf parse-events: Switch tracepoints to io_dir__readdir	Ian Rogers	1	-13/+19
	Avoid DIR allocations when scanning sysfs by using io_dir for the readdir implementation, that allocates about 1kb on the stack. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250222061015.303622-7-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-25	perf events: Remove scandir in thread synthesis	Ian Rogers	1	-10/+12
	This avoids scanddir reading the directory into memory that's allocated and instead allocates on the stack. Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20250222061015.303622-6-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-25	perf header: Switch mem topology to io_dir__readdir	Ian Rogers	1	-15/+16
	Switch memory_node__read and build_mem_topology from opendir/readdir to io_dir__readdir, with smaller stack allocations. Reduces peak memory consumption of perf record by 10kb. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250222061015.303622-5-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-25	perf pmu: Switch to io_dir__readdir	Ian Rogers	2	-46/+30
	Avoid DIR allocations when scanning sysfs by using io_dir for the readdir implementation, that allocates about 1kb on the stack. Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250222061015.303622-4-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-25	perf maps: Switch modules tree walk to io_dir__readdir	Ian Rogers	1	-17/+8
	Compared to glibc's opendir/readdir this lowers the max RSS of perf record by 1.8MB on a Debian machine. Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250222061015.303622-3-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-21	perf parse-events: Tidy name token matching	Ian Rogers	1	-19/+32
	Prior to commit 70c90e4a6b2f ("perf parse-events: Avoid scanning PMUs before parsing") names (generally event names) excluded hyphen (minus) symbols as the formation of legacy names with hyphens was handled in the yacc code. That commit allowed hyphens supposedly making name_minus unnecessary. However, changing name_minus to name has issues in the term config tokens as then name ends up having priority over numbers and name allows matching numbers since commit 5ceb57990bf4 ("perf parse: Allow tracepoint names to start with digits "). It is also permissable for a name to match with a colon (':') in it when its in a config term list. To address this rename name_minus to term_name, make the pattern match name's except for the colon, add number matching into the config term region with a higher priority than name matching. This addresses an inconsistency and allows greater matching for names inside of term lists, for example, they may start with a number. Rename name_tag to quoted_name and update comments and helper functions to avoid str detecting quoted strings which was already done by the lexer. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250109175401.161340-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-20	perf tools: Improve startup time by reducing unnecessary stat() calls	Krzysztof Łopatowski	1	-6/+12
	When testing perf trace on NixOS, I noticed significant startup delays: - `ls`: ~2ms - `strace ls`: ~10ms - `perf trace ls`: ~550ms Profiling showed that 51% of the time is spent reading files, 26% in loading BPF programs, and 11% in `newfstatat`. This patch optimizes module path exploration by avoiding `stat()` calls unless necessary. For filesystems that do not implement `d_type` (DT_UNKNOWN), it falls back to the old behavior. See `readdir(3)` for details. This reduces `perf trace ls` time to ~500ms. A more thorough startup optimization based on command parameters would be ideal, but that is a larger effort. Signed-off-by: Krzysztof Łopatowski <krzysztof.m.lopatowski@gmail.com> Acked-by: Howard Chu <howardchu95@gmail.com> Link: https://lore.kernel.org/r/20250206113314.335376-2-krzysztof.m.lopatowski@gmail.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-20	perf tools: Fix up some comments and code to properly use the event_source bus	Greg Kroah-Hartman	2	-3/+3
	In sysfs, the perf events are all located in /sys/bus/event_source/devices/ but some places ended up hard-coding the location to be at the root of /sys/devices/ which could be very risky as you do not exactly know what type of device you are accessing in sysfs at that location. So fix this all up by properly pointing everything at the bus device list instead of the root of the sysfs devices/ tree. Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/2025021955-implant-excavator-179d@gregkh Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19	perf hist: Shrink struct hist_entry size	Dmitry Vyukov	2	-5/+5
	Reorder the struct fields by size to reduce paddings and reduce struct simd_flags size from 8 to 1 byte. This reduces struct hist_entry size by 8 bytes (592->584), and leaves a single more usable 6 byte padding hole. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/7c1cb1c8f9901e945162701ba7269d0f9c70be89.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19	perf report: Add --latency flag	Dmitry Vyukov	4	-8/+32
	Add record/report --latency flag that allows to capture and show latency-centric profiles rather than the default CPU-consumption-centric profiles. For latency profiles record captures context switch events, and report shows Latency as the first column. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/e9640464bcbc47dde2cb557003f421052ebc9eec.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19	perf report: Add latency output field	Dmitry Vyukov	6	-14/+65
	Latency output field is similar to overhead, but represents overhead for latency rather than CPU consumption. It's re-scaled from overhead by dividing weight by the current parallelism level at the time of the sample. It effectively models profiling with 1 sample taken per unit of wall-clock time rather than unit of CPU time. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/b6269518758c2166e6ffdc2f0e24cfdecc8ef9c1.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19	perf report: Add parallelism filter	Dmitry Vyukov	6	-1/+87
	Add parallelism filter that can be used to look at specific parallelism levels only. The format is the same as cpu lists. For example: Only single-threaded samples: --parallelism=1 Low parallelism only: --parallelism=1-4 High parallelism only: --parallelism=64-128 Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/e61348985ff0a6a14b07c39e880edbd60a8f8635.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19	perf report: Switch filtered from u8 to u16	Dmitry Vyukov	3	-3/+5
	We already have all u8 bits taken, adding one more filter leads to unpleasant failure mode, where code compiles w/o warnings, but the last filters silently don't work. Add a typedef and switch to u16. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/32b4ce1731126c88a2d9e191dc87e39ae4651cb7.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18	perf report: Add parallelism sort key	Dmitry Vyukov	6	-0/+42
	Show parallelism level in profiles if requested by user. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/7f7bb87cbaa51bf1fb008a0d68b687423ce4bad4.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18	perf report: Add machine parallelism	Dmitry Vyukov	5	-0/+19
	Add calculation of the current parallelism level (number of threads actively running on CPUs). The parallelism level can be shown in reports on its own, and to calculate latency overheads. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/0f8c1b8eb12619029e31b3d5c0346f4616a5aeda.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-13	perf sample: Make user_regs and intr_regs optional	Ian Rogers	18	-147/+351
	The struct dump_regs contains 512 bytes of cache_regs, meaning the two values in perf_sample contribute 1088 bytes of its total 1384 bytes size. Initializing this much memory has a cost reported by Tavian Barnes <tavianator@tavianator.com> as about 2.5% when running `perf script --itrace=i0`: https://lore.kernel.org/lkml/d841b97b3ad2ca8bcab07e4293375fb7c32dfce7.1736618095.git.tavianator@tavianator.com/ Adrian Hunter <adrian.hunter@intel.com> replied that the zero initialization was necessary and couldn't simply be removed. This patch aims to strike a middle ground of still zeroing the perf_sample, but removing 79% of its size by make user_regs and intr_regs optional pointers to zalloc-ed memory. To support the allocation accessors are created for user_regs and intr_regs. To support correct cleanup perf_sample__init and perf_sample__exit functions are created and added throughout the code base. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250113194345.1537821-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-13	perf tools: Use symfs when opening debuginfo by path	Namhyung Kim	1	-1/+5
	I found that it failed to load a binary using --symfs option. Say I have a binary in /home/user/prog/xxx and a perf data file with it. If I move them to a different machine and use --symfs, it tries to find the binary in some locations under symfs using dso__read_binary_type_filename(), but not the last one. ${symfs}/usr/lib/debug/home/user/prog/xxx.debug ${symfs}/usr/lib/debug/home/user/prog/xxx ${symfs}/home/user/prog/.debug/xxx /home/user/prog/xxx It should check ${symfs}/home/usr/prog/xxx. Let's fix it. Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250212221445.437481-1-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-13	perf tools: Get rid of now-unused rb_resort.h	Namhyung Kim	1	-146/+0
	It was only used in perf trace and it switched to use hashmap instead. Let's delete the code. Reviewed-by: Howard Chu <howardchu95@gmail.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20250205205443.1986408-4-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-10	perf tools: Add skip check in tool_pmu__event_to_str()	Kan Liang	2	-1/+4
	Some topdown related metrics may fail on hybrid machines. $ perf stat -M tma_frontend_bound Cannot resolve IDs for tma_frontend_bound: cpu_atom@TOPDOWN_FE_BOUND.ALL@ / (8 * cpu_atom@CPU_CLK_UNHALTED.CORE@) In the find_tool_events(), the tool_pmu__event_to_str() is used to compare the tool_events. It only checks the event name, no PMU or arch. So the tool_events[TOOL_PMU__EVENT_SLOTS] is set to true, because the p-core Topdown metrics has "slots" event. The tool_events is shared. So when parsing the e-core metrics, the "slots" is automatically added. The "slots" event as a tool event should only be available on arm64. It has a different meaning on X86. The tool_pmu__skip_event() intends handle the case. Apply it for tool_pmu__event_to_str() as well. There is a lack of sanity check in the expr__get_id(). Add the check. Closes: https://lore.kernel.org/lkml/608077bc-4139-4a97-8dc4-7997177d95c4@linux.intel.com/ Fixes: 069057239a67 ("perf tool_pmu: Move expr literals to tool_pmu") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: thomas.falcon@intel.com Link: https://lore.kernel.org/r/20250207152844.302167-1-kan.liang@linux.intel.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-10	perf tools: Deadcode removal	Dr. David Alan Gilbert	4	-36/+0
	The last use of machine__fprintf_vmlinux_path() was removed in 2011 by commit ab81f3fd350c ("perf top: Reuse the 'report' hist_entry/hists classes") mmap_cpu_mask__duplicate() was added in 2021 by commit 6bd006c6eb7f ("perf mmap: Introduce mmap_cpu_mask__duplicate()") but hasn't been used since. Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Tested-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250204220545.456435-1-linux@treblig.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-06	Merge tag 'v6.14-rc1' into perf-tools-next	Namhyung Kim	5	-87/+102
	To get the various fixes in the current master. Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-05	perf stat: Changes to event name uniquification	Ian Rogers	2	-33/+79
	The existing logic would disable uniquification on an evlist or enable it per evsel, this is unfortunate as uniquification is most needed when events have the same name and so the whole evlist must be considered. Change the initial disable uniquify on an evlist processing to also set a needs_uniquify flag, for cases like the matching event names. This must be done as an initial pass as uniquification of an event name will change the behavior of the check. Keep the per counter uniquification but now only uniquify event names when the needs_uniquify flag is set. Before this change a hwmon like temp1 wouldn't be uniquified and afterwards it will (ie the PMU is added to the temp1 event's name). Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250201074320.746259-6-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-05	perf stat: Don't merge counters purely on name	Ian Rogers	1	-2/+11
	Counter merging was added in commit 942c5593393d ("perf stat: Add perf_stat_merge_counters()"), however, it merges events with the same name on different PMUs regardless of whether the different PMUs are actually of the same type (ie they differ only in the suffix on the PMU). For hwmon events there may be a temp1 event on every PMU, but the PMU names are all unique and don't differ just by a suffix. The merging is over eager and will merge all the hwmon counters together meaning an aggregated and very large temp1 value is shown. The same would be true for say cache events and memory controller events where the PMUs differ but the event names are the same. Fix the problem by correctly saying two PMUs alias when they differ only by suffix. Note, there is an overlap with evsel's merged_stat with aggregation and the evsel's metric_leader where aggregation happens for metrics. Fixes: 942c5593393d ("perf stat: Add perf_stat_merge_counters()") Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250201074320.746259-5-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-05	perf pmu: Rename name matching for no suffix or wildcard variants	Ian Rogers	3	-80/+183
	Wildcard PMU naming will match a name like pmu_1 to a PMU name like pmu_10 but not to a PMU name like pmu_2 as the suffix forms part of the match. No suffix matching will match pmu_10 to either pmu_1 or pmu_2. Add or rename matching functions on PMU to make it clearer what kind of matching is being performed. Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250201074320.746259-4-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-05	perf pmus: Restructure pmu_read_sysfs to scan fewer PMUs	Ian Rogers	2	-51/+97
	Rather than scanning core or all PMUs, allow pmu_read_sysfs to read some combination of core, other, hwmon and tool PMUs. The PMUs that should be read and are already read are held as bitmaps. It is known that a "hwmon_" prefix is necessary for a hwmon PMU's name, similarly with "tool", so only scan those PMUs in situations the PMU name or the PMU's type number make sense to. The number of openat system calls reduces from 276 to 98 for a hwmon event. The number of openats for regular perf events isn't changed. Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250201074320.746259-3-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-05	perf evsel: Reduce scanning core PMUs in is_hybrid	Ian Rogers	1	-2/+2
	evsel__is_hybrid returns true if there are multiple core PMUs and the evsel is for a core PMU. Determining the number of core PMUs can require loading/scanning PMUs. There's no point doing the scanning if evsel for the is_hybrid test isn't core so reorder the tests to reduce PMU scanning. Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250201074320.746259-2-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-05	perf test: Fix Hwmon PMU test endianess issue	Thomas Richter	2	-14/+16
	perf test 11 hwmon fails on s390 with this error # ./perf test -Fv 11 --- start --- ---- end ---- 11.1: Basic parsing test : Ok --- start --- Testing 'temp_test_hwmon_event1' Using CPUID IBM,3931,704,A01,3.7,002f temp_test_hwmon_event1 -> hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/ FAILED tests/hwmon_pmu.c:189 Unexpected config for 'temp_test_hwmon_event1', 292470092988416 != 655361 ---- end ---- 11.2: Parsing without PMU name : FAILED! --- start --- Testing 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/' FAILED tests/hwmon_pmu.c:189 Unexpected config for 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/', 292470092988416 != 655361 ---- end ---- 11.3: Parsing with PMU name : FAILED! # The root cause is in member test_event::config which is initialized to 0xA0001 or 655361. During event parsing a long list event parsing functions are called and end up with this gdb call stack: #0 hwmon_pmu__config_term (hwm=0x168dfd0, attr=0x3ffffff5ee8, term=0x168db60, err=0x3ffffff81c8) at util/hwmon_pmu.c:623 #1 hwmon_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, err=0x3ffffff81c8) at util/hwmon_pmu.c:662 #2 0x00000000012f870c in perf_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, zero=false, apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1519 #3 0x00000000012f88a4 in perf_pmu__config (pmu=0x168dfd0, attr=0x3ffffff5ee8, head_terms=0x3ffffff5ea8, apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1545 #4 0x00000000012680c4 in parse_events_add_pmu (parse_state=0x3ffffff7fb8, list=0x168dc00, pmu=0x168dfd0, const_parsed_terms=0x3ffffff6090, auto_merge_stats=true, alternate_hw_config=10) at util/parse-events.c:1508 #5 0x00000000012684c6 in parse_events_multi_pmu_add (parse_state=0x3ffffff7fb8, event_name=0x168ec10 "temp_test_hwmon_event1", hw_config=10, const_parsed_terms=0x0, listp=0x3ffffff6230, loc_=0x3ffffff70e0) at util/parse-events.c:1592 #6 0x00000000012f0e4e in parse_events_parse (_parse_state=0x3ffffff7fb8, scanner=0x16878c0) at util/parse-events.y:293 #7 0x00000000012695a0 in parse_events__scanner (str=0x3ffffff81d8 "temp_test_hwmon_event1", input=0x0, parse_state=0x3ffffff7fb8) at util/parse-events.c:1867 #8 0x000000000126a1e8 in __parse_events (evlist=0x168b580, str=0x3ffffff81d8 "temp_test_hwmon_event1", pmu_filter=0x0, err=0x3ffffff81c8, fake_pmu=false, warn_if_reordered=true, fake_tp=false) at util/parse-events.c:2136 #9 0x00000000011e36aa in parse_events (evlist=0x168b580, str=0x3ffffff81d8 "temp_test_hwmon_event1", err=0x3ffffff81c8) at /root/linux/tools/perf/util/parse-events.h:41 #10 0x00000000011e3e64 in do_test (i=0, with_pmu=false, with_alias=false) at tests/hwmon_pmu.c:164 #11 0x00000000011e422c in test__hwmon_pmu (with_pmu=false) at tests/hwmon_pmu.c:219 #12 0x00000000011e431c in test__hwmon_pmu_without_pmu (test=0x1610368 <suite.hwmon_pmu>, subtest=1) at tests/hwmon_pmu.c:23 where the attr::config is set to value 292470092988416 or 0x10a0000000000 in line 625 of file ./util/hwmon_pmu.c: attr->config = key.type_and_num; However member key::type_and_num is defined as union and bit field: union hwmon_pmu_event_key { long type_and_num; struct { int num :16; enum hwmon_type type :8; }; }; s390 is big endian and Intel is little endian architecture. The events for the hwmon dummy pmu have num = 1 or num = 2 and type is set to HWMON_TYPE_TEMP (which is 10). On s390 this assignes member key::type_and_num the value of 0x10a0000000000 (which is 292470092988416) as shown in above trace output. Fix this and export the structure/union hwmon_pmu_event_key so the test shares the same implementation as the event parsing functions for union and bit fields. This should avoid endianess issues on all platforms. Output after: # ./perf test -F 11 11.1: Basic parsing test : Ok 11.2: Parsing without PMU name : Ok 11.3: Parsing with PMU name : Ok # Fixes: 531ee0fd4836 ("perf test: Add hwmon "PMU" test") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250131112400.568975-1-tmricht@linux.ibm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-01	perf: Always feature test reallocarray	James Clark	1	-0/+2
	This is also used in util/comm.c now, so instead of selectively doing the feature test, always do it. If it's ever used anywhere else it's less likely to cause another build failure. This doesn't remove the need to manually include libc_compat.h, and missing that will still cause an error for glibc < 2.26. There isn't a way to fix that without poisoning reallocarray like libbpf did, but that has other downsides like making memory debugging tools less useful. So for Perf keep it like this and we'll have to fix up any missed includes. Fixes the following build error: util/comm.c:152:31: error: implicit declaration of function 'reallocarray' [-Wimplicit-function-declaration] 152 \| tmp = reallocarray(comm_strs->strs, \| ^~~~~~~~~~~~ Fixes: 13ca628716c6 ("perf comm: Add reference count checking to 'struct comm_str'") Reported-by: Ali Utku Selen <ali.utku.selen@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250129154405.777533-1-james.clark@linaro.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-30	perf stat: Fix find_stat for mixed legacy/non-legacy events	Ian Rogers	2	-4/+19
	Legacy events typically don't have a PMU when added leading to mismatched legacy/non-legacy cases in find_stat. Use evsel__find_pmu to make sure the evsel PMU is looked up. Update the evsel__find_pmu code to look for the PMU using the extended config type or, for legacy hardware/hw_cache events on non-hybrid systems, just use the core PMU. Before: ``` $ perf stat -e cycles,cpu/instructions/ -a sleep 1 Performance counter stats for 'system wide': 215,309,764 cycles 44,326,491 cpu/instructions/ 1.002555314 seconds time elapsed ``` After: ``` $ perf stat -e cycles,cpu/instructions/ -a sleep 1 Performance counter stats for 'system wide': 990,676,332 cycles 1,235,762,487 cpu/instructions/ # 1.25 insn per cycle 1.002667198 seconds time elapsed ``` Fixes: 3612ca8e2935 ("perf stat: Fix the hard-coded metrics calculation on the hybrid") Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: James Clark <james.clark@linaro.org> Tested-by: Leo Yan <leo.yan@arm.com> Tested-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20250109222109.567031-3-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-30	perf evsel: Add pmu_name helper	Ian Rogers	2	-0/+11
	Add helper to get the name of the evsel's PMU. This handles the case where there's no sysfs PMU via parse_events event_type helper. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: James Clark <james.clark@linaro.org> Tested-by: Leo Yan <leo.yan@arm.com> Tested-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20250109222109.567031-2-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-28	perf cpumap: Fix die and cluster IDs	James Clark	1	-2/+2
	Now that filename__read_int() returns -errno instead of -1 these statements need to be updated otherwise error values will be used as die IDs. This appears as a -2 die ID when the platform doesn't export one: $ perf stat --per-core -a -- true S36-D-2-C0 1 9.45 msec cpu-clock And the session topology test fails: $ perf test -vvv topology CPU 0, core 0, socket 36 CPU 1, core 1, socket 36 CPU 2, core 2, socket 36 CPU 3, core 3, socket 36 FAILED tests/topology.c:137 Cpu map - Die ID doesn't match ---- end(-1) ---- 38: Session topology : FAILED! Fixes: 05be17eed774 ("tool api fs: Correctly encode errno for read/write open failures") Reported-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: James Clark <james.clark@linaro.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241218115552.912517-1-james.clark@linaro.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-28	perf annotate: Use an array for the disassembler preference	Ian Rogers	3	-78/+96
	Prior to this change a string was used which could cause issues with an unrecognized disassembler in symbol__disassembler. Change to initializing an array of perf_disassembler enum values. If a value already exists then adding it a second time is ignored to avoid array out of bounds problems present in the previous code, it also allows a statically sized array and removes memory allocation needs. Errors in the disassembler string are reported when the config is parsed during perf annotate or perf top start up. If the array is uninitialized after processing the config file the default llvm, capstone then objdump values are added but without a need to parse a string. Fixes: a6e8a58de629 ("perf disasm: Allow configuring what disassemblers to use") Closes: https://lore.kernel.org/lkml/CAP-5=fUdfCyxmEiTpzS2uumUp3-SyQOseX2xZo81-dQtWXj6vA@mail.gmail.com/ Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250124043856.1177264-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-24	perf trace: Fix BPF loading failure (-E2BIG)	Howard Chu	1	-7/+4
	As reported by Namhyung Kim and acknowledged by Qiao Zhao (link: https://lore.kernel.org/linux-perf-users/20241206001436.1947528-1-namhyung@kernel.org/), on certain machines, perf trace failed to load the BPF program into the kernel. The verifier runs perf trace's BPF program for up to 1 million instructions, returning an E2BIG error, whereas the perf trace BPF program should be much less complex than that. This patch aims to fix the issue described above. The E2BIG problem from clang-15 to clang-16 is cause by this line: } else if (size < 0 && size >= -6) { /* buffer / Specifically this check: size < 0. seems like clang generates a cool optimization to this sign check that breaks things. Making 'size' s64, and use } else if ((int)size < 0 && size >= -6) { / buffer / Solves the problem. This is some Hogwarts magic. And the unbounded access of clang-12 and clang-14 (clang-13 works this time) is fixed by making variable 'aug_size' s64. As for this: -if (aug_size > TRACE_AUG_MAX_BUF) - aug_size = TRACE_AUG_MAX_BUF; +aug_size = args->args[index] > TRACE_AUG_MAX_BUF ? TRACE_AUG_MAX_BUF : args->args[index]; This makes the BPF skel generated by clang-18 work. Yes, new clangs introduce problems too. Sorry, I only know that it works, but I don't know how it works. I'm not an expert in the BPF verifier. I really hope this is not a kernel version issue, as that would make the test case (kernel_nr) (clang_nr), a true horror story. I will test it on more kernel versions in the future. Fixes: 395d38419f18: ("perf trace augmented_raw_syscalls: Add more check s to pass the verifier") Reported-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241213023047.541218-1-howardchu95@gmail.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-18	perf annotate: Prefer passing evsel to evsel->core.idx	Ian Rogers	2	-26/+26
	An evsel idx may not be stable due to sorting, evlist removal, etc. Try to reduce it being part of APIs by explicitly passing the evsel in annotate code. Internally the code just reads evsel->core.idx so behavior is unchanged. Signed-off-by: Ian Rogers <irogers@google.com> Cc: Chen Ni <nichen@iscas.ac.cn> Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Link: https://lore.kernel.org/r/20250117181848.690474-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-17	perf hist: Fix bogus profiles when filters are enabled	Dmitry Vyukov	1	-1/+10
	When a filtered column is not present in the sort order, profiles become arbitrary broken. Filtered and non-filtered entries are collapsed together, and the filtered-by field ends up with a random value (either from a filtered or non-filtered entry). If we end up with filtered entry/value, then the whole collapsed entry will be filtered out and will be missing in the profile. If we end up with non-filtered entry/value, then the overhead value will be wrongly larger (include some subset of filtered out samples). This leads to very confusing profiles. The problem is hard to notice, and if noticed hard to understand. If the filter is for a single value, then it can be fixed by adding the corresponding field to the sort order (provided user understood the problem). But if the filter is for multiple values, it's impossible to fix b/c there is no concept of binary sorting based on filter predicate (we want to group all non-filtered values in one bucket, and all filtered values in another). Examples of affected commands: perf report --tid=123 perf report --sort overhead,symbol --comm=foo,bar Fix this by considering filtered status as the highest priority sort/collapse predicate. As a side effect this effectively adds a new feature of showing profile where several lines are combined based on arbitrary filtering predicate. For example, showing symbols from binaries foo and bar combined together, but not from other binaries; or showing combined overhead of several particular threads. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/r/359dc444ce94d20e59d3a9e360c36fbeac833a04.1736927981.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-17	perf hist: Deduplicate cmp/sort/collapse code	Dmitry Vyukov	2	-68/+49
	Application of cmp/sort/collapse fmt callbacks is duplicated 6 times. Factor it into a common helper function. NFC. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/r/84c4b55614e24a344f86ae0db62e8fa8f251f874.1736927981.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-14	perf config: Add a function to set one variable in .perfconfig	Arnaldo Carvalho de Melo	1	-0/+1
	To allow for setting a variable from some other tool, like with the "wallclock" patchset needs to allow the user to opt-in to having that key in the sort order for 'perf report'. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/lkml/Z4akewi7UPXpagce@x1 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>