1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
|
Coresight - HW Assisted Tracing on ARM
======================================
Author: Mathieu Poirier <mathieu.poirier@linaro.org>
Date: September 11th, 2014
Introduction
------------
Coresight is an umbrella of technologies allowing for the debugging of ARM
based SoC. It includes solutions for JTAG and HW assisted tracing. This
document is concerned with the latter.
HW assisted tracing is becoming increasingly useful when dealing with systems
that have many SoCs and other components like GPU and DMA engines. ARM has
developed a HW assisted tracing solution by means of different components, each
being added to a design at synthesis time to cater to specific tracing needs.
Components are generally categorised as source, link and sinks and are
(usually) discovered using the AMBA bus.
"Sources" generate a compressed stream representing the processor instruction
path based on tracing scenarios as configured by users. From there the stream
flows through the coresight system (via ATB bus) using links that are connecting
the emanating source to a sink(s). Sinks serve as endpoints to the coresight
implementation, either storing the compressed stream in a memory buffer or
creating an interface to the outside world where data can be transferred to a
host without fear of filling up the onboard coresight memory buffer.
At typical coresight system would look like this:
*****************************************************************
**************************** AMBA AXI ****************************===||
***************************************************************** ||
^ ^ | ||
| | * **
0000000 ::::: 0000000 ::::: ::::: @@@@@@@ ||||||||||||
0 CPU 0<-->: C : 0 CPU 0<-->: C : : C : @ STM @ || System ||
|->0000000 : T : |->0000000 : T : : T :<--->@@@@@ || Memory ||
| #######<-->: I : | #######<-->: I : : I : @@@<-| ||||||||||||
| # ETM # ::::: | # PTM # ::::: ::::: @ |
| ##### ^ ^ | ##### ^ ! ^ ! . | |||||||||
| |->### | ! | |->### | ! | ! . | || DAP ||
| | # | ! | | # | ! | ! . | |||||||||
| | . | ! | | . | ! | ! . | | |
| | . | ! | | . | ! | ! . | | *
| | . | ! | | . | ! | ! . | | SWD/
| | . | ! | | . | ! | ! . | | JTAG
*****************************************************************<-|
*************************** AMBA Debug APB ************************
*****************************************************************
| . ! . ! ! . |
| . * . * * . |
*****************************************************************
******************** Cross Trigger Matrix (CTM) *******************
*****************************************************************
| . ^ . . |
| * ! * * |
*****************************************************************
****************** AMBA Advanced Trace Bus (ATB) ******************
*****************************************************************
| ! =============== |
| * ===== F =====<---------|
| ::::::::: ==== U ====
|-->:: CTI ::<!! === N ===
| ::::::::: ! == N ==
| ^ * == E ==
| ! &&&&&&&&& IIIIIII == L ==
|------>&& ETB &&<......II I =======
| ! &&&&&&&&& II I .
| ! I I .
| ! I REP I<..........
| ! I I
| !!>&&&&&&&&& II I *Source: ARM ltd.
|------>& TPIU &<......II I DAP = Debug Access Port
&&&&&&&&& IIIIIII ETM = Embedded Trace Macrocell
; PTM = Program Trace Macrocell
; CTI = Cross Trigger Interface
* ETB = Embedded Trace Buffer
To trace port TPIU= Trace Port Interface Unit
SWD = Serial Wire Debug
While on target configuration of the components is done via the APB bus,
all trace data are carried out-of-band on the ATB bus. The CTM provides
a way to aggregate and distribute signals between CoreSight components.
The coresight framework provides a central point to represent, configure and
manage coresight devices on a platform. This first implementation centers on
the basic tracing functionality, enabling components such ETM/PTM, funnel,
replicator, TMC, TPIU and ETB. Future work will enable more
intricate IP blocks such as STM and CTI.
Acronyms and Classification
---------------------------
Acronyms:
PTM: Program Trace Macrocell
ETM: Embedded Trace Macrocell
STM: System trace Macrocell
ETB: Embedded Trace Buffer
ITM: Instrumentation Trace Macrocell
TPIU: Trace Port Interface Unit
TMC-ETR: Trace Memory Controller, configured as Embedded Trace Router
TMC-ETF: Trace Memory Controller, configured as Embedded Trace FIFO
CTI: Cross Trigger Interface
Classification:
Source:
ETMv3.x ETMv4, PTMv1.0, PTMv1.1, STM, STM500, ITM
Link:
Funnel, replicator (intelligent or not), TMC-ETR
Sinks:
ETBv1.0, ETB1.1, TPIU, TMC-ETF
Misc:
CTI
Device Tree Bindings
----------------------
See Documentation/devicetree/bindings/arm/coresight.txt for details.
As of this writing drivers for ITM, STMs and CTIs are not provided but are
expected to be added as the solution matures.
Framework and implementation
----------------------------
The coresight framework provides a central point to represent, configure and
manage coresight devices on a platform. Any coresight compliant device can
register with the framework for as long as they use the right APIs:
struct coresight_device *coresight_register(struct coresight_desc *desc);
void coresight_unregister(struct coresight_device *csdev);
The registering function is taking a "struct coresight_device *csdev" and
register the device with the core framework. The unregister function takes
a reference to a "struct coresight_device", obtained at registration time.
If everything goes well during the registration process the new devices will
show up under /sys/bus/coresight/devices, as showns here for a TC2 platform:
root:~# ls /sys/bus/coresight/devices/
replicator 20030000.tpiu 2201c000.ptm 2203c000.etm 2203e000.etm
20010000.etb 20040000.funnel 2201d000.ptm 2203d000.etm
root:~#
The functions take a "struct coresight_device", which looks like this:
struct coresight_desc {
enum coresight_dev_type type;
struct coresight_dev_subtype subtype;
const struct coresight_ops *ops;
struct coresight_platform_data *pdata;
struct device *dev;
const struct attribute_group **groups;
};
The "coresight_dev_type" identifies what the device is, i.e, source link or
sink while the "coresight_dev_subtype" will characterise that type further.
The "struct coresight_ops" is mandatory and will tell the framework how to
perform base operations related to the components, each component having
a different set of requirement. For that "struct coresight_ops_sink",
"struct coresight_ops_link" and "struct coresight_ops_source" have been
provided.
The next field, "struct coresight_platform_data *pdata" is acquired by calling
"of_get_coresight_platform_data()", as part of the driver's _probe routine and
"struct device *dev" gets the device reference embedded in the "amba_device":
static int etm_probe(struct amba_device *adev, const struct amba_id *id)
{
...
...
drvdata->dev = &adev->dev;
...
}
Specific class of device (source, link, or sink) have generic operations
that can be performed on them (see "struct coresight_ops"). The
"**groups" is a list of sysfs entries pertaining to operations
specific to that component only. "Implementation defined" customisations are
expected to be accessed and controlled using those entries.
How to use the tracer modules
-----------------------------
Before trace collection can start, a coresight sink needs to be identify.
There is no limit on the amount of sinks (nor sources) that can be enabled at
any given moment. As a generic operation, all device pertaining to the sink
class will have an "active" entry in sysfs:
root:/sys/bus/coresight/devices# ls
replicator 20030000.tpiu 2201c000.ptm 2203c000.etm 2203e000.etm
20010000.etb 20040000.funnel 2201d000.ptm 2203d000.etm
root:/sys/bus/coresight/devices# ls 20010000.etb
enable_sink status trigger_cntr
root:/sys/bus/coresight/devices# echo 1 > 20010000.etb/enable_sink
root:/sys/bus/coresight/devices# cat 20010000.etb/enable_sink
1
root:/sys/bus/coresight/devices#
At boot time the current etm3x driver will configure the first address
comparator with "_stext" and "_etext", essentially tracing any instruction
that falls within that range. As such "enabling" a source will immediately
trigger a trace capture:
root:/sys/bus/coresight/devices# echo 1 > 2201c000.ptm/enable_source
root:/sys/bus/coresight/devices# cat 2201c000.ptm/enable_source
1
root:/sys/bus/coresight/devices# cat 20010000.etb/status
Depth: 0x2000
Status: 0x1
RAM read ptr: 0x0
RAM wrt ptr: 0x19d3 <----- The write pointer is moving
Trigger cnt: 0x0
Control: 0x1
Flush status: 0x0
Flush ctrl: 0x2001
root:/sys/bus/coresight/devices#
Trace collection is stopped the same way:
root:/sys/bus/coresight/devices# echo 0 > 2201c000.ptm/enable_source
root:/sys/bus/coresight/devices#
The content of the ETB buffer can be harvested directly from /dev:
root:/sys/bus/coresight/devices# dd if=/dev/20010000.etb \
of=~/cstrace.bin
64+0 records in
64+0 records out
32768 bytes (33 kB) copied, 0.00125258 s, 26.2 MB/s
root:/sys/bus/coresight/devices#
The file cstrace.bin can be decompressed using "ptm2human", DS-5 or Trace32.
Following is a DS-5 output of an experimental loop that increments a variable up
to a certain value. The example is simple and yet provides a glimpse of the
wealth of possibilities that coresight provides.
Info Tracing enabled
Instruction 106378866 0x8026B53C E52DE004 false PUSH {lr}
Instruction 0 0x8026B540 E24DD00C false SUB sp,sp,#0xc
Instruction 0 0x8026B544 E3A03000 false MOV r3,#0
Instruction 0 0x8026B548 E58D3004 false STR r3,[sp,#4]
Instruction 0 0x8026B54C E59D3004 false LDR r3,[sp,#4]
Instruction 0 0x8026B550 E3530004 false CMP r3,#4
Instruction 0 0x8026B554 E2833001 false ADD r3,r3,#1
Instruction 0 0x8026B558 E58D3004 false STR r3,[sp,#4]
Instruction 0 0x8026B55C DAFFFFFA true BLE {pc}-0x10 ; 0x8026b54c
Timestamp Timestamp: 17106715833
Instruction 319 0x8026B54C E59D3004 false LDR r3,[sp,#4]
Instruction 0 0x8026B550 E3530004 false CMP r3,#4
Instruction 0 0x8026B554 E2833001 false ADD r3,r3,#1
Instruction 0 0x8026B558 E58D3004 false STR r3,[sp,#4]
Instruction 0 0x8026B55C DAFFFFFA true BLE {pc}-0x10 ; 0x8026b54c
Instruction 9 0x8026B54C E59D3004 false LDR r3,[sp,#4]
Instruction 0 0x8026B550 E3530004 false CMP r3,#4
Instruction 0 0x8026B554 E2833001 false ADD r3,r3,#1
Instruction 0 0x8026B558 E58D3004 false STR r3,[sp,#4]
Instruction 0 0x8026B55C DAFFFFFA true BLE {pc}-0x10 ; 0x8026b54c
Instruction 7 0x8026B54C E59D3004 false LDR r3,[sp,#4]
Instruction 0 0x8026B550 E3530004 false CMP r3,#4
Instruction 0 0x8026B554 E2833001 false ADD r3,r3,#1
Instruction 0 0x8026B558 E58D3004 false STR r3,[sp,#4]
Instruction 0 0x8026B55C DAFFFFFA true BLE {pc}-0x10 ; 0x8026b54c
Instruction 7 0x8026B54C E59D3004 false LDR r3,[sp,#4]
Instruction 0 0x8026B550 E3530004 false CMP r3,#4
Instruction 0 0x8026B554 E2833001 false ADD r3,r3,#1
Instruction 0 0x8026B558 E58D3004 false STR r3,[sp,#4]
Instruction 0 0x8026B55C DAFFFFFA true BLE {pc}-0x10 ; 0x8026b54c
Instruction 10 0x8026B54C E59D3004 false LDR r3,[sp,#4]
Instruction 0 0x8026B550 E3530004 false CMP r3,#4
Instruction 0 0x8026B554 E2833001 false ADD r3,r3,#1
Instruction 0 0x8026B558 E58D3004 false STR r3,[sp,#4]
Instruction 0 0x8026B55C DAFFFFFA true BLE {pc}-0x10 ; 0x8026b54c
Instruction 6 0x8026B560 EE1D3F30 false MRC p15,#0x0,r3,c13,c0,#1
Instruction 0 0x8026B564 E1A0100D false MOV r1,sp
Instruction 0 0x8026B568 E3C12D7F false BIC r2,r1,#0x1fc0
Instruction 0 0x8026B56C E3C2203F false BIC r2,r2,#0x3f
Instruction 0 0x8026B570 E59D1004 false LDR r1,[sp,#4]
Instruction 0 0x8026B574 E59F0010 false LDR r0,[pc,#16] ; [0x8026B58C] = 0x80550368
Instruction 0 0x8026B578 E592200C false LDR r2,[r2,#0xc]
Instruction 0 0x8026B57C E59221D0 false LDR r2,[r2,#0x1d0]
Instruction 0 0x8026B580 EB07A4CF true BL {pc}+0x1e9344 ; 0x804548c4
Info Tracing enabled
Instruction 13570831 0x8026B584 E28DD00C false ADD sp,sp,#0xc
Instruction 0 0x8026B588 E8BD8000 true LDM sp!,{pc}
Timestamp Timestamp: 17107041535
How to use the STM module
-------------------------
Using the System Trace Macrocell module is the same as the tracers - the only
difference is that clients are driving the trace capture rather
than the program flow through the code.
As with any other CoreSight component, specifics about the STM tracer can be
found in sysfs with more information on each entry being found in [1]:
root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm
enable_source hwevent_select port_enable subsystem uevent
hwevent_enable mgmt port_select traceid
root@genericarmv8:~#
Like any other source a sink needs to be identified and the STM enabled before
being used:
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source
From there user space applications can request and use channels using the devfs
interface provided for that purpose by the generic STM API:
root@genericarmv8:~# ls -l /dev/20100000.stm
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm
root@genericarmv8:~#
Details on how to use the generic STM API can be found here [2].
[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
[2]. Documentation/trace/stm.txt
Using perf tools
----------------
perf can be used to record and analyze trace of programs.
Execution can be recorded using 'perf record' with the cs_etm event,
specifying the name of the sink to record to, e.g:
perf record -e cs_etm/@20070000.etr/u --per-thread
The 'perf report' and 'perf script' commands can be used to analyze execution,
synthesizing instruction and branch events from the instruction trace.
'perf inject' can be used to replace the trace data with the synthesized events.
The --itrace option controls the type and frequency of synthesized events
(see perf documentation).
Note that only 64-bit programs are currently supported - further work is
required to support instruction decode of 32-bit Arm programs.
Generating coverage files for Feedback Directed Optimization: AutoFDO
---------------------------------------------------------------------
'perf inject' accepts the --itrace option in which case tracing data is
removed and replaced with the synthesized events. e.g.
perf inject --itrace --strip -i perf.data -o perf.data.new
Below is an example of using ARM ETM for autoFDO. It requires autofdo
(https://github.com/google/autofdo) and gcc version 5. The bubble
sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial).
$ gcc-5 -O3 sort.c -o sort
$ taskset -c 2 ./sort
Bubble sorting array of 30000 elements
5910 ms
$ perf record -e cs_etm/@20070000.etr/u --per-thread taskset -c 2 ./sort
Bubble sorting array of 30000 elements
12543 ms
[ perf record: Woken up 35 times to write data ]
[ perf record: Captured and wrote 69.640 MB perf.data ]
$ perf inject -i perf.data -o inj.data --itrace=il64 --strip
$ create_gcov --binary=./sort --profile=inj.data --gcov=sort.gcov -gcov_version=1
$ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
$ taskset -c 2 ./sort_autofdo
Bubble sorting array of 30000 elements
5806 ms
|