summaryrefslogtreecommitdiff
path: root/Documentation/trace
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/trace')
-rw-r--r--Documentation/trace/boottime-trace.rst4
-rw-r--r--Documentation/trace/coresight/coresight-perf.rst31
-rw-r--r--Documentation/trace/coresight/coresight.rst41
-rw-r--r--Documentation/trace/coresight/panic.rst362
-rw-r--r--Documentation/trace/debugging.rst2
-rw-r--r--Documentation/trace/eprobetrace.rst269
-rw-r--r--Documentation/trace/events.rst24
-rw-r--r--Documentation/trace/fprobe.rst42
-rw-r--r--Documentation/trace/ftrace-design.rst12
-rw-r--r--Documentation/trace/ftrace.rst30
-rw-r--r--Documentation/trace/histogram.rst4
-rw-r--r--Documentation/trace/index.rst98
-rw-r--r--Documentation/trace/postprocess/decode_msr.py2
-rw-r--r--Documentation/trace/rv/da_monitor_synthesis.rst147
-rw-r--r--Documentation/trace/rv/index.rst5
-rw-r--r--Documentation/trace/rv/linear_temporal_logic.rst134
-rw-r--r--Documentation/trace/rv/monitor_rtapp.rst133
-rw-r--r--Documentation/trace/rv/monitor_sched.rst402
-rw-r--r--Documentation/trace/rv/monitor_synthesis.rst271
-rw-r--r--Documentation/trace/rv/runtime-verification.rst4
-rw-r--r--Documentation/trace/tracepoints.rst17
21 files changed, 1804 insertions, 230 deletions
diff --git a/Documentation/trace/boottime-trace.rst b/Documentation/trace/boottime-trace.rst
index d594597201fd..3efac10adb36 100644
--- a/Documentation/trace/boottime-trace.rst
+++ b/Documentation/trace/boottime-trace.rst
@@ -198,8 +198,8 @@ Most of the subsystems and architecture dependent drivers will be initialized
after that (arch_initcall or subsys_initcall). Thus, you can trace those with
boot-time tracing.
If you want to trace events before core_initcall, you can use the options
-starting with ``kernel``. Some of them will be enabled eariler than the initcall
-processing (for example,. ``kernel.ftrace=function`` and ``kernel.trace_event``
+starting with ``kernel``. Some of them will be enabled earlier than the initcall
+processing (for example, ``kernel.ftrace=function`` and ``kernel.trace_event``
will start before the initcall.)
diff --git a/Documentation/trace/coresight/coresight-perf.rst b/Documentation/trace/coresight/coresight-perf.rst
index d087aae7d492..30be89320621 100644
--- a/Documentation/trace/coresight/coresight-perf.rst
+++ b/Documentation/trace/coresight/coresight-perf.rst
@@ -78,6 +78,37 @@ enabled like::
Please refer to the kernel configuration help for more information.
+Fine-grained tracing with AUX pause and resume
+----------------------------------------------
+
+Arm CoreSight may generate a large amount of hardware trace data, which
+will lead to overhead in recording and distract users when reviewing
+profiling result. To mitigate the issue of excessive trace data, Perf
+provides AUX pause and resume functionality for fine-grained tracing.
+
+The AUX pause and resume can be triggered by associated events. These
+events can be ftrace tracepoints (including static and dynamic
+tracepoints) or PMU events (e.g. CPU PMU cycle event). To create a perf
+session with AUX pause / resume, three configuration terms are
+introduced:
+
+- "aux-action=start-paused": it is specified for the cs_etm PMU event to
+ launch in a paused state.
+- "aux-action=pause": an associated event is specified with this term
+ to pause AUX trace.
+- "aux-action=resume": an associated event is specified with this term
+ to resume AUX trace.
+
+Example for triggering AUX pause and resume with ftrace tracepoints::
+
+ perf record -e cs_etm/aux-action=start-paused/k,syscalls:sys_enter_openat/aux-action=resume/,syscalls:sys_exit_openat/aux-action=pause/ ls
+
+Example for triggering AUX pause and resume with PMU event::
+
+ perf record -a -e cs_etm/aux-action=start-paused/k \
+ -e cycles/aux-action=pause,period=10000000/ \
+ -e cycles/aux-action=resume,period=1050000/ -- sleep 1
+
Perf test - Verify kernel and userspace perf CoreSight work
-----------------------------------------------------------
diff --git a/Documentation/trace/coresight/coresight.rst b/Documentation/trace/coresight/coresight.rst
index d4f93d6a2d63..806699871b80 100644
--- a/Documentation/trace/coresight/coresight.rst
+++ b/Documentation/trace/coresight/coresight.rst
@@ -462,44 +462,35 @@ queried by the perf command line tool:
cs_etm// [Kernel PMU event]
- linaro@linaro-nano:~$
-
Regardless of the number of tracers available in a system (usually equal to the
amount of processor cores), the "cs_etm" PMU will be listed only once.
A Coresight PMU works the same way as any other PMU, i.e the name of the PMU is
-listed along with configuration options within forward slashes '/'. Since a
-Coresight system will typically have more than one sink, the name of the sink to
-work with needs to be specified as an event option.
-On newer kernels the available sinks are listed in sysFS under
+provided along with configuration options within forward slashes '/' (see
+`Config option formats`_).
+
+Advanced Perf framework usage
+-----------------------------
+
+Sink selection
+~~~~~~~~~~~~~~
+
+An appropriate sink will be selected automatically for use with Perf, but since
+there will typically be more than one sink, the name of the sink to use may be
+specified as a special config option prefixed with '@'.
+
+The available sinks are listed in sysFS under
($SYSFS)/bus/event_source/devices/cs_etm/sinks/::
root@localhost:/sys/bus/event_source/devices/cs_etm/sinks# ls
tmc_etf0 tmc_etr0 tpiu0
-On older kernels, this may need to be found from the list of coresight devices,
-available under ($SYSFS)/bus/coresight/devices/::
-
- root:~# ls /sys/bus/coresight/devices/
- etm0 etm1 etm2 etm3 etm4 etm5 funnel0
- funnel1 funnel2 replicator0 stm0 tmc_etf0 tmc_etr0 tpiu0
root@linaro-nano:~# perf record -e cs_etm/@tmc_etr0/u --per-thread program
-As mentioned above in section "Device Naming scheme", the names of the devices could
-look different from what is used in the example above. One must use the device names
-as it appears under the sysFS.
-
-The syntax within the forward slashes '/' is important. The '@' character
-tells the parser that a sink is about to be specified and that this is the sink
-to use for the trace session.
-
More information on the above and other example on how to use Coresight with
the perf tools can be found in the "HOWTO.md" file of the openCSD gitHub
repository [#third]_.
-Advanced perf framework usage
------------------------------
-
AutoFDO analysis using the perf tools
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -508,7 +499,7 @@ perf can be used to record and analyze trace of programs.
Execution can be recorded using 'perf record' with the cs_etm event,
specifying the name of the sink to record to, e.g::
- perf record -e cs_etm/@tmc_etr0/u --per-thread
+ perf record -e cs_etm//u --per-thread
The 'perf report' and 'perf script' commands can be used to analyze execution,
synthesizing instruction and branch events from the instruction trace.
@@ -572,7 +563,7 @@ sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tuto
Bubble sorting array of 30000 elements
5910 ms
- $ perf record -e cs_etm/@tmc_etr0/u --per-thread taskset -c 2 ./sort
+ $ perf record -e cs_etm//u --per-thread taskset -c 2 ./sort
Bubble sorting array of 30000 elements
12543 ms
[ perf record: Woken up 35 times to write data ]
diff --git a/Documentation/trace/coresight/panic.rst b/Documentation/trace/coresight/panic.rst
new file mode 100644
index 000000000000..6e4bde953cae
--- /dev/null
+++ b/Documentation/trace/coresight/panic.rst
@@ -0,0 +1,362 @@
+===================================================
+Using Coresight for Kernel panic and Watchdog reset
+===================================================
+
+Introduction
+------------
+This documentation is about using Linux coresight trace support to
+debug kernel panic and watchdog reset scenarios.
+
+Coresight trace during Kernel panic
+-----------------------------------
+From the coresight driver point of view, addressing the kernel panic
+situation has four main requirements.
+
+a. Support for allocation of trace buffer pages from reserved memory area.
+ Platform can advertise this using a new device tree property added to
+ relevant coresight nodes.
+
+b. Support for stopping coresight blocks at the time of panic
+
+c. Saving required metadata in the specified format
+
+d. Support for reading trace data captured at the time of panic
+
+Allocation of trace buffer pages from reserved RAM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A new optional device tree property "memory-region" is added to the
+Coresight TMC device nodes, that would give the base address and size of trace
+buffer.
+
+Static allocation of trace buffers would ensure that both IOMMU enabled
+and disabled cases are handled. Also, platforms that support persistent
+RAM will allow users to read trace data in the subsequent boot without
+booting the crashdump kernel.
+
+Note:
+For ETR sink devices, this reserved region will be used for both trace
+capture and trace data retrieval.
+For ETF sink devices, internal SRAM would be used for trace capture,
+and they would be synced to reserved region for retrieval.
+
+
+Disabling coresight blocks at the time of panic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In order to avoid the situation of losing relevant trace data after a
+kernel panic, it would be desirable to stop the coresight blocks at the
+time of panic.
+
+This can be achieved by configuring the comparator, CTI and sink
+devices as below::
+
+ Trigger on panic
+ Comparator --->External out --->CTI -->External In---->ETR/ETF stop
+
+Saving metadata at the time of kernel panic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Coresight metadata involves all additional data that are required for a
+successful trace decode in addition to the trace data. This involves
+ETR/ETF/ETB register snapshot etc.
+
+A new optional device property "memory-region" is added to
+the ETR/ETF/ETB device nodes for this.
+
+Reading trace data captured at the time of panic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Trace data captured at the time of panic, can be read from rebooted kernel
+or from crashdump kernel using a special device file /dev/crash_tmc_xxx.
+This device file is created only when there is a valid crashdata available.
+
+General flow of trace capture and decode in case of kernel panic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+1. Enable source and sink on all the cores using the sysfs interface.
+ ETR sinks should have trace buffers allocated from reserved memory,
+ by selecting "resrv" buffer mode from sysfs.
+
+2. Run relevant tests.
+
+3. On a kernel panic, all coresight blocks are disabled, necessary
+ metadata is synced by kernel panic handler.
+
+ System would eventually reboot or boot a crashdump kernel.
+
+4. For platforms that supports crashdump kernel, raw trace data can be
+ dumped using the coresight sysfs interface from the crashdump kernel
+ itself. Persistent RAM is not a requirement in this case.
+
+5. For platforms that supports persistent RAM, trace data can be dumped
+ using the coresight sysfs interface in the subsequent Linux boot.
+ Crashdump kernel is not a requirement in this case. Persistent RAM
+ ensures that trace data is intact across reboot.
+
+Coresight trace during Watchdog reset
+-------------------------------------
+The main difference between addressing the watchdog reset and kernel panic
+case are below,
+
+a. Saving coresight metadata need to be taken care by the
+ SCP(system control processor) firmware in the specified format,
+ instead of kernel.
+
+b. Reserved memory region given by firmware for trace buffer and metadata
+ has to be in persistent RAM.
+ Note: This is a requirement for watchdog reset case but optional
+ in kernel panic case.
+
+Watchdog reset can be supported only on platforms that meet the above
+two requirements.
+
+Sample commands for testing a Kernel panic case with ETR sink
+-------------------------------------------------------------
+
+1. Boot Linux kernel with "crash_kexec_post_notifiers" added to the kernel
+ bootargs. This is mandatory if the user would like to read the tracedata
+ from the crashdump kernel.
+
+2. Enable the preloaded ETM configuration::
+
+ #echo 1 > /sys/kernel/config/cs-syscfg/configurations/panicstop/enable
+
+3. Configure CTI using sysfs interface::
+
+ #./cti_setup.sh
+
+ #cat cti_setup.sh
+
+
+ cd /sys/bus/coresight/devices/
+
+ ap_cti_config () {
+ #ETM trig out[0] trigger to Channel 0
+ echo 0 4 > channels/trigin_attach
+ }
+
+ etf_cti_config () {
+ #ETF Flush in trigger from Channel 0
+ echo 0 1 > channels/trigout_attach
+ echo 1 > channels/trig_filter_enable
+ }
+
+ etr_cti_config () {
+ #ETR Flush in from Channel 0
+ echo 0 1 > channels/trigout_attach
+ echo 1 > channels/trig_filter_enable
+ }
+
+ ctidevs=`find . -name "cti*"`
+
+ for i in $ctidevs
+ do
+ cd $i
+
+ connection=`find . -name "ete*"`
+ if [ ! -z "$connection" ]
+ then
+ echo "AP CTI config for $i"
+ ap_cti_config
+ fi
+
+ connection=`find . -name "tmc_etf*"`
+ if [ ! -z "$connection" ]
+ then
+ echo "ETF CTI config for $i"
+ etf_cti_config
+ fi
+
+ connection=`find . -name "tmc_etr*"`
+ if [ ! -z "$connection" ]
+ then
+ echo "ETR CTI config for $i"
+ etr_cti_config
+ fi
+
+ cd ..
+ done
+
+Note: CTI connections are SOC specific and hence the above script is
+added just for reference.
+
+4. Choose reserved buffer mode for ETR buffer::
+
+ #echo "resrv" > /sys/bus/coresight/devices/tmc_etr0/buf_mode_preferred
+
+5. Enable stop on flush trigger configuration::
+
+ #echo 1 > /sys/bus/coresight/devices/tmc_etr0/stop_on_flush
+
+6. Start Coresight tracing on cores 1 and 2 using sysfs interface
+
+7. Run some application on core 1::
+
+ #taskset -c 1 dd if=/dev/urandom of=/dev/null &
+
+8. Invoke kernel panic on core 2::
+
+ #echo 1 > /proc/sys/kernel/panic
+ #taskset -c 2 echo c > /proc/sysrq-trigger
+
+9. From rebooted kernel or crashdump kernel, read crashdata::
+
+ #dd if=/dev/crash_tmc_etr0 of=/trace/cstrace.bin
+
+10. Run opencsd decoder tools/scripts to generate the instruction trace.
+
+Sample instruction trace dump
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Core1 dump::
+
+ A etm4_enable_hw: ffff800008ae1dd4
+ CONTEXT EL2 etm4_enable_hw: ffff800008ae1dd4
+ I etm4_enable_hw: ffff800008ae1dd4:
+ d503201f nop
+ I etm4_enable_hw: ffff800008ae1dd8:
+ d503201f nop
+ I etm4_enable_hw: ffff800008ae1ddc:
+ d503201f nop
+ I etm4_enable_hw: ffff800008ae1de0:
+ d503201f nop
+ I etm4_enable_hw: ffff800008ae1de4:
+ d503201f nop
+ I etm4_enable_hw: ffff800008ae1de8:
+ d503233f paciasp
+ I etm4_enable_hw: ffff800008ae1dec:
+ a9be7bfd stp x29, x30, [sp, #-32]!
+ I etm4_enable_hw: ffff800008ae1df0:
+ 910003fd mov x29, sp
+ I etm4_enable_hw: ffff800008ae1df4:
+ a90153f3 stp x19, x20, [sp, #16]
+ I etm4_enable_hw: ffff800008ae1df8:
+ 2a0003f4 mov w20, w0
+ I etm4_enable_hw: ffff800008ae1dfc:
+ 900085b3 adrp x19, ffff800009b95000 <reserved_mem+0xc48>
+ I etm4_enable_hw: ffff800008ae1e00:
+ 910f4273 add x19, x19, #0x3d0
+ I etm4_enable_hw: ffff800008ae1e04:
+ f8747a60 ldr x0, [x19, x20, lsl #3]
+ E etm4_enable_hw: ffff800008ae1e08:
+ b4000140 cbz x0, ffff800008ae1e30 <etm4_starting_cpu+0x50>
+ I 149.039572921 etm4_enable_hw: ffff800008ae1e30:
+ a94153f3 ldp x19, x20, [sp, #16]
+ I 149.039572921 etm4_enable_hw: ffff800008ae1e34:
+ 52800000 mov w0, #0x0 // #0
+ I 149.039572921 etm4_enable_hw: ffff800008ae1e38:
+ a8c27bfd ldp x29, x30, [sp], #32
+
+ ..snip
+
+ 149.052324811 chacha_block_generic: ffff800008642d80:
+ 9100a3e0 add x0,
+ I 149.052324811 chacha_block_generic: ffff800008642d84:
+ b86178a2 ldr w2, [x5, x1, lsl #2]
+ I 149.052324811 chacha_block_generic: ffff800008642d88:
+ 8b010803 add x3, x0, x1, lsl #2
+ I 149.052324811 chacha_block_generic: ffff800008642d8c:
+ b85fc063 ldur w3, [x3, #-4]
+ I 149.052324811 chacha_block_generic: ffff800008642d90:
+ 0b030042 add w2, w2, w3
+ I 149.052324811 chacha_block_generic: ffff800008642d94:
+ b8217882 str w2, [x4, x1, lsl #2]
+ I 149.052324811 chacha_block_generic: ffff800008642d98:
+ 91000421 add x1, x1, #0x1
+ I 149.052324811 chacha_block_generic: ffff800008642d9c:
+ f100443f cmp x1, #0x11
+
+
+Core 2 dump::
+
+ A etm4_enable_hw: ffff800008ae1dd4
+ CONTEXT EL2 etm4_enable_hw: ffff800008ae1dd4
+ I etm4_enable_hw: ffff800008ae1dd4:
+ d503201f nop
+ I etm4_enable_hw: ffff800008ae1dd8:
+ d503201f nop
+ I etm4_enable_hw: ffff800008ae1ddc:
+ d503201f nop
+ I etm4_enable_hw: ffff800008ae1de0:
+ d503201f nop
+ I etm4_enable_hw: ffff800008ae1de4:
+ d503201f nop
+ I etm4_enable_hw: ffff800008ae1de8:
+ d503233f paciasp
+ I etm4_enable_hw: ffff800008ae1dec:
+ a9be7bfd stp x29, x30, [sp, #-32]!
+ I etm4_enable_hw: ffff800008ae1df0:
+ 910003fd mov x29, sp
+ I etm4_enable_hw: ffff800008ae1df4:
+ a90153f3 stp x19, x20, [sp, #16]
+ I etm4_enable_hw: ffff800008ae1df8:
+ 2a0003f4 mov w20, w0
+ I etm4_enable_hw: ffff800008ae1dfc:
+ 900085b3 adrp x19, ffff800009b95000 <reserved_mem+0xc48>
+ I etm4_enable_hw: ffff800008ae1e00:
+ 910f4273 add x19, x19, #0x3d0
+ I etm4_enable_hw: ffff800008ae1e04:
+ f8747a60 ldr x0, [x19, x20, lsl #3]
+ E etm4_enable_hw: ffff800008ae1e08:
+ b4000140 cbz x0, ffff800008ae1e30 <etm4_starting_cpu+0x50>
+ I 149.046243445 etm4_enable_hw: ffff800008ae1e30:
+ a94153f3 ldp x19, x20, [sp, #16]
+ I 149.046243445 etm4_enable_hw: ffff800008ae1e34:
+ 52800000 mov w0, #0x0 // #0
+ I 149.046243445 etm4_enable_hw: ffff800008ae1e38:
+ a8c27bfd ldp x29, x30, [sp], #32
+ I 149.046243445 etm4_enable_hw: ffff800008ae1e3c:
+ d50323bf autiasp
+ E 149.046243445 etm4_enable_hw: ffff800008ae1e40:
+ d65f03c0 ret
+ A ete_sysreg_write: ffff800008adfa18
+
+ ..snip
+
+ I 149.05422547 panic: ffff800008096300:
+ a90363f7 stp x23, x24, [sp, #48]
+ I 149.05422547 panic: ffff800008096304:
+ 6b00003f cmp w1, w0
+ I 149.05422547 panic: ffff800008096308:
+ 3a411804 ccmn w0, #0x1, #0x4, ne // ne = any
+ N 149.05422547 panic: ffff80000809630c:
+ 540001e0 b.eq ffff800008096348 <panic+0xe0> // b.none
+ I 149.05422547 panic: ffff800008096310:
+ f90023f9 str x25, [sp, #64]
+ E 149.05422547 panic: ffff800008096314:
+ 97fe44ef bl ffff8000080276d0 <panic_smp_self_stop>
+ A panic: ffff80000809634c
+ I 149.05422547 panic: ffff80000809634c:
+ 910102d5 add x21, x22, #0x40
+ I 149.05422547 panic: ffff800008096350:
+ 52800020 mov w0, #0x1 // #1
+ E 149.05422547 panic: ffff800008096354:
+ 94166b8b bl ffff800008631180 <bust_spinlocks>
+ N 149.054225518 bust_spinlocks: ffff800008631180:
+ 340000c0 cbz w0, ffff800008631198 <bust_spinlocks+0x18>
+ I 149.054225518 bust_spinlocks: ffff800008631184:
+ f000a321 adrp x1, ffff800009a98000 <pbufs.0+0xbb8>
+ I 149.054225518 bust_spinlocks: ffff800008631188:
+ b9405c20 ldr w0, [x1, #92]
+ I 149.054225518 bust_spinlocks: ffff80000863118c:
+ 11000400 add w0, w0, #0x1
+ I 149.054225518 bust_spinlocks: ffff800008631190:
+ b9005c20 str w0, [x1, #92]
+ E 149.054225518 bust_spinlocks: ffff800008631194:
+ d65f03c0 ret
+ A panic: ffff800008096358
+
+Perf based testing
+------------------
+
+Starting perf session
+~~~~~~~~~~~~~~~~~~~~~
+ETF::
+
+ perf record -e cs_etm/panicstop,@tmc_etf1/ -C 1
+ perf record -e cs_etm/panicstop,@tmc_etf2/ -C 2
+
+ETR::
+
+ perf record -e cs_etm/panicstop,@tmc_etr0/ -C 1,2
+
+Reading trace data after panic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Same sysfs based method explained above can be used to retrieve and
+decode the trace data after the reboot on kernel panic.
diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst
index 54fb16239d70..d54bc500af80 100644
--- a/Documentation/trace/debugging.rst
+++ b/Documentation/trace/debugging.rst
@@ -136,6 +136,8 @@ kernel, so only the same kernel is guaranteed to work if the mapping is
preserved. Switching to a different kernel version may find a different
layout and mark the buffer as invalid.
+NB: Both the mapped address and size must be page aligned for the architecture.
+
Using trace_printk() in the boot instance
-----------------------------------------
By default, the content of trace_printk() goes into the top level tracing
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
new file mode 100644
index 000000000000..89b5157cfab8
--- /dev/null
+++ b/Documentation/trace/eprobetrace.rst
@@ -0,0 +1,269 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+Eprobe - Event-based Probe Tracing
+==================================
+
+:Author: Steven Rostedt <rostedt@goodmis.org>
+
+- Written for v6.17
+
+Overview
+========
+
+Eprobes are dynamic events that are placed on existing events to either
+dereference a field that is a pointer, or simply to limit what fields are
+recorded in the trace event.
+
+Eprobes depend on kprobe events so to enable this feature, build your kernel
+with CONFIG_EPROBE_EVENTS=y.
+
+Eprobes are created via the /sys/kernel/tracing/dynamic_events file.
+
+Synopsis of eprobe_events
+-------------------------
+::
+
+ e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS] : Set a probe
+ -:[EGRP/][EEVENT] : Clear a probe
+
+ EGRP : Group name of the new event. If omitted, use "eprobes" for it.
+ EEVENT : Event name. If omitted, the event name is generated and will
+ be the same event name as the event it attached to.
+ GRP : Group name of the event to attach to.
+ EVENT : Event name of the event to attach to.
+
+ FETCHARGS : Arguments. Each probe can have up to 128 args.
+ $FIELD : Fetch the value of the event field called FIELD.
+ @ADDR : Fetch memory at ADDR (ADDR should be in kernel)
+ @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
+ $comm : Fetch current task comm.
+ +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+ \IMM : Store an immediate value to the argument.
+ NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
+ FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
+ (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
+ (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
+ "string", "ustring", "symbol", "symstr" and "bitfield" are
+ supported.
+
+Types
+-----
+The FETCHARGS above is very similar to the kprobe events as described in
+Documentation/trace/kprobetrace.rst.
+
+The difference between eprobes and kprobes FETCHARGS is that eprobes has a
+$FIELD command that returns the content of the event field of the event
+that is attached. Eprobes do not have access to registers, stacks and function
+arguments that kprobes has.
+
+If a field argument is a pointer, it may be dereferenced just like a memory
+address using the FETCHARGS syntax.
+
+
+Attaching to dynamic events
+---------------------------
+
+Eprobes may attach to dynamic events as well as to normal events. It may
+attach to a kprobe event, a synthetic event or a fprobe event. This is useful
+if the type of a field needs to be changed. See Example 2 below.
+
+Usage examples
+==============
+
+Example 1
+---------
+
+The basic usage of eprobes is to limit the data that is being recorded into
+the tracing buffer. For example, a common event to trace is the sched_switch
+trace event. That has a format of::
+
+ field:unsigned short common_type; offset:0; size:2; signed:0;
+ field:unsigned char common_flags; offset:2; size:1; signed:0;
+ field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
+ field:int common_pid; offset:4; size:4; signed:1;
+
+ field:char prev_comm[16]; offset:8; size:16; signed:0;
+ field:pid_t prev_pid; offset:24; size:4; signed:1;
+ field:int prev_prio; offset:28; size:4; signed:1;
+ field:long prev_state; offset:32; size:8; signed:1;
+ field:char next_comm[16]; offset:40; size:16; signed:0;
+ field:pid_t next_pid; offset:56; size:4; signed:1;
+ field:int next_prio; offset:60; size:4; signed:1;
+
+The first four fields are common to all events and can not be limited. But the
+rest of the event has 60 bytes of information. It records the names of the
+previous and next tasks being scheduled out and in, as well as their pids and
+priorities. It also records the state of the previous task. If only the pids
+of the tasks are of interest, why waste the ring buffer with all the other
+fields?
+
+An eprobe can limit what gets recorded. Note, it does not help in performance,
+as all the fields are recorded in a temporary buffer to process the eprobe.
+::
+
+ # echo 'e:sched/switch sched.sched_switch prev=$prev_pid:u32 next=$next_pid:u32' >> /sys/kernel/tracing/dynamic_events
+ # echo 1 > /sys/kernel/tracing/events/sched/switch/enable
+ # cat /sys/kernel/tracing/trace
+
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 2721/2721 #P:8
+ #
+ # _-----=> irqs-off/BH-disabled
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / _-=> migrate-disable
+ # |||| / delay
+ # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
+ # | | | ||||| | |
+ sshd-session-1082 [004] d..4. 5041.239906: switch: (sched.sched_switch) prev=1082 next=0
+ bash-1085 [001] d..4. 5041.240198: switch: (sched.sched_switch) prev=1085 next=141
+ kworker/u34:5-141 [001] d..4. 5041.240259: switch: (sched.sched_switch) prev=141 next=1085
+ <idle>-0 [004] d..4. 5041.240354: switch: (sched.sched_switch) prev=0 next=1082
+ bash-1085 [001] d..4. 5041.240385: switch: (sched.sched_switch) prev=1085 next=141
+ kworker/u34:5-141 [001] d..4. 5041.240410: switch: (sched.sched_switch) prev=141 next=1085
+ bash-1085 [001] d..4. 5041.240478: switch: (sched.sched_switch) prev=1085 next=0
+ sshd-session-1082 [004] d..4. 5041.240526: switch: (sched.sched_switch) prev=1082 next=0
+ <idle>-0 [001] d..4. 5041.247524: switch: (sched.sched_switch) prev=0 next=90
+ <idle>-0 [002] d..4. 5041.247545: switch: (sched.sched_switch) prev=0 next=16
+ kworker/1:1-90 [001] d..4. 5041.247580: switch: (sched.sched_switch) prev=90 next=0
+ rcu_sched-16 [002] d..4. 5041.247591: switch: (sched.sched_switch) prev=16 next=0
+ <idle>-0 [002] d..4. 5041.257536: switch: (sched.sched_switch) prev=0 next=16
+ rcu_sched-16 [002] d..4. 5041.257573: switch: (sched.sched_switch) prev=16 next=0
+
+Note, without adding the "u32" after the prev_pid and next_pid, the values
+would default showing in hexadecimal.
+
+Example 2
+---------
+
+If a specific system call is to be recorded but the syscalls events are not
+enabled, the raw_syscalls can still be used (syscalls are system call
+events are not normal events, but are created from the raw_syscalls events
+within the kernel). In order to trace the openat system call, one can create
+an event probe on top of the raw_syscalls event:
+::
+
+ # cd /sys/kernel/tracing
+ # cat events/raw_syscalls/sys_enter/format
+ name: sys_enter
+ ID: 395
+ format:
+ field:unsigned short common_type; offset:0; size:2; signed:0;
+ field:unsigned char common_flags; offset:2; size:1; signed:0;
+ field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
+ field:int common_pid; offset:4; size:4; signed:1;
+
+ field:long id; offset:8; size:8; signed:1;
+ field:unsigned long args[6]; offset:16; size:48; signed:0;
+
+ print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5]
+
+From the source code, the sys_openat() has:
+::
+
+ int sys_openat(int dirfd, const char *path, int flags, mode_t mode)
+ {
+ return my_syscall4(__NR_openat, dirfd, path, flags, mode);
+ }
+
+The path is the second parameter, and that is what is wanted.
+::
+
+ # echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events
+
+This is being run on x86_64 where the word size is 8 bytes and the openat
+system call __NR_openat is set at 257.
+::
+
+ # echo 'nr == 257' > events/eprobes/openat/filter
+
+Now enable the event and look at the trace.
+::
+
+ # echo 1 > events/eprobes/openat/enable
+ # cat trace
+
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 4/4 #P:8
+ #
+ # _-----=> irqs-off/BH-disabled
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / _-=> migrate-disable
+ # |||| / delay
+ # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
+ # | | | ||||| | |
+ cat-1298 [003] ...2. 2060.875970: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
+ cat-1298 [003] ...2. 2060.876197: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
+ cat-1298 [003] ...2. 2060.879126: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
+ cat-1298 [003] ...2. 2060.879639: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
+
+The filename shows "(fault)". This is likely because the filename has not been
+pulled into memory yet and currently trace events cannot fault in memory that
+is not present. When an eprobe tries to read memory that has not been faulted
+in yet, it will show the "(fault)" text.
+
+To get around this, as the kernel will likely pull in this filename and make
+it present, attaching it to a synthetic event that can pass the address of the
+filename from the entry of the event to the end of the event, this can be used
+to show the filename when the system call returns.
+
+Remove the old eprobe::
+
+ # echo 1 > events/eprobes/openat/enable
+ # echo '-:openat' >> dynamic_events
+
+This time make an eprobe where the address of the filename is saved::
+
+ # echo 'e:openat_start raw_syscalls.sys_enter nr=$id filename=+8($args):x64' >> dynamic_events
+
+Create a synthetic event that passes the address of the filename to the
+end of the event::
+
+ # echo 's:filename u64 file' >> dynamic_events
+ # echo 'hist:keys=common_pid:f=filename if nr == 257' > events/eprobes/openat_start/trigger
+ # echo 'hist:keys=common_pid:file=$f:onmatch(eprobes.openat_start).trace(filename,$file) if id == 257' > events/raw_syscalls/sys_exit/trigger
+
+Now that the address of the filename has been passed to the end of the
+system call, create another eprobe to attach to the exit event to show the
+string::
+
+ # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
+ # echo 1 > events/eprobes/openat/enable
+ # cat trace
+
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 4/4 #P:8
+ #
+ # _-----=> irqs-off/BH-disabled
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / _-=> migrate-disable
+ # |||| / delay
+ # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
+ # | | | ||||| | |
+ cat-1331 [001] ...5. 2944.787977: openat: (synthetic.filename) filename="/etc/ld.so.cache"
+ cat-1331 [001] ...5. 2944.788480: openat: (synthetic.filename) filename="/lib/x86_64-linux-gnu/libc.so.6"
+ cat-1331 [001] ...5. 2944.793426: openat: (synthetic.filename) filename="/usr/lib/locale/locale-archive"
+ cat-1331 [001] ...5. 2944.831362: openat: (synthetic.filename) filename="trace"
+
+Example 3
+---------
+
+If syscall trace events are available, the above would not need the first
+eprobe, but it would still need the last one::
+
+ # echo 's:filename u64 file' >> dynamic_events
+ # echo 'hist:keys=common_pid:f=filename' > events/syscalls/sys_enter_openat/trigger
+ # echo 'hist:keys=common_pid:file=$f:onmatch(syscalls.sys_enter_openat).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger
+ # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
+ # echo 1 > events/eprobes/openat/enable
+
+And this would produce the same result as Example 2.
diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index 759907c20e75..2d88a2acacc0 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -55,6 +55,30 @@ command::
# echo 'irq:*' > /sys/kernel/tracing/set_event
+The set_event file may also be used to enable events associated to only
+a specific module::
+
+ # echo ':mod:<module>' > /sys/kernel/tracing/set_event
+
+Will enable all events in the module ``<module>``. If the module is not yet
+loaded, the string will be saved and when a module is that matches ``<module>``
+is loaded, then it will apply the enabling of events then.
+
+The text before ``:mod:`` will be parsed to specify specific events that the
+module creates::
+
+ # echo '<match>:mod:<module>' > /sys/kernel/tracing/set_event
+
+The above will enable any system or event that ``<match>`` matches. If
+``<match>`` is ``"*"`` then it will match all events.
+
+To enable only a specific event within a system::
+
+ # echo '<system>:<event>:mod:<module>' > /sys/kernel/tracing/set_event
+
+If ``<event>`` is ``"*"`` then it will match all events within the system
+for a given module.
+
2.2 Via the 'enable' toggle
---------------------------
diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
index 196f52386aaa..71cd40472d36 100644
--- a/Documentation/trace/fprobe.rst
+++ b/Documentation/trace/fprobe.rst
@@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe
Introduction
============
-Fprobe is a function entry/exit probe mechanism based on ftrace.
-Instead of using ftrace full feature, if you only want to attach callbacks
-on function entry and exit, similar to the kprobes and kretprobes, you can
+Fprobe is a function entry/exit probe based on the function-graph tracing
+feature in ftrace.
+Instead of tracing all functions, if you want to attach callbacks on specific
+function entry and exit, similar to the kprobes and kretprobes, you can
use fprobe. Compared with kprobes and kretprobes, fprobe gives faster
instrumentation for multiple functions with single handler. This document
describes how to use fprobe.
@@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as follows:
.. code-block:: c
- int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data);
+ int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data);
- void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data);
+ void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data);
-Note that the @entry_ip is saved at function entry and passed to exit handler.
-If the entry callback function returns !0, the corresponding exit callback will be cancelled.
+Note that the @entry_ip is saved at function entry and passed to exit
+handler.
+If the entry callback function returns !0, the corresponding exit callback
+will be cancelled.
@fp
This is the address of `fprobe` data structure related to this handler.
@@ -112,12 +115,10 @@ If the entry callback function returns !0, the corresponding exit callback will
This is the return address that the traced function will return to,
somewhere in the caller. This can be used at both entry and exit.
-@regs
- This is the `pt_regs` data structure at the entry and exit. Note that
- the instruction pointer of @regs may be different from the @entry_ip
- in the entry_handler. If you need traced instruction pointer, you need
- to use @entry_ip. On the other hand, in the exit_handler, the instruction
- pointer of @regs is set to the current return address.
+@fregs
+ This is the `ftrace_regs` data structure at the entry and exit. This
+ includes the function parameters, or the return values. So user can
+ access thos values via appropriate `ftrace_regs_*` APIs.
@entry_data
This is a local storage to share the data between entry and exit handlers.
@@ -125,6 +126,17 @@ If the entry callback function returns !0, the corresponding exit callback will
and `entry_data_size` field when registering the fprobe, the storage is
allocated and passed to both `entry_handler` and `exit_handler`.
+Entry data size and exit handlers on the same function
+======================================================
+
+Since the entry data is passed via per-task stack and it has limited size,
+the entry data size per probe is limited to `15 * sizeof(long)`. You also need
+to take care that the different fprobes are probing on the same function, this
+limit becomes smaller. The entry data size is aligned to `sizeof(long)` and
+each fprobe which has exit handler uses a `sizeof(long)` space on the stack,
+you should keep the number of fprobes on the same function as small as
+possible.
+
Share the callbacks with kprobes
================================
@@ -165,8 +177,8 @@ This counter counts up when;
- fprobe fails to take ftrace_recursion lock. This usually means that a function
which is traced by other ftrace users is called from the entry_handler.
- - fprobe fails to setup the function exit because of the shortage of rethook
- (the shadow stack for hooking the function return.)
+ - fprobe fails to setup the function exit because of failing to allocate the
+ data buffer from the per-task shadow stack.
The `fprobe::nmissed` field counts up in both cases. Therefore, the former
skips both of entry and exit callback and the latter skips the exit
diff --git a/Documentation/trace/ftrace-design.rst b/Documentation/trace/ftrace-design.rst
index dc82d64b3a44..8f4fab3f9324 100644
--- a/Documentation/trace/ftrace-design.rst
+++ b/Documentation/trace/ftrace-design.rst
@@ -238,19 +238,15 @@ You need very few things to get the syscalls tracing in an arch.
- Tag this arch as HAVE_SYSCALL_TRACEPOINTS.
-HAVE_FTRACE_MCOUNT_RECORD
--------------------------
+HAVE_DYNAMIC_FTRACE
+-------------------
See scripts/recordmcount.pl for more info. Just fill in the arch-specific
details for how to locate the addresses of mcount call sites via objdump.
This option doesn't make much sense without also implementing dynamic ftrace.
-
-HAVE_DYNAMIC_FTRACE
--------------------
-
-You will first need HAVE_FTRACE_MCOUNT_RECORD and HAVE_FUNCTION_TRACER, so
-scroll your reader back up if you got over eager.
+You will first need HAVE_FUNCTION_TRACER, so scroll your reader back up if you
+got over eager.
Once those are out of the way, you will need to implement:
- asm/ftrace.h:
diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 4073ca48af4a..af66a05e18cc 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -810,6 +810,12 @@ Here is the list of current tracers that may be configured.
to draw a graph of function calls similar to C code
source.
+ Note that the function graph calculates the timings of when the
+ function starts and returns internally and for each instance. If
+ there are two instances that run function graph tracer and traces
+ the same functions, the length of the timings may be slightly off as
+ each read the timestamp separately and not at the same time.
+
"blk"
The block tracer. The tracer used by the blktrace user
@@ -1031,14 +1037,15 @@ explains which is which.
CPU#: The CPU which the process was running on.
irqs-off: 'd' interrupts are disabled. '.' otherwise.
- .. caution:: If the architecture does not support a way to
- read the irq flags variable, an 'X' will always
- be printed here.
need-resched:
+ - 'B' all, TIF_NEED_RESCHED, PREEMPT_NEED_RESCHED and TIF_RESCHED_LAZY is set,
- 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
- 'n' only TIF_NEED_RESCHED is set,
- 'p' only PREEMPT_NEED_RESCHED is set,
+ - 'L' both PREEMPT_NEED_RESCHED and TIF_RESCHED_LAZY is set,
+ - 'b' both TIF_NEED_RESCHED and TIF_RESCHED_LAZY is set,
+ - 'l' only TIF_RESCHED_LAZY is set
- '.' otherwise.
hardirq/softirq:
@@ -1198,6 +1205,19 @@ Here are the available options:
default instance. The only way the top level instance has this flag
cleared, is by it being set in another instance.
+ copy_trace_marker
+ If there are applications that hard code writing into the top level
+ trace_marker file (/sys/kernel/tracing/trace_marker or trace_marker_raw),
+ and the tooling would like it to go into an instance, this option can
+ be used. Create an instance and set this option, and then all writes
+ into the top level trace_marker file will also be redirected into this
+ instance.
+
+ Note, by default this option is set for the top level instance. If it
+ is disabled, then writes to the trace_marker or trace_marker_raw files
+ will not be written into the top level file. If no instance has this
+ option set, then a write will error with the errno of ENODEV.
+
annotate
It is sometimes confusing when the CPU buffers are full
and one CPU buffer had a lot of events recently, thus
@@ -3070,7 +3090,7 @@ Notice that we lost the sys_nanosleep.
# cat set_ftrace_filter
hrtimer_run_queues
hrtimer_run_pending
- hrtimer_init
+ hrtimer_setup
hrtimer_cancel
hrtimer_try_to_cancel
hrtimer_forward
@@ -3108,7 +3128,7 @@ Again, now we want to append.
# cat set_ftrace_filter
hrtimer_run_queues
hrtimer_run_pending
- hrtimer_init
+ hrtimer_setup
hrtimer_cancel
hrtimer_try_to_cancel
hrtimer_forward
diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst
index 3c9b263de9c2..2b98c1720a54 100644
--- a/Documentation/trace/histogram.rst
+++ b/Documentation/trace/histogram.rst
@@ -81,7 +81,7 @@ Documentation written by Tom Zanussi
.usecs display a common_timestamp in microseconds
.percent display a number of percentage value
.graph display a bar-graph of a value
- .stacktrace display as a stacktrace (must by a long[] type)
+ .stacktrace display as a stacktrace (must be a long[] type)
============= =================================================
Note that in general the semantics of a given field aren't
@@ -249,7 +249,7 @@ Extended error information
table, it should keep a running total of the number of bytes
requested by that call_site.
- We'll let it run for awhile and then dump the contents of the 'hist'
+ We'll let it run for a while and then dump the contents of the 'hist'
file in the kmalloc event's subdirectory (for readability, a number
of entries have been omitted)::
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 0b300901fd75..b4a429dc4f7a 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -1,38 +1,104 @@
-==========================
-Linux Tracing Technologies
-==========================
+================================
+Linux Tracing Technologies Guide
+================================
+
+Tracing in the Linux kernel is a powerful mechanism that allows
+developers and system administrators to analyze and debug system
+behavior. This guide provides documentation on various tracing
+frameworks and tools available in the Linux kernel.
+
+Introduction to Tracing
+-----------------------
+
+This section provides an overview of Linux tracing mechanisms
+and debugging approaches.
.. toctree::
- :maxdepth: 2
+ :maxdepth: 1
- ftrace-design
+ debugging
+ tracepoints
tracepoint-analysis
+ ring-buffer-map
+
+Core Tracing Frameworks
+-----------------------
+
+The following are the primary tracing frameworks integrated into
+the Linux kernel.
+
+.. toctree::
+ :maxdepth: 1
+
ftrace
+ ftrace-design
ftrace-uses
- fprobe
kprobes
kprobetrace
- uprobetracer
fprobetrace
- tracepoints
+ eprobetrace
+ fprobe
+ ring-buffer-design
+
+Event Tracing and Analysis
+--------------------------
+
+A detailed explanation of event tracing mechanisms and their
+applications.
+
+.. toctree::
+ :maxdepth: 1
+
events
events-kmem
events-power
events-nmi
events-msr
- mmiotrace
+ boottime-trace
histogram
histogram-design
- boottime-trace
- hwlat_detector
- osnoise-tracer
- timerlat-tracer
+
+Hardware and Performance Tracing
+--------------------------------
+
+This section covers tracing features that monitor hardware
+interactions and system performance.
+
+.. toctree::
+ :maxdepth: 1
+
intel_th
- ring-buffer-design
- ring-buffer-map
stm
sys-t
coresight/index
- user_events
rv/index
hisi-ptt
+ mmiotrace
+ hwlat_detector
+ osnoise-tracer
+ timerlat-tracer
+
+User-Space Tracing
+------------------
+
+These tools allow tracing user-space applications and
+interactions.
+
+.. toctree::
+ :maxdepth: 1
+
+ user_events
+ uprobetracer
+
+Additional Resources
+--------------------
+
+For more details, refer to the respective documentation of each
+tracing tool and framework.
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/trace/postprocess/decode_msr.py b/Documentation/trace/postprocess/decode_msr.py
index aa9cc7abd5c2..f5609b16f589 100644
--- a/Documentation/trace/postprocess/decode_msr.py
+++ b/Documentation/trace/postprocess/decode_msr.py
@@ -32,6 +32,6 @@ for j in sys.stdin:
break
if r:
j = j.replace(" " + m.group(2), " " + r + "(" + m.group(2) + ")")
- print j,
+ print(j)
diff --git a/Documentation/trace/rv/da_monitor_synthesis.rst b/Documentation/trace/rv/da_monitor_synthesis.rst
deleted file mode 100644
index 0a92729c8a9b..000000000000
--- a/Documentation/trace/rv/da_monitor_synthesis.rst
+++ /dev/null
@@ -1,147 +0,0 @@
-Deterministic Automata Monitor Synthesis
-========================================
-
-The starting point for the application of runtime verification (RV) techniques
-is the *specification* or *modeling* of the desired (or undesired) behavior
-of the system under scrutiny.
-
-The formal representation needs to be then *synthesized* into a *monitor*
-that can then be used in the analysis of the trace of the system. The
-*monitor* connects to the system via an *instrumentation* that converts
-the events from the *system* to the events of the *specification*.
-
-
-In Linux terms, the runtime verification monitors are encapsulated inside
-the *RV monitor* abstraction. The RV monitor includes a set of instances
-of the monitor (per-cpu monitor, per-task monitor, and so on), the helper
-functions that glue the monitor to the system reference model, and the
-trace output as a reaction to event parsing and exceptions, as depicted
-below::
-
- Linux +----- RV Monitor ----------------------------------+ Formal
- Realm | | Realm
- +-------------------+ +----------------+ +-----------------+
- | Linux kernel | | Monitor | | Reference |
- | Tracing | -> | Instance(s) | <- | Model |
- | (instrumentation) | | (verification) | | (specification) |
- +-------------------+ +----------------+ +-----------------+
- | | |
- | V |
- | +----------+ |
- | | Reaction | |
- | +--+--+--+-+ |
- | | | | |
- | | | +-> trace output ? |
- +------------------------|--|----------------------+
- | +----> panic ?
- +-------> <user-specified>
-
-DA monitor synthesis
---------------------
-
-The synthesis of automata-based models into the Linux *RV monitor* abstraction
-is automated by the dot2k tool and the rv/da_monitor.h header file that
-contains a set of macros that automatically generate the monitor's code.
-
-dot2k
------
-
-The dot2k utility leverages dot2c by converting an automaton model in
-the DOT format into the C representation [1] and creating the skeleton of
-a kernel monitor in C.
-
-For example, it is possible to transform the wip.dot model present in
-[1] into a per-cpu monitor with the following command::
-
- $ dot2k -d wip.dot -t per_cpu
-
-This will create a directory named wip/ with the following files:
-
-- wip.h: the wip model in C
-- wip.c: the RV monitor
-
-The wip.c file contains the monitor declaration and the starting point for
-the system instrumentation.
-
-Monitor macros
---------------
-
-The rv/da_monitor.h enables automatic code generation for the *Monitor
-Instance(s)* using C macros.
-
-The benefits of the usage of macro for monitor synthesis are 3-fold as it:
-
-- Reduces the code duplication;
-- Facilitates the bug fix/improvement;
-- Avoids the case of developers changing the core of the monitor code
- to manipulate the model in a (let's say) non-standard way.
-
-This initial implementation presents three different types of monitor instances:
-
-- ``#define DECLARE_DA_MON_GLOBAL(name, type)``
-- ``#define DECLARE_DA_MON_PER_CPU(name, type)``
-- ``#define DECLARE_DA_MON_PER_TASK(name, type)``
-
-The first declares the functions for a global deterministic automata monitor,
-the second for monitors with per-cpu instances, and the third with per-task
-instances.
-
-In all cases, the 'name' argument is a string that identifies the monitor, and
-the 'type' argument is the data type used by dot2k on the representation of
-the model in C.
-
-For example, the wip model with two states and three events can be
-stored in an 'unsigned char' type. Considering that the preemption control
-is a per-cpu behavior, the monitor declaration in the 'wip.c' file is::
-
- DECLARE_DA_MON_PER_CPU(wip, unsigned char);
-
-The monitor is executed by sending events to be processed via the functions
-presented below::
-
- da_handle_event_$(MONITOR_NAME)($(event from event enum));
- da_handle_start_event_$(MONITOR_NAME)($(event from event enum));
- da_handle_start_run_event_$(MONITOR_NAME)($(event from event enum));
-
-The function ``da_handle_event_$(MONITOR_NAME)()`` is the regular case where
-the event will be processed if the monitor is processing events.
-
-When a monitor is enabled, it is placed in the initial state of the automata.
-However, the monitor does not know if the system is in the *initial state*.
-
-The ``da_handle_start_event_$(MONITOR_NAME)()`` function is used to notify the
-monitor that the system is returning to the initial state, so the monitor can
-start monitoring the next event.
-
-The ``da_handle_start_run_event_$(MONITOR_NAME)()`` function is used to notify
-the monitor that the system is known to be in the initial state, so the
-monitor can start monitoring and monitor the current event.
-
-Using the wip model as example, the events "preempt_disable" and
-"sched_waking" should be sent to monitor, respectively, via [2]::
-
- da_handle_event_wip(preempt_disable_wip);
- da_handle_event_wip(sched_waking_wip);
-
-While the event "preempt_enabled" will use::
-
- da_handle_start_event_wip(preempt_enable_wip);
-
-To notify the monitor that the system will be returning to the initial state,
-so the system and the monitor should be in sync.
-
-Final remarks
--------------
-
-With the monitor synthesis in place using the rv/da_monitor.h and
-dot2k, the developer's work should be limited to the instrumentation
-of the system, increasing the confidence in the overall approach.
-
-[1] For details about deterministic automata format and the translation
-from one representation to another, see::
-
- Documentation/trace/rv/deterministic_automata.rst
-
-[2] dot2k appends the monitor's name suffix to the events enums to
-avoid conflicting variables when exporting the global vmlinux.h
-use by BPF programs.
diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst
index 15fa966102c0..a2812ac5cfeb 100644
--- a/Documentation/trace/rv/index.rst
+++ b/Documentation/trace/rv/index.rst
@@ -8,7 +8,10 @@ Runtime Verification
runtime-verification.rst
deterministic_automata.rst
- da_monitor_synthesis.rst
+ linear_temporal_logic.rst
+ monitor_synthesis.rst
da_monitor_instrumentation.rst
monitor_wip.rst
monitor_wwnr.rst
+ monitor_sched.rst
+ monitor_rtapp.rst
diff --git a/Documentation/trace/rv/linear_temporal_logic.rst b/Documentation/trace/rv/linear_temporal_logic.rst
new file mode 100644
index 000000000000..9eee09d9cacf
--- /dev/null
+++ b/Documentation/trace/rv/linear_temporal_logic.rst
@@ -0,0 +1,134 @@
+Linear temporal logic
+=====================
+
+Introduction
+------------
+
+Runtime verification monitor is a verification technique which checks that the
+kernel follows a specification. It does so by using tracepoints to monitor the
+kernel's execution trace, and verifying that the execution trace sastifies the
+specification.
+
+Initially, the specification can only be written in the form of deterministic
+automaton (DA). However, while attempting to implement DA monitors for some
+complex specifications, deterministic automaton is found to be inappropriate as
+the specification language. The automaton is complicated, hard to understand,
+and error-prone.
+
+Thus, RV monitors based on linear temporal logic (LTL) are introduced. This type
+of monitor uses LTL as specification instead of DA. For some cases, writing the
+specification as LTL is more concise and intuitive.
+
+Many materials explain LTL in details. One book is::
+
+ Christel Baier and Joost-Pieter Katoen: Principles of Model Checking, The MIT
+ Press, 2008.
+
+Grammar
+-------
+
+Unlike some existing syntax, kernel's implementation of LTL is more verbose.
+This is motivated by considering that the people who read the LTL specifications
+may not be well-versed in LTL.
+
+Grammar:
+ ltl ::= opd | ( ltl ) | ltl binop ltl | unop ltl
+
+Operands (opd):
+ true, false, user-defined names consisting of upper-case characters, digits,
+ and underscore.
+
+Unary Operators (unop):
+ always
+ eventually
+ next
+ not
+
+Binary Operators (binop):
+ until
+ and
+ or
+ imply
+ equivalent
+
+This grammar is ambiguous: operator precedence is not defined. Parentheses must
+be used.
+
+Example linear temporal logic
+-----------------------------
+.. code-block::
+
+ RAIN imply (GO_OUTSIDE imply HAVE_UMBRELLA)
+
+means: if it is raining, going outside means having an umbrella.
+
+.. code-block::
+
+ RAIN imply (WET until not RAIN)
+
+means: if it is raining, it is going to be wet until the rain stops.
+
+.. code-block::
+
+ RAIN imply eventually not RAIN
+
+means: if it is raining, rain will eventually stop.
+
+The above examples are referring to the current time instance only. For kernel
+verification, the `always` operator is usually desirable, to specify that
+something is always true at the present and for all future. For example::
+
+ always (RAIN imply eventually not RAIN)
+
+means: *all* rain eventually stops.
+
+In the above examples, `RAIN`, `GO_OUTSIDE`, `HAVE_UMBRELLA` and `WET` are the
+"atomic propositions".
+
+Monitor synthesis
+-----------------
+
+To synthesize an LTL into a kernel monitor, the `rvgen` tool can be used:
+`tools/verification/rvgen`. The specification needs to be provided as a file,
+and it must have a "RULE = LTL" assignment. For example::
+
+ RULE = always (ACQUIRE imply ((not KILLED and not CRASHED) until RELEASE))
+
+which says: if `ACQUIRE`, then `RELEASE` must happen before `KILLED` or
+`CRASHED`.
+
+The LTL can be broken down using sub-expressions. The above is equivalent to:
+
+ .. code-block::
+
+ RULE = always (ACQUIRE imply (ALIVE until RELEASE))
+ ALIVE = not KILLED and not CRASHED
+
+From this specification, `rvgen` generates the C implementation of a Buchi
+automaton - a non-deterministic state machine which checks the satisfiability of
+the LTL. See Documentation/trace/rv/monitor_synthesis.rst for details on using
+`rvgen`.
+
+References
+----------
+
+One book covering model checking and linear temporal logic is::
+
+ Christel Baier and Joost-Pieter Katoen: Principles of Model Checking, The MIT
+ Press, 2008.
+
+For an example of using linear temporal logic in software testing, see::
+
+ Ruijie Meng, Zhen Dong, Jialin Li, Ivan Beschastnikh, and Abhik Roychoudhury.
+ 2022. Linear-time temporal logic guided greybox fuzzing. In Proceedings of the
+ 44th International Conference on Software Engineering (ICSE '22). Association
+ for Computing Machinery, New York, NY, USA, 1343–1355.
+ https://doi.org/10.1145/3510003.3510082
+
+The kernel's LTL monitor implementation is based on::
+
+ Gerth, R., Peled, D., Vardi, M.Y., Wolper, P. (1996). Simple On-the-fly
+ Automatic Verification of Linear Temporal Logic. In: Dembiński, P., Średniawa,
+ M. (eds) Protocol Specification, Testing and Verification XV. PSTV 1995. IFIP
+ Advances in Information and Communication Technology. Springer, Boston, MA.
+ https://doi.org/10.1007/978-0-387-34892-6_1
diff --git a/Documentation/trace/rv/monitor_rtapp.rst b/Documentation/trace/rv/monitor_rtapp.rst
new file mode 100644
index 000000000000..c8104eda924a
--- /dev/null
+++ b/Documentation/trace/rv/monitor_rtapp.rst
@@ -0,0 +1,133 @@
+Real-time application monitors
+==============================
+
+- Name: rtapp
+- Type: container for multiple monitors
+- Author: Nam Cao <namcao@linutronix.de>
+
+Description
+-----------
+
+Real-time applications may have design flaws such that they experience
+unexpected latency and fail to meet their time requirements. Often, these flaws
+follow a few patterns:
+
+ - Page faults: A real-time thread may access memory that does not have a
+ mapped physical backing or must first be copied (such as for copy-on-write).
+ Thus a page fault is raised and the kernel must first perform the expensive
+ action. This causes significant delays to the real-time thread
+ - Priority inversion: A real-time thread blocks waiting for a lower-priority
+ thread. This causes the real-time thread to effectively take on the
+ scheduling priority of the lower-priority thread. For example, the real-time
+ thread needs to access a shared resource that is protected by a
+ non-pi-mutex, but the mutex is currently owned by a non-real-time thread.
+
+The `rtapp` monitor detects these patterns. It aids developers to identify
+reasons for unexpected latency with real-time applications. It is a container of
+multiple sub-monitors described in the following sections.
+
+Monitor pagefault
++++++++++++++++++
+
+The `pagefault` monitor reports real-time tasks raising page faults. Its
+specification is::
+
+ RULE = always (RT imply not PAGEFAULT)
+
+To fix warnings reported by this monitor, `mlockall()` or `mlock()` can be used
+to ensure physical backing for memory.
+
+This monitor may have false negatives because the pages used by the real-time
+threads may just happen to be directly available during testing. To minimize
+this, the system can be put under memory pressure (e.g. invoking the OOM killer
+using a program that does `ptr = malloc(SIZE_OF_RAM); memset(ptr, 0,
+SIZE_OF_RAM);`) so that the kernel executes aggressive strategies to recycle as
+much physical memory as possible.
+
+Monitor sleep
++++++++++++++
+
+The `sleep` monitor reports real-time threads sleeping in a manner that may
+cause undesirable latency. Real-time applications should only put a real-time
+thread to sleep for one of the following reasons:
+
+ - Cyclic work: real-time thread sleeps waiting for the next cycle. For this
+ case, only the `clock_nanosleep` syscall should be used with `TIMER_ABSTIME`
+ (to avoid time drift) and `CLOCK_MONOTONIC` (to avoid the clock being
+ changed). No other method is safe for real-time. For example, threads
+ waiting for timerfd can be woken by softirq which provides no real-time
+ guarantee.
+ - Real-time thread waiting for something to happen (e.g. another thread
+ releasing shared resources, or a completion signal from another thread). In
+ this case, only futexes (FUTEX_LOCK_PI, FUTEX_LOCK_PI2 or one of
+ FUTEX_WAIT_*) should be used. Applications usually do not use futexes
+ directly, but use PI mutexes and PI condition variables which are built on
+ top of futexes. Be aware that the C library might not implement conditional
+ variables as safe for real-time. As an alternative, the librtpi library
+ exists to provide a conditional variable implementation that is correct for
+ real-time applications in Linux.
+
+Beside the reason for sleeping, the eventual waker should also be
+real-time-safe. Namely, one of:
+
+ - An equal-or-higher-priority thread
+ - Hard interrupt handler
+ - Non-maskable interrupt handler
+
+This monitor's warning usually means one of the following:
+
+ - Real-time thread is blocked by a non-real-time thread (e.g. due to
+ contention on a mutex without priority inheritance). This is priority
+ inversion.
+ - Time-critical work waits for something which is not safe for real-time (e.g.
+ timerfd).
+ - The work executed by the real-time thread does not need to run at real-time
+ priority at all. This is not a problem for the real-time thread itself, but
+ it is potentially taking the CPU away from other important real-time work.
+
+Application developers may purposely choose to have their real-time application
+sleep in a way that is not safe for real-time. It is debatable whether that is a
+problem. Application developers must analyze the warnings to make a proper
+assessment.
+
+The monitor's specification is::
+
+ RULE = always ((RT and SLEEP) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
+
+ RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
+ and ((not WAKE) until RT_FRIENDLY_WAKE)
+
+ RT_VALID_SLEEP_REASON = FUTEX_WAIT
+ or RT_FRIENDLY_NANOSLEEP
+
+ RT_FRIENDLY_NANOSLEEP = CLOCK_NANOSLEEP
+ and NANOSLEEP_TIMER_ABSTIME
+ and NANOSLEEP_CLOCK_MONOTONIC
+
+ RT_FRIENDLY_WAKE = WOKEN_BY_EQUAL_OR_HIGHER_PRIO
+ or WOKEN_BY_HARDIRQ
+ or WOKEN_BY_NMI
+ or KTHREAD_SHOULD_STOP
+
+ ALLOWLIST = BLOCK_ON_RT_MUTEX
+ or FUTEX_LOCK_PI
+ or TASK_IS_RCU
+ or TASK_IS_MIGRATION
+
+Beside the scenarios described above, this specification also handle some
+special cases:
+
+ - `KERNEL_THREAD`: kernel tasks do not have any pattern that can be recognized
+ as valid real-time sleeping reasons. Therefore sleeping reason is not
+ checked for kernel tasks.
+ - `KTHREAD_SHOULD_STOP`: a non-real-time thread may stop a real-time kernel
+ thread by waking it and waiting for it to exit (`kthread_stop()`). This
+ wakeup is safe for real-time.
+ - `ALLOWLIST`: to handle known false positives with the kernel.
+ - `BLOCK_ON_RT_MUTEX` is included in the allowlist due to its implementation.
+ In the release path of rt_mutex, a boosted task is de-boosted before waking
+ the rt_mutex's waiter. Consequently, the monitor may see a real-time-unsafe
+ wakeup (e.g. non-real-time task waking real-time task). This is actually
+ real-time-safe because preemption is disabled for the duration.
+ - `FUTEX_LOCK_PI` is included in the allowlist for the same reason as
+ `BLOCK_ON_RT_MUTEX`.
diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst
new file mode 100644
index 000000000000..3f8381ad9ec7
--- /dev/null
+++ b/Documentation/trace/rv/monitor_sched.rst
@@ -0,0 +1,402 @@
+Scheduler monitors
+==================
+
+- Name: sched
+- Type: container for multiple monitors
+- Author: Gabriele Monaco <gmonaco@redhat.com>, Daniel Bristot de Oliveira <bristot@kernel.org>
+
+Description
+-----------
+
+Monitors describing complex systems, such as the scheduler, can easily grow to
+the point where they are just hard to understand because of the many possible
+state transitions.
+Often it is possible to break such descriptions into smaller monitors,
+sharing some or all events. Enabling those smaller monitors concurrently is,
+in fact, testing the system as if we had one single larger monitor.
+Splitting models into multiple specification is not only easier to
+understand, but gives some more clues when we see errors.
+
+The sched monitor is a set of specifications to describe the scheduler behaviour.
+It includes several per-cpu and per-task monitors that work independently to verify
+different specifications the scheduler should follow.
+
+To make this system as straightforward as possible, sched specifications are *nested*
+monitors, whereas sched itself is a *container*.
+From the interface perspective, sched includes other monitors as sub-directories,
+enabling/disabling or setting reactors to sched, propagates the change to all monitors,
+however single monitors can be used independently as well.
+
+It is important that future modules are built after their container (sched, in
+this case), otherwise the linker would not respect the order and the nesting
+wouldn't work as expected.
+To do so, simply add them after sched in the Makefile.
+
+Specifications
+--------------
+
+The specifications included in sched are currently a work in progress, adapting the ones
+defined in by Daniel Bristot in [1].
+
+Currently we included the following:
+
+Monitor sco
+~~~~~~~~~~~
+
+The scheduling context operations (sco) monitor ensures changes in a task state
+happen only in thread context::
+
+
+ |
+ |
+ v
+ sched_set_state +------------------+
+ +------------------ | |
+ | | thread_context |
+ +-----------------> | | <+
+ +------------------+ |
+ | |
+ | schedule_entry | schedule_exit
+ v |
+ |
+ scheduling_context -+
+
+Monitor snroc
+~~~~~~~~~~~~~
+
+The set non runnable on its own context (snroc) monitor ensures changes in a
+task state happens only in the respective task's context. This is a per-task
+monitor::
+
+ |
+ |
+ v
+ +------------------+
+ | other_context | <+
+ +------------------+ |
+ | |
+ | sched_switch_in | sched_switch_out
+ v |
+ sched_set_state |
+ +------------------ |
+ | own_context |
+ +-----------------> -+
+
+Monitor scpd
+~~~~~~~~~~~~
+
+The schedule called with preemption disabled (scpd) monitor ensures schedule is
+called with preemption disabled::
+
+ |
+ |
+ v
+ +------------------+
+ | cant_sched | <+
+ +------------------+ |
+ | |
+ | preempt_disable | preempt_enable
+ v |
+ schedule_entry |
+ schedule_exit |
+ +----------------- can_sched |
+ | |
+ +----------------> -+
+
+Monitor snep
+~~~~~~~~~~~~
+
+The schedule does not enable preempt (snep) monitor ensures a schedule call
+does not enable preemption::
+
+ |
+ |
+ v
+ preempt_disable +------------------------+
+ preempt_enable | |
+ +------------------ | non_scheduling_context |
+ | | |
+ +-----------------> | | <+
+ +------------------------+ |
+ | |
+ | schedule_entry | schedule_exit
+ v |
+ |
+ scheduling_contex -+
+
+Monitor sts
+~~~~~~~~~~~
+
+The schedule implies task switch (sts) monitor ensures a task switch happens
+only in scheduling context and up to once, as well as scheduling occurs with
+interrupts enabled but no task switch can happen before interrupts are
+disabled. When the next task picked for execution is the same as the previously
+running one, no real task switch occurs but interrupts are disabled nonetheless::
+
+ irq_entry |
+ +----+ |
+ v | v
+ +------------+ irq_enable #===================# irq_disable
+ | | ------------> H H irq_entry
+ | cant_sched | <------------ H H irq_enable
+ | | irq_disable H can_sched H --------------+
+ +------------+ H H |
+ H H |
+ +---------------> H H <-------------+
+ | #===================#
+ | |
+ schedule_exit | schedule_entry
+ | v
+ | +-------------------+ irq_enable
+ | | scheduling | <---------------+
+ | +-------------------+ |
+ | | |
+ | | irq_disable +--------+ irq_entry
+ | v | | --------+
+ | +-------------------+ irq_entry | in_irq | |
+ | | | -----------> | | <-------+
+ | | disable_to_switch | +--------+
+ | | | --+
+ | +-------------------+ |
+ | | |
+ | | sched_switch |
+ | v |
+ | +-------------------+ |
+ | | switching | | irq_enable
+ | +-------------------+ |
+ | | |
+ | | irq_enable |
+ | v |
+ | +-------------------+ |
+ +-- | enable_to_exit | <-+
+ +-------------------+
+ ^ | irq_disable
+ | | irq_entry
+ +---------------+ irq_enable
+
+Monitor nrp
+-----------
+
+The need resched preempts (nrp) monitor ensures preemption requires
+``need_resched``. Only kernel preemption is considered, since preemption
+while returning to userspace, for this monitor, is indistinguishable from
+``sched_switch_yield`` (described in the sssw monitor).
+A kernel preemption is whenever ``__schedule`` is called with the preemption
+flag set to true (e.g. from preempt_enable or exiting from interrupts). This
+type of preemption occurs after the need for ``rescheduling`` has been set.
+This is not valid for the *lazy* variant of the flag, which causes only
+userspace preemption.
+A ``schedule_entry_preempt`` may involve a task switch or not, in the latter
+case, a task goes through the scheduler from a preemption context but it is
+picked as the next task to run. Since the scheduler runs, this clears the need
+to reschedule. The ``any_thread_running`` state does not imply the monitored
+task is not running as this monitor does not track the outcome of scheduling.
+
+In theory, a preemption can only occur after the ``need_resched`` flag is set. In
+practice, however, it is possible to see a preemption where the flag is not
+set. This can happen in one specific condition::
+
+ need_resched
+ preempt_schedule()
+ preempt_schedule_irq()
+ __schedule()
+ !need_resched
+ __schedule()
+
+In the situation above, standard preemption starts (e.g. from preempt_enable
+when the flag is set), an interrupt occurs before scheduling and, on its exit
+path, it schedules, which clears the ``need_resched`` flag.
+When the preempted task runs again, the standard preemption started earlier
+resumes, although the flag is no longer set. The monitor considers this a
+``nested_preemption``, this allows another preemption without re-setting the
+flag. This condition relaxes the monitor constraints and may catch false
+negatives (i.e. no real ``nested_preemptions``) but makes the monitor more
+robust and able to validate other scenarios.
+For simplicity, the monitor starts in ``preempt_irq``, although no interrupt
+occurred, as the situation above is hard to pinpoint::
+
+ schedule_entry
+ irq_entry #===========================================#
+ +-------------------------- H H
+ | H H
+ +-------------------------> H any_thread_running H
+ H H
+ +-------------------------> H H
+ | #===========================================#
+ | schedule_entry | ^
+ | schedule_entry_preempt | sched_need_resched | schedule_entry
+ | | schedule_entry_preempt
+ | v |
+ | +----------------------+ |
+ | +--- | | |
+ | sched_need_resched | | rescheduling | -+
+ | +--> | |
+ | +----------------------+
+ | | irq_entry
+ | v
+ | +----------------------+
+ | | | ---+
+ | ---> | | | sched_need_resched
+ | | preempt_irq | | irq_entry
+ | | | <--+
+ | | | <--+
+ | +----------------------+ |
+ | | schedule_entry | sched_need_resched
+ | | schedule_entry_preempt |
+ | v |
+ | +-----------------------+ |
+ +-------------------------- | nested_preempt | --+
+ +-----------------------+
+ ^ irq_entry |
+ +-------------------+
+
+Due to how the ``need_resched`` flag on the preemption count works on arm64,
+this monitor is unstable on that architecture, as it often records preemption
+when the flag is not set, even in presence of the workaround above.
+For the time being, the monitor is disabled by default on arm64.
+
+Monitor sssw
+------------
+
+The set state sleep and wakeup (sssw) monitor ensures ``set_state`` to
+sleepable leads to sleeping and sleeping tasks require wakeup. It includes the
+following types of switch:
+
+* ``switch_suspend``:
+ a task puts itself to sleep, this can happen only after explicitly setting
+ the task to ``sleepable``. After a task is suspended, it needs to be woken up
+ (``waking`` state) before being switched in again.
+ Setting the task's state to ``sleepable`` can be reverted before switching if it
+ is woken up or set to ``runnable``.
+* ``switch_blocking``:
+ a special case of a ``switch_suspend`` where the task is waiting on a
+ sleeping RT lock (``PREEMPT_RT`` only), it is common to see wakeup and set
+ state events racing with each other and this leads the model to perceive this
+ type of switch when the task is not set to sleepable. This is a limitation of
+ the model in SMP system and workarounds may slow down the system.
+* ``switch_preempt``:
+ a task switch as a result of kernel preemption (``schedule_entry_preempt`` in
+ the nrp model).
+* ``switch_yield``:
+ a task explicitly calls the scheduler or is preempted while returning to
+ userspace. It can happen after a ``yield`` system call, from the idle task or
+ if the ``need_resched`` flag is set. By definition, a task cannot yield while
+ ``sleepable`` as that would be a suspension. A special case of a yield occurs
+ when a task in ``TASK_INTERRUPTIBLE`` calls the scheduler while a signal is
+ pending. The task doesn't go through the usual blocking/waking and is set
+ back to runnable, the resulting switch (if there) looks like a yield to the
+ ``signal_wakeup`` state and is followed by the signal delivery. From this
+ state, the monitor expects a signal even if it sees a wakeup event, although
+ not necessary, to rule out false negatives.
+
+This monitor doesn't include a running state, ``sleepable`` and ``runnable``
+are only referring to the task's desired state, which could be scheduled out
+(e.g. due to preemption). However, it does include the event
+``sched_switch_in`` to represent when a task is allowed to become running. This
+can be triggered also by preemption, but cannot occur after the task got to
+``sleeping`` before a ``wakeup`` occurs::
+
+ +--------------------------------------------------------------------------+
+ | |
+ | |
+ | switch_suspend | |
+ | switch_blocking | |
+ v v |
+ +----------+ #==========================# set_state_runnable |
+ | | H H wakeup |
+ | | H H switch_in |
+ | | H H switch_yield |
+ | sleeping | H H switch_preempt |
+ | | H H signal_deliver |
+ | | switch_ H H ------+ |
+ | | _blocking H runnable H | |
+ | | <----------- H H <-----+ |
+ +----------+ H H |
+ | wakeup H H |
+ +---------------------> H H |
+ H H |
+ +---------> H H |
+ | #==========================# |
+ | | ^ |
+ | | | set_state_runnable |
+ | | | wakeup |
+ | set_state_sleepable | +------------------------+
+ | v | |
+ | +--------------------------+ set_state_sleepable
+ | | | switch_in
+ | | | switch_preempt
+ signal_deliver | sleepable | signal_deliver
+ | | | ------+
+ | | | |
+ | | | <-----+
+ | +--------------------------+
+ | | ^
+ | switch_yield | set_state_sleepable
+ | v |
+ | +---------------+ |
+ +---------- | signal_wakeup | -+
+ +---------------+
+ ^ | switch_in
+ | | switch_preempt
+ | | switch_yield
+ +-----------+ wakeup
+
+Monitor opid
+------------
+
+The operations with preemption and irq disabled (opid) monitor ensures
+operations like ``wakeup`` and ``need_resched`` occur with interrupts and
+preemption disabled or during interrupt context, in such case preemption may
+not be disabled explicitly.
+``need_resched`` can be set by some RCU internals functions, in which case it
+doesn't match a task wakeup and might occur with only interrupts disabled::
+
+ | sched_need_resched
+ | sched_waking
+ | irq_entry
+ | +--------------------+
+ v v |
+ +------------------------------------------------------+
+ +----------- | disabled | <+
+ | +------------------------------------------------------+ |
+ | | ^ |
+ | | preempt_disable sched_need_resched |
+ | preempt_enable | +--------------------+ |
+ | v | v | |
+ | +------------------------------------------------------+ |
+ | | irq_disabled | |
+ | +------------------------------------------------------+ |
+ | | | ^ |
+ | irq_entry irq_entry | | |
+ | sched_need_resched v | irq_disable |
+ | sched_waking +--------------+ | | |
+ | +----- | | irq_enable | |
+ | | | in_irq | | | |
+ | +----> | | | | |
+ | +--------------+ | | irq_disable
+ | | | | |
+ | irq_enable | irq_enable | | |
+ | v v | |
+ | #======================================================# |
+ | H enabled H |
+ | #======================================================# |
+ | | ^ ^ preempt_enable | |
+ | preempt_disable preempt_enable +--------------------+ |
+ | v | |
+ | +------------------+ | |
+ +----------> | preempt_disabled | -+ |
+ +------------------+ |
+ | |
+ +-------------------------------------------------------+
+
+This monitor is designed to work on ``PREEMPT_RT`` kernels, the special case of
+events occurring in interrupt context is a shortcut to identify valid scenarios
+where the preemption tracepoints might not be visible, during interrupts
+preemption is always disabled. On non- ``PREEMPT_RT`` kernels, the interrupts
+might invoke a softirq to set ``need_resched`` and wake up a task. This is
+another special case that is currently not supported by the monitor.
+
+References
+----------
+
+[1] - https://bristot.me/linux-task-model
diff --git a/Documentation/trace/rv/monitor_synthesis.rst b/Documentation/trace/rv/monitor_synthesis.rst
new file mode 100644
index 000000000000..ac808a7554f5
--- /dev/null
+++ b/Documentation/trace/rv/monitor_synthesis.rst
@@ -0,0 +1,271 @@
+Runtime Verification Monitor Synthesis
+======================================
+
+The starting point for the application of runtime verification (RV) techniques
+is the *specification* or *modeling* of the desired (or undesired) behavior
+of the system under scrutiny.
+
+The formal representation needs to be then *synthesized* into a *monitor*
+that can then be used in the analysis of the trace of the system. The
+*monitor* connects to the system via an *instrumentation* that converts
+the events from the *system* to the events of the *specification*.
+
+
+In Linux terms, the runtime verification monitors are encapsulated inside
+the *RV monitor* abstraction. The RV monitor includes a set of instances
+of the monitor (per-cpu monitor, per-task monitor, and so on), the helper
+functions that glue the monitor to the system reference model, and the
+trace output as a reaction to event parsing and exceptions, as depicted
+below::
+
+ Linux +----- RV Monitor ----------------------------------+ Formal
+ Realm | | Realm
+ +-------------------+ +----------------+ +-----------------+
+ | Linux kernel | | Monitor | | Reference |
+ | Tracing | -> | Instance(s) | <- | Model |
+ | (instrumentation) | | (verification) | | (specification) |
+ +-------------------+ +----------------+ +-----------------+
+ | | |
+ | V |
+ | +----------+ |
+ | | Reaction | |
+ | +--+--+--+-+ |
+ | | | | |
+ | | | +-> trace output ? |
+ +------------------------|--|----------------------+
+ | +----> panic ?
+ +-------> <user-specified>
+
+RV monitor synthesis
+--------------------
+
+The synthesis of a specification into the Linux *RV monitor* abstraction is
+automated by the rvgen tool and the header file containing common code for
+creating monitors. The header files are:
+
+ * rv/da_monitor.h for deterministic automaton monitor.
+ * rv/ltl_monitor.h for linear temporal logic monitor.
+
+rvgen
+-----
+
+The rvgen utility converts a specification into the C presentation and creating
+the skeleton of a kernel monitor in C.
+
+For example, it is possible to transform the wip.dot model present in
+[1] into a per-cpu monitor with the following command::
+
+ $ rvgen monitor -c da -s wip.dot -t per_cpu
+
+This will create a directory named wip/ with the following files:
+
+- wip.h: the wip model in C
+- wip.c: the RV monitor
+
+The wip.c file contains the monitor declaration and the starting point for
+the system instrumentation.
+
+Similarly, a linear temporal logic monitor can be generated with the following
+command::
+
+ $ rvgen monitor -c ltl -s pagefault.ltl -t per_task
+
+This generates pagefault/ directory with:
+
+- pagefault.h: The Buchi automaton (the non-deterministic state machine to
+ verify the specification)
+- pagefault.c: The skeleton for the RV monitor
+
+Monitor header files
+--------------------
+
+The header files:
+
+- `rv/da_monitor.h` for deterministic automaton monitor
+- `rv/ltl_monitor` for linear temporal logic monitor
+
+include common macros and static functions for implementing *Monitor
+Instance(s)*.
+
+The benefits of having all common functionalities in a single header file are
+3-fold:
+
+ - Reduce the code duplication;
+ - Facilitate the bug fix/improvement;
+ - Avoid the case of developers changing the core of the monitor code to
+ manipulate the model in a (let's say) non-standard way.
+
+rv/da_monitor.h
++++++++++++++++
+
+This initial implementation presents three different types of monitor instances:
+
+- ``#define DECLARE_DA_MON_GLOBAL(name, type)``
+- ``#define DECLARE_DA_MON_PER_CPU(name, type)``
+- ``#define DECLARE_DA_MON_PER_TASK(name, type)``
+
+The first declares the functions for a global deterministic automata monitor,
+the second for monitors with per-cpu instances, and the third with per-task
+instances.
+
+In all cases, the 'name' argument is a string that identifies the monitor, and
+the 'type' argument is the data type used by rvgen on the representation of
+the model in C.
+
+For example, the wip model with two states and three events can be
+stored in an 'unsigned char' type. Considering that the preemption control
+is a per-cpu behavior, the monitor declaration in the 'wip.c' file is::
+
+ DECLARE_DA_MON_PER_CPU(wip, unsigned char);
+
+The monitor is executed by sending events to be processed via the functions
+presented below::
+
+ da_handle_event_$(MONITOR_NAME)($(event from event enum));
+ da_handle_start_event_$(MONITOR_NAME)($(event from event enum));
+ da_handle_start_run_event_$(MONITOR_NAME)($(event from event enum));
+
+The function ``da_handle_event_$(MONITOR_NAME)()`` is the regular case where
+the event will be processed if the monitor is processing events.
+
+When a monitor is enabled, it is placed in the initial state of the automata.
+However, the monitor does not know if the system is in the *initial state*.
+
+The ``da_handle_start_event_$(MONITOR_NAME)()`` function is used to notify the
+monitor that the system is returning to the initial state, so the monitor can
+start monitoring the next event.
+
+The ``da_handle_start_run_event_$(MONITOR_NAME)()`` function is used to notify
+the monitor that the system is known to be in the initial state, so the
+monitor can start monitoring and monitor the current event.
+
+Using the wip model as example, the events "preempt_disable" and
+"sched_waking" should be sent to monitor, respectively, via [2]::
+
+ da_handle_event_wip(preempt_disable_wip);
+ da_handle_event_wip(sched_waking_wip);
+
+While the event "preempt_enabled" will use::
+
+ da_handle_start_event_wip(preempt_enable_wip);
+
+To notify the monitor that the system will be returning to the initial state,
+so the system and the monitor should be in sync.
+
+rv/ltl_monitor.h
+++++++++++++++++
+This file must be combined with the $(MODEL_NAME).h file (generated by `rvgen`)
+to be complete. For example, for the `pagefault` monitor, the `pagefault.c`
+source file must include::
+
+ #include "pagefault.h"
+ #include <rv/ltl_monitor.h>
+
+(the skeleton monitor file generated by `rvgen` already does this).
+
+`$(MODEL_NAME).h` (`pagefault.h` in the above example) includes the
+implementation of the Buchi automaton - a non-deterministic state machine that
+verifies the LTL specification. While `rv/ltl_monitor.h` includes the common
+helper functions to interact with the Buchi automaton and to implement an RV
+monitor. An important definition in `$(MODEL_NAME).h` is::
+
+ enum ltl_atom {
+ LTL_$(FIRST_ATOMIC_PROPOSITION),
+ LTL_$(SECOND_ATOMIC_PROPOSITION),
+ ...
+ LTL_NUM_ATOM
+ };
+
+which is the list of atomic propositions present in the LTL specification
+(prefixed with "LTL\_" to avoid name collision). This `enum` is passed to the
+functions interacting with the Buchi automaton.
+
+While generating code, `rvgen` cannot understand the meaning of the atomic
+propositions. Thus, that task is left for manual work. The recommended pratice
+is adding tracepoints to places where the atomic propositions change; and in the
+tracepoints' handlers: the Buchi automaton is executed using::
+
+ void ltl_atom_update(struct task_struct *task, enum ltl_atom atom, bool value)
+
+which tells the Buchi automaton that the atomic proposition `atom` is now
+`value`. The Buchi automaton checks whether the LTL specification is still
+satisfied, and invokes the monitor's error tracepoint and the reactor if
+violation is detected.
+
+Tracepoints and `ltl_atom_update()` should be used whenever possible. However,
+it is sometimes not the most convenient. For some atomic propositions which are
+changed in multiple places in the kernel, it is cumbersome to trace all those
+places. Furthermore, it may not be important that the atomic propositions are
+updated at precise times. For example, considering the following linear temporal
+logic::
+
+ RULE = always (RT imply not PAGEFAULT)
+
+This LTL states that a real-time task does not raise page faults. For this
+specification, it is not important when `RT` changes, as long as it has the
+correct value when `PAGEFAULT` is true. Motivated by this case, another
+function is introduced::
+
+ void ltl_atom_fetch(struct task_struct *task, struct ltl_monitor *mon)
+
+This function is called whenever the Buchi automaton is triggered. Therefore, it
+can be manually implemented to "fetch" `RT`::
+
+ void ltl_atom_fetch(struct task_struct *task, struct ltl_monitor *mon)
+ {
+ ltl_atom_set(mon, LTL_RT, rt_task(task));
+ }
+
+Effectively, whenever `PAGEFAULT` is updated with a call to `ltl_atom_update()`,
+`RT` is also fetched. Thus, the LTL specification can be verified without
+tracing `RT` everywhere.
+
+For atomic propositions which act like events, they usually need to be set (or
+cleared) and then immediately cleared (or set). A convenient function is
+provided::
+
+ void ltl_atom_pulse(struct task_struct *task, enum ltl_atom atom, bool value)
+
+which is equivalent to::
+
+ ltl_atom_update(task, atom, value);
+ ltl_atom_update(task, atom, !value);
+
+To initialize the atomic propositions, the following function must be
+implemented::
+
+ ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bool task_creation)
+
+This function is called for all running tasks when the monitor is enabled. It is
+also called for new tasks created after the enabling the monitor. It should
+initialize as many atomic propositions as possible, for example::
+
+ void ltl_atom_init(struct task_struct *task, struct ltl_monitor *mon, bool task_creation)
+ {
+ ltl_atom_set(mon, LTL_RT, rt_task(task));
+ if (task_creation)
+ ltl_atom_set(mon, LTL_PAGEFAULT, false);
+ }
+
+Atomic propositions not initialized by `ltl_atom_init()` will stay in the
+unknown state until relevant tracepoints are hit, which can take some time. As
+monitoring for a task cannot be done until all atomic propositions is known for
+the task, the monitor may need some time to start validating tasks which have
+been running before the monitor is enabled. Therefore, it is recommended to
+start the tasks of interest after enabling the monitor.
+
+Final remarks
+-------------
+
+With the monitor synthesis in place using the header files and
+rvgen, the developer's work should be limited to the instrumentation
+of the system, increasing the confidence in the overall approach.
+
+[1] For details about deterministic automata format and the translation
+from one representation to another, see::
+
+ Documentation/trace/rv/deterministic_automata.rst
+
+[2] rvgen appends the monitor's name suffix to the events enums to
+avoid conflicting variables when exporting the global vmlinux.h
+use by BPF programs.
diff --git a/Documentation/trace/rv/runtime-verification.rst b/Documentation/trace/rv/runtime-verification.rst
index dae78dfa7cdc..c700dde9259c 100644
--- a/Documentation/trace/rv/runtime-verification.rst
+++ b/Documentation/trace/rv/runtime-verification.rst
@@ -8,14 +8,14 @@ checking* and *theorem proving*) with a more practical approach for complex
systems.
Instead of relying on a fine-grained model of a system (e.g., a
-re-implementation a instruction level), RV works by analyzing the trace of the
+re-implementation at instruction level), RV works by analyzing the trace of the
system's actual execution, comparing it against a formal specification of
the system behavior.
The main advantage is that RV can give precise information on the runtime
behavior of the monitored system, without the pitfalls of developing models
that require a re-implementation of the entire system in a modeling language.
-Moreover, given an efficient monitoring method, it is possible execute an
+Moreover, given an efficient monitoring method, it is possible to execute an
*online* verification of a system, enabling the *reaction* for unexpected
events, avoiding, for example, the propagation of a failure on safety-critical
systems.
diff --git a/Documentation/trace/tracepoints.rst b/Documentation/trace/tracepoints.rst
index decabcc77b56..b35c40e3abbe 100644
--- a/Documentation/trace/tracepoints.rst
+++ b/Documentation/trace/tracepoints.rst
@@ -71,7 +71,7 @@ In subsys/file.c (where the tracing statement must be added)::
void somefct(void)
{
...
- trace_subsys_eventname(arg, task);
+ trace_subsys_eventname_tp(arg, task);
...
}
@@ -129,12 +129,12 @@ within an if statement with the following::
for (i = 0; i < count; i++)
tot += calculate_nuggets();
- trace_foo_bar(tot);
+ trace_foo_bar_tp(tot);
}
-All trace_<tracepoint>() calls have a matching trace_<tracepoint>_enabled()
+All trace_<tracepoint>_tp() calls have a matching trace_<tracepoint>_enabled()
function defined that returns true if the tracepoint is enabled and
-false otherwise. The trace_<tracepoint>() should always be within the
+false otherwise. The trace_<tracepoint>_tp() should always be within the
block of the if (trace_<tracepoint>_enabled()) to prevent races between
the tracepoint being enabled and the check being seen.
@@ -143,7 +143,10 @@ the static_key of the tracepoint to allow the if statement to be implemented
with jump labels and avoid conditional branches.
.. note:: The convenience macro TRACE_EVENT provides an alternative way to
- define tracepoints. Check http://lwn.net/Articles/379903,
+ define tracepoints. Note, DECLARE_TRACE(foo) creates a function
+ "trace_foo_tp()" whereas TRACE_EVENT(foo) creates a function
+ "trace_foo()", and also exposes the tracepoint as a trace event in
+ /sys/kernel/tracing/events directory. Check http://lwn.net/Articles/379903,
http://lwn.net/Articles/381064 and http://lwn.net/Articles/383362
for a series of articles with more details.
@@ -159,7 +162,9 @@ In a C file::
void do_trace_foo_bar_wrapper(args)
{
- trace_foo_bar(args);
+ trace_foo_bar_tp(args); // for tracepoints created via DECLARE_TRACE
+ // or
+ trace_foo_bar(args); // for tracepoints created via TRACE_EVENT
}
In the header file::