summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/arm/sunxi/clocks.txt56
-rw-r--r--Documentation/cgroups/cgroups.txt3
-rw-r--r--Documentation/cgroups/devices.txt70
-rw-r--r--Documentation/clk.txt15
-rw-r--r--Documentation/devicetree/bindings/clock/axi-clkgen.txt22
-rw-r--r--Documentation/devicetree/bindings/clock/fixed-factor-clock.txt24
-rw-r--r--Documentation/devicetree/bindings/clock/silabs,si5351.txt114
-rw-r--r--Documentation/devicetree/bindings/clock/sunxi.txt151
-rw-r--r--Documentation/devicetree/bindings/mfd/mc13xxx.txt36
-rw-r--r--Documentation/devicetree/bindings/regulator/max8952.txt52
-rw-r--r--Documentation/devicetree/bindings/spi/brcm,bcm2835-spi.txt22
-rw-r--r--Documentation/devicetree/bindings/spi/fsl-spi.txt3
-rw-r--r--Documentation/devicetree/bindings/spi/nvidia,tegra114-spi.txt26
-rw-r--r--Documentation/devicetree/bindings/spi/spi-samsung.txt8
-rw-r--r--Documentation/devicetree/bindings/vendor-prefixes.txt2
-rw-r--r--Documentation/kernel-parameters.txt24
-rw-r--r--Documentation/this_cpu_ops.txt205
-rw-r--r--Documentation/trace/ftrace.txt2097
18 files changed, 2297 insertions, 633 deletions
diff --git a/Documentation/arm/sunxi/clocks.txt b/Documentation/arm/sunxi/clocks.txt
new file mode 100644
index 000000000000..e09a88aa3136
--- /dev/null
+++ b/Documentation/arm/sunxi/clocks.txt
@@ -0,0 +1,56 @@
+Frequently asked questions about the sunxi clock system
+=======================================================
+
+This document contains useful bits of information that people tend to ask
+about the sunxi clock system, as well as accompanying ASCII art when adequate.
+
+Q: Why is the main 24MHz oscillator gatable? Wouldn't that break the
+ system?
+
+A: The 24MHz oscillator allows gating to save power. Indeed, if gated
+ carelessly the system would stop functioning, but with the right
+ steps, one can gate it and keep the system running. Consider this
+ simplified suspend example:
+
+ While the system is operational, you would see something like
+
+ 24MHz 32kHz
+ |
+ PLL1
+ \
+ \_ CPU Mux
+ |
+ [CPU]
+
+ When you are about to suspend, you switch the CPU Mux to the 32kHz
+ oscillator:
+
+ 24Mhz 32kHz
+ | |
+ PLL1 |
+ /
+ CPU Mux _/
+ |
+ [CPU]
+
+ Finally you can gate the main oscillator
+
+ 32kHz
+ |
+ |
+ /
+ CPU Mux _/
+ |
+ [CPU]
+
+Q: Were can I learn more about the sunxi clocks?
+
+A: The linux-sunxi wiki contains a page documenting the clock registers,
+ you can find it at
+
+ http://linux-sunxi.org/A10/CCM
+
+ The authoritative source for information at this time is the ccmu driver
+ released by Allwinner, you can find it at
+
+ https://github.com/linux-sunxi/linux-sunxi/tree/sunxi-3.0/arch/arm/mach-sun4i/clock/ccmu
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index bcf1a00b06a1..638bf17ff869 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -442,7 +442,7 @@ You can attach the current shell task by echoing 0:
You can use the cgroup.procs file instead of the tasks file to move all
threads in a threadgroup at once. Echoing the PID of any task in a
threadgroup to cgroup.procs causes all tasks in that threadgroup to be
-be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
in the writing task's threadgroup.
Note: Since every task is always a member of exactly one cgroup in each
@@ -580,6 +580,7 @@ propagation along the hierarchy. See the comment on
cgroup_for_each_descendant_pre() for details.
void css_offline(struct cgroup *cgrp);
+(cgroup_mutex held by caller)
This is the counterpart of css_online() and called iff css_online()
has succeeded on @cgrp. This signifies the beginning of the end of
diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt
index 16624a7f8222..3c1095ca02ea 100644
--- a/Documentation/cgroups/devices.txt
+++ b/Documentation/cgroups/devices.txt
@@ -13,9 +13,7 @@ either an integer or * for all. Access is a composition of r
The root device cgroup starts with rwm to 'all'. A child device
cgroup gets a copy of the parent. Administrators can then remove
devices from the whitelist or add new entries. A child cgroup can
-never receive a device access which is denied by its parent. However
-when a device access is removed from a parent it will not also be
-removed from the child(ren).
+never receive a device access which is denied by its parent.
2. User Interface
@@ -50,3 +48,69 @@ task to a new cgroup. (Again we'll probably want to change that).
A cgroup may not be granted more permissions than the cgroup's
parent has.
+
+4. Hierarchy
+
+device cgroups maintain hierarchy by making sure a cgroup never has more
+access permissions than its parent. Every time an entry is written to
+a cgroup's devices.deny file, all its children will have that entry removed
+from their whitelist and all the locally set whitelist entries will be
+re-evaluated. In case one of the locally set whitelist entries would provide
+more access than the cgroup's parent, it'll be removed from the whitelist.
+
+Example:
+ A
+ / \
+ B
+
+ group behavior exceptions
+ A allow "b 8:* rwm", "c 116:1 rw"
+ B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
+
+If a device is denied in group A:
+ # echo "c 116:* r" > A/devices.deny
+it'll propagate down and after revalidating B's entries, the whitelist entry
+"c 116:2 rwm" will be removed:
+
+ group whitelist entries denied devices
+ A all "b 8:* rwm", "c 116:* rw"
+ B "c 1:3 rwm", "b 3:* rwm" all the rest
+
+In case parent's exceptions change and local exceptions are not allowed
+anymore, they'll be deleted.
+
+Notice that new whitelist entries will not be propagated:
+ A
+ / \
+ B
+
+ group whitelist entries denied devices
+ A "c 1:3 rwm", "c 1:5 r" all the rest
+ B "c 1:3 rwm", "c 1:5 r" all the rest
+
+when adding "c *:3 rwm":
+ # echo "c *:3 rwm" >A/devices.allow
+
+the result:
+ group whitelist entries denied devices
+ A "c *:3 rwm", "c 1:5 r" all the rest
+ B "c 1:3 rwm", "c 1:5 r" all the rest
+
+but now it'll be possible to add new entries to B:
+ # echo "c 2:3 rwm" >B/devices.allow
+ # echo "c 50:3 r" >B/devices.allow
+or even
+ # echo "c *:3 rwm" >B/devices.allow
+
+Allowing or denying all by writing 'a' to devices.allow or devices.deny will
+not be possible once the device cgroups has children.
+
+4.1 Hierarchy (internal implementation)
+
+device cgroups is implemented internally using a behavior (ALLOW, DENY) and a
+list of exceptions. The internal state is controlled using the same user
+interface to preserve compatibility with the previous whitelist-only
+implementation. Removal or addition of exceptions that will reduce the access
+to devices will be propagated down the hierarchy.
+For every propagated exception, the effective rules will be re-evaluated based
+on current parent's access rules.
diff --git a/Documentation/clk.txt b/Documentation/clk.txt
index 1943fae014fd..b9911c27f496 100644
--- a/Documentation/clk.txt
+++ b/Documentation/clk.txt
@@ -174,9 +174,9 @@ int clk_foo_enable(struct clk_hw *hw)
};
Below is a matrix detailing which clk_ops are mandatory based upon the
-hardware capbilities of that clock. A cell marked as "y" means
+hardware capabilities of that clock. A cell marked as "y" means
mandatory, a cell marked as "n" implies that either including that
-callback is invalid or otherwise uneccesary. Empty cells are either
+callback is invalid or otherwise unnecessary. Empty cells are either
optional or must be evaluated on a case-by-case basis.
clock hardware characteristics
@@ -231,3 +231,14 @@ To better enforce this policy, always follow this simple rule: any
statically initialized clock data MUST be defined in a separate file
from the logic that implements its ops. Basically separate the logic
from the data and all is well.
+
+ Part 6 - Disabling clock gating of unused clocks
+
+Sometimes during development it can be useful to be able to bypass the
+default disabling of unused clocks. For example, if drivers aren't enabling
+clocks properly but rely on them being on from the bootloader, bypassing
+the disabling means that the driver will remain functional while the issues
+are sorted out.
+
+To bypass this disabling, include "clk_ignore_unused" in the bootargs to the
+kernel.
diff --git a/Documentation/devicetree/bindings/clock/axi-clkgen.txt b/Documentation/devicetree/bindings/clock/axi-clkgen.txt
new file mode 100644
index 000000000000..028b493e97ff
--- /dev/null
+++ b/Documentation/devicetree/bindings/clock/axi-clkgen.txt
@@ -0,0 +1,22 @@
+Binding for the axi-clkgen clock generator
+
+This binding uses the common clock binding[1].
+
+[1] Documentation/devicetree/bindings/clock/clock-bindings.txt
+
+Required properties:
+- compatible : shall be "adi,axi-clkgen".
+- #clock-cells : from common clock binding; Should always be set to 0.
+- reg : Address and length of the axi-clkgen register set.
+- clocks : Phandle and clock specifier for the parent clock.
+
+Optional properties:
+- clock-output-names : From common clock binding.
+
+Example:
+ clock@0xff000000 {
+ compatible = "adi,axi-clkgen";
+ #clock-cells = <0>;
+ reg = <0xff000000 0x1000>;
+ clocks = <&osc 1>;
+ };
diff --git a/Documentation/devicetree/bindings/clock/fixed-factor-clock.txt b/Documentation/devicetree/bindings/clock/fixed-factor-clock.txt
new file mode 100644
index 000000000000..5757f9abfc26
--- /dev/null
+++ b/Documentation/devicetree/bindings/clock/fixed-factor-clock.txt
@@ -0,0 +1,24 @@
+Binding for simple fixed factor rate clock sources.
+
+This binding uses the common clock binding[1].
+
+[1] Documentation/devicetree/bindings/clock/clock-bindings.txt
+
+Required properties:
+- compatible : shall be "fixed-factor-clock".
+- #clock-cells : from common clock binding; shall be set to 0.
+- clock-div: fixed divider.
+- clock-mult: fixed multiplier.
+- clocks: parent clock.
+
+Optional properties:
+- clock-output-names : From common clock binding.
+
+Example:
+ clock {
+ compatible = "fixed-factor-clock";
+ clocks = <&parentclk>;
+ #clock-cells = <0>;
+ div = <2>;
+ mult = <1>;
+ };
diff --git a/Documentation/devicetree/bindings/clock/silabs,si5351.txt b/Documentation/devicetree/bindings/clock/silabs,si5351.txt
new file mode 100644
index 000000000000..cc374651662c
--- /dev/null
+++ b/Documentation/devicetree/bindings/clock/silabs,si5351.txt
@@ -0,0 +1,114 @@
+Binding for Silicon Labs Si5351a/b/c programmable i2c clock generator.
+
+Reference
+[1] Si5351A/B/C Data Sheet
+ http://www.silabs.com/Support%20Documents/TechnicalDocs/Si5351.pdf
+
+The Si5351a/b/c are programmable i2c clock generators with upto 8 output
+clocks. Si5351a also has a reduced pin-count package (MSOP10) where only
+3 output clocks are accessible. The internal structure of the clock
+generators can be found in [1].
+
+==I2C device node==
+
+Required properties:
+- compatible: shall be one of "silabs,si5351{a,a-msop,b,c}".
+- reg: i2c device address, shall be 0x60 or 0x61.
+- #clock-cells: from common clock binding; shall be set to 1.
+- clocks: from common clock binding; list of parent clock
+ handles, shall be xtal reference clock or xtal and clkin for
+ si5351c only.
+- #address-cells: shall be set to 1.
+- #size-cells: shall be set to 0.
+
+Optional properties:
+- silabs,pll-source: pair of (number, source) for each pll. Allows
+ to overwrite clock source of pll A (number=0) or B (number=1).
+
+==Child nodes==
+
+Each of the clock outputs can be overwritten individually by
+using a child node to the I2C device node. If a child node for a clock
+output is not set, the eeprom configuration is not overwritten.
+
+Required child node properties:
+- reg: number of clock output.
+
+Optional child node properties:
+- silabs,clock-source: source clock of the output divider stage N, shall be
+ 0 = multisynth N
+ 1 = multisynth 0 for output clocks 0-3, else multisynth4
+ 2 = xtal
+ 3 = clkin (si5351c only)
+- silabs,drive-strength: output drive strength in mA, shall be one of {2,4,6,8}.
+- silabs,multisynth-source: source pll A(0) or B(1) of corresponding multisynth
+ divider.
+- silabs,pll-master: boolean, multisynth can change pll frequency.
+
+==Example==
+
+/* 25MHz reference crystal */
+ref25: ref25M {
+ compatible = "fixed-clock";
+ #clock-cells = <0>;
+ clock-frequency = <25000000>;
+};
+
+i2c-master-node {
+
+ /* Si5351a msop10 i2c clock generator */
+ si5351a: clock-generator@60 {
+ compatible = "silabs,si5351a-msop";
+ reg = <0x60>;
+ #address-cells = <1>;
+ #size-cells = <0>;
+ #clock-cells = <1>;
+
+ /* connect xtal input to 25MHz reference */
+ clocks = <&ref25>;
+
+ /* connect xtal input as source of pll0 and pll1 */
+ silabs,pll-source = <0 0>, <1 0>;
+
+ /*
+ * overwrite clkout0 configuration with:
+ * - 8mA output drive strength
+ * - pll0 as clock source of multisynth0
+ * - multisynth0 as clock source of output divider
+ * - multisynth0 can change pll0
+ * - set initial clock frequency of 74.25MHz
+ */
+ clkout0 {
+ reg = <0>;
+ silabs,drive-strength = <8>;
+ silabs,multisynth-source = <0>;
+ silabs,clock-source = <0>;
+ silabs,pll-master;
+ clock-frequency = <74250000>;
+ };
+
+ /*
+ * overwrite clkout1 configuration with:
+ * - 4mA output drive strength
+ * - pll1 as clock source of multisynth1
+ * - multisynth1 as clock source of output divider
+ * - multisynth1 can change pll1
+ */
+ clkout1 {
+ reg = <1>;
+ silabs,drive-strength = <4>;
+ silabs,multisynth-source = <1>;
+ silabs,clock-source = <0>;
+ pll-master;
+ };
+
+ /*
+ * overwrite clkout2 configuration with:
+ * - xtal as clock source of output divider
+ */
+ clkout2 {
+ reg = <2>;
+ silabs,clock-source = <2>;
+ };
+ };
+};
diff --git a/Documentation/devicetree/bindings/clock/sunxi.txt b/Documentation/devicetree/bindings/clock/sunxi.txt
new file mode 100644
index 000000000000..729f52426fe1
--- /dev/null
+++ b/Documentation/devicetree/bindings/clock/sunxi.txt
@@ -0,0 +1,151 @@
+Device Tree Clock bindings for arch-sunxi
+
+This binding uses the common clock binding[1].
+
+[1] Documentation/devicetree/bindings/clock/clock-bindings.txt
+
+Required properties:
+- compatible : shall be one of the following:
+ "allwinner,sun4i-osc-clk" - for a gatable oscillator
+ "allwinner,sun4i-pll1-clk" - for the main PLL clock
+ "allwinner,sun4i-cpu-clk" - for the CPU multiplexer clock
+ "allwinner,sun4i-axi-clk" - for the AXI clock
+ "allwinner,sun4i-axi-gates-clk" - for the AXI gates
+ "allwinner,sun4i-ahb-clk" - for the AHB clock
+ "allwinner,sun4i-ahb-gates-clk" - for the AHB gates
+ "allwinner,sun4i-apb0-clk" - for the APB0 clock
+ "allwinner,sun4i-apb0-gates-clk" - for the APB0 gates
+ "allwinner,sun4i-apb1-clk" - for the APB1 clock
+ "allwinner,sun4i-apb1-mux-clk" - for the APB1 clock muxing
+ "allwinner,sun4i-apb1-gates-clk" - for the APB1 gates
+
+Required properties for all clocks:
+- reg : shall be the control register address for the clock.
+- clocks : shall be the input parent clock(s) phandle for the clock
+- #clock-cells : from common clock binding; shall be set to 0 except for
+ "allwinner,sun4i-*-gates-clk" where it shall be set to 1
+
+Additionally, "allwinner,sun4i-*-gates-clk" clocks require:
+- clock-output-names : the corresponding gate names that the clock controls
+
+For example:
+
+osc24M: osc24M@01c20050 {
+ #clock-cells = <0>;
+ compatible = "allwinner,sun4i-osc-clk";
+ reg = <0x01c20050 0x4>;
+ clocks = <&osc24M_fixed>;
+};
+
+pll1: pll1@01c20000 {
+ #clock-cells = <0>;
+ compatible = "allwinner,sun4i-pll1-clk";
+ reg = <0x01c20000 0x4>;
+ clocks = <&osc24M>;
+};
+
+cpu: cpu@01c20054 {
+ #clock-cells = <0>;
+ compatible = "allwinner,sun4i-cpu-clk";
+ reg = <0x01c20054 0x4>;
+ clocks = <&osc32k>, <&osc24M>, <&pll1>;
+};
+
+
+
+Gate clock outputs
+
+The "allwinner,sun4i-*-gates-clk" clocks provide several gatable outputs;
+their corresponding offsets as present on sun4i are listed below. Note that
+some of these gates are not present on sun5i.
+
+ * AXI gates ("allwinner,sun4i-axi-gates-clk")
+
+ DRAM 0
+
+ * AHB gates ("allwinner,sun4i-ahb-gates-clk")
+
+ USB0 0
+ EHCI0 1
+ OHCI0 2*
+ EHCI1 3
+ OHCI1 4*
+ SS 5
+ DMA 6
+ BIST 7
+ MMC0 8
+ MMC1 9
+ MMC2 10
+ MMC3 11
+ MS 12**
+ NAND 13
+ SDRAM 14
+
+ ACE 16
+ EMAC 17
+ TS 18
+
+ SPI0 20
+ SPI1 21
+ SPI2 22
+ SPI3 23
+ PATA 24
+ SATA 25**
+ GPS 26*
+
+ VE 32
+ TVD 33
+ TVE0 34
+ TVE1 35
+ LCD0 36
+ LCD1 37
+
+ CSI0 40
+ CSI1 41
+
+ HDMI 43
+ DE_BE0 44
+ DE_BE1 45
+ DE_FE0 46
+ DE_FE1 47
+
+ MP 50
+
+ MALI400 52
+
+ * APB0 gates ("allwinner,sun4i-apb0-gates-clk")
+
+ CODEC 0
+ SPDIF 1*
+ AC97 2
+ IIS 3
+
+ PIO 5
+ IR0 6
+ IR1 7
+
+ KEYPAD 10
+
+ * APB1 gates ("allwinner,sun4i-apb1-gates-clk")
+
+ I2C0 0
+ I2C1 1
+ I2C2 2
+
+ CAN 4
+ SCR 5
+ PS20 6
+ PS21 7
+
+ UART0 16
+ UART1 17
+ UART2 18
+ UART3 19
+ UART4 20
+ UART5 21
+ UART6 22
+ UART7 23
+
+Notation:
+ [*]: The datasheet didn't mention these, but they are present on AW code
+ [**]: The datasheet had this marked as "NC" but they are used on AW code
diff --git a/Documentation/devicetree/bindings/mfd/mc13xxx.txt b/Documentation/devicetree/bindings/mfd/mc13xxx.txt
index baf07987ae68..abd9e3cb2db7 100644
--- a/Documentation/devicetree/bindings/mfd/mc13xxx.txt
+++ b/Documentation/devicetree/bindings/mfd/mc13xxx.txt
@@ -10,10 +10,40 @@ Optional properties:
- fsl,mc13xxx-uses-touch : Indicate the touchscreen controller is being used
Sub-nodes:
-- regulators : Contain the regulator nodes. The MC13892 regulators are
- bound using their names as listed below with their registers and bits
- for enabling.
+- regulators : Contain the regulator nodes. The regulators are bound using
+ their names as listed below with their registers and bits for enabling.
+MC13783 regulators:
+ sw1a : regulator SW1A (register 24, bit 0)
+ sw1b : regulator SW1B (register 25, bit 0)
+ sw2a : regulator SW2A (register 26, bit 0)
+ sw2b : regulator SW2B (register 27, bit 0)
+ sw3 : regulator SW3 (register 29, bit 20)
+ vaudio : regulator VAUDIO (register 32, bit 0)
+ viohi : regulator VIOHI (register 32, bit 3)
+ violo : regulator VIOLO (register 32, bit 6)
+ vdig : regulator VDIG (register 32, bit 9)
+ vgen : regulator VGEN (register 32, bit 12)
+ vrfdig : regulator VRFDIG (register 32, bit 15)
+ vrfref : regulator VRFREF (register 32, bit 18)
+ vrfcp : regulator VRFCP (register 32, bit 21)
+ vsim : regulator VSIM (register 33, bit 0)
+ vesim : regulator VESIM (register 33, bit 3)
+ vcam : regulator VCAM (register 33, bit 6)
+ vrfbg : regulator VRFBG (register 33, bit 9)
+ vvib : regulator VVIB (register 33, bit 11)
+ vrf1 : regulator VRF1 (register 33, bit 12)
+ vrf2 : regulator VRF2 (register 33, bit 15)
+ vmmc1 : regulator VMMC1 (register 33, bit 18)
+ vmmc2 : regulator VMMC2 (register 33, bit 21)
+ gpo1 : regulator GPO1 (register 34, bit 6)
+ gpo2 : regulator GPO2 (register 34, bit 8)
+ gpo3 : regulator GPO3 (register 34, bit 10)
+ gpo4 : regulator GPO4 (register 34, bit 12)
+ pwgt1spi : regulator PWGT1SPI (register 34, bit 15)
+ pwgt2spi : regulator PWGT2SPI (register 34, bit 16)
+
+MC13892 regulators:
vcoincell : regulator VCOINCELL (register 13, bit 23)
sw1 : regulator SW1 (register 24, bit 0)
sw2 : regulator SW2 (register 25, bit 0)
diff --git a/Documentation/devicetree/bindings/regulator/max8952.txt b/Documentation/devicetree/bindings/regulator/max8952.txt
new file mode 100644
index 000000000000..866fcdd0f4eb
--- /dev/null
+++ b/Documentation/devicetree/bindings/regulator/max8952.txt
@@ -0,0 +1,52 @@
+Maxim MAX8952 voltage regulator
+
+Required properties:
+- compatible: must be equal to "maxim,max8952"
+- reg: I2C slave address, usually 0x60
+- max8952,dvs-mode-microvolt: array of 4 integer values defining DVS voltages
+ in microvolts. All values must be from range <770000, 1400000>
+- any required generic properties defined in regulator.txt
+
+Optional properties:
+- max8952,vid-gpios: array of two GPIO pins used for DVS voltage selection
+- max8952,en-gpio: GPIO used to control enable status of regulator
+- max8952,default-mode: index of default DVS voltage, from <0, 3> range
+- max8952,sync-freq: sync frequency, must be one of following values:
+ - 0: 26 MHz
+ - 1: 13 MHz
+ - 2: 19.2 MHz
+ Defaults to 26 MHz if not specified.
+- max8952,ramp-speed: voltage ramp speed, must be one of following values:
+ - 0: 32mV/us
+ - 1: 16mV/us
+ - 2: 8mV/us
+ - 3: 4mV/us
+ - 4: 2mV/us
+ - 5: 1mV/us
+ - 6: 0.5mV/us
+ - 7: 0.25mV/us
+ Defaults to 32mV/us if not specified.
+- any available generic properties defined in regulator.txt
+
+Example:
+
+ vdd_arm_reg: pmic@60 {
+ compatible = "maxim,max8952";
+ reg = <0x60>;
+
+ /* max8952-specific properties */
+ max8952,vid-gpios = <&gpx0 3 0>, <&gpx0 4 0>;
+ max8952,en-gpio = <&gpx0 1 0>;
+ max8952,default-mode = <0>;
+ max8952,dvs-mode-microvolt = <1250000>, <1200000>,
+ <1050000>, <950000>;
+ max8952,sync-freq = <0>;
+ max8952,ramp-speed = <0>;
+
+ /* generic regulator properties */
+ regulator-name = "vdd_arm";
+ regulator-min-microvolt = <770000>;
+ regulator-max-microvolt = <1400000>;
+ regulator-always-on;
+ regulator-boot-on;
+ };
diff --git a/Documentation/devicetree/bindings/spi/brcm,bcm2835-spi.txt b/Documentation/devicetree/bindings/spi/brcm,bcm2835-spi.txt
new file mode 100644
index 000000000000..8bf89c643640
--- /dev/null
+++ b/Documentation/devicetree/bindings/spi/brcm,bcm2835-spi.txt
@@ -0,0 +1,22 @@
+Broadcom BCM2835 SPI0 controller
+
+The BCM2835 contains two forms of SPI master controller, one known simply as
+SPI0, and the other known as the "Universal SPI Master"; part of the
+auxilliary block. This binding applies to the SPI0 controller.
+
+Required properties:
+- compatible: Should be "brcm,bcm2835-spi".
+- reg: Should contain register location and length.
+- interrupts: Should contain interrupt.
+- clocks: The clock feeding the SPI controller.
+
+Example:
+
+spi@20204000 {
+ compatible = "brcm,bcm2835-spi";
+ reg = <0x7e204000 0x1000>;
+ interrupts = <2 22>;
+ clocks = <&clk_spi>;
+ #address-cells = <1>;
+ #size-cells = <0>;
+};
diff --git a/Documentation/devicetree/bindings/spi/fsl-spi.txt b/Documentation/devicetree/bindings/spi/fsl-spi.txt
index 777abd7399d5..b032dd76e9d2 100644
--- a/Documentation/devicetree/bindings/spi/fsl-spi.txt
+++ b/Documentation/devicetree/bindings/spi/fsl-spi.txt
@@ -4,7 +4,7 @@ Required properties:
- cell-index : QE SPI subblock index.
0: QE subblock SPI1
1: QE subblock SPI2
-- compatible : should be "fsl,spi".
+- compatible : should be "fsl,spi" or "aeroflexgaisler,spictrl".
- mode : the SPI operation mode, it can be "cpu" or "cpu-qe".
- reg : Offset and length of the register set for the device
- interrupts : <a b> where a is the interrupt number and b is a
@@ -14,6 +14,7 @@ Required properties:
controller you have.
- interrupt-parent : the phandle for the interrupt controller that
services interrupts for this device.
+- clock-frequency : input clock frequency to non FSL_SOC cores
Optional properties:
- gpios : specifies the gpio pins to be used for chipselects.
diff --git a/Documentation/devicetree/bindings/spi/nvidia,tegra114-spi.txt b/Documentation/devicetree/bindings/spi/nvidia,tegra114-spi.txt
new file mode 100644
index 000000000000..91ff771c7e77
--- /dev/null
+++ b/Documentation/devicetree/bindings/spi/nvidia,tegra114-spi.txt
@@ -0,0 +1,26 @@
+NVIDIA Tegra114 SPI controller.
+
+Required properties:
+- compatible : should be "nvidia,tegra114-spi".
+- reg: Should contain SPI registers location and length.
+- interrupts: Should contain SPI interrupts.
+- nvidia,dma-request-selector : The Tegra DMA controller's phandle and
+ request selector for this SPI controller.
+- This is also require clock named "spi" as per binding document
+ Documentation/devicetree/bindings/clock/clock-bindings.txt
+
+Recommended properties:
+- spi-max-frequency: Definition as per
+ Documentation/devicetree/bindings/spi/spi-bus.txt
+Example:
+
+spi@7000d600 {
+ compatible = "nvidia,tegra114-spi";
+ reg = <0x7000d600 0x200>;
+ interrupts = <0 82 0x04>;
+ nvidia,dma-request-selector = <&apbdma 16>;
+ spi-max-frequency = <25000000>;
+ #address-cells = <1>;
+ #size-cells = <0>;
+ status = "disabled";
+};
diff --git a/Documentation/devicetree/bindings/spi/spi-samsung.txt b/Documentation/devicetree/bindings/spi/spi-samsung.txt
index a15ffeddfba4..86aa061f069f 100644
--- a/Documentation/devicetree/bindings/spi/spi-samsung.txt
+++ b/Documentation/devicetree/bindings/spi/spi-samsung.txt
@@ -31,9 +31,6 @@ Required Board Specific Properties:
- #address-cells: should be 1.
- #size-cells: should be 0.
-- gpios: The gpio specifier for clock, mosi and miso interface lines (in the
- order specified). The format of the gpio specifier depends on the gpio
- controller.
Optional Board Specific Properties:
@@ -86,9 +83,8 @@ Example:
spi_0: spi@12d20000 {
#address-cells = <1>;
#size-cells = <0>;
- gpios = <&gpa2 4 2 3 0>,
- <&gpa2 6 2 3 0>,
- <&gpa2 7 2 3 0>;
+ pinctrl-names = "default";
+ pinctrl-0 = <&spi0_bus>;
w25q80bw@0 {
#address-cells = <1>;
diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt
index 19e1ef73ab0d..4d1919bf2332 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.txt
+++ b/Documentation/devicetree/bindings/vendor-prefixes.txt
@@ -5,6 +5,7 @@ using them to avoid name-space collisions.
ad Avionic Design GmbH
adi Analog Devices, Inc.
+aeroflexgaisler Aeroflex Gaisler AB
ak Asahi Kasei Corp.
amcc Applied Micro Circuits Corporation (APM, formally AMCC)
apm Applied Micro Circuits Corporation (APM)
@@ -48,6 +49,7 @@ samsung Samsung Semiconductor
sbs Smart Battery System
schindler Schindler
sil Silicon Image
+silabs Silicon Laboratories
simtek
sirf SiRF Technology, Inc.
snps Synopsys, Inc.
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 8ccbf27aead4..12bbce346d58 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -44,6 +44,7 @@ parameter is applicable:
AVR32 AVR32 architecture is enabled.
AX25 Appropriate AX.25 support is enabled.
BLACKFIN Blackfin architecture is enabled.
+ CLK Common clock infrastructure is enabled.
DRM Direct Rendering Management support is enabled.
DYNAMIC_DEBUG Build in debug messages and enable them at runtime
EDD BIOS Enhanced Disk Drive Services (EDD) is enabled
@@ -320,6 +321,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
on: enable for both 32- and 64-bit processes
off: disable for both 32- and 64-bit processes
+ alloc_snapshot [FTRACE]
+ Allocate the ftrace snapshot buffer on boot up when the
+ main buffer is allocated. This is handy if debugging
+ and you need to use tracing_snapshot() on boot up, and
+ do not want to use tracing_snapshot_alloc() as it needs
+ to be done where GFP_KERNEL allocations are allowed.
+
amd_iommu= [HW,X86-64]
Pass parameters to the AMD IOMMU driver in the system.
Possible values are:
@@ -465,6 +473,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
cio_ignore= [S390]
See Documentation/s390/CommonIO for details.
+ clk_ignore_unused
+ [CLK]
+ Keep all clocks already enabled by bootloader on,
+ even if no driver has claimed them. This is useful
+ for debug and development, but should not be
+ needed on a platform with proper driver support.
+ For more information, see Documentation/clk.txt.
clock= [BUGS=X86-32, HW] gettimeofday clocksource override.
[Deprecated]
@@ -3245,6 +3260,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
or other driver-specific files in the
Documentation/watchdog/ directory.
+ workqueue.disable_numa
+ By default, all work items queued to unbound
+ workqueues are affine to the NUMA nodes they're
+ issued on, which results in better behavior in
+ general. If NUMA affinity needs to be disabled for
+ whatever reason, this option can be used. Note
+ that this also can be controlled per-workqueue for
+ workqueues visible under /sys/bus/workqueue/.
+
x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of
default x2apic cluster mode on platforms
supporting x2apic.
diff --git a/Documentation/this_cpu_ops.txt b/Documentation/this_cpu_ops.txt
new file mode 100644
index 000000000000..1a4ce7e3e05f
--- /dev/null
+++ b/Documentation/this_cpu_ops.txt
@@ -0,0 +1,205 @@
+this_cpu operations
+-------------------
+
+this_cpu operations are a way of optimizing access to per cpu
+variables associated with the *currently* executing processor through
+the use of segment registers (or a dedicated register where the cpu
+permanently stored the beginning of the per cpu area for a specific
+processor).
+
+The this_cpu operations add a per cpu variable offset to the processor
+specific percpu base and encode that operation in the instruction
+operating on the per cpu variable.
+
+This means there are no atomicity issues between the calculation of
+the offset and the operation on the data. Therefore it is not
+necessary to disable preempt or interrupts to ensure that the
+processor is not changed between the calculation of the address and
+the operation on the data.
+
+Read-modify-write operations are of particular interest. Frequently
+processors have special lower latency instructions that can operate
+without the typical synchronization overhead but still provide some
+sort of relaxed atomicity guarantee. The x86 for example can execute
+RMV (Read Modify Write) instructions like inc/dec/cmpxchg without the
+lock prefix and the associated latency penalty.
+
+Access to the variable without the lock prefix is not synchronized but
+synchronization is not necessary since we are dealing with per cpu
+data specific to the currently executing processor. Only the current
+processor should be accessing that variable and therefore there are no
+concurrency issues with other processors in the system.
+
+On x86 the fs: or the gs: segment registers contain the base of the
+per cpu area. It is then possible to simply use the segment override
+to relocate a per cpu relative address to the proper per cpu area for
+the processor. So the relocation to the per cpu base is encoded in the
+instruction via a segment register prefix.
+
+For example:
+
+ DEFINE_PER_CPU(int, x);
+ int z;
+
+ z = this_cpu_read(x);
+
+results in a single instruction
+
+ mov ax, gs:[x]
+
+instead of a sequence of calculation of the address and then a fetch
+from that address which occurs with the percpu operations. Before
+this_cpu_ops such sequence also required preempt disable/enable to
+prevent the kernel from moving the thread to a different processor
+while the calculation is performed.
+
+The main use of the this_cpu operations has been to optimize counter
+operations.
+
+ this_cpu_inc(x)
+
+results in the following single instruction (no lock prefix!)
+
+ inc gs:[x]
+
+instead of the following operations required if there is no segment
+register.
+
+ int *y;
+ int cpu;
+
+ cpu = get_cpu();
+ y = per_cpu_ptr(&x, cpu);
+ (*y)++;
+ put_cpu();
+
+Note that these operations can only be used on percpu data that is
+reserved for a specific processor. Without disabling preemption in the
+surrounding code this_cpu_inc() will only guarantee that one of the
+percpu counters is correctly incremented. However, there is no
+guarantee that the OS will not move the process directly before or
+after the this_cpu instruction is executed. In general this means that
+the value of the individual counters for each processor are
+meaningless. The sum of all the per cpu counters is the only value
+that is of interest.
+
+Per cpu variables are used for performance reasons. Bouncing cache
+lines can be avoided if multiple processors concurrently go through
+the same code paths. Since each processor has its own per cpu
+variables no concurrent cacheline updates take place. The price that
+has to be paid for this optimization is the need to add up the per cpu
+counters when the value of the counter is needed.
+
+
+Special operations:
+-------------------
+
+ y = this_cpu_ptr(&x)
+
+Takes the offset of a per cpu variable (&x !) and returns the address
+of the per cpu variable that belongs to the currently executing
+processor. this_cpu_ptr avoids multiple steps that the common
+get_cpu/put_cpu sequence requires. No processor number is
+available. Instead the offset of the local per cpu area is simply
+added to the percpu offset.
+
+
+
+Per cpu variables and offsets
+-----------------------------
+
+Per cpu variables have *offsets* to the beginning of the percpu
+area. They do not have addresses although they look like that in the
+code. Offsets cannot be directly dereferenced. The offset must be
+added to a base pointer of a percpu area of a processor in order to
+form a valid address.
+
+Therefore the use of x or &x outside of the context of per cpu
+operations is invalid and will generally be treated like a NULL
+pointer dereference.
+
+In the context of per cpu operations
+
+ x is a per cpu variable. Most this_cpu operations take a cpu
+ variable.
+
+ &x is the *offset* a per cpu variable. this_cpu_ptr() takes
+ the offset of a per cpu variable which makes this look a bit
+ strange.
+
+
+
+Operations on a field of a per cpu structure
+--------------------------------------------
+
+Let's say we have a percpu structure
+
+ struct s {
+ int n,m;
+ };
+
+ DEFINE_PER_CPU(struct s, p);
+
+
+Operations on these fields are straightforward
+
+ this_cpu_inc(p.m)
+
+ z = this_cpu_cmpxchg(p.m, 0, 1);
+
+
+If we have an offset to struct s:
+
+ struct s __percpu *ps = &p;
+
+ z = this_cpu_dec(ps->m);
+
+ z = this_cpu_inc_return(ps->n);
+
+
+The calculation of the pointer may require the use of this_cpu_ptr()
+if we do not make use of this_cpu ops later to manipulate fields:
+
+ struct s *pp;
+
+ pp = this_cpu_ptr(&p);
+
+ pp->m--;
+
+ z = pp->n++;
+
+
+Variants of this_cpu ops
+-------------------------
+
+this_cpu ops are interrupt safe. Some architecture do not support
+these per cpu local operations. In that case the operation must be
+replaced by code that disables interrupts, then does the operations
+that are guaranteed to be atomic and then reenable interrupts. Doing
+so is expensive. If there are other reasons why the scheduler cannot
+change the processor we are executing on then there is no reason to
+disable interrupts. For that purpose the __this_cpu operations are
+provided. For example.
+
+ __this_cpu_inc(x);
+
+Will increment x and will not fallback to code that disables
+interrupts on platforms that cannot accomplish atomicity through
+address relocation and a Read-Modify-Write operation in the same
+instruction.
+
+
+
+&this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n)
+--------------------------------------------
+
+The first operation takes the offset and forms an address and then
+adds the offset of the n field.
+
+The second one first adds the two offsets and then does the
+relocation. IMHO the second form looks cleaner and has an easier time
+with (). The second form also is consistent with the way
+this_cpu_read() and friends are used.
+
+
+Christoph Lameter, April 3rd, 2013
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index a372304aef10..bfe8c29b1f1d 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -8,6 +8,7 @@ Copyright 2008 Red Hat Inc.
Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton,
John Kacur, and David Teigland.
Written for: 2.6.28-rc2
+Updated for: 3.10
Introduction
------------
@@ -17,13 +18,16 @@ designers of systems to find what is going on inside the kernel.
It can be used for debugging or analyzing latencies and
performance issues that take place outside of user-space.
-Although ftrace is the function tracer, it also includes an
-infrastructure that allows for other types of tracing. Some of
-the tracers that are currently in ftrace include a tracer to
-trace context switches, the time it takes for a high priority
-task to run after it was woken up, the time interrupts are
-disabled, and more (ftrace allows for tracer plugins, which
-means that the list of tracers can always grow).
+Although ftrace is typically considered the function tracer, it
+is really a frame work of several assorted tracing utilities.
+There's latency tracing to examine what occurs between interrupts
+disabled and enabled, as well as for preemption and from a time
+a task is woken to the task is actually scheduled in.
+
+One of the most common uses of ftrace is the event tracing.
+Through out the kernel is hundreds of static event points that
+can be enabled via the debugfs file system to see what is
+going on in certain parts of the kernel.
Implementation Details
@@ -61,7 +65,7 @@ the extended "/sys/kernel/debug/tracing" path name.
That's it! (assuming that you have ftrace configured into your kernel)
-After mounting the debugfs, you can see a directory called
+After mounting debugfs, you can see a directory called
"tracing". This directory contains the control and output files
of ftrace. Here is a list of some of the key files:
@@ -84,7 +88,9 @@ of ftrace. Here is a list of some of the key files:
This sets or displays whether writing to the trace
ring buffer is enabled. Echo 0 into this file to disable
- the tracer or 1 to enable it.
+ the tracer or 1 to enable it. Note, this only disables
+ writing to the ring buffer, the tracing overhead may
+ still be occurring.
trace:
@@ -109,7 +115,15 @@ of ftrace. Here is a list of some of the key files:
This file lets the user control the amount of data
that is displayed in one of the above output
- files.
+ files. Options also exist to modify how a tracer
+ or events work (stack traces, timestamps, etc).
+
+ options:
+
+ This is a directory that has a file for every available
+ trace option (also in trace_options). Options may also be set
+ or cleared by writing a "1" or "0" respectively into the
+ corresponding file with the option name.
tracing_max_latency:
@@ -121,10 +135,17 @@ of ftrace. Here is a list of some of the key files:
latency is greater than the value in this
file. (in microseconds)
+ tracing_thresh:
+
+ Some latency tracers will record a trace whenever the
+ latency is greater than the number in this file.
+ Only active when the file contains a number greater than 0.
+ (in microseconds)
+
buffer_size_kb:
This sets or displays the number of kilobytes each CPU
- buffer can hold. The tracer buffers are the same size
+ buffer holds. By default, the trace buffers are the same size
for each CPU. The displayed number is the size of the
CPU buffer and not total size of all buffers. The
trace buffers are allocated in pages (blocks of memory
@@ -133,16 +154,30 @@ of ftrace. Here is a list of some of the key files:
than requested, the rest of the page will be used,
making the actual allocation bigger than requested.
( Note, the size may not be a multiple of the page size
- due to buffer management overhead. )
+ due to buffer management meta-data. )
- This can only be updated when the current_tracer
- is set to "nop".
+ buffer_total_size_kb:
+
+ This displays the total combined size of all the trace buffers.
+
+ free_buffer:
+
+ If a process is performing the tracing, and the ring buffer
+ should be shrunk "freed" when the process is finished, even
+ if it were to be killed by a signal, this file can be used
+ for that purpose. On close of this file, the ring buffer will
+ be resized to its minimum size. Having a process that is tracing
+ also open this file, when the process exits its file descriptor
+ for this file will be closed, and in doing so, the ring buffer
+ will be "freed".
+
+ It may also stop tracing if disable_on_free option is set.
tracing_cpumask:
This is a mask that lets the user only trace
- on specified CPUS. The format is a hex string
- representing the CPUS.
+ on specified CPUs. The format is a hex string
+ representing the CPUs.
set_ftrace_filter:
@@ -183,6 +218,261 @@ of ftrace. Here is a list of some of the key files:
"set_ftrace_notrace". (See the section "dynamic ftrace"
below for more details.)
+ enabled_functions:
+
+ This file is more for debugging ftrace, but can also be useful
+ in seeing if any function has a callback attached to it.
+ Not only does the trace infrastructure use ftrace function
+ trace utility, but other subsystems might too. This file
+ displays all functions that have a callback attached to them
+ as well as the number of callbacks that have been attached.
+ Note, a callback may also call multiple functions which will
+ not be listed in this count.
+
+ If the callback registered to be traced by a function with
+ the "save regs" attribute (thus even more overhead), a 'R'
+ will be displayed on the same line as the function that
+ is returning registers.
+
+ function_profile_enabled:
+
+ When set it will enable all functions with either the function
+ tracer, or if enabled, the function graph tracer. It will
+ keep a histogram of the number of functions that were called
+ and if run with the function graph tracer, it will also keep
+ track of the time spent in those functions. The histogram
+ content can be displayed in the files:
+
+ trace_stats/function<cpu> ( function0, function1, etc).
+
+ trace_stats:
+
+ A directory that holds different tracing stats.
+
+ kprobe_events:
+
+ Enable dynamic trace points. See kprobetrace.txt.
+
+ kprobe_profile:
+
+ Dynamic trace points stats. See kprobetrace.txt.
+
+ max_graph_depth:
+
+ Used with the function graph tracer. This is the max depth
+ it will trace into a function. Setting this to a value of
+ one will show only the first kernel function that is called
+ from user space.
+
+ printk_formats:
+
+ This is for tools that read the raw format files. If an event in
+ the ring buffer references a string (currently only trace_printk()
+ does this), only a pointer to the string is recorded into the buffer
+ and not the string itself. This prevents tools from knowing what
+ that string was. This file displays the string and address for
+ the string allowing tools to map the pointers to what the
+ strings were.
+
+ saved_cmdlines:
+
+ Only the pid of the task is recorded in a trace event unless
+ the event specifically saves the task comm as well. Ftrace
+ makes a cache of pid mappings to comms to try to display
+ comms for events. If a pid for a comm is not listed, then
+ "<...>" is displayed in the output.
+
+ snapshot:
+
+ This displays the "snapshot" buffer and also lets the user
+ take a snapshot of the current running trace.
+ See the "Snapshot" section below for more details.
+
+ stack_max_size:
+
+ When the stack tracer is activated, this will display the
+ maximum stack size it has encountered.
+ See the "Stack Trace" section below.
+
+ stack_trace:
+
+ This displays the stack back trace of the largest stack
+ that was encountered when the stack tracer is activated.
+ See the "Stack Trace" section below.
+
+ stack_trace_filter:
+
+ This is similar to "set_ftrace_filter" but it limits what
+ functions the stack tracer will check.
+
+ trace_clock:
+
+ Whenever an event is recorded into the ring buffer, a
+ "timestamp" is added. This stamp comes from a specified
+ clock. By default, ftrace uses the "local" clock. This
+ clock is very fast and strictly per cpu, but on some
+ systems it may not be monotonic with respect to other
+ CPUs. In other words, the local clocks may not be in sync
+ with local clocks on other CPUs.
+
+ Usual clocks for tracing:
+
+ # cat trace_clock
+ [local] global counter x86-tsc
+
+ local: Default clock, but may not be in sync across CPUs
+
+ global: This clock is in sync with all CPUs but may
+ be a bit slower than the local clock.
+
+ counter: This is not a clock at all, but literally an atomic
+ counter. It counts up one by one, but is in sync
+ with all CPUs. This is useful when you need to
+ know exactly the order events occurred with respect to
+ each other on different CPUs.
+
+ uptime: This uses the jiffies counter and the time stamp
+ is relative to the time since boot up.
+
+ perf: This makes ftrace use the same clock that perf uses.
+ Eventually perf will be able to read ftrace buffers
+ and this will help out in interleaving the data.
+
+ x86-tsc: Architectures may define their own clocks. For
+ example, x86 uses its own TSC cycle clock here.
+
+ To set a clock, simply echo the clock name into this file.
+
+ echo global > trace_clock
+
+ trace_marker:
+
+ This is a very useful file for synchronizing user space
+ with events happening in the kernel. Writing strings into
+ this file will be written into the ftrace buffer.
+
+ It is useful in applications to open this file at the start
+ of the application and just reference the file descriptor
+ for the file.
+
+ void trace_write(const char *fmt, ...)
+ {
+ va_list ap;
+ char buf[256];
+ int n;
+
+ if (trace_fd < 0)
+ return;
+
+ va_start(ap, fmt);
+ n = vsnprintf(buf, 256, fmt, ap);
+ va_end(ap);
+
+ write(trace_fd, buf, n);
+ }
+
+ start:
+
+ trace_fd = open("trace_marker", WR_ONLY);
+
+ uprobe_events:
+
+ Add dynamic tracepoints in programs.
+ See uprobetracer.txt
+
+ uprobe_profile:
+
+ Uprobe statistics. See uprobetrace.txt
+
+ instances:
+
+ This is a way to make multiple trace buffers where different
+ events can be recorded in different buffers.
+ See "Instances" section below.
+
+ events:
+
+ This is the trace event directory. It holds event tracepoints
+ (also known as static tracepoints) that have been compiled
+ into the kernel. It shows what event tracepoints exist
+ and how they are grouped by system. There are "enable"
+ files at various levels that can enable the tracepoints
+ when a "1" is written to them.
+
+ See events.txt for more information.
+
+ per_cpu:
+
+ This is a directory that contains the trace per_cpu information.
+
+ per_cpu/cpu0/buffer_size_kb:
+
+ The ftrace buffer is defined per_cpu. That is, there's a separate
+ buffer for each CPU to allow writes to be done atomically,
+ and free from cache bouncing. These buffers may have different
+ size buffers. This file is similar to the buffer_size_kb
+ file, but it only displays or sets the buffer size for the
+ specific CPU. (here cpu0).
+
+ per_cpu/cpu0/trace:
+
+ This is similar to the "trace" file, but it will only display
+ the data specific for the CPU. If written to, it only clears
+ the specific CPU buffer.
+
+ per_cpu/cpu0/trace_pipe
+
+ This is similar to the "trace_pipe" file, and is a consuming
+ read, but it will only display (and consume) the data specific
+ for the CPU.
+
+ per_cpu/cpu0/trace_pipe_raw
+
+ For tools that can parse the ftrace ring buffer binary format,
+ the trace_pipe_raw file can be used to extract the data
+ from the ring buffer directly. With the use of the splice()
+ system call, the buffer data can be quickly transferred to
+ a file or to the network where a server is collecting the
+ data.
+
+ Like trace_pipe, this is a consuming reader, where multiple
+ reads will always produce different data.
+
+ per_cpu/cpu0/snapshot:
+
+ This is similar to the main "snapshot" file, but will only
+ snapshot the current CPU (if supported). It only displays
+ the content of the snapshot for a given CPU, and if
+ written to, only clears this CPU buffer.
+
+ per_cpu/cpu0/snapshot_raw:
+
+ Similar to the trace_pipe_raw, but will read the binary format
+ from the snapshot buffer for the given CPU.
+
+ per_cpu/cpu0/stats:
+
+ This displays certain stats about the ring buffer:
+
+ entries: The number of events that are still in the buffer.
+
+ overrun: The number of lost events due to overwriting when
+ the buffer was full.
+
+ commit overrun: Should always be zero.
+ This gets set if so many events happened within a nested
+ event (ring buffer is re-entrant), that it fills the
+ buffer and starts dropping events.
+
+ bytes: Bytes actually read (not overwritten).
+
+ oldest event ts: The oldest timestamp in the buffer
+
+ now ts: The current timestamp
+
+ dropped events: Events lost due to overwrite option being off.
+
+ read events: The number of events read.
The Tracers
-----------
@@ -234,11 +524,6 @@ Here is the list of current tracers that may be configured.
RT tasks (as the current "wakeup" does). This is useful
for those interested in wake up timings of RT tasks.
- "hw-branch-tracer"
-
- Uses the BTS CPU feature on x86 CPUs to traces all
- branches executed.
-
"nop"
This is the "trace nothing" tracer. To remove all
@@ -261,70 +546,100 @@ Here is an example of the output format of the file "trace"
--------
# tracer: function
#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
- bash-4251 [01] 10152.583854: path_put <-path_walk
- bash-4251 [01] 10152.583855: dput <-path_put
- bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput
+# entries-in-buffer/entries-written: 140080/250280 #P:4
+#
+# _-----=> irqs-off
+# / _----=> need-resched
+# | / _---=> hardirq/softirq
+# || / _--=> preempt-depth
+# ||| / delay
+# TASK-PID CPU# |||| TIMESTAMP FUNCTION
+# | | | |||| | |
+ bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath
+ bash-1977 [000] .... 17284.993653: __close_fd <-sys_close
+ bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd
+ sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
+ bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
+ bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
+ bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
+ bash-1977 [000] .... 17284.993657: filp_close <-__close_fd
+ bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close
+ sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath
--------
A header is printed with the tracer name that is represented by
-the trace. In this case the tracer is "function". Then a header
-showing the format. Task name "bash", the task PID "4251", the
-CPU that it was running on "01", the timestamp in <secs>.<usecs>
-format, the function name that was traced "path_put" and the
-parent function that called this function "path_walk". The
-timestamp is the time at which the function was entered.
+the trace. In this case the tracer is "function". Then it shows the
+number of events in the buffer as well as the total number of entries
+that were written. The difference is the number of entries that were
+lost due to the buffer filling up (250280 - 140080 = 110200 events
+lost).
+
+The header explains the content of the events. Task name "bash", the task
+PID "1977", the CPU that it was running on "000", the latency format
+(explained below), the timestamp in <secs>.<usecs> format, the
+function name that was traced "sys_close" and the parent function that
+called this function "system_call_fastpath". The timestamp is the time
+at which the function was entered.
Latency trace format
--------------------
-When the latency-format option is enabled, the trace file gives
-somewhat more information to see why a latency happened.
-Here is a typical trace.
+When the latency-format option is enabled or when one of the latency
+tracers is set, the trace file gives somewhat more information to see
+why a latency happened. Here is a typical trace.
# tracer: irqsoff
#
-irqsoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: apic_timer_interrupt
- => ended at: do_softirq
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- <idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt)
- <idle>-0 0d.s. 97us : __do_softirq (do_softirq)
- <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq)
+# irqsoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+# -----------------
+# | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
+# -----------------
+# => started at: __lock_task_sighand
+# => ended at: _raw_spin_unlock_irqrestore
+#
+#
+# _------=> CPU#
+# / _-----=> irqs-off
+# | / _----=> need-resched
+# || / _---=> hardirq/softirq
+# ||| / _--=> preempt-depth
+# |||| / delay
+# cmd pid ||||| time | caller
+# \ / ||||| \ | /
+ ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand
+ ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
+ ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
+ ps-6143 2d..1 306us : <stack trace>
+ => trace_hardirqs_on_caller
+ => trace_hardirqs_on
+ => _raw_spin_unlock_irqrestore
+ => do_task_stat
+ => proc_tgid_stat
+ => proc_single_show
+ => seq_read
+ => vfs_read
+ => sys_read
+ => system_call_fastpath
This shows that the current tracer is "irqsoff" tracing the time
-for which interrupts were disabled. It gives the trace version
-and the version of the kernel upon which this was executed on
-(2.6.26-rc8). Then it displays the max latency in microsecs (97
-us). The number of trace entries displayed and the total number
-recorded (both are three: #3/3). The type of preemption that was
-used (PREEMPT). VP, KP, SP, and HP are always zero and are
-reserved for later use. #P is the number of online CPUS (#P:2).
+for which interrupts were disabled. It gives the trace version (which
+never changes) and the version of the kernel upon which this was executed on
+(3.10). Then it displays the max latency in microseconds (259 us). The number
+of trace entries displayed and the total number (both are four: #4/4).
+VP, KP, SP, and HP are always zero and are reserved for later use.
+#P is the number of online CPUs (#P:4).
The task is the process that was running when the latency
-occurred. (swapper pid: 0).
+occurred. (ps pid: 6143).
The start and stop (the functions in which the interrupts were
disabled and enabled respectively) that caused the latencies:
- apic_timer_interrupt is where the interrupts were disabled.
- do_softirq is where they were enabled again.
+ __lock_task_sighand is where the interrupts were disabled.
+ _raw_spin_unlock_irqrestore is where they were enabled again.
The next lines after the header are the trace itself. The header
explains which is which.
@@ -367,16 +682,43 @@ The above is mostly meaningful for kernel developers.
The rest is the same as the 'trace' file.
+ Note, the latency tracers will usually end with a back trace
+ to easily find where the latency occurred.
trace_options
-------------
-The trace_options file is used to control what gets printed in
-the trace output. To see what is available, simply cat the file:
+The trace_options file (or the options directory) is used to control
+what gets printed in the trace output, or manipulate the tracers.
+To see what is available, simply cat the file:
cat trace_options
- print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
- noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj
+print-parent
+nosym-offset
+nosym-addr
+noverbose
+noraw
+nohex
+nobin
+noblock
+nostacktrace
+trace_printk
+noftrace_preempt
+nobranch
+annotate
+nouserstacktrace
+nosym-userobj
+noprintk-msg-only
+context-info
+latency-format
+sleep-time
+graph-time
+record-cmd
+overwrite
+nodisable_on_free
+irq-info
+markers
+function-trace
To disable one of the options, echo in the option prepended with
"no".
@@ -428,13 +770,34 @@ Here are the available options:
bin - This will print out the formats in raw binary.
- block - TBD (needs update)
+ block - When set, reading trace_pipe will not block when polled.
stacktrace - This is one of the options that changes the trace
itself. When a trace is recorded, so is the stack
of functions. This allows for back traces of
trace sites.
+ trace_printk - Can disable trace_printk() from writing into the buffer.
+
+ branch - Enable branch tracing with the tracer.
+
+ annotate - It is sometimes confusing when the CPU buffers are full
+ and one CPU buffer had a lot of events recently, thus
+ a shorter time frame, were another CPU may have only had
+ a few events, which lets it have older events. When
+ the trace is reported, it shows the oldest events first,
+ and it may look like only one CPU ran (the one with the
+ oldest events). When the annotate option is set, it will
+ display when a new CPU buffer started:
+
+ <idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
+ <idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
+ <idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
+##### CPU 2 buffer started ####
+ <idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
+ <idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
+ <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
+
userstacktrace - This option changes the trace. It records a
stacktrace of the current userspace thread.
@@ -451,9 +814,13 @@ Here are the available options:
a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
- sched-tree - trace all tasks that are on the runqueue, at
- every scheduling event. Will add overhead if
- there's a lot of tasks running at once.
+
+ printk-msg-only - When set, trace_printk()s will only show the format
+ and not their parameters (if trace_bprintk() or
+ trace_bputs() was used to save the trace_printk()).
+
+ context-info - Show only the event data. Hides the comm, PID,
+ timestamp, CPU, and other useful data.
latency-format - This option changes the trace. When
it is enabled, the trace displays
@@ -461,31 +828,61 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
latencies, as described in "Latency
trace format".
+ sleep-time - When running function graph tracer, to include
+ the time a task schedules out in its function.
+ When enabled, it will account time the task has been
+ scheduled out as part of the function call.
+
+ graph-time - When running function graph tracer, to include the
+ time to call nested functions. When this is not set,
+ the time reported for the function will only include
+ the time the function itself executed for, not the time
+ for functions that it called.
+
+ record-cmd - When any event or tracer is enabled, a hook is enabled
+ in the sched_switch trace point to fill comm cache
+ with mapped pids and comms. But this may cause some
+ overhead, and if you only care about pids, and not the
+ name of the task, disabling this option can lower the
+ impact of tracing.
+
overwrite - This controls what happens when the trace buffer is
full. If "1" (default), the oldest events are
discarded and overwritten. If "0", then the newest
events are discarded.
+ (see per_cpu/cpu0/stats for overrun and dropped)
-ftrace_enabled
---------------
+ disable_on_free - When the free_buffer is closed, tracing will
+ stop (tracing_on set to 0).
-The following tracers (listed below) give different output
-depending on whether or not the sysctl ftrace_enabled is set. To
-set ftrace_enabled, one can either use the sysctl function or
-set it via the proc file system interface.
+ irq-info - Shows the interrupt, preempt count, need resched data.
+ When disabled, the trace looks like:
- sysctl kernel.ftrace_enabled=1
+# tracer: function
+#
+# entries-in-buffer/entries-written: 144405/9452052 #P:4
+#
+# TASK-PID CPU# TIMESTAMP FUNCTION
+# | | | | |
+ <idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
+ <idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89
+ <idle>-0 [002] 23636.756055: enqueue_task <-activate_task
- or
- echo 1 > /proc/sys/kernel/ftrace_enabled
+ markers - When set, the trace_marker is writable (only by root).
+ When disabled, the trace_marker will error with EINVAL
+ on write.
+
+
+ function-trace - The latency tracers will enable function tracing
+ if this option is enabled (default it is). When
+ it is disabled, the latency tracers do not trace
+ functions. This keeps the overhead of the tracer down
+ when performing latency tests.
-To disable ftrace_enabled simply replace the '1' with '0' in the
-above commands.
+ Note: Some tracers have their own options. They only appear
+ when the tracer is active.
-When ftrace_enabled is set the tracers will also record the
-functions that are within the trace. The descriptions of the
-tracers will also show an example with ftrace enabled.
irqsoff
@@ -506,95 +903,133 @@ new trace is saved.
To reset the maximum, echo 0 into tracing_max_latency. Here is
an example:
+ # echo 0 > options/function-trace
# echo irqsoff > current_tracer
- # echo latency-format > trace_options
- # echo 0 > tracing_max_latency
# echo 1 > tracing_on
+ # echo 0 > tracing_max_latency
# ls -ltr
[...]
# echo 0 > tracing_on
# cat trace
# tracer: irqsoff
#
-irqsoff latency trace v1.1.5 on 2.6.26
---------------------------------------------------------------------
- latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: sys_setpgid
- => ended at: sys_setpgid
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- bash-3730 1d... 0us : _write_lock_irq (sys_setpgid)
- bash-3730 1d..1 1us+: _write_unlock_irq (sys_setpgid)
- bash-3730 1d..2 14us : trace_hardirqs_on (sys_setpgid)
-
-
-Here we see that that we had a latency of 12 microsecs (which is
-very good). The _write_lock_irq in sys_setpgid disabled
-interrupts. The difference between the 12 and the displayed
-timestamp 14us occurred because the clock was incremented
+# irqsoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+# -----------------
+# | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
+# -----------------
+# => started at: run_timer_softirq
+# => ended at: run_timer_softirq
+#
+#
+# _------=> CPU#
+# / _-----=> irqs-off
+# | / _----=> need-resched
+# || / _---=> hardirq/softirq
+# ||| / _--=> preempt-depth
+# |||| / delay
+# cmd pid ||||| time | caller
+# \ / ||||| \ | /
+ <idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq
+ <idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq
+ <idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq
+ <idle>-0 0dNs3 25us : <stack trace>
+ => _raw_spin_unlock_irq
+ => run_timer_softirq
+ => __do_softirq
+ => call_softirq
+ => do_softirq
+ => irq_exit
+ => smp_apic_timer_interrupt
+ => apic_timer_interrupt
+ => rcu_idle_exit
+ => cpu_idle
+ => rest_init
+ => start_kernel
+ => x86_64_start_reservations
+ => x86_64_start_kernel
+
+Here we see that that we had a latency of 16 microseconds (which is
+very good). The _raw_spin_lock_irq in run_timer_softirq disabled
+interrupts. The difference between the 16 and the displayed
+timestamp 25us occurred because the clock was incremented
between the time of recording the max latency and the time of
recording the function that had that latency.
-Note the above example had ftrace_enabled not set. If we set the
-ftrace_enabled, we get a much larger output:
+Note the above example had function-trace not set. If we set
+function-trace, we get a much larger output:
+
+ with echo 1 > options/function-trace
# tracer: irqsoff
#
-irqsoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: __alloc_pages_internal
- => ended at: __alloc_pages_internal
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal)
- ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist)
- ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk)
- ls-4339 0d..1 4us : add_preempt_count (_spin_lock)
- ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk)
- ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue)
- ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest)
- ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk)
- ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue)
- ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest)
- ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk)
- ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue)
+# irqsoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+# -----------------
+# | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
+# -----------------
+# => started at: ata_scsi_queuecmd
+# => ended at: ata_scsi_queuecmd
+#
+#
+# _------=> CPU#
+# / _-----=> irqs-off
+# | / _----=> need-resched
+# || / _---=> hardirq/softirq
+# ||| / _--=> preempt-depth
+# |||| / delay
+# cmd pid ||||| time | caller
+# \ / ||||| \ | /
+ bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
+ bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave
+ bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd
+ bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev
+ bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
+ bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd
+ bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd
+ bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
+ bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat
[...]
- ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue)
- ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest)
- ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk)
- ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue)
- ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest)
- ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk)
- ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock)
- ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal)
- ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal)
-
-
-
-Here we traced a 50 microsecond latency. But we also see all the
+ bash-2042 3d..1 67us : delay_tsc <-__delay
+ bash-2042 3d..1 67us : add_preempt_count <-delay_tsc
+ bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc
+ bash-2042 3d..1 67us : add_preempt_count <-delay_tsc
+ bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc
+ bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue
+ bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
+ bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
+ bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd
+ bash-2042 3d..1 120us : <stack trace>
+ => _raw_spin_unlock_irqrestore
+ => ata_scsi_queuecmd
+ => scsi_dispatch_cmd
+ => scsi_request_fn
+ => __blk_run_queue_uncond
+ => __blk_run_queue
+ => blk_queue_bio
+ => generic_make_request
+ => submit_bio
+ => submit_bh
+ => __ext3_get_inode_loc
+ => ext3_iget
+ => ext3_lookup
+ => lookup_real
+ => __lookup_hash
+ => walk_component
+ => lookup_last
+ => path_lookupat
+ => filename_lookup
+ => user_path_at_empty
+ => user_path_at
+ => vfs_fstatat
+ => vfs_stat
+ => sys_newstat
+ => system_call_fastpath
+
+
+Here we traced a 71 microsecond latency. But we also see all the
functions that were called during that time. Note that by
enabling function tracing, we incur an added overhead. This
overhead may extend the latency times. But nevertheless, this
@@ -614,120 +1049,122 @@ Like the irqsoff tracer, it records the maximum latency for
which preemption was disabled. The control of preemptoff tracer
is much like the irqsoff tracer.
+ # echo 0 > options/function-trace
# echo preemptoff > current_tracer
- # echo latency-format > trace_options
- # echo 0 > tracing_max_latency
# echo 1 > tracing_on
+ # echo 0 > tracing_max_latency
# ls -ltr
[...]
# echo 0 > tracing_on
# cat trace
# tracer: preemptoff
#
-preemptoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: do_IRQ
- => ended at: __do_softirq
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- sshd-4261 0d.h. 0us+: irq_enter (do_IRQ)
- sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq)
- sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq)
+# preemptoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+# -----------------
+# | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
+# -----------------
+# => started at: do_IRQ
+# => ended at: do_IRQ
+#
+#
+# _------=> CPU#
+# / _-----=> irqs-off
+# | / _----=> need-resched
+# || / _---=> hardirq/softirq
+# ||| / _--=> preempt-depth
+# |||| / delay
+# cmd pid ||||| time | caller
+# \ / ||||| \ | /
+ sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ
+ sshd-1991 1d..1 46us : irq_exit <-do_IRQ
+ sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ
+ sshd-1991 1d..1 52us : <stack trace>
+ => sub_preempt_count
+ => irq_exit
+ => do_IRQ
+ => ret_from_intr
This has some more changes. Preemption was disabled when an
-interrupt came in (notice the 'h'), and was enabled while doing
-a softirq. (notice the 's'). But we also see that interrupts
-have been disabled when entering the preempt off section and
-leaving it (the 'd'). We do not know if interrupts were enabled
-in the mean time.
+interrupt came in (notice the 'h'), and was enabled on exit.
+But we also see that interrupts have been disabled when entering
+the preempt off section and leaving it (the 'd'). We do not know if
+interrupts were enabled in the mean time or shortly after this
+was over.
# tracer: preemptoff
#
-preemptoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: remove_wait_queue
- => ended at: __do_softirq
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue)
- sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue)
- sshd-4261 0d..1 2us : do_IRQ (common_interrupt)
- sshd-4261 0d..1 2us : irq_enter (do_IRQ)
- sshd-4261 0d..1 2us : idle_cpu (irq_enter)
- sshd-4261 0d..1 3us : add_preempt_count (irq_enter)
- sshd-4261 0d.h1 3us : idle_cpu (irq_enter)
- sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ)
+# preemptoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+# -----------------
+# | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
+# -----------------
+# => started at: wake_up_new_task
+# => ended at: task_rq_unlock
+#
+#
+# _------=> CPU#
+# / _-----=> irqs-off
+# | / _----=> need-resched
+# || / _---=> hardirq/softirq
+# ||| / _--=> preempt-depth
+# |||| / delay
+# cmd pid ||||| time | caller
+# \ / ||||| \ | /
+ bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task
+ bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq
+ bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair
+ bash-1994 1d..1 1us : source_load <-select_task_rq_fair
+ bash-1994 1d..1 1us : source_load <-select_task_rq_fair
[...]
- sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock)
- sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq)
- sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq)
- sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq)
- sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock)
- sshd-4261 0d.h1 14us : irq_exit (do_IRQ)
- sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit)
- sshd-4261 0d..2 15us : do_softirq (irq_exit)
- sshd-4261 0d... 15us : __do_softirq (do_softirq)
- sshd-4261 0d... 16us : __local_bh_disable (__do_softirq)
- sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable)
- sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable)
- sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable)
- sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable)
+ bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt
+ bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter
+ bash-1994 1d..1 13us : add_preempt_count <-irq_enter
+ bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt
+ bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt
+ bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt
+ bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock
+ bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt
[...]
- sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable)
- sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable)
- sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable)
- sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable)
- sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip)
- sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip)
- sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable)
- sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable)
+ bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event
+ bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt
+ bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit
+ bash-1994 1d..2 36us : do_softirq <-irq_exit
+ bash-1994 1d..2 36us : __do_softirq <-call_softirq
+ bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq
+ bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq
+ bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq
+ bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock
+ bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq
[...]
- sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq)
- sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq)
+ bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks
+ bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq
+ bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable
+ bash-1994 1dN.2 82us : idle_cpu <-irq_exit
+ bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit
+ bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit
+ bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
+ bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock
+ bash-1994 1.N.1 104us : <stack trace>
+ => sub_preempt_count
+ => _raw_spin_unlock_irqrestore
+ => task_rq_unlock
+ => wake_up_new_task
+ => do_fork
+ => sys_clone
+ => stub_clone
The above is an example of the preemptoff trace with
-ftrace_enabled set. Here we see that interrupts were disabled
+function-trace set. Here we see that interrupts were not disabled
the entire time. The irq_enter code lets us know that we entered
an interrupt 'h'. Before that, the functions being traced still
show that it is not in an interrupt, but we can see from the
functions themselves that this is not the case.
-Notice that __do_softirq when called does not have a
-preempt_count. It may seem that we missed a preempt enabling.
-What really happened is that the preempt count is held on the
-thread's stack and we switched to the softirq stack (4K stacks
-in effect). The code does not copy the preempt count, but
-because interrupts are disabled, we do not need to worry about
-it. Having a tracer like this is good for letting people know
-what really happens inside the kernel.
-
-
preemptirqsoff
--------------
@@ -762,38 +1199,57 @@ tracer.
Again, using this trace is much like the irqsoff and preemptoff
tracers.
+ # echo 0 > options/function-trace
# echo preemptirqsoff > current_tracer
- # echo latency-format > trace_options
- # echo 0 > tracing_max_latency
# echo 1 > tracing_on
+ # echo 0 > tracing_max_latency
# ls -ltr
[...]
# echo 0 > tracing_on
# cat trace
# tracer: preemptirqsoff
#
-preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: apic_timer_interrupt
- => ended at: __do_softirq
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt)
- ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq)
- ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq)
-
+# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+# -----------------
+# | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
+# -----------------
+# => started at: ata_scsi_queuecmd
+# => ended at: ata_scsi_queuecmd
+#
+#
+# _------=> CPU#
+# / _-----=> irqs-off
+# | / _----=> need-resched
+# || / _---=> hardirq/softirq
+# ||| / _--=> preempt-depth
+# |||| / delay
+# cmd pid ||||| time | caller
+# \ / ||||| \ | /
+ ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
+ ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
+ ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd
+ ls-2230 3...1 111us : <stack trace>
+ => sub_preempt_count
+ => _raw_spin_unlock_irqrestore
+ => ata_scsi_queuecmd
+ => scsi_dispatch_cmd
+ => scsi_request_fn
+ => __blk_run_queue_uncond
+ => __blk_run_queue
+ => blk_queue_bio
+ => generic_make_request
+ => submit_bio
+ => submit_bh
+ => ext3_bread
+ => ext3_dir_bread
+ => htree_dirblock_to_tree
+ => ext3_htree_fill_tree
+ => ext3_readdir
+ => vfs_readdir
+ => sys_getdents
+ => system_call_fastpath
The trace_hardirqs_off_thunk is called from assembly on x86 when
@@ -802,105 +1258,158 @@ function tracing, we do not know if interrupts were enabled
within the preemption points. We do see that it started with
preemption enabled.
-Here is a trace with ftrace_enabled set:
-
+Here is a trace with function-trace set:
# tracer: preemptirqsoff
#
-preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: write_chan
- => ended at: __do_softirq
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- ls-4473 0.N.. 0us : preempt_schedule (write_chan)
- ls-4473 0dN.1 1us : _spin_lock (schedule)
- ls-4473 0dN.1 2us : add_preempt_count (_spin_lock)
- ls-4473 0d..2 2us : put_prev_task_fair (schedule)
-[...]
- ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts)
- ls-4473 0d..2 13us : __switch_to (schedule)
- sshd-4261 0d..2 14us : finish_task_switch (schedule)
- sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch)
- sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave)
- sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set)
- sshd-4261 0d..2 16us : do_IRQ (common_interrupt)
- sshd-4261 0d..2 17us : irq_enter (do_IRQ)
- sshd-4261 0d..2 17us : idle_cpu (irq_enter)
- sshd-4261 0d..2 18us : add_preempt_count (irq_enter)
- sshd-4261 0d.h2 18us : idle_cpu (irq_enter)
- sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ)
- sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq)
- sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock)
- sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq)
- sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock)
-[...]
- sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq)
- sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock)
- sshd-4261 0d.h2 29us : irq_exit (do_IRQ)
- sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit)
- sshd-4261 0d..3 30us : do_softirq (irq_exit)
- sshd-4261 0d... 30us : __do_softirq (do_softirq)
- sshd-4261 0d... 31us : __local_bh_disable (__do_softirq)
- sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable)
- sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable)
+# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+# -----------------
+# | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
+# -----------------
+# => started at: schedule
+# => ended at: mutex_unlock
+#
+#
+# _------=> CPU#
+# / _-----=> irqs-off
+# | / _----=> need-resched
+# || / _---=> hardirq/softirq
+# ||| / _--=> preempt-depth
+# |||| / delay
+# cmd pid ||||| time | caller
+# \ / ||||| \ | /
+kworker/-59 3...1 0us : __schedule <-schedule
+kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
+kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq
+kworker/-59 3d..2 1us : deactivate_task <-__schedule
+kworker/-59 3d..2 1us : dequeue_task <-deactivate_task
+kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task
+kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task
+kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair
+kworker/-59 3d..2 2us : update_min_vruntime <-update_curr
+kworker/-59 3d..2 3us : cpuacct_charge <-update_curr
+kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge
+kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge
+kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair
+kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair
+kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair
+kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair
+kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair
+kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair
+kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule
+kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping
+kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule
+kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task
+kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair
+kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair
+kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity
+ ls-2269 3d..2 7us : finish_task_switch <-__schedule
+ ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch
+ ls-2269 3d..2 8us : do_IRQ <-ret_from_intr
+ ls-2269 3d..2 8us : irq_enter <-do_IRQ
+ ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter
+ ls-2269 3d..2 9us : add_preempt_count <-irq_enter
+ ls-2269 3d.h2 9us : exit_idle <-do_IRQ
[...]
- sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip)
- sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip)
- sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt)
- sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt)
- sshd-4261 0d.s3 45us : idle_cpu (irq_enter)
- sshd-4261 0d.s3 46us : add_preempt_count (irq_enter)
- sshd-4261 0d.H3 46us : idle_cpu (irq_enter)
- sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt)
- sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt)
+ ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock
+ ls-2269 3d.h2 20us : irq_exit <-do_IRQ
+ ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit
+ ls-2269 3d..3 21us : do_softirq <-irq_exit
+ ls-2269 3d..3 21us : __do_softirq <-call_softirq
+ ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq
+ ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip
+ ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip
+ ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr
+ ls-2269 3d.s5 31us : irq_enter <-do_IRQ
+ ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter
[...]
- sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt)
- sshd-4261 0d.H3 82us : ktime_get (tick_program_event)
- sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get)
- sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts)
- sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts)
- sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event)
- sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event)
- sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt)
- sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit)
- sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit)
- sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable)
+ ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter
+ ls-2269 3d.s5 32us : add_preempt_count <-irq_enter
+ ls-2269 3d.H5 32us : exit_idle <-do_IRQ
+ ls-2269 3d.H5 32us : handle_irq <-do_IRQ
+ ls-2269 3d.H5 32us : irq_to_desc <-handle_irq
+ ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq
[...]
- sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action)
- sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq)
- sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq)
- sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq)
- sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable)
- sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq)
- sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq)
-
-
-This is a very interesting trace. It started with the preemption
-of the ls task. We see that the task had the "need_resched" bit
-set via the 'N' in the trace. Interrupts were disabled before
-the spin_lock at the beginning of the trace. We see that a
-schedule took place to run sshd. When the interrupts were
-enabled, we took an interrupt. On return from the interrupt
-handler, the softirq ran. We took another interrupt while
-running the softirq as we see from the capital 'H'.
+ ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
+ ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
+ ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq
+ ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable
+ ls-2269 3d..3 159us : idle_cpu <-irq_exit
+ ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit
+ ls-2269 3d..3 160us : sub_preempt_count <-irq_exit
+ ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock
+ ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock
+ ls-2269 3d... 186us : <stack trace>
+ => __mutex_unlock_slowpath
+ => mutex_unlock
+ => process_output
+ => n_tty_write
+ => tty_write
+ => vfs_write
+ => sys_write
+ => system_call_fastpath
+
+This is an interesting trace. It started with kworker running and
+scheduling out and ls taking over. But as soon as ls released the
+rq lock and enabled interrupts (but not preemption) an interrupt
+triggered. When the interrupt finished, it started running softirqs.
+But while the softirq was running, another interrupt triggered.
+When an interrupt is running inside a softirq, the annotation is 'H'.
wakeup
------
+One common case that people are interested in tracing is the
+time it takes for a task that is woken to actually wake up.
+Now for non Real-Time tasks, this can be arbitrary. But tracing
+it none the less can be interesting.
+
+Without function tracing:
+
+ # echo 0 > options/function-trace
+ # echo wakeup > current_tracer
+ # echo 1 > tracing_on
+ # echo 0 > tracing_max_latency
+ # chrt -f 5 sleep 1
+ # echo 0 > tracing_on
+ # cat trace
+# tracer: wakeup
+#
+# wakeup latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+# -----------------
+# | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
+# -----------------
+#
+# _------=> CPU#
+# / _-----=> irqs-off
+# | / _----=> need-resched
+# || / _---=> hardirq/softirq
+# ||| / _--=> preempt-depth
+# |||| / delay
+# cmd pid ||||| time | caller
+# \ / ||||| \ | /
+ <idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H
+ <idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
+ <idle>-0 3d..3 15us : __schedule <-schedule
+ <idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H
+
+The tracer only traces the highest priority task in the system
+to avoid tracing the normal circumstances. Here we see that
+the kworker with a nice priority of -20 (not very nice), took
+just 15 microseconds from the time it woke up, to the time it
+ran.
+
+Non Real-Time tasks are not that interesting. A more interesting
+trace is to concentrate only on Real-Time tasks.
+
+wakeup_rt
+---------
+
In a Real-Time environment it is very important to know the
wakeup time it takes for the highest priority task that is woken
up to the time that it executes. This is also known as "schedule
@@ -914,124 +1423,229 @@ Real-Time environments are interested in the worst case latency.
That is the longest latency it takes for something to happen,
and not the average. We can have a very fast scheduler that may
only have a large latency once in a while, but that would not
-work well with Real-Time tasks. The wakeup tracer was designed
+work well with Real-Time tasks. The wakeup_rt tracer was designed
to record the worst case wakeups of RT tasks. Non-RT tasks are
not recorded because the tracer only records one worst case and
tracing non-RT tasks that are unpredictable will overwrite the
-worst case latency of RT tasks.
+worst case latency of RT tasks (just run the normal wakeup
+tracer for a while to see that effect).
Since this tracer only deals with RT tasks, we will run this
slightly differently than we did with the previous tracers.
Instead of performing an 'ls', we will run 'sleep 1' under
'chrt' which changes the priority of the task.
- # echo wakeup > current_tracer
- # echo latency-format > trace_options
- # echo 0 > tracing_max_latency
+ # echo 0 > options/function-trace
+ # echo wakeup_rt > current_tracer
# echo 1 > tracing_on
+ # echo 0 > tracing_max_latency
# chrt -f 5 sleep 1
# echo 0 > tracing_on
# cat trace
# tracer: wakeup
#
-wakeup latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5)
- -----------------
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- <idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process)
- <idle>-0 1d..4 4us : schedule (cpu_idle)
-
-
-Running this on an idle system, we see that it only took 4
-microseconds to perform the task switch. Note, since the trace
-marker in the schedule is before the actual "switch", we stop
-the tracing when the recorded task is about to schedule in. This
-may change if we add a new marker at the end of the scheduler.
-
-Notice that the recorded task is 'sleep' with the PID of 4901
+# tracer: wakeup_rt
+#
+# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+# -----------------
+# | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
+# -----------------
+#
+# _------=> CPU#
+# / _-----=> irqs-off
+# | / _----=> need-resched
+# || / _---=> hardirq/softirq
+# ||| / _--=> preempt-depth
+# |||| / delay
+# cmd pid ||||| time | caller
+# \ / ||||| \ | /
+ <idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep
+ <idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
+ <idle>-0 3d..3 5us : __schedule <-schedule
+ <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep
+
+
+Running this on an idle system, we see that it only took 5 microseconds
+to perform the task switch. Note, since the trace point in the schedule
+is before the actual "switch", we stop the tracing when the recorded task
+is about to schedule in. This may change if we add a new marker at the
+end of the scheduler.
+
+Notice that the recorded task is 'sleep' with the PID of 2389
and it has an rt_prio of 5. This priority is user-space priority
and not the internal kernel priority. The policy is 1 for
SCHED_FIFO and 2 for SCHED_RR.
-Doing the same with chrt -r 5 and ftrace_enabled set.
+Note, that the trace data shows the internal priority (99 - rtprio).
-# tracer: wakeup
+ <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep
+
+The 0:120:R means idle was running with a nice priority of 0 (120 - 20)
+and in the running state 'R'. The sleep task was scheduled in with
+2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
+and it too is in the running state.
+
+Doing the same with chrt -r 5 and function-trace set.
+
+ echo 1 > options/function-trace
+
+# tracer: wakeup_rt
#
-wakeup latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5)
- -----------------
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
-ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process)
-ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb)
-ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up)
-ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup)
-ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr)
-ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup)
-ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up)
-ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up)
-[...]
-ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt)
-ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit)
-ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit)
-ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq)
-[...]
-ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks)
-ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq)
-ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable)
-ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd)
-ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd)
-ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched)
-ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched)
-ksoftirq-7 1.N.2 33us : schedule (__cond_resched)
-ksoftirq-7 1.N.2 33us : add_preempt_count (schedule)
-ksoftirq-7 1.N.3 34us : hrtick_clear (schedule)
-ksoftirq-7 1dN.3 35us : _spin_lock (schedule)
-ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock)
-ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule)
-ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair)
-[...]
-ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline)
-ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock)
-ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline)
-ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock)
-ksoftirq-7 1d..4 50us : schedule (__cond_resched)
-
-The interrupt went off while running ksoftirqd. This task runs
-at SCHED_OTHER. Why did not we see the 'N' set early? This may
-be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K
-stacks configured, the interrupt and softirq run with their own
-stack. Some information is held on the top of the task's stack
-(need_resched and preempt_count are both stored there). The
-setting of the NEED_RESCHED bit is done directly to the task's
-stack, but the reading of the NEED_RESCHED is done by looking at
-the current stack, which in this case is the stack for the hard
-interrupt. This hides the fact that NEED_RESCHED has been set.
-We do not see the 'N' until we switch back to the task's
-assigned stack.
+# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+# -----------------
+# | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
+# -----------------
+#
+# _------=> CPU#
+# / _-----=> irqs-off
+# | / _----=> need-resched
+# || / _---=> hardirq/softirq
+# ||| / _--=> preempt-depth
+# |||| / delay
+# cmd pid ||||| time | caller
+# \ / ||||| \ | /
+ <idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep
+ <idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
+ <idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup
+ <idle>-0 3d.h3 3us : resched_task <-check_preempt_curr
+ <idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup
+ <idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up
+ <idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock
+ <idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up
+ <idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
+ <idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
+ <idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer
+ <idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock
+ <idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt
+ <idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock
+ <idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt
+ <idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event
+ <idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event
+ <idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event
+ <idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt
+ <idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit
+ <idle>-0 3dN.2 9us : idle_cpu <-irq_exit
+ <idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit
+ <idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
+ <idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit
+ <idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle
+ <idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
+ <idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle
+ <idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
+ <idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit
+ <idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
+ <idle>-0 3dN.1 13us : update_cpu_load_nohz <-tick_nohz_idle_exit
+ <idle>-0 3dN.1 13us : _raw_spin_lock <-update_cpu_load_nohz
+ <idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock
+ <idle>-0 3dN.2 13us : __update_cpu_load <-update_cpu_load_nohz
+ <idle>-0 3dN.2 14us : sched_avg_update <-__update_cpu_load
+ <idle>-0 3dN.2 14us : _raw_spin_unlock <-update_cpu_load_nohz
+ <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock
+ <idle>-0 3dN.1 15us : calc_load_exit_idle <-tick_nohz_idle_exit
+ <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
+ <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit
+ <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel
+ <idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
+ <idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
+ <idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave
+ <idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16
+ <idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer
+ <idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram
+ <idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event
+ <idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event
+ <idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event
+ <idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
+ <idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
+ <idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit
+ <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward
+ <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward
+ <idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
+ <idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
+ <idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
+ <idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
+ <idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave
+ <idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns
+ <idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns
+ <idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns
+ <idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event
+ <idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event
+ <idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event
+ <idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
+ <idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
+ <idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit
+ <idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks
+ <idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle
+ <idle>-0 3.N.. 25us : schedule <-cpu_idle
+ <idle>-0 3.N.. 25us : __schedule <-preempt_schedule
+ <idle>-0 3.N.. 26us : add_preempt_count <-__schedule
+ <idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule
+ <idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch
+ <idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch
+ <idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule
+ <idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq
+ <idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule
+ <idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task
+ <idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task
+ <idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt
+ <idle>-0 3d..3 29us : __schedule <-preempt_schedule
+ <idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep
+
+This isn't that big of a trace, even with function tracing enabled,
+so I included the entire trace.
+
+The interrupt went off while when the system was idle. Somewhere
+before task_woken_rt() was called, the NEED_RESCHED flag was set,
+this is indicated by the first occurrence of the 'N' flag.
+
+Latency tracing and events
+--------------------------
+As function tracing can induce a much larger latency, but without
+seeing what happens within the latency it is hard to know what
+caused it. There is a middle ground, and that is with enabling
+events.
+
+ # echo 0 > options/function-trace
+ # echo wakeup_rt > current_tracer
+ # echo 1 > events/enable
+ # echo 1 > tracing_on
+ # echo 0 > tracing_max_latency
+ # chrt -f 5 sleep 1
+ # echo 0 > tracing_on
+ # cat trace
+# tracer: wakeup_rt
+#
+# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+# -----------------
+# | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
+# -----------------
+#
+# _------=> CPU#
+# / _-----=> irqs-off
+# | / _----=> need-resched
+# || / _---=> hardirq/softirq
+# ||| / _--=> preempt-depth
+# |||| / delay
+# cmd pid ||||| time | caller
+# \ / ||||| \ | /
+ <idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep
+ <idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
+ <idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
+ <idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
+ <idle>-0 2.N.2 2us : power_end: cpu_id=2
+ <idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2
+ <idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
+ <idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
+ <idle>-0 2.N.2 5us : rcu_utilization: Start context switch
+ <idle>-0 2.N.2 5us : rcu_utilization: End context switch
+ <idle>-0 2d..3 6us : __schedule <-schedule
+ <idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep
+
function
--------
@@ -1039,6 +1653,7 @@ function
This tracer is the function tracer. Enabling the function tracer
can be done from the debug file system. Make sure the
ftrace_enabled is set; otherwise this tracer is a nop.
+See the "ftrace_enabled" section below.
# sysctl kernel.ftrace_enabled=1
# echo function > current_tracer
@@ -1048,23 +1663,23 @@ ftrace_enabled is set; otherwise this tracer is a nop.
# cat trace
# tracer: function
#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
- bash-4003 [00] 123.638713: finish_task_switch <-schedule
- bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch
- bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq
- bash-4003 [00] 123.638715: hrtick_set <-schedule
- bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set
- bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave
- bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set
- bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore
- bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set
- bash-4003 [00] 123.638718: sub_preempt_count <-schedule
- bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule
- bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run
- bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion
- bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common
- bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq
+# entries-in-buffer/entries-written: 24799/24799 #P:4
+#
+# _-----=> irqs-off
+# / _----=> need-resched
+# | / _---=> hardirq/softirq
+# || / _--=> preempt-depth
+# ||| / delay
+# TASK-PID CPU# |||| TIMESTAMP FUNCTION
+# | | | |||| | |
+ bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write
+ bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock
+ bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify
+ bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify
+ bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify
+ bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock
+ bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock
+ bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify
[...]
@@ -1214,79 +1829,19 @@ int main (int argc, char **argv)
return 0;
}
+Or this simple script!
-hw-branch-tracer (x86 only)
----------------------------
-
-This tracer uses the x86 last branch tracing hardware feature to
-collect a branch trace on all cpus with relatively low overhead.
-
-The tracer uses a fixed-size circular buffer per cpu and only
-traces ring 0 branches. The trace file dumps that buffer in the
-following format:
-
-# tracer: hw-branch-tracer
-#
-# CPU# TO <- FROM
- 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6
- 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a
- 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf
- 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf
- 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a
- 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf
-
-
-The tracer may be used to dump the trace for the oops'ing cpu on
-a kernel oops into the system log. To enable this,
-ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one
-can either use the sysctl function or set it via the proc system
-interface.
-
- sysctl kernel.ftrace_dump_on_oops=n
-
-or
-
- echo n > /proc/sys/kernel/ftrace_dump_on_oops
-
-If n = 1, ftrace will dump buffers of all CPUs, if n = 2 ftrace will
-only dump the buffer of the CPU that triggered the oops.
-
-Here's an example of such a dump after a null pointer
-dereference in a kernel module:
-
-[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
-[57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops]
-[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0
-[57848.106019] Oops: 0002 [#1] SMP
-[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus
-[57848.106019] Dumping ftrace buffer:
-[57848.106019] ---------------------------------
-[...]
-[57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24
-[57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165
-[57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165
-[57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165
-[57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165
-[57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops]
-[57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30
-[57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b
-[57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31
-[57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1
-[57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30
-[...]
-[57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2
-[57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881
-[57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881
-[57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96
-[...]
-[57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3
-[57848.106019] ---------------------------------
-[57848.106019] CPU 0
-[57848.106019] Modules linked in: oops
-[57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23
-[57848.106019] RIP: 0010:[<ffffffffa0000006>] [<ffffffffa0000006>] open+0x6/0x14 [oops]
-[57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246
-[...]
+------
+#!/bin/bash
+
+debugfs=`sed -ne 's/^debugfs \(.*\) debugfs.*/\1/p' /proc/mounts`
+echo nop > $debugfs/tracing/current_tracer
+echo 0 > $debugfs/tracing/tracing_on
+echo $$ > $debugfs/tracing/set_ftrace_pid
+echo function > $debugfs/tracing/current_tracer
+echo 1 > $debugfs/tracing/tracing_on
+exec "$@"
+------
function graph tracer
@@ -1473,16 +2028,18 @@ starts of pointing to a simple return. (Enabling FTRACE will
include the -pg switch in the compiling of the kernel.)
At compile time every C file object is run through the
-recordmcount.pl script (located in the scripts directory). This
-script will process the C object using objdump to find all the
-locations in the .text section that call mcount. (Note, only the
-.text section is processed, since processing other sections like
-.init.text may cause races due to those sections being freed).
+recordmcount program (located in the scripts directory). This
+program will parse the ELF headers in the C object to find all
+the locations in the .text section that call mcount. (Note, only
+white listed .text sections are processed, since processing other
+sections like .init.text may cause races due to those sections
+being freed unexpectedly).
A new section called "__mcount_loc" is created that holds
references to all the mcount call sites in the .text section.
-This section is compiled back into the original object. The
-final linker will add all these references into a single table.
+The recordmcount program re-links this section back into the
+original object. The final linking stage of the kernel will add all these
+references into a single table.
On boot up, before SMP is initialized, the dynamic ftrace code
scans this table and updates all the locations into nops. It
@@ -1493,13 +2050,25 @@ unloaded, it also removes its functions from the ftrace function
list. This is automatic in the module unload code, and the
module author does not need to worry about it.
-When tracing is enabled, kstop_machine is called to prevent
-races with the CPUS executing code being modified (which can
-cause the CPU to do undesirable things), and the nops are
+When tracing is enabled, the process of modifying the function
+tracepoints is dependent on architecture. The old method is to use
+kstop_machine to prevent races with the CPUs executing code being
+modified (which can cause the CPU to do undesirable things, especially
+if the modified code crosses cache (or page) boundaries), and the nops are
patched back to calls. But this time, they do not call mcount
(which is just a function stub). They now call into the ftrace
infrastructure.
+The new method of modifying the function tracepoints is to place
+a breakpoint at the location to be modified, sync all CPUs, modify
+the rest of the instruction not covered by the breakpoint. Sync
+all CPUs again, and then remove the breakpoint with the finished
+version to the ftrace call site.
+
+Some archs do not even need to monkey around with the synchronization,
+and can just slap the new code on top of the old without any
+problems with other CPUs executing it at the same time.
+
One special side-effect to the recording of the functions being
traced is that we can now selectively choose which functions we
wish to trace and which ones we want the mcount calls to remain
@@ -1530,20 +2099,28 @@ mutex_lock
If I am only interested in sys_nanosleep and hrtimer_interrupt:
- # echo sys_nanosleep hrtimer_interrupt \
- > set_ftrace_filter
+ # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
# echo function > current_tracer
# echo 1 > tracing_on
# usleep 1
# echo 0 > tracing_on
# cat trace
-# tracer: ftrace
+# tracer: function
+#
+# entries-in-buffer/entries-written: 5/5 #P:4
#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
- usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt
- usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call
- <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
+# _-----=> irqs-off
+# / _----=> need-resched
+# | / _---=> hardirq/softirq
+# || / _--=> preempt-depth
+# ||| / delay
+# TASK-PID CPU# |||| TIMESTAMP FUNCTION
+# | | | |||| | |
+ usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath
+ <idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
+ usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
+ <idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
+ <idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
To see which functions are being traced, you can cat the file:
@@ -1571,20 +2148,25 @@ Note: It is better to use quotes to enclose the wild cards,
Produces:
-# tracer: ftrace
+# tracer: function
#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
- bash-4003 [00] 1480.611794: hrtimer_init <-copy_process
- bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set
- bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear
- bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel
- <idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt
- <idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt
- <idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt
- <idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt
- <idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt
-
+# entries-in-buffer/entries-written: 897/897 #P:4
+#
+# _-----=> irqs-off
+# / _----=> need-resched
+# | / _---=> hardirq/softirq
+# || / _--=> preempt-depth
+# ||| / delay
+# TASK-PID CPU# |||| TIMESTAMP FUNCTION
+# | | | |||| | |
+ <idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
+ <idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
+ <idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
+ <idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit
+ <idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
+ <idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
+ <idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter
+ <idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem
Notice that we lost the sys_nanosleep.
@@ -1651,19 +2233,29 @@ traced.
Produces:
-# tracer: ftrace
+# tracer: function
+#
+# entries-in-buffer/entries-written: 39608/39608 #P:4
#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
- bash-4043 [01] 115.281644: finish_task_switch <-schedule
- bash-4043 [01] 115.281645: hrtick_set <-schedule
- bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set
- bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run
- bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion
- bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run
- bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop
- bash-4043 [01] 115.281648: wake_up_process <-kthread_stop
- bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process
+# _-----=> irqs-off
+# / _----=> need-resched
+# | / _---=> hardirq/softirq
+# || / _--=> preempt-depth
+# ||| / delay
+# TASK-PID CPU# |||| TIMESTAMP FUNCTION
+# | | | |||| | |
+ bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open
+ bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last
+ bash-1994 [000] .... 4342.324897: ima_file_check <-do_last
+ bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check
+ bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement
+ bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action
+ bash-1994 [000] .... 4342.324899: do_truncate <-do_last
+ bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate
+ bash-1994 [000] .... 4342.324899: notify_change <-do_truncate
+ bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change
+ bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time
+ bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time
We can see that there's no more lock or preempt tracing.
@@ -1729,6 +2321,28 @@ this special filter via:
echo > set_graph_function
+ftrace_enabled
+--------------
+
+Note, the proc sysctl ftrace_enable is a big on/off switch for the
+function tracer. By default it is enabled (when function tracing is
+enabled in the kernel). If it is disabled, all function tracing is
+disabled. This includes not only the function tracers for ftrace, but
+also for any other uses (perf, kprobes, stack tracing, profiling, etc).
+
+Please disable this with care.
+
+This can be disable (and enabled) with:
+
+ sysctl kernel.ftrace_enabled=0
+ sysctl kernel.ftrace_enabled=1
+
+ or
+
+ echo 0 > /proc/sys/kernel/ftrace_enabled
+ echo 1 > /proc/sys/kernel/ftrace_enabled
+
+
Filter commands
---------------
@@ -1763,12 +2377,58 @@ The following commands are supported:
echo '__schedule_bug:traceoff:5' > set_ftrace_filter
+ To always disable tracing when __schedule_bug is hit:
+
+ echo '__schedule_bug:traceoff' > set_ftrace_filter
+
These commands are cumulative whether or not they are appended
to set_ftrace_filter. To remove a command, prepend it by '!'
and drop the parameter:
+ echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
+
+ The above removes the traceoff command for __schedule_bug
+ that have a counter. To remove commands without counters:
+
echo '!__schedule_bug:traceoff' > set_ftrace_filter
+- snapshot
+ Will cause a snapshot to be triggered when the function is hit.
+
+ echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
+
+ To only snapshot once:
+
+ echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
+
+ To remove the above commands:
+
+ echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
+ echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
+
+- enable_event/disable_event
+ These commands can enable or disable a trace event. Note, because
+ function tracing callbacks are very sensitive, when these commands
+ are registered, the trace point is activated, but disabled in
+ a "soft" mode. That is, the tracepoint will be called, but
+ just will not be traced. The event tracepoint stays in this mode
+ as long as there's a command that triggers it.
+
+ echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
+ set_ftrace_filter
+
+ The format is:
+
+ <function>:enable_event:<system>:<event>[:count]
+ <function>:disable_event:<system>:<event>[:count]
+
+ To remove the events commands:
+
+
+ echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
+ set_ftrace_filter
+ echo '!schedule:disable_event:sched:sched_switch' > \
+ set_ftrace_filter
trace_pipe
----------
@@ -1787,28 +2447,31 @@ different. The trace is live.
# cat trace
# tracer: function
#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
+# entries-in-buffer/entries-written: 0/0 #P:4
+#
+# _-----=> irqs-off
+# / _----=> need-resched
+# | / _---=> hardirq/softirq
+# || / _--=> preempt-depth
+# ||| / delay
+# TASK-PID CPU# |||| TIMESTAMP FUNCTION
+# | | | |||| | |
#
# cat /tmp/trace.out
- bash-4043 [00] 41.267106: finish_task_switch <-schedule
- bash-4043 [00] 41.267106: hrtick_set <-schedule
- bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set
- bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run
- bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion
- bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run
- bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop
- bash-4043 [00] 41.267110: wake_up_process <-kthread_stop
- bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process
- bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up
+ bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write
+ bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock
+ bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify
+ bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify
+ bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify
+ bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock
+ bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock
+ bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify
+ bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath
Note, reading the trace_pipe file will block until more input is
-added. By changing the tracer, trace_pipe will issue an EOF. We
-needed to set the function tracer _before_ we "cat" the
-trace_pipe file.
-
+added.
trace entries
-------------
@@ -1817,31 +2480,50 @@ Having too much or not enough data can be troublesome in
diagnosing an issue in the kernel. The file buffer_size_kb is
used to modify the size of the internal trace buffers. The
number listed is the number of entries that can be recorded per
-CPU. To know the full size, multiply the number of possible CPUS
+CPU. To know the full size, multiply the number of possible CPUs
with the number of entries.
# cat buffer_size_kb
1408 (units kilobytes)
-Note, to modify this, you must have tracing completely disabled.
-To do that, echo "nop" into the current_tracer. If the
-current_tracer is not set to "nop", an EINVAL error will be
-returned.
+Or simply read buffer_total_size_kb
+
+ # cat buffer_total_size_kb
+5632
+
+To modify the buffer, simple echo in a number (in 1024 byte segments).
- # echo nop > current_tracer
# echo 10000 > buffer_size_kb
# cat buffer_size_kb
10000 (units kilobytes)
-The number of pages which will be allocated is limited to a
-percentage of available memory. Allocating too much will produce
-an error.
+It will try to allocate as much as possible. If you allocate too
+much, it can cause Out-Of-Memory to trigger.
# echo 1000000000000 > buffer_size_kb
-bash: echo: write error: Cannot allocate memory
# cat buffer_size_kb
85
+The per_cpu buffers can be changed individually as well:
+
+ # echo 10000 > per_cpu/cpu0/buffer_size_kb
+ # echo 100 > per_cpu/cpu1/buffer_size_kb
+
+When the per_cpu buffers are not the same, the buffer_size_kb
+at the top level will just show an X
+
+ # cat buffer_size_kb
+X
+
+This is where the buffer_total_size_kb is useful:
+
+ # cat buffer_total_size_kb
+12916
+
+Writing to the top level buffer_size_kb will reset all the buffers
+to be the same again.
+
Snapshot
--------
CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
@@ -1925,7 +2607,188 @@ bash: echo: write error: Device or resource busy
# cat snapshot
cat: snapshot: Device or resource busy
+
+Instances
+---------
+In the debugfs tracing directory is a directory called "instances".
+This directory can have new directories created inside of it using
+mkdir, and removing directories with rmdir. The directory created
+with mkdir in this directory will already contain files and other
+directories after it is created.
+
+ # mkdir instances/foo
+ # ls instances/foo
+buffer_size_kb buffer_total_size_kb events free_buffer per_cpu
+set_event snapshot trace trace_clock trace_marker trace_options
+trace_pipe tracing_on
+
+As you can see, the new directory looks similar to the tracing directory
+itself. In fact, it is very similar, except that the buffer and
+events are agnostic from the main director, or from any other
+instances that are created.
+
+The files in the new directory work just like the files with the
+same name in the tracing directory except the buffer that is used
+is a separate and new buffer. The files affect that buffer but do not
+affect the main buffer with the exception of trace_options. Currently,
+the trace_options affect all instances and the top level buffer
+the same, but this may change in future releases. That is, options
+may become specific to the instance they reside in.
+
+Notice that none of the function tracer files are there, nor is
+current_tracer and available_tracers. This is because the buffers
+can currently only have events enabled for them.
+
+ # mkdir instances/foo
+ # mkdir instances/bar
+ # mkdir instances/zoot
+ # echo 100000 > buffer_size_kb
+ # echo 1000 > instances/foo/buffer_size_kb
+ # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
+ # echo function > current_trace
+ # echo 1 > instances/foo/events/sched/sched_wakeup/enable
+ # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
+ # echo 1 > instances/foo/events/sched/sched_switch/enable
+ # echo 1 > instances/bar/events/irq/enable
+ # echo 1 > instances/zoot/events/syscalls/enable
+ # cat trace_pipe
+CPU:2 [LOST 11745 EVENTS]
+ bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
+ bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
+ bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
+ bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
+ bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
+ bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
+ bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
+ bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
+ bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics
+ bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics
+ bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process
+[...]
+
+ # cat instances/foo/trace_pipe
+ bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
+ bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
+ <idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
+ <idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
+ rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
+ bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
+ bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
+ bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
+ kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
+ kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
+[...]
+
+ # cat instances/bar/trace_pipe
+ migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX]
+ <idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX]
+ bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER]
+ bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU]
+ bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER]
+ bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER]
+ bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU]
+ bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU]
+ sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
+ sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled
+ sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0
+ sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled
+[...]
+
+ # cat instances/zoot/trace
+# tracer: nop
+#
+# entries-in-buffer/entries-written: 18996/18996 #P:4
+#
+# _-----=> irqs-off
+# / _----=> need-resched
+# | / _---=> hardirq/softirq
+# || / _--=> preempt-depth
+# ||| / delay
+# TASK-PID CPU# |||| TIMESTAMP FUNCTION
+# | | | |||| | |
+ bash-1998 [000] d... 140.733501: sys_write -> 0x2
+ bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1)
+ bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1
+ bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
+ bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1
+ bash-1998 [000] d... 140.733510: sys_close(fd: a)
+ bash-1998 [000] d... 140.733510: sys_close -> 0x0
+ bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
+ bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0
+ bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
+ bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0
+
+You can see that the trace of the top most trace buffer shows only
+the function tracing. The foo instance displays wakeups and task
+switches.
+
+To remove the instances, simply delete their directories:
+
+ # rmdir instances/foo
+ # rmdir instances/bar
+ # rmdir instances/zoot
+
+Note, if a process has a trace file open in one of the instance
+directories, the rmdir will fail with EBUSY.
+
+
+Stack trace
-----------
+Since the kernel has a fixed sized stack, it is important not to
+waste it in functions. A kernel developer must be conscience of
+what they allocate on the stack. If they add too much, the system
+can be in danger of a stack overflow, and corruption will occur,
+usually leading to a system panic.
+
+There are some tools that check this, usually with interrupts
+periodically checking usage. But if you can perform a check
+at every function call that will become very useful. As ftrace provides
+a function tracer, it makes it convenient to check the stack size
+at every function call. This is enabled via the stack tracer.
+
+CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
+To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
+
+ # echo 1 > /proc/sys/kernel/stack_tracer_enabled
+
+You can also enable it from the kernel command line to trace
+the stack size of the kernel during boot up, by adding "stacktrace"
+to the kernel command line parameter.
+
+After running it for a few minutes, the output looks like:
+
+ # cat stack_max_size
+2928
+
+ # cat stack_trace
+ Depth Size Location (18 entries)
+ ----- ---- --------
+ 0) 2928 224 update_sd_lb_stats+0xbc/0x4ac
+ 1) 2704 160 find_busiest_group+0x31/0x1f1
+ 2) 2544 256 load_balance+0xd9/0x662
+ 3) 2288 80 idle_balance+0xbb/0x130
+ 4) 2208 128 __schedule+0x26e/0x5b9
+ 5) 2080 16 schedule+0x64/0x66
+ 6) 2064 128 schedule_timeout+0x34/0xe0
+ 7) 1936 112 wait_for_common+0x97/0xf1
+ 8) 1824 16 wait_for_completion+0x1d/0x1f
+ 9) 1808 128 flush_work+0xfe/0x119
+ 10) 1680 16 tty_flush_to_ldisc+0x1e/0x20
+ 11) 1664 48 input_available_p+0x1d/0x5c
+ 12) 1616 48 n_tty_poll+0x6d/0x134
+ 13) 1568 64 tty_poll+0x64/0x7f
+ 14) 1504 880 do_select+0x31e/0x511
+ 15) 624 400 core_sys_select+0x177/0x216
+ 16) 224 96 sys_select+0x91/0xb9
+ 17) 128 128 system_call_fastpath+0x16/0x1b
+
+Note, if -mfentry is being used by gcc, functions get traced before
+they set up the stack frame. This means that leaf level functions
+are not tested by the stack tracer when -mfentry is used.
+
+Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
+
+---------
More details can be found in the source code, in the
kernel/trace/*.c files.