summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/ima_policy2
-rw-r--r--Documentation/ABI/testing/sysfs-class-rtc16
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt98
-rw-r--r--Documentation/arm/Samsung-S3C24XX/S3C2412.txt2
-rw-r--r--Documentation/arm64/memory.txt9
-rw-r--r--Documentation/cgroup-v1/memory.txt2
-rw-r--r--Documentation/device-mapper/verity.txt11
-rw-r--r--Documentation/devicetree/bindings/arm/hisilicon/hisilicon-low-pin-count.txt33
-rw-r--r--Documentation/devicetree/bindings/dma/mtk-hsdma.txt33
-rw-r--r--Documentation/devicetree/bindings/dma/qcom_bam_dma.txt4
-rw-r--r--Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt2
-rw-r--r--Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt1
-rw-r--r--Documentation/devicetree/bindings/dma/snps,dw-axi-dmac.txt41
-rw-r--r--Documentation/devicetree/bindings/dma/stm32-dma.txt6
-rw-r--r--Documentation/devicetree/bindings/eeprom/at24.txt4
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-rcar.txt2
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-sh_mobile.txt1
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-synquacer.txt29
-rw-r--r--Documentation/devicetree/bindings/mailbox/hisilicon,hi3660-mailbox.txt51
-rw-r--r--Documentation/devicetree/bindings/mips/mscc.txt43
-rw-r--r--Documentation/devicetree/bindings/mtd/fsl-quadspi.txt24
-rw-r--r--Documentation/devicetree/bindings/mtd/marvell-nand.txt5
-rw-r--r--Documentation/devicetree/bindings/mtd/mtd-physmap.txt7
-rw-r--r--Documentation/devicetree/bindings/mtd/sunxi-nand.txt4
-rw-r--r--Documentation/devicetree/bindings/net/fsl-tsec-phy.txt6
-rw-r--r--Documentation/devicetree/bindings/pci/hisilicon-histb-pcie.txt1
-rw-r--r--Documentation/devicetree/bindings/pci/mediatek-pcie.txt11
-rw-r--r--Documentation/devicetree/bindings/pci/qcom,pcie.txt4
-rw-r--r--Documentation/devicetree/bindings/pci/rcar-pci.txt6
-rw-r--r--Documentation/devicetree/bindings/pmem/pmem-region.txt65
-rw-r--r--Documentation/devicetree/bindings/rtc/isil,isl12026.txt28
-rw-r--r--Documentation/devicetree/bindings/vendor-prefixes.txt1
-rw-r--r--Documentation/driver-api/gpio/drivers-on-gpio.rst4
-rw-r--r--Documentation/driver-api/mtdnand.rst8
-rw-r--r--Documentation/filesystems/caching/netfs-api.txt157
-rw-r--r--Documentation/filesystems/ceph.txt16
-rw-r--r--Documentation/filesystems/orangefs.txt137
-rw-r--r--Documentation/hwmon/adm127520
-rw-r--r--Documentation/hwmon/lm926
-rw-r--r--Documentation/hwmon/nct677556
-rw-r--r--Documentation/hwmon/sht216
-rw-r--r--Documentation/hwmon/sht3x2
-rw-r--r--Documentation/media/kapi/v4l2-dev.rst2
-rw-r--r--Documentation/media/uapi/mediactl/media-ioc-enum-entities.rst2
-rw-r--r--Documentation/media/uapi/mediactl/media-ioc-g-topology.rst8
-rw-r--r--Documentation/media/uapi/mediactl/media-types.rst4
-rw-r--r--Documentation/media/uapi/v4l/extended-controls.rst2
-rw-r--r--Documentation/media/uapi/v4l/pixfmt-v4l2-mplane.rst36
-rw-r--r--Documentation/media/uapi/v4l/pixfmt-v4l2.rst36
-rw-r--r--Documentation/process/4.Coding.rst8
-rw-r--r--Documentation/process/clang-format.rst184
-rw-r--r--Documentation/process/coding-style.rst8
-rw-r--r--Documentation/s390/vfio-ccw.txt79
-rw-r--r--Documentation/security/LSM-sctp.rst175
-rw-r--r--Documentation/security/SELinux-sctp.rst158
-rw-r--r--Documentation/sysctl/kernel.txt54
-rw-r--r--Documentation/sysctl/vm.txt5
-rw-r--r--Documentation/trace/events.rst1548
-rw-r--r--Documentation/trace/ftrace.rst24
-rw-r--r--Documentation/trace/histogram.txt1995
-rw-r--r--Documentation/trace/postprocess/trace-vmscan-postprocess.pl4
-rw-r--r--Documentation/virtual/kvm/00-INDEX10
-rw-r--r--Documentation/virtual/kvm/api.txt135
-rw-r--r--Documentation/virtual/kvm/cpuid.txt15
-rw-r--r--Documentation/vm/hmm.txt396
-rw-r--r--Documentation/vm/page_migration14
66 files changed, 3770 insertions, 2096 deletions
diff --git a/Documentation/ABI/testing/ima_policy b/Documentation/ABI/testing/ima_policy
index 2028f2d093b2..b8465e00ba5f 100644
--- a/Documentation/ABI/testing/ima_policy
+++ b/Documentation/ABI/testing/ima_policy
@@ -26,7 +26,7 @@ Description:
[obj_user=] [obj_role=] [obj_type=]]
option: [[appraise_type=]] [permit_directio]
- base: func:= [BPRM_CHECK][MMAP_CHECK][FILE_CHECK][MODULE_CHECK]
+ base: func:= [BPRM_CHECK][MMAP_CHECK][CREDS_CHECK][FILE_CHECK][MODULE_CHECK]
[FIRMWARE_CHECK]
[KEXEC_KERNEL_CHECK] [KEXEC_INITRAMFS_CHECK]
mask:= [[^]MAY_READ] [[^]MAY_WRITE] [[^]MAY_APPEND]
diff --git a/Documentation/ABI/testing/sysfs-class-rtc b/Documentation/ABI/testing/sysfs-class-rtc
index cf60412882f0..95984289a4ee 100644
--- a/Documentation/ABI/testing/sysfs-class-rtc
+++ b/Documentation/ABI/testing/sysfs-class-rtc
@@ -43,6 +43,14 @@ Contact: linux-rtc@vger.kernel.org
Description:
(RO) The name of the RTC corresponding to this sysfs directory
+What: /sys/class/rtc/rtcX/range
+Date: January 2018
+KernelVersion: 4.16
+Contact: linux-rtc@vger.kernel.org
+Description:
+ Valid time range for the RTC, as seconds from epoch, formatted
+ as [min, max]
+
What: /sys/class/rtc/rtcX/since_epoch
Date: March 2006
KernelVersion: 2.6.17
@@ -57,14 +65,6 @@ Contact: linux-rtc@vger.kernel.org
Description:
(RO) RTC-provided time in 24-hour notation (hh:mm:ss)
-What: /sys/class/rtc/rtcX/*/nvmem
-Date: February 2016
-KernelVersion: 4.6
-Contact: linux-rtc@vger.kernel.org
-Description:
- (RW) The non volatile storage exported as a raw file, as
- described in Documentation/nvmem/nvmem.txt
-
What: /sys/class/rtc/rtcX/offset
Date: February 2016
KernelVersion: 4.6
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 34dac7cef4cf..11fc28ecdb6d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -389,15 +389,15 @@
Use software keyboard repeat
audit= [KNL] Enable the audit sub-system
- Format: { "0" | "1" } (0 = disabled, 1 = enabled)
- 0 - kernel audit is disabled and can not be enabled
- until the next reboot
+ Format: { "0" | "1" | "off" | "on" }
+ 0 | off - kernel audit is disabled and can not be
+ enabled until the next reboot
unset - kernel audit is initialized but disabled and
will be fully enabled by the userspace auditd.
- 1 - kernel audit is initialized and partially enabled,
- storing at most audit_backlog_limit messages in
- RAM until it is fully enabled by the userspace
- auditd.
+ 1 | on - kernel audit is initialized and partially
+ enabled, storing at most audit_backlog_limit
+ messages in RAM until it is fully enabled by the
+ userspace auditd.
Default: unset
audit_backlog_limit= [KNL] Set the audit queue size limit.
@@ -1521,7 +1521,8 @@
ima_policy= [IMA]
The builtin policies to load during IMA setup.
- Format: "tcb | appraise_tcb | secure_boot"
+ Format: "tcb | appraise_tcb | secure_boot |
+ fail_securely"
The "tcb" policy measures all programs exec'd, files
mmap'd for exec, and all files opened with the read
@@ -1536,6 +1537,11 @@
of files (eg. kexec kernel image, kernel modules,
firmware, policy, etc) based on file signatures.
+ The "fail_securely" policy forces file signature
+ verification failure also on privileged mounted
+ filesystems with the SB_I_UNVERIFIABLE_SIGNATURE
+ flag.
+
ima_tcb [IMA] Deprecated. Use ima_policy= instead.
Load a policy which meets the needs of the Trusted
Computing Base. This means IMA will measure all
@@ -1840,30 +1846,29 @@
keepinitrd [HW,ARM]
kernelcore= [KNL,X86,IA-64,PPC]
- Format: nn[KMGTPE] | "mirror"
- This parameter
- specifies the amount of memory usable by the kernel
- for non-movable allocations. The requested amount is
- spread evenly throughout all nodes in the system. The
- remaining memory in each node is used for Movable
- pages. In the event, a node is too small to have both
- kernelcore and Movable pages, kernelcore pages will
- take priority and other nodes will have a larger number
- of Movable pages. The Movable zone is used for the
- allocation of pages that may be reclaimed or moved
- by the page migration subsystem. This means that
- HugeTLB pages may not be allocated from this zone.
- Note that allocations like PTEs-from-HighMem still
- use the HighMem zone if it exists, and the Normal
+ Format: nn[KMGTPE] | nn% | "mirror"
+ This parameter specifies the amount of memory usable by
+ the kernel for non-movable allocations. The requested
+ amount is spread evenly throughout all nodes in the
+ system as ZONE_NORMAL. The remaining memory is used for
+ movable memory in its own zone, ZONE_MOVABLE. In the
+ event, a node is too small to have both ZONE_NORMAL and
+ ZONE_MOVABLE, kernelcore memory will take priority and
+ other nodes will have a larger ZONE_MOVABLE.
+
+ ZONE_MOVABLE is used for the allocation of pages that
+ may be reclaimed or moved by the page migration
+ subsystem. Note that allocations like PTEs-from-HighMem
+ still use the HighMem zone if it exists, and the Normal
zone if it does not.
- Instead of specifying the amount of memory (nn[KMGTPE]),
- you can specify "mirror" option. In case "mirror"
+ It is possible to specify the exact amount of memory in
+ the form of "nn[KMGTPE]", a percentage of total system
+ memory in the form of "nn%", or "mirror". If "mirror"
option is specified, mirrored (reliable) memory is used
for non-movable allocations and remaining memory is used
- for Movable pages. nn[KMGTPE] and "mirror" are exclusive,
- so you can NOT specify nn[KMGTPE] and "mirror" at the same
- time.
+ for Movable pages. "nn[KMGTPE]", "nn%", and "mirror"
+ are exclusive, so you cannot specify multiple forms.
kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
@@ -1902,6 +1907,9 @@
kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs.
Default is 0 (don't ignore, but inject #GP)
+ kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
+ Default is false (don't support).
+
kvm.mmu_audit= [KVM] This is a R/W parameter which allows audit
KVM MMU at runtime.
Default is 0 (off)
@@ -2377,13 +2385,14 @@
mousedev.yres= [MOUSE] Vertical screen resolution, used for devices
reporting absolute coordinates, such as tablets
- movablecore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter
- is similar to kernelcore except it specifies the
- amount of memory used for migratable allocations.
- If both kernelcore and movablecore is specified,
- then kernelcore will be at *least* the specified
- value but may be more. If movablecore on its own
- is specified, the administrator must be careful
+ movablecore= [KNL,X86,IA-64,PPC]
+ Format: nn[KMGTPE] | nn%
+ This parameter is the complement to kernelcore=, it
+ specifies the amount of memory used for migratable
+ allocations. If both kernelcore and movablecore is
+ specified, then kernelcore will be at *least* the
+ specified value but may be more. If movablecore on its
+ own is specified, the administrator must be careful
that the amount of memory usable for all allocations
is not too small.
@@ -3154,18 +3163,13 @@
force Enable ASPM even on devices that claim not to support it.
WARNING: Forcing ASPM on may cause system lockups.
- pcie_hp= [PCIE] PCI Express Hotplug driver options:
- nomsi Do not use MSI for PCI Express Native Hotplug (this
- makes all PCIe ports use INTx for hotplug services).
-
- pcie_ports= [PCIE] PCIe ports handling:
- auto Ask the BIOS whether or not to use native PCIe services
- associated with PCIe ports (PME, hot-plug, AER). Use
- them only if that is allowed by the BIOS.
- native Use native PCIe services associated with PCIe ports
- unconditionally.
- compat Treat PCIe ports as PCI-to-PCI bridges, disable the PCIe
- ports driver.
+ pcie_ports= [PCIE] PCIe port services handling:
+ native Use native PCIe services (PME, AER, DPC, PCIe hotplug)
+ even if the platform doesn't give the OS permission to
+ use them. This may cause conflicts if the platform
+ also tries to use these services.
+ compat Disable native PCIe services (PME, AER, DPC, PCIe
+ hotplug).
pcie_port_pm= [PCIE] PCIe port power management handling:
off Disable power management of all PCIe ports
diff --git a/Documentation/arm/Samsung-S3C24XX/S3C2412.txt b/Documentation/arm/Samsung-S3C24XX/S3C2412.txt
index f057876b920b..dc1fd362d3c1 100644
--- a/Documentation/arm/Samsung-S3C24XX/S3C2412.txt
+++ b/Documentation/arm/Samsung-S3C24XX/S3C2412.txt
@@ -46,7 +46,7 @@ NAND
----
The NAND hardware is similar to the S3C2440, and is supported by the
- s3c2410 driver in the drivers/mtd/nand directory.
+ s3c2410 driver in the drivers/mtd/nand/raw directory.
USB Host
diff --git a/Documentation/arm64/memory.txt b/Documentation/arm64/memory.txt
index 671bc0639262..c5dab30d3389 100644
--- a/Documentation/arm64/memory.txt
+++ b/Documentation/arm64/memory.txt
@@ -86,9 +86,12 @@ Translation table lookup with 64KB pages:
+-------------------------------------------------> [63] TTBR0/1
-When using KVM without the Virtualization Host Extensions, the hypervisor
-maps kernel pages in EL2 at a fixed offset from the kernel VA. See the
-kern_hyp_va macro for more details.
+When using KVM without the Virtualization Host Extensions, the
+hypervisor maps kernel pages in EL2 at a fixed (and potentially
+random) offset from the linear mapping. See the kern_hyp_va macro and
+kvm_update_va_mask function for more details. MMIO devices such as
+GICv2 gets mapped next to the HYP idmap page, as do vectors when
+ARM64_HARDEN_EL2_VECTORS is selected for particular CPUs.
When using KVM with the Virtualization Host Extensions, no additional
mappings are created, since the host kernel runs directly in EL2.
diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
index a4af2e124e24..3682e99234c2 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -262,7 +262,7 @@ When oom event notifier is registered, event will be delivered.
2.6 Locking
lock_page_cgroup()/unlock_page_cgroup() should not be called under
- mapping->tree_lock.
+ the i_pages lock.
Other lock order is following:
PG_locked.
diff --git a/Documentation/device-mapper/verity.txt b/Documentation/device-mapper/verity.txt
index 89fd8f9a259f..b3d2e4a42255 100644
--- a/Documentation/device-mapper/verity.txt
+++ b/Documentation/device-mapper/verity.txt
@@ -109,6 +109,17 @@ fec_start <offset>
This is the offset, in <data_block_size> blocks, from the start of the
FEC device to the beginning of the encoding data.
+check_at_most_once
+ Verify data blocks only the first time they are read from the data device,
+ rather than every time. This reduces the overhead of dm-verity so that it
+ can be used on systems that are memory and/or CPU constrained. However, it
+ provides a reduced level of security because only offline tampering of the
+ data device's content will be detected, not online tampering.
+
+ Hash blocks are still verified each time they are read from the hash device,
+ since verification of hash blocks is less performance critical than data
+ blocks, and a hash block will not be verified any more after all the data
+ blocks it covers have been verified anyway.
Theory of operation
===================
diff --git a/Documentation/devicetree/bindings/arm/hisilicon/hisilicon-low-pin-count.txt b/Documentation/devicetree/bindings/arm/hisilicon/hisilicon-low-pin-count.txt
new file mode 100644
index 000000000000..10bd35f9207f
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/hisilicon/hisilicon-low-pin-count.txt
@@ -0,0 +1,33 @@
+Hisilicon Hip06 Low Pin Count device
+ Hisilicon Hip06 SoCs implement a Low Pin Count (LPC) controller, which
+ provides I/O access to some legacy ISA devices.
+ Hip06 is based on arm64 architecture where there is no I/O space. So, the
+ I/O ports here are not CPU addresses, and there is no 'ranges' property in
+ LPC device node.
+
+Required properties:
+- compatible: value should be as follows:
+ (a) "hisilicon,hip06-lpc"
+ (b) "hisilicon,hip07-lpc"
+- #address-cells: must be 2 which stick to the ISA/EISA binding doc.
+- #size-cells: must be 1 which stick to the ISA/EISA binding doc.
+- reg: base memory range where the LPC register set is mapped.
+
+Note:
+ The node name before '@' must be "isa" to represent the binding stick to the
+ ISA/EISA binding specification.
+
+Example:
+
+isa@a01b0000 {
+ compatible = "hisilicon,hip06-lpc";
+ #address-cells = <2>;
+ #size-cells = <1>;
+ reg = <0x0 0xa01b0000 0x0 0x1000>;
+
+ ipmi0: bt@e4 {
+ compatible = "ipmi-bt";
+ device_type = "ipmi";
+ reg = <0x01 0xe4 0x04>;
+ };
+};
diff --git a/Documentation/devicetree/bindings/dma/mtk-hsdma.txt b/Documentation/devicetree/bindings/dma/mtk-hsdma.txt
new file mode 100644
index 000000000000..4bb317359dc6
--- /dev/null
+++ b/Documentation/devicetree/bindings/dma/mtk-hsdma.txt
@@ -0,0 +1,33 @@
+MediaTek High-Speed DMA Controller
+==================================
+
+This device follows the generic DMA bindings defined in dma/dma.txt.
+
+Required properties:
+
+- compatible: Must be one of
+ "mediatek,mt7622-hsdma": for MT7622 SoC
+ "mediatek,mt7623-hsdma": for MT7623 SoC
+- reg: Should contain the register's base address and length.
+- interrupts: Should contain a reference to the interrupt used by this
+ device.
+- clocks: Should be the clock specifiers corresponding to the entry in
+ clock-names property.
+- clock-names: Should contain "hsdma" entries.
+- power-domains: Phandle to the power domain that the device is part of
+- #dma-cells: The length of the DMA specifier, must be <1>. This one cell
+ in dmas property of a client device represents the channel
+ number.
+Example:
+
+ hsdma: dma-controller@1b007000 {
+ compatible = "mediatek,mt7623-hsdma";
+ reg = <0 0x1b007000 0 0x1000>;
+ interrupts = <GIC_SPI 98 IRQ_TYPE_LEVEL_LOW>;
+ clocks = <&ethsys CLK_ETHSYS_HSDMA>;
+ clock-names = "hsdma";
+ power-domains = <&scpsys MT2701_POWER_DOMAIN_ETH>;
+ #dma-cells = <1>;
+ };
+
+DMA clients must use the format described in dma/dma.txt file.
diff --git a/Documentation/devicetree/bindings/dma/qcom_bam_dma.txt b/Documentation/devicetree/bindings/dma/qcom_bam_dma.txt
index 9cbf5d9df8fd..cf5b9e44432c 100644
--- a/Documentation/devicetree/bindings/dma/qcom_bam_dma.txt
+++ b/Documentation/devicetree/bindings/dma/qcom_bam_dma.txt
@@ -15,6 +15,10 @@ Required properties:
the secure world.
- qcom,controlled-remotely : optional, indicates that the bam is controlled by
remote proccessor i.e. execution environment.
+- num-channels : optional, indicates supported number of DMA channels in a
+ remotely controlled bam.
+- qcom,num-ees : optional, indicates supported number of Execution Environments
+ in a remotely controlled bam.
Example:
diff --git a/Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt b/Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt
index 891db41e9420..aadfb236d53a 100644
--- a/Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt
+++ b/Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt
@@ -18,6 +18,7 @@ Required Properties:
Examples with soctypes are:
- "renesas,dmac-r8a7743" (RZ/G1M)
- "renesas,dmac-r8a7745" (RZ/G1E)
+ - "renesas,dmac-r8a77470" (RZ/G1C)
- "renesas,dmac-r8a7790" (R-Car H2)
- "renesas,dmac-r8a7791" (R-Car M2-W)
- "renesas,dmac-r8a7792" (R-Car V2H)
@@ -26,6 +27,7 @@ Required Properties:
- "renesas,dmac-r8a7795" (R-Car H3)
- "renesas,dmac-r8a7796" (R-Car M3-W)
- "renesas,dmac-r8a77970" (R-Car V3M)
+ - "renesas,dmac-r8a77980" (R-Car V3H)
- reg: base address and length of the registers block for the DMAC
diff --git a/Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt b/Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt
index f3d1f151ba80..9dc935e24e55 100644
--- a/Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt
+++ b/Documentation/devicetree/bindings/dma/renesas,usb-dmac.txt
@@ -11,6 +11,7 @@ Required Properties:
- "renesas,r8a7794-usb-dmac" (R-Car E2)
- "renesas,r8a7795-usb-dmac" (R-Car H3)
- "renesas,r8a7796-usb-dmac" (R-Car M3-W)
+ - "renesas,r8a77965-usb-dmac" (R-Car M3-N)
- reg: base address and length of the registers block for the DMAC
- interrupts: interrupt specifiers for the DMAC, one for each entry in
interrupt-names.
diff --git a/Documentation/devicetree/bindings/dma/snps,dw-axi-dmac.txt b/Documentation/devicetree/bindings/dma/snps,dw-axi-dmac.txt
new file mode 100644
index 000000000000..f237b7928283
--- /dev/null
+++ b/Documentation/devicetree/bindings/dma/snps,dw-axi-dmac.txt
@@ -0,0 +1,41 @@
+Synopsys DesignWare AXI DMA Controller
+
+Required properties:
+- compatible: "snps,axi-dma-1.01a"
+- reg: Address range of the DMAC registers. This should include
+ all of the per-channel registers.
+- interrupt: Should contain the DMAC interrupt number.
+- interrupt-parent: Should be the phandle for the interrupt controller
+ that services interrupts for this device.
+- dma-channels: Number of channels supported by hardware.
+- snps,dma-masters: Number of AXI masters supported by the hardware.
+- snps,data-width: Maximum AXI data width supported by hardware.
+ (0 - 8bits, 1 - 16bits, 2 - 32bits, ..., 6 - 512bits)
+- snps,priority: Priority of channel. Array size is equal to the number of
+ dma-channels. Priority value must be programmed within [0:dma-channels-1]
+ range. (0 - minimum priority)
+- snps,block-size: Maximum block size supported by the controller channel.
+ Array size is equal to the number of dma-channels.
+
+Optional properties:
+- snps,axi-max-burst-len: Restrict master AXI burst length by value specified
+ in this property. If this property is missing the maximum AXI burst length
+ supported by DMAC is used. [1:256]
+
+Example:
+
+dmac: dma-controller@80000 {
+ compatible = "snps,axi-dma-1.01a";
+ reg = <0x80000 0x400>;
+ clocks = <&core_clk>, <&cfgr_clk>;
+ clock-names = "core-clk", "cfgr-clk";
+ interrupt-parent = <&intc>;
+ interrupts = <27>;
+
+ dma-channels = <4>;
+ snps,dma-masters = <2>;
+ snps,data-width = <3>;
+ snps,block-size = <4096 4096 4096 4096>;
+ snps,priority = <0 1 2 3>;
+ snps,axi-max-burst-len = <16>;
+};
diff --git a/Documentation/devicetree/bindings/dma/stm32-dma.txt b/Documentation/devicetree/bindings/dma/stm32-dma.txt
index 0b55718bf889..c5f519097204 100644
--- a/Documentation/devicetree/bindings/dma/stm32-dma.txt
+++ b/Documentation/devicetree/bindings/dma/stm32-dma.txt
@@ -62,14 +62,14 @@ channel: a phandle to the DMA controller plus the following four integer cells:
0x1: medium
0x2: high
0x3: very high
-4. A 32bit mask specifying the DMA FIFO threshold configuration which are device
- dependent:
- -bit 0-1: Fifo threshold
+4. A 32bit bitfield value specifying DMA features which are device dependent:
+ -bit 0-1: DMA FIFO threshold selection
0x0: 1/4 full FIFO
0x1: 1/2 full FIFO
0x2: 3/4 full FIFO
0x3: full FIFO
+
Example:
usart1: serial@40011000 {
diff --git a/Documentation/devicetree/bindings/eeprom/at24.txt b/Documentation/devicetree/bindings/eeprom/at24.txt
index abfae1beca2b..61d833abafbf 100644
--- a/Documentation/devicetree/bindings/eeprom/at24.txt
+++ b/Documentation/devicetree/bindings/eeprom/at24.txt
@@ -41,12 +41,16 @@ Required properties:
"nxp",
"ramtron",
"renesas",
+ "rohm",
"st",
Some vendors use different model names for chips which are just
variants of the above. Known such exceptions are listed below:
+ "nxp,se97b" - the fallback is "atmel,24c02",
"renesas,r1ex24002" - the fallback is "atmel,24c02"
+ "renesas,r1ex24128" - the fallback is "atmel,24c128"
+ "rohm,br24t01" - the fallback is "atmel,24c01"
- reg: The I2C address of the EEPROM.
diff --git a/Documentation/devicetree/bindings/i2c/i2c-rcar.txt b/Documentation/devicetree/bindings/i2c/i2c-rcar.txt
index a777477e4547..4a7811ecd954 100644
--- a/Documentation/devicetree/bindings/i2c/i2c-rcar.txt
+++ b/Documentation/devicetree/bindings/i2c/i2c-rcar.txt
@@ -13,7 +13,9 @@ Required properties:
"renesas,i2c-r8a7794" if the device is a part of a R8A7794 SoC.
"renesas,i2c-r8a7795" if the device is a part of a R8A7795 SoC.
"renesas,i2c-r8a7796" if the device is a part of a R8A7796 SoC.
+ "renesas,i2c-r8a77965" if the device is a part of a R8A77965 SoC.
"renesas,i2c-r8a77970" if the device is a part of a R8A77970 SoC.
+ "renesas,i2c-r8a77995" if the device is a part of a R8A77995 SoC.
"renesas,rcar-gen1-i2c" for a generic R-Car Gen1 compatible device.
"renesas,rcar-gen2-i2c" for a generic R-Car Gen2 or RZ/G1 compatible
device.
diff --git a/Documentation/devicetree/bindings/i2c/i2c-sh_mobile.txt b/Documentation/devicetree/bindings/i2c/i2c-sh_mobile.txt
index 224390999e81..fc7e17802746 100644
--- a/Documentation/devicetree/bindings/i2c/i2c-sh_mobile.txt
+++ b/Documentation/devicetree/bindings/i2c/i2c-sh_mobile.txt
@@ -13,6 +13,7 @@ Required properties:
- "renesas,iic-r8a7794" (R-Car E2)
- "renesas,iic-r8a7795" (R-Car H3)
- "renesas,iic-r8a7796" (R-Car M3-W)
+ - "renesas,iic-r8a77965" (R-Car M3-N)
- "renesas,iic-sh73a0" (SH-Mobile AG5)
- "renesas,rcar-gen2-iic" (generic R-Car Gen2 or RZ/G1
compatible device)
diff --git a/Documentation/devicetree/bindings/i2c/i2c-synquacer.txt b/Documentation/devicetree/bindings/i2c/i2c-synquacer.txt
new file mode 100644
index 000000000000..72f4a2f0fedc
--- /dev/null
+++ b/Documentation/devicetree/bindings/i2c/i2c-synquacer.txt
@@ -0,0 +1,29 @@
+Socionext SynQuacer I2C
+
+Required properties:
+- compatible : Must be "socionext,synquacer-i2c"
+- reg : Offset and length of the register set for the device
+- interrupts : A single interrupt specifier
+- #address-cells : Must be <1>;
+- #size-cells : Must be <0>;
+- clock-names : Must contain "pclk".
+- clocks : Must contain an entry for each name in clock-names.
+ (See the common clock bindings.)
+
+Optional properties:
+- clock-frequency : Desired I2C bus clock frequency in Hz. As only Normal and
+ Fast modes are supported, possible values are 100000 and
+ 400000.
+
+Example :
+
+ i2c@51210000 {
+ compatible = "socionext,synquacer-i2c";
+ reg = <0x51210000 0x1000>;
+ interrupts = <GIC_SPI 165 IRQ_TYPE_LEVEL_HIGH>;
+ #address-cells = <1>;
+ #size-cells = <0>;
+ clock-names = "pclk";
+ clocks = <&clk_i2c>;
+ clock-frequency = <400000>;
+ };
diff --git a/Documentation/devicetree/bindings/mailbox/hisilicon,hi3660-mailbox.txt b/Documentation/devicetree/bindings/mailbox/hisilicon,hi3660-mailbox.txt
new file mode 100644
index 000000000000..3e5b4537407d
--- /dev/null
+++ b/Documentation/devicetree/bindings/mailbox/hisilicon,hi3660-mailbox.txt
@@ -0,0 +1,51 @@
+Hisilicon Hi3660 Mailbox Controller
+
+Hisilicon Hi3660 mailbox controller supports up to 32 channels. Messages
+are passed between processors, including application & communication
+processors, MCU, HIFI, etc. Each channel is unidirectional and accessed
+by using MMIO registers; it supports maximum to 8 words message.
+
+Controller
+----------
+
+Required properties:
+- compatible: : Shall be "hisilicon,hi3660-mbox"
+- reg: : Offset and length of the device's register set
+- #mbox-cells: : Must be 3
+ <&phandle channel dst_irq ack_irq>
+ phandle : Label name of controller
+ channel : Channel number
+ dst_irq : Remote interrupt vector
+ ack_irq : Local interrupt vector
+
+- interrupts: : Contains the two IRQ lines for mailbox.
+
+Example:
+
+mailbox: mailbox@e896b000 {
+ compatible = "hisilicon,hi3660-mbox";
+ reg = <0x0 0xe896b000 0x0 0x1000>;
+ interrupts = <0x0 0xc0 0x4>,
+ <0x0 0xc1 0x4>;
+ #mbox-cells = <3>;
+};
+
+Client
+------
+
+Required properties:
+- compatible : See the client docs
+- mboxes : Standard property to specify a Mailbox (See ./mailbox.txt)
+ Cells must match 'mbox-cells' (See Controller docs above)
+
+Optional properties
+- mbox-names : Name given to channels seen in the 'mboxes' property.
+
+Example:
+
+stub_clock: stub_clock@e896b500 {
+ compatible = "hisilicon,hi3660-stub-clk";
+ reg = <0x0 0xe896b500 0x0 0x0100>;
+ #clock-cells = <1>;
+ mboxes = <&mailbox 13 3 0>;
+};
diff --git a/Documentation/devicetree/bindings/mips/mscc.txt b/Documentation/devicetree/bindings/mips/mscc.txt
new file mode 100644
index 000000000000..ae15ec333542
--- /dev/null
+++ b/Documentation/devicetree/bindings/mips/mscc.txt
@@ -0,0 +1,43 @@
+* Microsemi MIPS CPUs
+
+Boards with a SoC of the Microsemi MIPS family shall have the following
+properties:
+
+Required properties:
+- compatible: "mscc,ocelot"
+
+
+* Other peripherals:
+
+o CPU chip regs:
+
+The SoC has a few registers (DEVCPU_GCB:CHIP_REGS) handling miscellaneous
+functionalities: chip ID, general purpose register for software use, reset
+controller, hardware status and configuration, efuses.
+
+Required properties:
+- compatible: Should be "mscc,ocelot-chip-regs", "simple-mfd", "syscon"
+- reg : Should contain registers location and length
+
+Example:
+ syscon@71070000 {
+ compatible = "mscc,ocelot-chip-regs", "simple-mfd", "syscon";
+ reg = <0x71070000 0x1c>;
+ };
+
+
+o CPU system control:
+
+The SoC has a few registers (ICPU_CFG:CPU_SYSTEM_CTRL) handling configuration of
+the CPU: 8 general purpose registers, reset control, CPU en/disabling, CPU
+endianness, CPU bus control, CPU status.
+
+Required properties:
+- compatible: Should be "mscc,ocelot-cpu-syscon", "syscon"
+- reg : Should contain registers location and length
+
+Example:
+ syscon@70000000 {
+ compatible = "mscc,ocelot-cpu-syscon", "syscon";
+ reg = <0x70000000 0x2c>;
+ };
diff --git a/Documentation/devicetree/bindings/mtd/fsl-quadspi.txt b/Documentation/devicetree/bindings/mtd/fsl-quadspi.txt
index 63d4d626fbd5..483e9cfac1b1 100644
--- a/Documentation/devicetree/bindings/mtd/fsl-quadspi.txt
+++ b/Documentation/devicetree/bindings/mtd/fsl-quadspi.txt
@@ -39,3 +39,27 @@ qspi0: quadspi@40044000 {
....
};
};
+
+Example showing the usage of two SPI NOR devices:
+
+&qspi2 {
+ pinctrl-names = "default";
+ pinctrl-0 = <&pinctrl_qspi2>;
+ status = "okay";
+
+ flash0: n25q256a@0 {
+ #address-cells = <1>;
+ #size-cells = <1>;
+ compatible = "micron,n25q256a", "jedec,spi-nor";
+ spi-max-frequency = <29000000>;
+ reg = <0>;
+ };
+
+ flash1: n25q256a@1 {
+ #address-cells = <1>;
+ #size-cells = <1>;
+ compatible = "micron,n25q256a", "jedec,spi-nor";
+ spi-max-frequency = <29000000>;
+ reg = <1>;
+ };
+};
diff --git a/Documentation/devicetree/bindings/mtd/marvell-nand.txt b/Documentation/devicetree/bindings/mtd/marvell-nand.txt
index c08fb477b3c6..e0c790706b9b 100644
--- a/Documentation/devicetree/bindings/mtd/marvell-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/marvell-nand.txt
@@ -14,7 +14,10 @@ Required properties:
- #address-cells: shall be set to 1. Encode the NAND CS.
- #size-cells: shall be set to 0.
- interrupts: shall define the NAND controller interrupt.
-- clocks: shall reference the NAND controller clock.
+- clocks: shall reference the NAND controller clocks, the second one is
+ is only needed for the Armada 7K/8K SoCs
+- clock-names: mandatory if there is a second clock, in this case there
+ should be one clock named "core" and another one named "reg"
- marvell,system-controller: Set to retrieve the syscon node that handles
NAND controller related registers (only required with the
"marvell,armada-8k-nand[-controller]" compatibles).
diff --git a/Documentation/devicetree/bindings/mtd/mtd-physmap.txt b/Documentation/devicetree/bindings/mtd/mtd-physmap.txt
index 4a0a48bf4ecb..232fa12e90ef 100644
--- a/Documentation/devicetree/bindings/mtd/mtd-physmap.txt
+++ b/Documentation/devicetree/bindings/mtd/mtd-physmap.txt
@@ -41,6 +41,13 @@ additional (optional) property is defined:
- erase-size : The chip's physical erase block size in bytes.
+ The device tree may optionally contain endianness property.
+ little-endian or big-endian : It Represents the endianness that should be used
+ by the controller to properly read/write data
+ from/to the flash. If this property is missing,
+ the endianness is chosen by the system
+ (potentially based on extra configuration options).
+
The device tree may optionally contain sub-nodes describing partitions of the
address space. See partition.txt for more detail.
diff --git a/Documentation/devicetree/bindings/mtd/sunxi-nand.txt b/Documentation/devicetree/bindings/mtd/sunxi-nand.txt
index 5e13a5cdff03..0734f03bf3d3 100644
--- a/Documentation/devicetree/bindings/mtd/sunxi-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/sunxi-nand.txt
@@ -24,8 +24,8 @@ Optional properties:
- allwinner,rb : shall contain the native Ready/Busy ids.
or
- rb-gpios : shall contain the gpios used as R/B pins.
-- nand-ecc-mode : one of the supported ECC modes ("hw", "hw_syndrome", "soft",
- "soft_bch" or "none")
+- nand-ecc-mode : one of the supported ECC modes ("hw", "soft", "soft_bch" or
+ "none")
see Documentation/devicetree/bindings/mtd/nand.txt for generic bindings.
diff --git a/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt b/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt
index 594982c6b9f9..79bf352e659c 100644
--- a/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt
+++ b/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt
@@ -6,7 +6,11 @@ the definition of the PHY node in booting-without-of.txt for an example
of how to define a PHY.
Required properties:
- - reg : Offset and length of the register set for the device
+ - reg : Offset and length of the register set for the device, and optionally
+ the offset and length of the TBIPA register (TBI PHY address
+ register). If TBIPA register is not specified, the driver will
+ attempt to infer it from the register set specified (your mileage may
+ vary).
- compatible : Should define the compatible device type for the
mdio. Currently supported strings/devices are:
- "fsl,gianfar-tbi"
diff --git a/Documentation/devicetree/bindings/pci/hisilicon-histb-pcie.txt b/Documentation/devicetree/bindings/pci/hisilicon-histb-pcie.txt
index c84bc027930b..760b4d740616 100644
--- a/Documentation/devicetree/bindings/pci/hisilicon-histb-pcie.txt
+++ b/Documentation/devicetree/bindings/pci/hisilicon-histb-pcie.txt
@@ -34,6 +34,7 @@ Required properties
Optional properties:
- reset-gpios: The gpio to generate PCIe PERST# assert and deassert signal.
+- vpcie-supply: The regulator in charge of PCIe port power.
- phys: List of phandle and phy mode specifier, should be 0.
- phy-names: Must be "phy".
diff --git a/Documentation/devicetree/bindings/pci/mediatek-pcie.txt b/Documentation/devicetree/bindings/pci/mediatek-pcie.txt
index 3a6ce55dd310..20227a875ac8 100644
--- a/Documentation/devicetree/bindings/pci/mediatek-pcie.txt
+++ b/Documentation/devicetree/bindings/pci/mediatek-pcie.txt
@@ -78,7 +78,7 @@ Examples for MT7623:
#reset-cells = <1>;
};
- pcie: pcie-controller@1a140000 {
+ pcie: pcie@1a140000 {
compatible = "mediatek,mt7623-pcie";
device_type = "pci";
reg = <0 0x1a140000 0 0x1000>, /* PCIe shared registers */
@@ -111,7 +111,6 @@ Examples for MT7623:
0x83000000 0 0x60000000 0 0x60000000 0 0x10000000>; /* memory space */
pcie@0,0 {
- device_type = "pci";
reg = <0x0000 0 0 0 0>;
#address-cells = <3>;
#size-cells = <2>;
@@ -123,7 +122,6 @@ Examples for MT7623:
};
pcie@1,0 {
- device_type = "pci";
reg = <0x0800 0 0 0 0>;
#address-cells = <3>;
#size-cells = <2>;
@@ -135,7 +133,6 @@ Examples for MT7623:
};
pcie@2,0 {
- device_type = "pci";
reg = <0x1000 0 0 0 0>;
#address-cells = <3>;
#size-cells = <2>;
@@ -148,6 +145,7 @@ Examples for MT7623:
};
Examples for MT2712:
+
pcie: pcie@11700000 {
compatible = "mediatek,mt2712-pcie";
device_type = "pci";
@@ -169,7 +167,6 @@ Examples for MT2712:
ranges = <0x82000000 0 0x20000000 0x0 0x20000000 0 0x10000000>;
pcie0: pcie@0,0 {
- device_type = "pci";
reg = <0x0000 0 0 0 0>;
#address-cells = <3>;
#size-cells = <2>;
@@ -189,7 +186,6 @@ Examples for MT2712:
};
pcie1: pcie@1,0 {
- device_type = "pci";
reg = <0x0800 0 0 0 0>;
#address-cells = <3>;
#size-cells = <2>;
@@ -210,6 +206,7 @@ Examples for MT2712:
};
Examples for MT7622:
+
pcie: pcie@1a140000 {
compatible = "mediatek,mt7622-pcie";
device_type = "pci";
@@ -243,7 +240,6 @@ Examples for MT7622:
ranges = <0x82000000 0 0x20000000 0x0 0x20000000 0 0x10000000>;
pcie0: pcie@0,0 {
- device_type = "pci";
reg = <0x0000 0 0 0 0>;
#address-cells = <3>;
#size-cells = <2>;
@@ -263,7 +259,6 @@ Examples for MT7622:
};
pcie1: pcie@1,0 {
- device_type = "pci";
reg = <0x0800 0 0 0 0>;
#address-cells = <3>;
#size-cells = <2>;
diff --git a/Documentation/devicetree/bindings/pci/qcom,pcie.txt b/Documentation/devicetree/bindings/pci/qcom,pcie.txt
index 3c9d321b3d3b..1fd703bd73e0 100644
--- a/Documentation/devicetree/bindings/pci/qcom,pcie.txt
+++ b/Documentation/devicetree/bindings/pci/qcom,pcie.txt
@@ -189,6 +189,10 @@
Value type: <phandle>
Definition: A phandle to the analog power supply for IC which generates
reference clock
+- vddpe-3v3-supply:
+ Usage: optional
+ Value type: <phandle>
+ Definition: A phandle to the PCIe endpoint power supply
- phys:
Usage: required for apq8084
diff --git a/Documentation/devicetree/bindings/pci/rcar-pci.txt b/Documentation/devicetree/bindings/pci/rcar-pci.txt
index 76ba3a61d1a3..1fb614e615da 100644
--- a/Documentation/devicetree/bindings/pci/rcar-pci.txt
+++ b/Documentation/devicetree/bindings/pci/rcar-pci.txt
@@ -1,13 +1,15 @@
* Renesas R-Car PCIe interface
Required properties:
-compatible: "renesas,pcie-r8a7779" for the R8A7779 SoC;
+compatible: "renesas,pcie-r8a7743" for the R8A7743 SoC;
+ "renesas,pcie-r8a7779" for the R8A7779 SoC;
"renesas,pcie-r8a7790" for the R8A7790 SoC;
"renesas,pcie-r8a7791" for the R8A7791 SoC;
"renesas,pcie-r8a7793" for the R8A7793 SoC;
"renesas,pcie-r8a7795" for the R8A7795 SoC;
"renesas,pcie-r8a7796" for the R8A7796 SoC;
- "renesas,pcie-rcar-gen2" for a generic R-Car Gen2 compatible device.
+ "renesas,pcie-rcar-gen2" for a generic R-Car Gen2 or
+ RZ/G1 compatible device.
"renesas,pcie-rcar-gen3" for a generic R-Car Gen3 compatible device.
When compatible with the generic version, nodes must list the
diff --git a/Documentation/devicetree/bindings/pmem/pmem-region.txt b/Documentation/devicetree/bindings/pmem/pmem-region.txt
new file mode 100644
index 000000000000..5cfa4f016a00
--- /dev/null
+++ b/Documentation/devicetree/bindings/pmem/pmem-region.txt
@@ -0,0 +1,65 @@
+Device-tree bindings for persistent memory regions
+-----------------------------------------------------
+
+Persistent memory refers to a class of memory devices that are:
+
+ a) Usable as main system memory (i.e. cacheable), and
+ b) Retain their contents across power failure.
+
+Given b) it is best to think of persistent memory as a kind of memory mapped
+storage device. To ensure data integrity the operating system needs to manage
+persistent regions separately to the normal memory pool. To aid with that this
+binding provides a standardised interface for discovering where persistent
+memory regions exist inside the physical address space.
+
+Bindings for the region nodes:
+-----------------------------
+
+Required properties:
+ - compatible = "pmem-region"
+
+ - reg = <base, size>;
+ The reg property should specificy an address range that is
+ translatable to a system physical address range. This address
+ range should be mappable as normal system memory would be
+ (i.e cacheable).
+
+ If the reg property contains multiple address ranges
+ each address range will be treated as though it was specified
+ in a separate device node. Having multiple address ranges in a
+ node implies no special relationship between the two ranges.
+
+Optional properties:
+ - Any relevant NUMA assocativity properties for the target platform.
+
+ - volatile; This property indicates that this region is actually
+ backed by non-persistent memory. This lets the OS know that it
+ may skip the cache flushes required to ensure data is made
+ persistent after a write.
+
+ If this property is absent then the OS must assume that the region
+ is backed by non-volatile memory.
+
+Examples:
+--------------------
+
+ /*
+ * This node specifies one 4KB region spanning from
+ * 0x5000 to 0x5fff that is backed by non-volatile memory.
+ */
+ pmem@5000 {
+ compatible = "pmem-region";
+ reg = <0x00005000 0x00001000>;
+ };
+
+ /*
+ * This node specifies two 4KB regions that are backed by
+ * volatile (normal) memory.
+ */
+ pmem@6000 {
+ compatible = "pmem-region";
+ reg = < 0x00006000 0x00001000
+ 0x00008000 0x00001000 >;
+ volatile;
+ };
+
diff --git a/Documentation/devicetree/bindings/rtc/isil,isl12026.txt b/Documentation/devicetree/bindings/rtc/isil,isl12026.txt
new file mode 100644
index 000000000000..2e0be45193bb
--- /dev/null
+++ b/Documentation/devicetree/bindings/rtc/isil,isl12026.txt
@@ -0,0 +1,28 @@
+ISL12026 I2C RTC/EEPROM
+
+ISL12026 is an I2C RTC/EEPROM combination device. The RTC and control
+registers respond at bus address 0x6f, and the EEPROM array responds
+at bus address 0x57. The canonical "reg" value will be for the RTC portion.
+
+Required properties supported by the device:
+
+ - "compatible": must be "isil,isl12026"
+ - "reg": I2C bus address of the device (always 0x6f)
+
+Optional properties:
+
+ - "isil,pwr-bsw": If present PWR.BSW bit must be set to the specified
+ value for proper operation.
+
+ - "isil,pwr-sbib": If present PWR.SBIB bit must be set to the specified
+ value for proper operation.
+
+
+Example:
+
+ rtc@6f {
+ compatible = "isil,isl12026";
+ reg = <0x6f>;
+ isil,pwr-bsw = <0>;
+ isil,pwr-sbib = <1>;
+ }
diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt
index 12e8b3e576b0..b5f978a4cac6 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.txt
+++ b/Documentation/devicetree/bindings/vendor-prefixes.txt
@@ -225,6 +225,7 @@ motorola Motorola, Inc.
moxa Moxa Inc.
mpl MPL AG
mqmaker mqmaker Inc.
+mscc Microsemi Corporation
msi Micro-Star International Co. Ltd.
mti Imagination Technologies Ltd. (formerly MIPS Technologies Inc.)
multi-inno Multi-Inno Technology Co.,Ltd
diff --git a/Documentation/driver-api/gpio/drivers-on-gpio.rst b/Documentation/driver-api/gpio/drivers-on-gpio.rst
index 019483868977..7da0c1dd1f7a 100644
--- a/Documentation/driver-api/gpio/drivers-on-gpio.rst
+++ b/Documentation/driver-api/gpio/drivers-on-gpio.rst
@@ -75,8 +75,8 @@ hardware descriptions such as device tree or ACPI:
it from 1-to-0-to-1. If that hardware does not receive its "ping"
periodically, it will reset the system.
-- gpio-nand: drivers/mtd/nand/gpio.c is used to connect a NAND flash chip to
- a set of simple GPIO lines: RDY, NCE, ALE, CLE, NWP. It interacts with the
+- gpio-nand: drivers/mtd/nand/raw/gpio.c is used to connect a NAND flash chip
+ to a set of simple GPIO lines: RDY, NCE, ALE, CLE, NWP. It interacts with the
NAND flash MTD subsystem and provides chip access and partition parsing like
any other NAND driving hardware.
diff --git a/Documentation/driver-api/mtdnand.rst b/Documentation/driver-api/mtdnand.rst
index 2a5191b6d445..dcd63599f700 100644
--- a/Documentation/driver-api/mtdnand.rst
+++ b/Documentation/driver-api/mtdnand.rst
@@ -967,10 +967,10 @@ API functions which are exported. Each function has a short description
which is marked with an [XXX] identifier. See the chapter "Documentation
hints" for an explanation.
-.. kernel-doc:: drivers/mtd/nand/nand_base.c
+.. kernel-doc:: drivers/mtd/nand/raw/nand_base.c
:export:
-.. kernel-doc:: drivers/mtd/nand/nand_ecc.c
+.. kernel-doc:: drivers/mtd/nand/raw/nand_ecc.c
:export:
Internal Functions Provided
@@ -982,10 +982,10 @@ marked with an [XXX] identifier. See the chapter "Documentation hints"
for an explanation. The functions marked with [DEFAULT] might be
relevant for a board driver developer.
-.. kernel-doc:: drivers/mtd/nand/nand_base.c
+.. kernel-doc:: drivers/mtd/nand/raw/nand_base.c
:internal:
-.. kernel-doc:: drivers/mtd/nand/nand_bbt.c
+.. kernel-doc:: drivers/mtd/nand/raw/nand_bbt.c
:internal:
Credits
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
index 0eb31de3a2c1..2a6f7399c1f3 100644
--- a/Documentation/filesystems/caching/netfs-api.txt
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -129,20 +129,10 @@ To define an object, a structure of the following type should be filled out:
const void *parent_netfs_data,
const void *cookie_netfs_data);
- uint16_t (*get_key)(const void *cookie_netfs_data,
- void *buffer,
- uint16_t bufmax);
-
- void (*get_attr)(const void *cookie_netfs_data,
- uint64_t *size);
-
- uint16_t (*get_aux)(const void *cookie_netfs_data,
- void *buffer,
- uint16_t bufmax);
-
enum fscache_checkaux (*check_aux)(void *cookie_netfs_data,
const void *data,
- uint16_t datalen);
+ uint16_t datalen,
+ loff_t object_size);
void (*get_context)(void *cookie_netfs_data, void *context);
@@ -187,36 +177,7 @@ This has the following fields:
cache in the parent's list will be chosen, or failing that, the first
cache in the master list.
- (4) A function to retrieve an object's key from the netfs [mandatory].
-
- This function will be called with the netfs data that was passed to the
- cookie acquisition function and the maximum length of key data that it may
- provide. It should write the required key data into the given buffer and
- return the quantity it wrote.
-
- (5) A function to retrieve attribute data from the netfs [optional].
-
- This function will be called with the netfs data that was passed to the
- cookie acquisition function. It should return the size of the file if
- this is a data file. The size may be used to govern how much cache must
- be reserved for this file in the cache.
-
- If the function is absent, a file size of 0 is assumed.
-
- (6) A function to retrieve auxiliary data from the netfs [optional].
-
- This function will be called with the netfs data that was passed to the
- cookie acquisition function and the maximum length of auxiliary data that
- it may provide. It should write the auxiliary data into the given buffer
- and return the quantity it wrote.
-
- If this function is absent, the auxiliary data length will be set to 0.
-
- The length of the auxiliary data buffer may be dependent on the key
- length. A netfs mustn't rely on being able to provide more than 400 bytes
- for both.
-
- (7) A function to check the auxiliary data [optional].
+ (4) A function to check the auxiliary data [optional].
This function will be called to check that a match found in the cache for
this object is valid. For instance with AFS it could check the auxiliary
@@ -226,6 +187,9 @@ This has the following fields:
If this function is absent, it will be assumed that matching objects in a
cache are always valid.
+ The function is also passed the cache's idea of the object size and may
+ use this to manage coherency also.
+
If present, the function should return one of the following values:
(*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is
@@ -235,7 +199,7 @@ This has the following fields:
This function can also be used to extract data from the auxiliary data in
the cache and copy it into the netfs's structures.
- (8) A pair of functions to manage contexts for the completion callback
+ (5) A pair of functions to manage contexts for the completion callback
[optional].
The cache read/write functions are passed a context which is then passed
@@ -249,7 +213,7 @@ This has the following fields:
required for indices as indices may not contain data. These functions may
be called in interrupt context and so may not sleep.
- (9) A function to mark a page as retaining cache metadata [optional].
+ (6) A function to mark a page as retaining cache metadata [optional].
This is called by the cache to indicate that it is retaining in-memory
information for this page and that the netfs should uncache the page when
@@ -261,7 +225,7 @@ This has the following fields:
This function is not required for indices as they're not permitted data.
-(10) A function to unmark all the pages retaining cache metadata [mandatory].
+ (7) A function to unmark all the pages retaining cache metadata [mandatory].
This is called by FS-Cache to indicate that a backing store is being
unbound from a cookie and that all the marks on the pages should be
@@ -333,12 +297,32 @@ the path to the file:
struct fscache_cookie *
fscache_acquire_cookie(struct fscache_cookie *parent,
const struct fscache_object_def *def,
+ const void *index_key,
+ size_t index_key_len,
+ const void *aux_data,
+ size_t aux_data_len,
void *netfs_data,
+ loff_t object_size,
bool enable);
This function creates an index entry in the index represented by parent,
filling in the index entry by calling the operations pointed to by def.
+A unique key that represents the object within the parent must be pointed to by
+index_key and is of length index_key_len.
+
+An optional blob of auxiliary data that is to be stored within the cache can be
+pointed to with aux_data and should be of length aux_data_len. This would
+typically be used for storing coherency data.
+
+The netfs may pass an arbitrary value in netfs_data and this will be presented
+to it in the event of any calling back. This may also be used in tracing or
+logging of messages.
+
+The cache tracks the size of the data attached to an object and this set to be
+object_size. For indices, this should be 0. This value will be passed to the
+->check_aux() callback.
+
Note that this function never returns an error - all errors are handled
internally. It may, however, return NULL to indicate no cookie. It is quite
acceptable to pass this token back to this function as the parent to another
@@ -355,30 +339,24 @@ must be enabled to do anything with it. A disabled cookie can be enabled by
calling fscache_enable_cookie() (see below).
For example, with AFS, a cell would be added to the primary index. This index
-entry would have a dependent inode containing a volume location index for the
-volume mappings within this cell:
+entry would have a dependent inode containing volume mappings within this cell:
cell->cache =
fscache_acquire_cookie(afs_cache_netfs.primary_index,
&afs_cell_cache_index_def,
- cell, true);
-
-Then when a volume location was accessed, it would be entered into the cell's
-index and an inode would be allocated that acts as a volume type and hash chain
-combination:
+ cell->name, strlen(cell->name),
+ NULL, 0,
+ cell, 0, true);
- vlocation->cache =
- fscache_acquire_cookie(cell->cache,
- &afs_vlocation_cache_index_def,
- vlocation, true);
-
-And then a particular flavour of volume (R/O for example) could be added to
-that index, creating another index for vnodes (AFS inode equivalents):
+And then a particular volume could be added to that index by ID, creating
+another index for vnodes (AFS inode equivalents):
volume->cache =
- fscache_acquire_cookie(vlocation->cache,
+ fscache_acquire_cookie(volume->cell->cache,
&afs_volume_cache_index_def,
- volume, true);
+ &volume->vid, sizeof(volume->vid),
+ NULL, 0,
+ volume, 0, true);
======================
@@ -392,7 +370,9 @@ the object definition should be something other than index type.
vnode->cache =
fscache_acquire_cookie(volume->cache,
&afs_vnode_cache_object_def,
- vnode, true);
+ &key, sizeof(key),
+ &aux, sizeof(aux),
+ vnode, vnode->status.size, true);
=================================
@@ -408,7 +388,9 @@ it would be some other type of object such as a data file.
xattr->cache =
fscache_acquire_cookie(vnode->cache,
&afs_xattr_cache_object_def,
- xattr, true);
+ &xattr->name, strlen(xattr->name),
+ NULL, 0,
+ xattr, strlen(xattr->val), true);
Miscellaneous objects might be used to store extended attributes or directory
entries for example.
@@ -425,8 +407,7 @@ cache to adjust its metadata for data tracking appropriately:
int fscache_attr_changed(struct fscache_cookie *cookie);
The cache will return -ENOBUFS if there is no backing cache or if there is no
-space to allocate any extra metadata required in the cache. The attributes
-will be accessed with the get_attr() cookie definition operation.
+space to allocate any extra metadata required in the cache.
Note that attempts to read or write data pages in the cache over this size may
be rebuffed with -ENOBUFS.
@@ -551,12 +532,13 @@ written back to the cache:
int fscache_write_page(struct fscache_cookie *cookie,
struct page *page,
+ loff_t object_size,
gfp_t gfp);
The cookie argument must specify a data file cookie, the page specified should
contain the data to be written (and is also used to specify the page number),
-and the gfp argument is used to control how any memory allocations made are
-satisfied.
+object_size is the revised size of the object and the gfp argument is used to
+control how any memory allocations made are satisfied.
The page must have first been read or allocated successfully and must not have
been uncached before writing is performed.
@@ -717,21 +699,23 @@ INDEX AND DATA FILE CONSISTENCY
To find out whether auxiliary data for an object is up to data within the
cache, the following function can be called:
- int fscache_check_consistency(struct fscache_cookie *cookie)
+ int fscache_check_consistency(struct fscache_cookie *cookie,
+ const void *aux_data);
This will call back to the netfs to check whether the auxiliary data associated
-with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it
-may also return -ENOMEM and -ERESTARTSYS.
+with a cookie is correct; if aux_data is non-NULL, it will update the auxiliary
+data buffer first. It returns 0 if it is and -ESTALE if it isn't; it may also
+return -ENOMEM and -ERESTARTSYS.
To request an update of the index data for an index or other object, the
following function should be called:
- void fscache_update_cookie(struct fscache_cookie *cookie);
+ void fscache_update_cookie(struct fscache_cookie *cookie,
+ const void *aux_data);
-This function will refer back to the netfs_data pointer stored in the cookie by
-the acquisition function to obtain the data to write into each revised index
-entry. The update method in the parent index definition will be called to
-transfer the data.
+This function will update the cookie's auxiliary data buffer from aux_data if
+that is non-NULL and then schedule this to be stored on disk. The update
+method in the parent index definition will be called to transfer the data.
Note that partial updates may happen automatically at other times, such as when
data blocks are added to a data file object.
@@ -748,10 +732,11 @@ still possible to uncache pages and relinquish the cookie.
The initial enablement state is set by fscache_acquire_cookie(), but the cookie
can be enabled or disabled later. To disable a cookie, call:
-
+
void fscache_disable_cookie(struct fscache_cookie *cookie,
+ const void *aux_data,
bool invalidate);
-
+
If the cookie is not already disabled, this locks the cookie against other
enable and disable ops, marks the cookie as being disabled, discards or
invalidates any backing objects and waits for cessation of activity on any
@@ -760,13 +745,15 @@ associated object before unlocking the cookie.
All possible failures are handled internally. The caller should consider
calling fscache_uncache_all_inode_pages() afterwards to make sure all page
markings are cleared up.
-
+
Cookies can be enabled or reenabled with:
-
+
void fscache_enable_cookie(struct fscache_cookie *cookie,
+ const void *aux_data,
+ loff_t object_size,
bool (*can_enable)(void *data),
void *data)
-
+
If the cookie is not already enabled, this locks the cookie against other
enable and disable ops, invokes can_enable() and, if the cookie is not an index
cookie, will begin the procedure of acquiring backing objects.
@@ -777,6 +764,12 @@ ruling as to whether or not enablement should actually be permitted to begin.
All possible failures are handled internally. The cookie will only be marked
as enabled if provisional backing objects are allocated.
+The object's data size is updated from object_size and is passed to the
+->check_aux() function.
+
+In both cases, the cookie's auxiliary data buffer is updated from aux_data if
+that is non-NULL inside the enablement lock before proceeding.
+
===============================
MISCELLANEOUS COOKIE OPERATIONS
@@ -823,6 +816,7 @@ COOKIE UNREGISTRATION
To get rid of a cookie, this function should be called.
void fscache_relinquish_cookie(struct fscache_cookie *cookie,
+ const void *aux_data,
bool retire);
If retire is non-zero, then the object will be marked for recycling, and all
@@ -833,6 +827,9 @@ If retire is zero, then the object may be available again when next the
acquisition function is called. Retirement here will overrule the pinning on a
cookie.
+The cookie's auxiliary data will be updated from aux_data if that is non-NULL
+so that the cache can lazily update it on disk.
+
One very important note - relinquish must NOT be called for a cookie unless all
the cookies for "child" indices, objects and pages have been relinquished
first.
diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt
index 0b302a11718a..d7f011ddc150 100644
--- a/Documentation/filesystems/ceph.txt
+++ b/Documentation/filesystems/ceph.txt
@@ -62,6 +62,18 @@ subdirectories, and a summation of all nested file sizes. This makes
the identification of large disk space consumers relatively quick, as
no 'du' or similar recursive scan of the file system is required.
+Finally, Ceph also allows quotas to be set on any directory in the system.
+The quota can restrict the number of bytes or the number of files stored
+beneath that point in the directory hierarchy. Quotas can be set using
+extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg:
+
+ setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir
+ getfattr -n ceph.quota.max_bytes /some/dir
+
+A limitation of the current quotas implementation is that it relies on the
+cooperation of the client mounting the file system to stop writers when a
+limit is reached. A modified or adversarial client cannot be prevented
+from writing as much data as it needs.
Mount Syntax
============
@@ -137,6 +149,10 @@ Mount Options
noasyncreaddir
Do not use the dcache as above for readdir.
+ noquotadf
+ Report overall filesystem usage in statfs instead of using the root
+ directory quota.
+
More Information
================
diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.txt
index e2818b60a5c2..f4ba94950e3f 100644
--- a/Documentation/filesystems/orangefs.txt
+++ b/Documentation/filesystems/orangefs.txt
@@ -21,10 +21,16 @@ Orangefs features include:
* Stateless
-MAILING LIST
-============
+MAILING LIST ARCHIVES
+=====================
-http://beowulf-underground.org/mailman/listinfo/pvfs2-users
+http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/
+
+
+MAILING LIST SUBMISSIONS
+========================
+
+devel@lists.orangefs.org
DOCUMENTATION
@@ -42,12 +48,59 @@ Orangefs versions prior to 2.9.3 would not be compatible with the
upstream version of the kernel client.
-BUILDING THE USERSPACE FILESYSTEM ON A SINGLE SERVER
-====================================================
+RUNNING ORANGEFS ON A SINGLE SERVER
+===================================
+
+OrangeFS is usually run in large installations with multiple servers and
+clients, but a complete filesystem can be run on a single machine for
+development and testing.
+
+On Fedora, install orangefs and orangefs-server.
+
+dnf -y install orangefs orangefs-server
+
+There is an example server configuration file in
+/etc/orangefs/orangefs.conf. Change localhost to your hostname if
+necessary.
+
+To generate a filesystem to run xfstests against, see below.
+
+There is an example client configuration file in /etc/pvfs2tab. It is a
+single line. Uncomment it and change the hostname if necessary. This
+controls clients which use libpvfs2. This does not control the
+pvfs2-client-core.
+
+Create the filesystem.
+
+pvfs2-server -f /etc/orangefs/orangefs.conf
+
+Start the server.
+
+systemctl start orangefs-server
+
+Test the server.
+
+pvfs2-ping -m /pvfsmnt
+
+Start the client. The module must be compiled in or loaded before this
+point.
+
+systemctl start orangefs-client
+
+Mount the filesystem.
+
+mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
+
-You can omit --prefix if you don't care that things are sprinkled around in
-/usr/local. As of version 2.9.6, Orangefs uses Berkeley DB by default, we
-will probably be changing the default to lmdb soon.
+BUILDING ORANGEFS ON A SINGLE SERVER
+====================================
+
+Where OrangeFS cannot be installed from distribution packages, it may be
+built from source.
+
+You can omit --prefix if you don't care that things are sprinkled around
+in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by
+default, we will probably be changing the default to LMDB soon.
./configure --prefix=/opt/ofs --with-db-backend=lmdb
@@ -55,35 +108,69 @@ make
make install
-Create an orangefs config file:
+Create an orangefs config file.
+
/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
- for "Enter hostnames", use the hostname, don't let it default to
- localhost.
+Create an /etc/pvfs2tab file.
+
+echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
+ /etc/pvfs2tab
+
+Create the mount point you specified in the tab file if needed.
-create a pvfs2tab file in /etc:
-cat /etc/pvfs2tab
-tcp://myhostname:3334/orangefs /mymountpoint pvfs2 defaults,noauto 0 0
+mkdir /pvfsmnt
-create the mount point you specified in the tab file if needed:
-mkdir /mymountpoint
+Bootstrap the server.
-bootstrap the server:
-/opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf -f
+/opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
+
+Start the server.
-start the server:
/opt/osf/sbin/pvfs2-server /etc/pvfs2.conf
-Now the server is running. At this point you might like to
-prove things are working with:
+Now the server should be running. Pvfs2-ls is a simple
+test to verify that the server is running.
+
+/opt/ofs/bin/pvfs2-ls /pvfsmnt
-/opt/osf/bin/pvfs2-ls /mymountpoint
+If stuff seems to be working, load the kernel module and
+turn on the client core.
-If stuff seems to be working, turn on the client core:
-/opt/osf/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core
+/opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core
Mount your filesystem.
-mount -t pvfs2 tcp://myhostname:3334/orangefs /mymountpoint
+
+mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
+
+
+RUNNING XFSTESTS
+================
+
+It is useful to use a scratch filesystem with xfstests. This can be
+done with only one server.
+
+Make a second copy of the FileSystem section in the server configuration
+file, which is /etc/orangefs/orangefs.conf. Change the Name to scratch.
+Change the ID to something other than the ID of the first FileSystem
+section (2 is usually a good choice).
+
+Then there are two FileSystem sections: orangefs and scratch.
+
+This change should be made before creating the filesystem.
+
+pvfs2-server -f /etc/orangefs/orangefs.conf
+
+To run xfstests, create /etc/xfsqa.config.
+
+TEST_DIR=/orangefs
+TEST_DEV=tcp://localhost:3334/orangefs
+SCRATCH_MNT=/scratch
+SCRATCH_DEV=tcp://localhost:3334/scratch
+
+Then xfstests can be run
+
+./check -pvfs2
OPTIONS
diff --git a/Documentation/hwmon/adm1275 b/Documentation/hwmon/adm1275
index 791bc0bd91e6..39033538eb03 100644
--- a/Documentation/hwmon/adm1275
+++ b/Documentation/hwmon/adm1275
@@ -6,6 +6,10 @@ Supported chips:
Prefix: 'adm1075'
Addresses scanned: -
Datasheet: www.analog.com/static/imported-files/data_sheets/ADM1075.pdf
+ * Analog Devices ADM1272
+ Prefix: 'adm1272'
+ Addresses scanned: -
+ Datasheet: www.analog.com/static/imported-files/data_sheets/ADM1272.pdf
* Analog Devices ADM1275
Prefix: 'adm1275'
Addresses scanned: -
@@ -29,11 +33,11 @@ Author: Guenter Roeck <linux@roeck-us.net>
Description
-----------
-This driver supports hardware monitoring for Analog Devices ADM1075, ADM1275,
-ADM1276, ADM1278, ADM1293, and ADM1294 Hot-Swap Controller and Digital
-Power Monitors.
+This driver supports hardware monitoring for Analog Devices ADM1075, ADM1272,
+ADM1275, ADM1276, ADM1278, ADM1293, and ADM1294 Hot-Swap Controller and
+Digital Power Monitors.
-ADM1075, ADM1275, ADM1276, ADM1278, ADM1293, and ADM1294 are hot-swap
+ADM1075, ADM1272, ADM1275, ADM1276, ADM1278, ADM1293, and ADM1294 are hot-swap
controllers that allow a circuit board to be removed from or inserted into
a live backplane. They also feature current and voltage readback via an
integrated 12 bit analog-to-digital converter (ADC), accessed using a
@@ -100,11 +104,10 @@ power1_input_lowest Lowest observed input power. ADM1293 and ADM1294 only.
power1_input_highest Highest observed input power.
power1_reset_history Write any value to reset history.
- Power attributes are supported on ADM1075, ADM1276,
- ADM1293, and ADM1294.
+ Power attributes are supported on ADM1075, ADM1272,
+ ADM1276, ADM1293, and ADM1294.
temp1_input Chip temperature.
- Temperature attributes are only available on ADM1278.
temp1_max Maximum chip temperature.
temp1_max_alarm Temperature alarm.
temp1_crit Critical chip temperature.
@@ -112,4 +115,5 @@ temp1_crit_alarm Critical temperature high alarm.
temp1_highest Highest observed temperature.
temp1_reset_history Write any value to reset history.
- Temperature attributes are supported on ADM1278.
+ Temperature attributes are supported on ADM1272 and
+ ADM1278.
diff --git a/Documentation/hwmon/lm92 b/Documentation/hwmon/lm92
index 22f68ad032cf..cfa99a353b8c 100644
--- a/Documentation/hwmon/lm92
+++ b/Documentation/hwmon/lm92
@@ -11,10 +11,8 @@ Supported chips:
Addresses scanned: none, force parameter needed
Datasheet: http://www.national.com/pf/LM/LM76.html
* Maxim MAX6633/MAX6634/MAX6635
- Prefix: 'lm92'
- Addresses scanned: I2C 0x48 - 0x4b
- MAX6633 with address in 0x40 - 0x47, 0x4c - 0x4f needs force parameter
- and MAX6634 with address in 0x4c - 0x4f needs force parameter
+ Prefix: 'max6635'
+ Addresses scanned: none, force parameter needed
Datasheet: http://www.maxim-ic.com/quick_view2.cfm/qv_pk/3074
Authors:
diff --git a/Documentation/hwmon/nct6775 b/Documentation/hwmon/nct6775
index 76add4c9cd68..bd59834d310f 100644
--- a/Documentation/hwmon/nct6775
+++ b/Documentation/hwmon/nct6775
@@ -36,6 +36,14 @@ Supported chips:
Prefix: 'nct6793'
Addresses scanned: ISA address retrieved from Super I/O registers
Datasheet: Available from Nuvoton upon request
+ * Nuvoton NCT6795D
+ Prefix: 'nct6795'
+ Addresses scanned: ISA address retrieved from Super I/O registers
+ Datasheet: Available from Nuvoton upon request
+ * Nuvoton NCT6796D
+ Prefix: 'nct6796'
+ Addresses scanned: ISA address retrieved from Super I/O registers
+ Datasheet: Available from Nuvoton upon request
Authors:
Guenter Roeck <linux@roeck-us.net>
@@ -88,10 +96,10 @@ The mode works for fan1-fan5.
sysfs attributes
----------------
-pwm[1-5] - this file stores PWM duty cycle or DC value (fan speed) in range:
+pwm[1-7] - this file stores PWM duty cycle or DC value (fan speed) in range:
0 (lowest speed) to 255 (full)
-pwm[1-5]_enable - this file controls mode of fan/temperature control:
+pwm[1-7]_enable - this file controls mode of fan/temperature control:
* 0 Fan control disabled (fans set to maximum speed)
* 1 Manual mode, write to pwm[0-5] any value 0-255
* 2 "Thermal Cruise" mode
@@ -99,16 +107,16 @@ pwm[1-5]_enable - this file controls mode of fan/temperature control:
* 4 "Smart Fan III" mode (NCT6775F only)
* 5 "Smart Fan IV" mode
-pwm[1-5]_mode - controls if output is PWM or DC level
+pwm[1-7]_mode - controls if output is PWM or DC level
* 0 DC output
* 1 PWM output
Common fan control attributes
-----------------------------
-pwm[1-5]_temp_sel Temperature source. Value is temperature sensor index.
+pwm[1-7]_temp_sel Temperature source. Value is temperature sensor index.
For example, select '1' for temp1_input.
-pwm[1-5]_weight_temp_sel
+pwm[1-7]_weight_temp_sel
Secondary temperature source. Value is temperature
sensor index. For example, select '1' for temp1_input.
Set to 0 to disable secondary temperature control.
@@ -116,16 +124,16 @@ pwm[1-5]_weight_temp_sel
If secondary temperature functionality is enabled, it is controlled with the
following attributes.
-pwm[1-5]_weight_duty_step
+pwm[1-7]_weight_duty_step
Duty step size.
-pwm[1-5]_weight_temp_step
+pwm[1-7]_weight_temp_step
Temperature step size. With each step over
temp_step_base, the value of weight_duty_step is added
to the current pwm value.
-pwm[1-5]_weight_temp_step_base
+pwm[1-7]_weight_temp_step_base
Temperature at which secondary temperature control kicks
in.
-pwm[1-5]_weight_temp_step_tol
+pwm[1-7]_weight_temp_step_tol
Temperature step tolerance.
Thermal Cruise mode (2)
@@ -133,9 +141,9 @@ Thermal Cruise mode (2)
If the temperature is in the range defined by:
-pwm[1-5]_target_temp Target temperature, unit millidegree Celsius
+pwm[1-7]_target_temp Target temperature, unit millidegree Celsius
(range 0 - 127000)
-pwm[1-5]_temp_tolerance
+pwm[1-7]_temp_tolerance
Target temperature tolerance, unit millidegree Celsius
there are no changes to fan speed. Once the temperature leaves the interval, fan
@@ -143,14 +151,14 @@ speed increases (if temperature is higher that desired) or decreases (if
temperature is lower than desired), using the following limits and time
intervals.
-pwm[1-5]_start fan pwm start value (range 1 - 255), to start fan
+pwm[1-7]_start fan pwm start value (range 1 - 255), to start fan
when the temperature is above defined range.
-pwm[1-5]_floor lowest fan pwm (range 0 - 255) if temperature is below
+pwm[1-7]_floor lowest fan pwm (range 0 - 255) if temperature is below
the defined range. If set to 0, the fan is expected to
stop if the temperature is below the defined range.
-pwm[1-5]_step_up_time milliseconds before fan speed is increased
-pwm[1-5]_step_down_time milliseconds before fan speed is decreased
-pwm[1-5]_stop_time how many milliseconds must elapse to switch
+pwm[1-7]_step_up_time milliseconds before fan speed is increased
+pwm[1-7]_step_down_time milliseconds before fan speed is decreased
+pwm[1-7]_stop_time how many milliseconds must elapse to switch
corresponding fan off (when the temperature was below
defined range).
@@ -159,8 +167,8 @@ Speed Cruise mode (3)
This modes tries to keep the fan speed constant.
-fan[1-5]_target Target fan speed
-fan[1-5]_tolerance
+fan[1-7]_target Target fan speed
+fan[1-7]_tolerance
Target speed tolerance
@@ -177,19 +185,19 @@ points should be set to higher temperatures and higher pwm values to achieve
higher fan speeds with increasing temperature. The last data point reflects
critical temperature mode, in which the fans should run at full speed.
-pwm[1-5]_auto_point[1-7]_pwm
+pwm[1-7]_auto_point[1-7]_pwm
pwm value to be set if temperature reaches matching
temperature range.
-pwm[1-5]_auto_point[1-7]_temp
+pwm[1-7]_auto_point[1-7]_temp
Temperature over which the matching pwm is enabled.
-pwm[1-5]_temp_tolerance
+pwm[1-7]_temp_tolerance
Temperature tolerance, unit millidegree Celsius
-pwm[1-5]_crit_temp_tolerance
+pwm[1-7]_crit_temp_tolerance
Temperature tolerance for critical temperature,
unit millidegree Celsius
-pwm[1-5]_step_up_time milliseconds before fan speed is increased
-pwm[1-5]_step_down_time milliseconds before fan speed is decreased
+pwm[1-7]_step_up_time milliseconds before fan speed is increased
+pwm[1-7]_step_down_time milliseconds before fan speed is decreased
Usage Notes
-----------
diff --git a/Documentation/hwmon/sht21 b/Documentation/hwmon/sht21
index 47f4765db256..8b3cdda541c1 100644
--- a/Documentation/hwmon/sht21
+++ b/Documentation/hwmon/sht21
@@ -6,13 +6,13 @@ Supported chips:
Prefix: 'sht21'
Addresses scanned: none
Datasheet: Publicly available at the Sensirion website
- http://www.sensirion.com/en/pdf/product_information/Datasheet-humidity-sensor-SHT21.pdf
+ http://www.sensirion.com/file/datasheet_sht21
* Sensirion SHT25
- Prefix: 'sht21'
+ Prefix: 'sht25'
Addresses scanned: none
Datasheet: Publicly available at the Sensirion website
- http://www.sensirion.com/en/pdf/product_information/Datasheet-humidity-sensor-SHT25.pdf
+ http://www.sensirion.com/file/datasheet_sht25
Author:
Urs Fleisch <urs.fleisch@sensirion.com>
diff --git a/Documentation/hwmon/sht3x b/Documentation/hwmon/sht3x
index b0d88184f48e..d9daa6ab1e8e 100644
--- a/Documentation/hwmon/sht3x
+++ b/Documentation/hwmon/sht3x
@@ -5,7 +5,7 @@ Supported chips:
* Sensirion SHT3x-DIS
Prefix: 'sht3x'
Addresses scanned: none
- Datasheet: http://www.sensirion.com/fileadmin/user_upload/customers/sensirion/Dokumente/Humidity/Sensirion_Humidity_Datasheet_SHT3x_DIS.pdf
+ Datasheet: https://www.sensirion.com/file/datasheet_sht3x_digital
Author:
David Frey <david.frey@sensirion.com>
diff --git a/Documentation/media/kapi/v4l2-dev.rst b/Documentation/media/kapi/v4l2-dev.rst
index 7bb0505b60f1..eb03ccc41c41 100644
--- a/Documentation/media/kapi/v4l2-dev.rst
+++ b/Documentation/media/kapi/v4l2-dev.rst
@@ -31,7 +31,7 @@ of the video device exits.
The default :c:func:`video_device_release` callback currently
just calls ``kfree`` to free the allocated memory.
-There is also a ::c:func:`video_device_release_empty` function that does
+There is also a :c:func:`video_device_release_empty` function that does
nothing (is empty) and should be used if the struct is embedded and there
is nothing to do when it is released.
diff --git a/Documentation/media/uapi/mediactl/media-ioc-enum-entities.rst b/Documentation/media/uapi/mediactl/media-ioc-enum-entities.rst
index 45e76e5bc1ea..582fda488810 100644
--- a/Documentation/media/uapi/mediactl/media-ioc-enum-entities.rst
+++ b/Documentation/media/uapi/mediactl/media-ioc-enum-entities.rst
@@ -89,7 +89,7 @@ id's until they get an error.
-
-
- - Entity type, see :ref:`media-entity-type` for details.
+ - Entity type, see :ref:`media-entity-functions` for details.
- .. row 4
diff --git a/Documentation/media/uapi/mediactl/media-ioc-g-topology.rst b/Documentation/media/uapi/mediactl/media-ioc-g-topology.rst
index c8f9ea37db2d..c4055ddf070a 100644
--- a/Documentation/media/uapi/mediactl/media-ioc-g-topology.rst
+++ b/Documentation/media/uapi/mediactl/media-ioc-g-topology.rst
@@ -205,13 +205,13 @@ desired arrays with the media graph elements.
- ``function``
- - Entity main function, see :ref:`media-entity-type` for details.
+ - Entity main function, see :ref:`media-entity-functions` for details.
- .. row 4
- __u32
- - ``reserved``\ [12]
+ - ``reserved``\ [6]
- Reserved for future extensions. Drivers and applications must set
this array to zero.
@@ -334,7 +334,7 @@ desired arrays with the media graph elements.
- __u32
- - ``reserved``\ [9]
+ - ``reserved``\ [5]
- Reserved for future extensions. Drivers and applications must set
this array to zero.
@@ -390,7 +390,7 @@ desired arrays with the media graph elements.
- __u32
- - ``reserved``\ [5]
+ - ``reserved``\ [6]
- Reserved for future extensions. Drivers and applications must set
this array to zero.
diff --git a/Documentation/media/uapi/mediactl/media-types.rst b/Documentation/media/uapi/mediactl/media-types.rst
index f92f10b7ffbd..2dda14bd89b7 100644
--- a/Documentation/media/uapi/mediactl/media-types.rst
+++ b/Documentation/media/uapi/mediactl/media-types.rst
@@ -7,11 +7,11 @@ Types and flags used to represent the media graph elements
.. tabularcolumns:: |p{8.2cm}|p{10.3cm}|
-.. _media-entity-type:
+.. _media-entity-functions:
.. cssclass:: longtable
-.. flat-table:: Media entity types
+.. flat-table:: Media entity functions
:header-rows: 0
:stub-columns: 0
diff --git a/Documentation/media/uapi/v4l/extended-controls.rst b/Documentation/media/uapi/v4l/extended-controls.rst
index d5f3eb6e674a..03931f9b1285 100644
--- a/Documentation/media/uapi/v4l/extended-controls.rst
+++ b/Documentation/media/uapi/v4l/extended-controls.rst
@@ -3565,7 +3565,7 @@ enum v4l2_dv_it_content_type -
HDMI carries 5V on one of the pins). This is often used to power an
eeprom which contains EDID information, such that the source can
read the EDID even if the sink is in standby/power off. Each bit
- corresponds to an input pad on the transmitter. If an input pad
+ corresponds to an input pad on the receiver. If an input pad
cannot detect whether power is present, then the bit for that pad
will be 0. This read-only control is applicable to DVI-D, HDMI and
DisplayPort connectors.
diff --git a/Documentation/media/uapi/v4l/pixfmt-v4l2-mplane.rst b/Documentation/media/uapi/v4l/pixfmt-v4l2-mplane.rst
index 337e8188caf1..ef52f637d8e9 100644
--- a/Documentation/media/uapi/v4l/pixfmt-v4l2-mplane.rst
+++ b/Documentation/media/uapi/v4l/pixfmt-v4l2-mplane.rst
@@ -55,12 +55,14 @@ describing all planes of that format.
- ``pixelformat``
- The pixel format. Both single- and multi-planar four character
codes can be used.
- * - enum :c:type:`v4l2_field`
+ * - __u32
- ``field``
- - See struct :c:type:`v4l2_pix_format`.
- * - enum :c:type:`v4l2_colorspace`
+ - Field order, from enum :c:type:`v4l2_field`.
+ See struct :c:type:`v4l2_pix_format`.
+ * - __u32
- ``colorspace``
- - See struct :c:type:`v4l2_pix_format`.
+ - Colorspace encoding, from enum :c:type:`v4l2_colorspace`.
+ See struct :c:type:`v4l2_pix_format`.
* - struct :c:type:`v4l2_plane_pix_format`
- ``plane_fmt[VIDEO_MAX_PLANES]``
- An array of structures describing format of each plane this pixel
@@ -73,24 +75,34 @@ describing all planes of that format.
* - __u8
- ``flags``
- Flags set by the application or driver, see :ref:`format-flags`.
- * - enum :c:type:`v4l2_ycbcr_encoding`
+ * - union {
+ - (anonymous)
+ -
+ * - __u8
- ``ycbcr_enc``
- - This information supplements the ``colorspace`` and must be set by
+ - Y'CbCr encoding, from enum :c:type:`v4l2_ycbcr_encoding`.
+ This information supplements the ``colorspace`` and must be set by
the driver for capture streams and by the application for output
streams, see :ref:`colorspaces`.
- * - enum :c:type:`v4l2_hsv_encoding`
+ * - __u8
- ``hsv_enc``
- - This information supplements the ``colorspace`` and must be set by
+ - HSV encoding, from enum :c:type:`v4l2_hsv_encoding`.
+ This information supplements the ``colorspace`` and must be set by
the driver for capture streams and by the application for output
streams, see :ref:`colorspaces`.
- * - enum :c:type:`v4l2_quantization`
+ * - }
+ -
+ -
+ * - __u8
- ``quantization``
- - This information supplements the ``colorspace`` and must be set by
+ - Quantization range, from enum :c:type:`v4l2_quantization`.
+ This information supplements the ``colorspace`` and must be set by
the driver for capture streams and by the application for output
streams, see :ref:`colorspaces`.
- * - enum :c:type:`v4l2_xfer_func`
+ * - __u8
- ``xfer_func``
- - This information supplements the ``colorspace`` and must be set by
+ - Transfer function, from enum :c:type:`v4l2_xfer_func`.
+ This information supplements the ``colorspace`` and must be set by
the driver for capture streams and by the application for output
streams, see :ref:`colorspaces`.
* - __u8
diff --git a/Documentation/media/uapi/v4l/pixfmt-v4l2.rst b/Documentation/media/uapi/v4l/pixfmt-v4l2.rst
index 6622938c1b41..826f2305da01 100644
--- a/Documentation/media/uapi/v4l/pixfmt-v4l2.rst
+++ b/Documentation/media/uapi/v4l/pixfmt-v4l2.rst
@@ -40,9 +40,10 @@ Single-planar format structure
RGB formats in :ref:`rgb-formats`, YUV formats in
:ref:`yuv-formats`, and reserved codes in
:ref:`reserved-formats`
- * - enum :c:type:`v4l2_field`
+ * - __u32
- ``field``
- - Video images are typically interlaced. Applications can request to
+ - Field order, from enum :c:type:`v4l2_field`.
+ Video images are typically interlaced. Applications can request to
capture or output only the top or bottom field, or both fields
interlaced or sequentially stored in one buffer or alternating in
separate buffers. Drivers return the actual field order selected.
@@ -82,9 +83,10 @@ Single-planar format structure
driver. Usually this is ``bytesperline`` times ``height``. When
the image consists of variable length compressed data this is the
maximum number of bytes required to hold an image.
- * - enum :c:type:`v4l2_colorspace`
+ * - __u32
- ``colorspace``
- - This information supplements the ``pixelformat`` and must be set
+ - Image colorspace, from enum :c:type:`v4l2_colorspace`.
+ This information supplements the ``pixelformat`` and must be set
by the driver for capture streams and by the application for
output streams, see :ref:`colorspaces`.
* - __u32
@@ -116,23 +118,33 @@ Single-planar format structure
* - __u32
- ``flags``
- Flags set by the application or driver, see :ref:`format-flags`.
- * - enum :c:type:`v4l2_ycbcr_encoding`
+ * - union {
+ - (anonymous)
+ -
+ * - __u32
- ``ycbcr_enc``
- - This information supplements the ``colorspace`` and must be set by
+ - Y'CbCr encoding, from enum :c:type:`v4l2_ycbcr_encoding`.
+ This information supplements the ``colorspace`` and must be set by
the driver for capture streams and by the application for output
streams, see :ref:`colorspaces`.
- * - enum :c:type:`v4l2_hsv_encoding`
+ * - __u32
- ``hsv_enc``
- - This information supplements the ``colorspace`` and must be set by
+ - HSV encoding, from enum :c:type:`v4l2_hsv_encoding`.
+ This information supplements the ``colorspace`` and must be set by
the driver for capture streams and by the application for output
streams, see :ref:`colorspaces`.
- * - enum :c:type:`v4l2_quantization`
+ * - }
+ -
+ -
+ * - __u32
- ``quantization``
- - This information supplements the ``colorspace`` and must be set by
+ - Quantization range, from enum :c:type:`v4l2_quantization`.
+ This information supplements the ``colorspace`` and must be set by
the driver for capture streams and by the application for output
streams, see :ref:`colorspaces`.
- * - enum :c:type:`v4l2_xfer_func`
+ * - __u32
- ``xfer_func``
- - This information supplements the ``colorspace`` and must be set by
+ - Transfer function, from enum :c:type:`v4l2_xfer_func`.
+ This information supplements the ``colorspace`` and must be set by
the driver for capture streams and by the application for output
streams, see :ref:`colorspaces`.
diff --git a/Documentation/process/4.Coding.rst b/Documentation/process/4.Coding.rst
index 26b106071364..eb4b185d168c 100644
--- a/Documentation/process/4.Coding.rst
+++ b/Documentation/process/4.Coding.rst
@@ -58,6 +58,14 @@ can never be transgressed. If there is a good reason to go against the
style (a line which becomes far less readable if split to fit within the
80-column limit, for example), just do it.
+Note that you can also use the ``clang-format`` tool to help you with
+these rules, to quickly re-format parts of your code automatically,
+and to review full files in order to spot coding style mistakes,
+typos and possible improvements. It is also handy for sorting ``#includes``,
+for aligning variables/macros, for reflowing text and other similar tasks.
+See the file :ref:`Documentation/process/clang-format.rst <clangformat>`
+for more details.
+
Abstraction layers
******************
diff --git a/Documentation/process/clang-format.rst b/Documentation/process/clang-format.rst
new file mode 100644
index 000000000000..6710c0707721
--- /dev/null
+++ b/Documentation/process/clang-format.rst
@@ -0,0 +1,184 @@
+.. _clangformat:
+
+clang-format
+============
+
+``clang-format`` is a tool to format C/C++/... code according to
+a set of rules and heuristics. Like most tools, it is not perfect
+nor covers every single case, but it is good enough to be helpful.
+
+``clang-format`` can be used for several purposes:
+
+ - Quickly reformat a block of code to the kernel style. Specially useful
+ when moving code around and aligning/sorting. See clangformatreformat_.
+
+ - Spot style mistakes, typos and possible improvements in files
+ you maintain, patches you review, diffs, etc. See clangformatreview_.
+
+ - Help you follow the coding style rules, specially useful for those
+ new to kernel development or working at the same time in several
+ projects with different coding styles.
+
+Its configuration file is ``.clang-format`` in the root of the kernel tree.
+The rules contained there try to approximate the most common kernel
+coding style. They also try to follow :ref:`Documentation/process/coding-style.rst <codingstyle>`
+as much as possible. Since not all the kernel follows the same style,
+it is possible that you may want to tweak the defaults for a particular
+subsystem or folder. To do so, you can override the defaults by writing
+another ``.clang-format`` file in a subfolder.
+
+The tool itself has already been included in the repositories of popular
+Linux distributions for a long time. Search for ``clang-format`` in
+your repositories. Otherwise, you can either download pre-built
+LLVM/clang binaries or build the source code from:
+
+ http://releases.llvm.org/download.html
+
+See more information about the tool at:
+
+ https://clang.llvm.org/docs/ClangFormat.html
+
+ https://clang.llvm.org/docs/ClangFormatStyleOptions.html
+
+
+.. _clangformatreview:
+
+Review files and patches for coding style
+-----------------------------------------
+
+By running the tool in its inline mode, you can review full subsystems,
+folders or individual files for code style mistakes, typos or improvements.
+
+To do so, you can run something like::
+
+ # Make sure your working directory is clean!
+ clang-format -i kernel/*.[ch]
+
+And then take a look at the git diff.
+
+Counting the lines of such a diff is also useful for improving/tweaking
+the style options in the configuration file; as well as testing new
+``clang-format`` features/versions.
+
+``clang-format`` also supports reading unified diffs, so you can review
+patches and git diffs easily. See the documentation at:
+
+ https://clang.llvm.org/docs/ClangFormat.html#script-for-patch-reformatting
+
+To avoid ``clang-format`` formatting some portion of a file, you can do::
+
+ int formatted_code;
+ // clang-format off
+ void unformatted_code ;
+ // clang-format on
+ void formatted_code_again;
+
+While it might be tempting to use this to keep a file always in sync with
+``clang-format``, specially if you are writing new files or if you are
+a maintainer, please note that people might be running different
+``clang-format`` versions or not have it available at all. Therefore,
+you should probably refrain yourself from using this in kernel sources;
+at least until we see if ``clang-format`` becomes commonplace.
+
+
+.. _clangformatreformat:
+
+Reformatting blocks of code
+---------------------------
+
+By using an integration with your text editor, you can reformat arbitrary
+blocks (selections) of code with a single keystroke. This is specially
+useful when moving code around, for complex code that is deeply intended,
+for multi-line macros (and aligning their backslashes), etc.
+
+Remember that you can always tweak the changes afterwards in those cases
+where the tool did not do an optimal job. But as a first approximation,
+it can be very useful.
+
+There are integrations for many popular text editors. For some of them,
+like vim, emacs, BBEdit and Visual Studio you can find support built-in.
+For instructions, read the appropiate section at:
+
+ https://clang.llvm.org/docs/ClangFormat.html
+
+For Atom, Eclipse, Sublime Text, Visual Studio Code, XCode and other
+editors and IDEs you should be able to find ready-to-use plugins.
+
+For this use case, consider using a secondary ``.clang-format``
+so that you can tweak a few options. See clangformatextra_.
+
+
+.. _clangformatmissing:
+
+Missing support
+---------------
+
+``clang-format`` is missing support for some things that are common
+in kernel code. They are easy to remember, so if you use the tool
+regularly, you will quickly learn to avoid/ignore those.
+
+In particular, some very common ones you will notice are:
+
+ - Aligned blocks of one-line ``#defines``, e.g.::
+
+ #define TRACING_MAP_BITS_DEFAULT 11
+ #define TRACING_MAP_BITS_MAX 17
+ #define TRACING_MAP_BITS_MIN 7
+
+ vs.::
+
+ #define TRACING_MAP_BITS_DEFAULT 11
+ #define TRACING_MAP_BITS_MAX 17
+ #define TRACING_MAP_BITS_MIN 7
+
+ - Aligned designated initializers, e.g.::
+
+ static const struct file_operations uprobe_events_ops = {
+ .owner = THIS_MODULE,
+ .open = probes_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+ .write = probes_write,
+ };
+
+ vs.::
+
+ static const struct file_operations uprobe_events_ops = {
+ .owner = THIS_MODULE,
+ .open = probes_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+ .write = probes_write,
+ };
+
+
+.. _clangformatextra:
+
+Extra features/options
+----------------------
+
+Some features/style options are not enabled by default in the configuration
+file in order to minimize the differences between the output and the current
+code. In other words, to make the difference as small as possible,
+which makes reviewing full-file style, as well diffs and patches as easy
+as possible.
+
+In other cases (e.g. particular subsystems/folders/files), the kernel style
+might be different and enabling some of these options may approximate
+better the style there.
+
+For instance:
+
+ - Aligning assignments (``AlignConsecutiveAssignments``).
+
+ - Aligning declarations (``AlignConsecutiveDeclarations``).
+
+ - Reflowing text in comments (``ReflowComments``).
+
+ - Sorting ``#includes`` (``SortIncludes``).
+
+They are typically useful for block re-formatting, rather than full-file.
+You might want to create another ``.clang-format`` file and use that one
+from your editor/IDE instead.
diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
index d98deb62c400..4e7c0a1c427a 100644
--- a/Documentation/process/coding-style.rst
+++ b/Documentation/process/coding-style.rst
@@ -631,6 +631,14 @@ options ``-kr -i8`` (stands for ``K&R, 8 character indents``), or use
re-formatting you may want to take a look at the man page. But
remember: ``indent`` is not a fix for bad programming.
+Note that you can also use the ``clang-format`` tool to help you with
+these rules, to quickly re-format parts of your code automatically,
+and to review full files in order to spot coding style mistakes,
+typos and possible improvements. It is also handy for sorting ``#includes``,
+for aligning variables/macros, for reflowing text and other similar tasks.
+See the file :ref:`Documentation/process/clang-format.rst <clangformat>`
+for more details.
+
10) Kconfig configuration files
-------------------------------
diff --git a/Documentation/s390/vfio-ccw.txt b/Documentation/s390/vfio-ccw.txt
index 90b3dfead81b..2be11ad864ff 100644
--- a/Documentation/s390/vfio-ccw.txt
+++ b/Documentation/s390/vfio-ccw.txt
@@ -28,7 +28,7 @@ every detail. More information/reference could be found here:
https://en.wikipedia.org/wiki/Channel_I/O
- s390 architecture:
s390 Principles of Operation manual (IBM Form. No. SA22-7832)
-- The existing Qemu code which implements a simple emulated channel
+- The existing QEMU code which implements a simple emulated channel
subsystem could also be a good reference. It makes it easier to follow
the flow.
qemu/hw/s390x/css.c
@@ -39,22 +39,22 @@ For vfio mediated device framework:
Motivation of vfio-ccw
----------------------
-Currently, a guest virtualized via qemu/kvm on s390 only sees
+Typically, a guest virtualized via QEMU/KVM on s390 only sees
paravirtualized virtio devices via the "Virtio Over Channel I/O
(virtio-ccw)" transport. This makes virtio devices discoverable via
standard operating system algorithms for handling channel devices.
However this is not enough. On s390 for the majority of devices, which
use the standard Channel I/O based mechanism, we also need to provide
-the functionality of passing through them to a Qemu virtual machine.
+the functionality of passing through them to a QEMU virtual machine.
This includes devices that don't have a virtio counterpart (e.g. tape
drives) or that have specific characteristics which guests want to
exploit.
For passing a device to a guest, we want to use the same interface as
-everybody else, namely vfio. Thus, we would like to introduce vfio
-support for channel devices. And we would like to name this new vfio
-device "vfio-ccw".
+everybody else, namely vfio. We implement this vfio support for channel
+devices via the vfio mediated device framework and the subchannel device
+driver "vfio_ccw".
Access patterns of CCW devices
------------------------------
@@ -99,7 +99,7 @@ As mentioned above, we realize vfio-ccw with a mdev implementation.
Channel I/O does not have IOMMU hardware support, so the physical
vfio-ccw device does not have an IOMMU level translation or isolation.
-Sub-channel I/O instructions are all privileged instructions, When
+Subchannel I/O instructions are all privileged instructions. When
handling the I/O instruction interception, vfio-ccw has the software
policing and translation how the channel program is programmed before
it gets sent to hardware.
@@ -121,7 +121,7 @@ devices:
- The vfio_mdev driver for the mediated vfio ccw device.
This is provided by the mdev framework. It is a vfio device driver for
the mdev that created by vfio_ccw.
- It realize a group of vfio device driver callbacks, adds itself to a
+ It realizes a group of vfio device driver callbacks, adds itself to a
vfio group, and registers itself to the mdev framework as a mdev
driver.
It uses a vfio iommu backend that uses the existing map and unmap
@@ -178,7 +178,7 @@ vfio-ccw I/O region
An I/O region is used to accept channel program request from user
space and store I/O interrupt result for user space to retrieve. The
-defination of the region is:
+definition of the region is:
struct ccw_io_region {
#define ORB_AREA_SIZE 12
@@ -198,30 +198,23 @@ irb_area stores the I/O result.
ret_code stores a return code for each access of the region.
-vfio-ccw patches overview
--------------------------
+vfio-ccw operation details
+--------------------------
-For now, our patches are rebased on the latest mdev implementation.
-vfio-ccw follows what vfio-pci did on the s390 paltform and uses
-vfio-iommu-type1 as the vfio iommu backend. It's a good start to launch
-the code review for vfio-ccw. Note that the implementation is far from
-complete yet; but we'd like to get feedback for the general
-architecture.
+vfio-ccw follows what vfio-pci did on the s390 platform and uses
+vfio-iommu-type1 as the vfio iommu backend.
* CCW translation APIs
-- Description:
- These introduce a group of APIs (start with 'cp_') to do CCW
- translation. The CCWs passed in by a user space program are
- organized with their guest physical memory addresses. These APIs
- will copy the CCWs into the kernel space, and assemble a runnable
- kernel channel program by updating the guest physical addresses with
- their corresponding host physical addresses.
-- Patches:
- vfio: ccw: introduce channel program interfaces
+ A group of APIs (start with 'cp_') to do CCW translation. The CCWs
+ passed in by a user space program are organized with their guest
+ physical memory addresses. These APIs will copy the CCWs into kernel
+ space, and assemble a runnable kernel channel program by updating the
+ guest physical addresses with their corresponding host physical addresses.
+ Note that we have to use IDALs even for direct-access CCWs, as the
+ referenced memory can be located anywhere, including above 2G.
* vfio_ccw device driver
-- Description:
- The following patches utilizes the CCW translation APIs and introduce
+ This driver utilizes the CCW translation APIs and introduces
vfio_ccw, which is the driver for the I/O subchannel devices you want
to pass through.
vfio_ccw implements the following vfio ioctls:
@@ -236,20 +229,14 @@ architecture.
This also provides the SET_IRQ ioctl to setup an event notifier to
notify the user space program the I/O completion in an asynchronous
way.
-- Patches:
- vfio: ccw: basic implementation for vfio_ccw driver
- vfio: ccw: introduce ccw_io_region
- vfio: ccw: realize VFIO_DEVICE_GET_REGION_INFO ioctl
- vfio: ccw: realize VFIO_DEVICE_RESET ioctl
- vfio: ccw: realize VFIO_DEVICE_G(S)ET_IRQ_INFO ioctls
-
-The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
+
+The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a
good example to get understand how these patches work. Here is a little
-bit more detail how an I/O request triggered by the Qemu guest will be
+bit more detail how an I/O request triggered by the QEMU guest will be
handled (without error handling).
Explanation:
-Q1-Q7: Qemu side process.
+Q1-Q7: QEMU side process.
K1-K5: Kernel side process.
Q1. Get I/O region info during initialization.
@@ -263,7 +250,7 @@ Q4. Write the guest channel program and ORB to the I/O region.
K2. Translate the guest channel program to a host kernel space
channel program, which becomes runnable for a real device.
K3. With the necessary information contained in the orb passed in
- by Qemu, issue the ccwchain to the device.
+ by QEMU, issue the ccwchain to the device.
K4. Return the ssch CC code.
Q5. Return the CC code to the guest.
@@ -271,7 +258,7 @@ Q5. Return the CC code to the guest.
K5. Interrupt handler gets the I/O result and write the result to
the I/O region.
- K6. Signal Qemu to retrieve the result.
+ K6. Signal QEMU to retrieve the result.
Q6. Get the signal and event handler reads out the result from the I/O
region.
Q7. Update the irb for the guest.
@@ -289,10 +276,20 @@ More information for DASD and ECKD could be found here:
https://en.wikipedia.org/wiki/Direct-access_storage_device
https://en.wikipedia.org/wiki/Count_key_data
-Together with the corresponding work in Qemu, we can bring the passed
+Together with the corresponding work in QEMU, we can bring the passed
through DASD/ECKD device online in a guest now and use it as a block
device.
+While the current code allows the guest to start channel programs via
+START SUBCHANNEL, support for HALT SUBCHANNEL or CLEAR SUBCHANNEL is
+not yet implemented.
+
+vfio-ccw supports classic (command mode) channel I/O only. Transport
+mode (HPF) is not supported.
+
+QDIO subchannels are currently not supported. Classic devices other than
+DASD/ECKD might work, but have not been tested.
+
Reference
---------
1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832)
diff --git a/Documentation/security/LSM-sctp.rst b/Documentation/security/LSM-sctp.rst
new file mode 100644
index 000000000000..6e5a3925a860
--- /dev/null
+++ b/Documentation/security/LSM-sctp.rst
@@ -0,0 +1,175 @@
+SCTP LSM Support
+================
+
+For security module support, three SCTP specific hooks have been implemented::
+
+ security_sctp_assoc_request()
+ security_sctp_bind_connect()
+ security_sctp_sk_clone()
+
+Also the following security hook has been utilised::
+
+ security_inet_conn_established()
+
+The usage of these hooks are described below with the SELinux implementation
+described in ``Documentation/security/SELinux-sctp.rst``
+
+
+security_sctp_assoc_request()
+-----------------------------
+Passes the ``@ep`` and ``@chunk->skb`` of the association INIT packet to the
+security module. Returns 0 on success, error on failure.
+::
+
+ @ep - pointer to sctp endpoint structure.
+ @skb - pointer to skbuff of association packet.
+
+
+security_sctp_bind_connect()
+-----------------------------
+Passes one or more ipv4/ipv6 addresses to the security module for validation
+based on the ``@optname`` that will result in either a bind or connect
+service as shown in the permission check tables below.
+Returns 0 on success, error on failure.
+::
+
+ @sk - Pointer to sock structure.
+ @optname - Name of the option to validate.
+ @address - One or more ipv4 / ipv6 addresses.
+ @addrlen - The total length of address(s). This is calculated on each
+ ipv4 or ipv6 address using sizeof(struct sockaddr_in) or
+ sizeof(struct sockaddr_in6).
+
+ ------------------------------------------------------------------
+ | BIND Type Checks |
+ | @optname | @address contains |
+ |----------------------------|-----------------------------------|
+ | SCTP_SOCKOPT_BINDX_ADD | One or more ipv4 / ipv6 addresses |
+ | SCTP_PRIMARY_ADDR | Single ipv4 or ipv6 address |
+ | SCTP_SET_PEER_PRIMARY_ADDR | Single ipv4 or ipv6 address |
+ ------------------------------------------------------------------
+
+ ------------------------------------------------------------------
+ | CONNECT Type Checks |
+ | @optname | @address contains |
+ |----------------------------|-----------------------------------|
+ | SCTP_SOCKOPT_CONNECTX | One or more ipv4 / ipv6 addresses |
+ | SCTP_PARAM_ADD_IP | One or more ipv4 / ipv6 addresses |
+ | SCTP_SENDMSG_CONNECT | Single ipv4 or ipv6 address |
+ | SCTP_PARAM_SET_PRIMARY | Single ipv4 or ipv6 address |
+ ------------------------------------------------------------------
+
+A summary of the ``@optname`` entries is as follows::
+
+ SCTP_SOCKOPT_BINDX_ADD - Allows additional bind addresses to be
+ associated after (optionally) calling
+ bind(3).
+ sctp_bindx(3) adds a set of bind
+ addresses on a socket.
+
+ SCTP_SOCKOPT_CONNECTX - Allows the allocation of multiple
+ addresses for reaching a peer
+ (multi-homed).
+ sctp_connectx(3) initiates a connection
+ on an SCTP socket using multiple
+ destination addresses.
+
+ SCTP_SENDMSG_CONNECT - Initiate a connection that is generated by a
+ sendmsg(2) or sctp_sendmsg(3) on a new asociation.
+
+ SCTP_PRIMARY_ADDR - Set local primary address.
+
+ SCTP_SET_PEER_PRIMARY_ADDR - Request peer sets address as
+ association primary.
+
+ SCTP_PARAM_ADD_IP - These are used when Dynamic Address
+ SCTP_PARAM_SET_PRIMARY - Reconfiguration is enabled as explained below.
+
+
+To support Dynamic Address Reconfiguration the following parameters must be
+enabled on both endpoints (or use the appropriate **setsockopt**\(2))::
+
+ /proc/sys/net/sctp/addip_enable
+ /proc/sys/net/sctp/addip_noauth_enable
+
+then the following *_PARAM_*'s are sent to the peer in an
+ASCONF chunk when the corresponding ``@optname``'s are present::
+
+ @optname ASCONF Parameter
+ ---------- ------------------
+ SCTP_SOCKOPT_BINDX_ADD -> SCTP_PARAM_ADD_IP
+ SCTP_SET_PEER_PRIMARY_ADDR -> SCTP_PARAM_SET_PRIMARY
+
+
+security_sctp_sk_clone()
+-------------------------
+Called whenever a new socket is created by **accept**\(2)
+(i.e. a TCP style socket) or when a socket is 'peeled off' e.g userspace
+calls **sctp_peeloff**\(3).
+::
+
+ @ep - pointer to current sctp endpoint structure.
+ @sk - pointer to current sock structure.
+ @sk - pointer to new sock structure.
+
+
+security_inet_conn_established()
+---------------------------------
+Called when a COOKIE ACK is received::
+
+ @sk - pointer to sock structure.
+ @skb - pointer to skbuff of the COOKIE ACK packet.
+
+
+Security Hooks used for Association Establishment
+=================================================
+The following diagram shows the use of ``security_sctp_bind_connect()``,
+``security_sctp_assoc_request()``, ``security_inet_conn_established()`` when
+establishing an association.
+::
+
+ SCTP endpoint "A" SCTP endpoint "Z"
+ ================= =================
+ sctp_sf_do_prm_asoc()
+ Association setup can be initiated
+ by a connect(2), sctp_connectx(3),
+ sendmsg(2) or sctp_sendmsg(3).
+ These will result in a call to
+ security_sctp_bind_connect() to
+ initiate an association to
+ SCTP peer endpoint "Z".
+ INIT --------------------------------------------->
+ sctp_sf_do_5_1B_init()
+ Respond to an INIT chunk.
+ SCTP peer endpoint "A" is
+ asking for an association. Call
+ security_sctp_assoc_request()
+ to set the peer label if first
+ association.
+ If not first association, check
+ whether allowed, IF so send:
+ <----------------------------------------------- INIT ACK
+ | ELSE audit event and silently
+ | discard the packet.
+ |
+ COOKIE ECHO ------------------------------------------>
+ |
+ |
+ |
+ <------------------------------------------- COOKIE ACK
+ | |
+ sctp_sf_do_5_1E_ca |
+ Call security_inet_conn_established() |
+ to set the peer label. |
+ | |
+ | If SCTP_SOCKET_TCP or peeled off
+ | socket security_sctp_sk_clone() is
+ | called to clone the new socket.
+ | |
+ ESTABLISHED ESTABLISHED
+ | |
+ ------------------------------------------------------------------
+ | Association Established |
+ ------------------------------------------------------------------
+
+
diff --git a/Documentation/security/SELinux-sctp.rst b/Documentation/security/SELinux-sctp.rst
new file mode 100644
index 000000000000..a332cb1c5334
--- /dev/null
+++ b/Documentation/security/SELinux-sctp.rst
@@ -0,0 +1,158 @@
+SCTP SELinux Support
+=====================
+
+Security Hooks
+===============
+
+``Documentation/security/LSM-sctp.rst`` describes the following SCTP security
+hooks with the SELinux specifics expanded below::
+
+ security_sctp_assoc_request()
+ security_sctp_bind_connect()
+ security_sctp_sk_clone()
+ security_inet_conn_established()
+
+
+security_sctp_assoc_request()
+-----------------------------
+Passes the ``@ep`` and ``@chunk->skb`` of the association INIT packet to the
+security module. Returns 0 on success, error on failure.
+::
+
+ @ep - pointer to sctp endpoint structure.
+ @skb - pointer to skbuff of association packet.
+
+The security module performs the following operations:
+ IF this is the first association on ``@ep->base.sk``, then set the peer
+ sid to that in ``@skb``. This will ensure there is only one peer sid
+ assigned to ``@ep->base.sk`` that may support multiple associations.
+
+ ELSE validate the ``@ep->base.sk peer_sid`` against the ``@skb peer sid``
+ to determine whether the association should be allowed or denied.
+
+ Set the sctp ``@ep sid`` to socket's sid (from ``ep->base.sk``) with
+ MLS portion taken from ``@skb peer sid``. This will be used by SCTP
+ TCP style sockets and peeled off connections as they cause a new socket
+ to be generated.
+
+ If IP security options are configured (CIPSO/CALIPSO), then the ip
+ options are set on the socket.
+
+
+security_sctp_bind_connect()
+-----------------------------
+Checks permissions required for ipv4/ipv6 addresses based on the ``@optname``
+as follows::
+
+ ------------------------------------------------------------------
+ | BIND Permission Checks |
+ | @optname | @address contains |
+ |----------------------------|-----------------------------------|
+ | SCTP_SOCKOPT_BINDX_ADD | One or more ipv4 / ipv6 addresses |
+ | SCTP_PRIMARY_ADDR | Single ipv4 or ipv6 address |
+ | SCTP_SET_PEER_PRIMARY_ADDR | Single ipv4 or ipv6 address |
+ ------------------------------------------------------------------
+
+ ------------------------------------------------------------------
+ | CONNECT Permission Checks |
+ | @optname | @address contains |
+ |----------------------------|-----------------------------------|
+ | SCTP_SOCKOPT_CONNECTX | One or more ipv4 / ipv6 addresses |
+ | SCTP_PARAM_ADD_IP | One or more ipv4 / ipv6 addresses |
+ | SCTP_SENDMSG_CONNECT | Single ipv4 or ipv6 address |
+ | SCTP_PARAM_SET_PRIMARY | Single ipv4 or ipv6 address |
+ ------------------------------------------------------------------
+
+
+``Documentation/security/LSM-sctp.rst`` gives a summary of the ``@optname``
+entries and also describes ASCONF chunk processing when Dynamic Address
+Reconfiguration is enabled.
+
+
+security_sctp_sk_clone()
+-------------------------
+Called whenever a new socket is created by **accept**\(2) (i.e. a TCP style
+socket) or when a socket is 'peeled off' e.g userspace calls
+**sctp_peeloff**\(3). ``security_sctp_sk_clone()`` will set the new
+sockets sid and peer sid to that contained in the ``@ep sid`` and
+``@ep peer sid`` respectively.
+::
+
+ @ep - pointer to current sctp endpoint structure.
+ @sk - pointer to current sock structure.
+ @sk - pointer to new sock structure.
+
+
+security_inet_conn_established()
+---------------------------------
+Called when a COOKIE ACK is received where it sets the connection's peer sid
+to that in ``@skb``::
+
+ @sk - pointer to sock structure.
+ @skb - pointer to skbuff of the COOKIE ACK packet.
+
+
+Policy Statements
+==================
+The following class and permissions to support SCTP are available within the
+kernel::
+
+ class sctp_socket inherits socket { node_bind }
+
+whenever the following policy capability is enabled::
+
+ policycap extended_socket_class;
+
+SELinux SCTP support adds the ``name_connect`` permission for connecting
+to a specific port type and the ``association`` permission that is explained
+in the section below.
+
+If userspace tools have been updated, SCTP will support the ``portcon``
+statement as shown in the following example::
+
+ portcon sctp 1024-1036 system_u:object_r:sctp_ports_t:s0
+
+
+SCTP Peer Labeling
+===================
+An SCTP socket will only have one peer label assigned to it. This will be
+assigned during the establishment of the first association. Any further
+associations on this socket will have their packet peer label compared to
+the sockets peer label, and only if they are different will the
+``association`` permission be validated. This is validated by checking the
+socket peer sid against the received packets peer sid to determine whether
+the association should be allowed or denied.
+
+NOTES:
+ 1) If peer labeling is not enabled, then the peer context will always be
+ ``SECINITSID_UNLABELED`` (``unlabeled_t`` in Reference Policy).
+
+ 2) As SCTP can support more than one transport address per endpoint
+ (multi-homing) on a single socket, it is possible to configure policy
+ and NetLabel to provide different peer labels for each of these. As the
+ socket peer label is determined by the first associations transport
+ address, it is recommended that all peer labels are consistent.
+
+ 3) **getpeercon**\(3) may be used by userspace to retrieve the sockets peer
+ context.
+
+ 4) While not SCTP specific, be aware when using NetLabel that if a label
+ is assigned to a specific interface, and that interface 'goes down',
+ then the NetLabel service will remove the entry. Therefore ensure that
+ the network startup scripts call **netlabelctl**\(8) to set the required
+ label (see **netlabel-config**\(8) helper script for details).
+
+ 5) The NetLabel SCTP peer labeling rules apply as discussed in the following
+ set of posts tagged "netlabel" at: http://www.paul-moore.com/blog/t.
+
+ 6) CIPSO is only supported for IPv4 addressing: ``socket(AF_INET, ...)``
+ CALIPSO is only supported for IPv6 addressing: ``socket(AF_INET6, ...)``
+
+ Note the following when testing CIPSO/CALIPSO:
+ a) CIPSO will send an ICMP packet if an SCTP packet cannot be
+ delivered because of an invalid label.
+ b) CALIPSO does not send an ICMP packet, just silently discards it.
+
+ 7) IPSEC is not supported as RFC 3554 - sctp/ipsec support has not been
+ implemented in userspace (**racoon**\(8) or **ipsec_pluto**\(8)),
+ although the kernel supports SCTP/IPSEC.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 412314eebda6..eded671d55eb 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -964,32 +964,34 @@ detect a hard lockup condition.
tainted:
-Non-zero if the kernel has been tainted. Numeric values, which
-can be ORed together:
-
- 1 - A module with a non-GPL license has been loaded, this
- includes modules with no license.
- Set by modutils >= 2.4.9 and module-init-tools.
- 2 - A module was force loaded by insmod -f.
- Set by modutils >= 2.4.9 and module-init-tools.
- 4 - Unsafe SMP processors: SMP with CPUs not designed for SMP.
- 8 - A module was forcibly unloaded from the system by rmmod -f.
- 16 - A hardware machine check error occurred on the system.
- 32 - A bad page was discovered on the system.
- 64 - The user has asked that the system be marked "tainted". This
- could be because they are running software that directly modifies
- the hardware, or for other reasons.
- 128 - The system has died.
- 256 - The ACPI DSDT has been overridden with one supplied by the user
- instead of using the one provided by the hardware.
- 512 - A kernel warning has occurred.
-1024 - A module from drivers/staging was loaded.
-2048 - The system is working around a severe firmware bug.
-4096 - An out-of-tree module has been loaded.
-8192 - An unsigned module has been loaded in a kernel supporting module
- signature.
-16384 - A soft lockup has previously occurred on the system.
-32768 - The kernel has been live patched.
+Non-zero if the kernel has been tainted. Numeric values, which can be
+ORed together. The letters are seen in "Tainted" line of Oops reports.
+
+ 1 (P): A module with a non-GPL license has been loaded, this
+ includes modules with no license.
+ Set by modutils >= 2.4.9 and module-init-tools.
+ 2 (F): A module was force loaded by insmod -f.
+ Set by modutils >= 2.4.9 and module-init-tools.
+ 4 (S): Unsafe SMP processors: SMP with CPUs not designed for SMP.
+ 8 (R): A module was forcibly unloaded from the system by rmmod -f.
+ 16 (M): A hardware machine check error occurred on the system.
+ 32 (B): A bad page was discovered on the system.
+ 64 (U): The user has asked that the system be marked "tainted". This
+ could be because they are running software that directly modifies
+ the hardware, or for other reasons.
+ 128 (D): The system has died.
+ 256 (A): The ACPI DSDT has been overridden with one supplied by the user
+ instead of using the one provided by the hardware.
+ 512 (W): A kernel warning has occurred.
+ 1024 (C): A module from drivers/staging was loaded.
+ 2048 (I): The system is working around a severe firmware bug.
+ 4096 (O): An out-of-tree module has been loaded.
+ 8192 (E): An unsigned module has been loaded in a kernel supporting module
+ signature.
+ 16384 (L): A soft lockup has previously occurred on the system.
+ 32768 (K): The kernel has been live patched.
+ 65536 (X): Auxiliary taint, defined and used by for distros.
+131072 (T): The kernel was built with the struct randomization plugin.
==============================================================
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index ff234d229cbb..17256f2ad919 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -312,8 +312,6 @@ The lowmem_reserve_ratio is an array. You can see them by reading this file.
% cat /proc/sys/vm/lowmem_reserve_ratio
256 256 32
-
-Note: # of this elements is one fewer than number of zones. Because the highest
- zone's value is not necessary for following calculation.
But, these values are not used directly. The kernel calculates # of protection
pages for each zones from them. These are shown as array of protection pages
@@ -364,7 +362,8 @@ As above expression, they are reciprocal number of ratio.
pages of higher zones on the node.
If you would like to protect more pages, smaller values are effective.
-The minimum value is 1 (1/1 -> 100%).
+The minimum value is 1 (1/1 -> 100%). The value less than 1 completely
+disables protection of the pages.
==============================================================
diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index bdf1963ba6ba..a5ea2cb0082b 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -520,1550 +520,4 @@ The following commands are supported:
totals derived from one or more trace event format fields and/or
event counts (hitcount).
- The format of a hist trigger is as follows::
-
- hist:keys=<field1[,field2,...]>[:values=<field1[,field2,...]>]
- [:sort=<field1[,field2,...]>][:size=#entries][:pause][:continue]
- [:clear][:name=histname1] [if <filter>]
-
- When a matching event is hit, an entry is added to a hash table
- using the key(s) and value(s) named. Keys and values correspond to
- fields in the event's format description. Values must correspond to
- numeric fields - on an event hit, the value(s) will be added to a
- sum kept for that field. The special string 'hitcount' can be used
- in place of an explicit value field - this is simply a count of
- event hits. If 'values' isn't specified, an implicit 'hitcount'
- value will be automatically created and used as the only value.
- Keys can be any field, or the special string 'stacktrace', which
- will use the event's kernel stacktrace as the key. The keywords
- 'keys' or 'key' can be used to specify keys, and the keywords
- 'values', 'vals', or 'val' can be used to specify values. Compound
- keys consisting of up to two fields can be specified by the 'keys'
- keyword. Hashing a compound key produces a unique entry in the
- table for each unique combination of component keys, and can be
- useful for providing more fine-grained summaries of event data.
- Additionally, sort keys consisting of up to two fields can be
- specified by the 'sort' keyword. If more than one field is
- specified, the result will be a 'sort within a sort': the first key
- is taken to be the primary sort key and the second the secondary
- key. If a hist trigger is given a name using the 'name' parameter,
- its histogram data will be shared with other triggers of the same
- name, and trigger hits will update this common data. Only triggers
- with 'compatible' fields can be combined in this way; triggers are
- 'compatible' if the fields named in the trigger share the same
- number and type of fields and those fields also have the same names.
- Note that any two events always share the compatible 'hitcount' and
- 'stacktrace' fields and can therefore be combined using those
- fields, however pointless that may be.
-
- 'hist' triggers add a 'hist' file to each event's subdirectory.
- Reading the 'hist' file for the event will dump the hash table in
- its entirety to stdout. If there are multiple hist triggers
- attached to an event, there will be a table for each trigger in the
- output. The table displayed for a named trigger will be the same as
- any other instance having the same name. Each printed hash table
- entry is a simple list of the keys and values comprising the entry;
- keys are printed first and are delineated by curly braces, and are
- followed by the set of value fields for the entry. By default,
- numeric fields are displayed as base-10 integers. This can be
- modified by appending any of the following modifiers to the field
- name:
-
- - .hex display a number as a hex value
- - .sym display an address as a symbol
- - .sym-offset display an address as a symbol and offset
- - .syscall display a syscall id as a system call name
- - .execname display a common_pid as a program name
-
- Note that in general the semantics of a given field aren't
- interpreted when applying a modifier to it, but there are some
- restrictions to be aware of in this regard:
-
- - only the 'hex' modifier can be used for values (because values
- are essentially sums, and the other modifiers don't make sense
- in that context).
- - the 'execname' modifier can only be used on a 'common_pid'. The
- reason for this is that the execname is simply the 'comm' value
- saved for the 'current' process when an event was triggered,
- which is the same as the common_pid value saved by the event
- tracing code. Trying to apply that comm value to other pid
- values wouldn't be correct, and typically events that care save
- pid-specific comm fields in the event itself.
-
- A typical usage scenario would be the following to enable a hist
- trigger, read its current contents, and then turn it off::
-
- # echo 'hist:keys=skbaddr.hex:vals=len' > \
- /sys/kernel/debug/tracing/events/net/netif_rx/trigger
-
- # cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
-
- # echo '!hist:keys=skbaddr.hex:vals=len' > \
- /sys/kernel/debug/tracing/events/net/netif_rx/trigger
-
- The trigger file itself can be read to show the details of the
- currently attached hist trigger. This information is also displayed
- at the top of the 'hist' file when read.
-
- By default, the size of the hash table is 2048 entries. The 'size'
- parameter can be used to specify more or fewer than that. The units
- are in terms of hashtable entries - if a run uses more entries than
- specified, the results will show the number of 'drops', the number
- of hits that were ignored. The size should be a power of 2 between
- 128 and 131072 (any non- power-of-2 number specified will be rounded
- up).
-
- The 'sort' parameter can be used to specify a value field to sort
- on. The default if unspecified is 'hitcount' and the default sort
- order is 'ascending'. To sort in the opposite direction, append
- .descending' to the sort key.
-
- The 'pause' parameter can be used to pause an existing hist trigger
- or to start a hist trigger but not log any events until told to do
- so. 'continue' or 'cont' can be used to start or restart a paused
- hist trigger.
-
- The 'clear' parameter will clear the contents of a running hist
- trigger and leave its current paused/active state.
-
- Note that the 'pause', 'cont', and 'clear' parameters should be
- applied using 'append' shell operator ('>>') if applied to an
- existing trigger, rather than via the '>' operator, which will cause
- the trigger to be removed through truncation.
-
-- enable_hist/disable_hist
-
- The enable_hist and disable_hist triggers can be used to have one
- event conditionally start and stop another event's already-attached
- hist trigger. Any number of enable_hist and disable_hist triggers
- can be attached to a given event, allowing that event to kick off
- and stop aggregations on a host of other events.
-
- The format is very similar to the enable/disable_event triggers::
-
- enable_hist:<system>:<event>[:count]
- disable_hist:<system>:<event>[:count]
-
- Instead of enabling or disabling the tracing of the target event
- into the trace buffer as the enable/disable_event triggers do, the
- enable/disable_hist triggers enable or disable the aggregation of
- the target event into a hash table.
-
- A typical usage scenario for the enable_hist/disable_hist triggers
- would be to first set up a paused hist trigger on some event,
- followed by an enable_hist/disable_hist pair that turns the hist
- aggregation on and off when conditions of interest are hit::
-
- # echo 'hist:keys=skbaddr.hex:vals=len:pause' > \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
-
- # echo 'enable_hist:net:netif_receive_skb if filename==/usr/bin/wget' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
-
- # echo 'disable_hist:net:netif_receive_skb if comm==wget' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
-
- The above sets up an initially paused hist trigger which is unpaused
- and starts aggregating events when a given program is executed, and
- which stops aggregating when the process exits and the hist trigger
- is paused again.
-
- The examples below provide a more concrete illustration of the
- concepts and typical usage patterns discussed above.
-
-
-6.2 'hist' trigger examples
----------------------------
-
- The first set of examples creates aggregations using the kmalloc
- event. The fields that can be used for the hist trigger are listed
- in the kmalloc event's format file::
-
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/format
- name: kmalloc
- ID: 374
- format:
- field:unsigned short common_type; offset:0; size:2; signed:0;
- field:unsigned char common_flags; offset:2; size:1; signed:0;
- field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
- field:int common_pid; offset:4; size:4; signed:1;
-
- field:unsigned long call_site; offset:8; size:8; signed:0;
- field:const void * ptr; offset:16; size:8; signed:0;
- field:size_t bytes_req; offset:24; size:8; signed:0;
- field:size_t bytes_alloc; offset:32; size:8; signed:0;
- field:gfp_t gfp_flags; offset:40; size:4; signed:0;
-
- We'll start by creating a hist trigger that generates a simple table
- that lists the total number of bytes requested for each function in
- the kernel that made one or more calls to kmalloc::
-
- # echo 'hist:key=call_site:val=bytes_req' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
-
- This tells the tracing system to create a 'hist' trigger using the
- call_site field of the kmalloc event as the key for the table, which
- just means that each unique call_site address will have an entry
- created for it in the table. The 'val=bytes_req' parameter tells
- the hist trigger that for each unique entry (call_site) in the
- table, it should keep a running total of the number of bytes
- requested by that call_site.
-
- We'll let it run for awhile and then dump the contents of the 'hist'
- file in the kmalloc event's subdirectory (for readability, a number
- of entries have been omitted)::
-
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
- # trigger info: hist:keys=call_site:vals=bytes_req:sort=hitcount:size=2048 [active]
-
- { call_site: 18446744072106379007 } hitcount: 1 bytes_req: 176
- { call_site: 18446744071579557049 } hitcount: 1 bytes_req: 1024
- { call_site: 18446744071580608289 } hitcount: 1 bytes_req: 16384
- { call_site: 18446744071581827654 } hitcount: 1 bytes_req: 24
- { call_site: 18446744071580700980 } hitcount: 1 bytes_req: 8
- { call_site: 18446744071579359876 } hitcount: 1 bytes_req: 152
- { call_site: 18446744071580795365 } hitcount: 3 bytes_req: 144
- { call_site: 18446744071581303129 } hitcount: 3 bytes_req: 144
- { call_site: 18446744071580713234 } hitcount: 4 bytes_req: 2560
- { call_site: 18446744071580933750 } hitcount: 4 bytes_req: 736
- .
- .
- .
- { call_site: 18446744072106047046 } hitcount: 69 bytes_req: 5576
- { call_site: 18446744071582116407 } hitcount: 73 bytes_req: 2336
- { call_site: 18446744072106054684 } hitcount: 136 bytes_req: 140504
- { call_site: 18446744072106224230 } hitcount: 136 bytes_req: 19584
- { call_site: 18446744072106078074 } hitcount: 153 bytes_req: 2448
- { call_site: 18446744072106062406 } hitcount: 153 bytes_req: 36720
- { call_site: 18446744071582507929 } hitcount: 153 bytes_req: 37088
- { call_site: 18446744072102520590 } hitcount: 273 bytes_req: 10920
- { call_site: 18446744071582143559 } hitcount: 358 bytes_req: 716
- { call_site: 18446744072106465852 } hitcount: 417 bytes_req: 56712
- { call_site: 18446744072102523378 } hitcount: 485 bytes_req: 27160
- { call_site: 18446744072099568646 } hitcount: 1676 bytes_req: 33520
-
- Totals:
- Hits: 4610
- Entries: 45
- Dropped: 0
-
- The output displays a line for each entry, beginning with the key
- specified in the trigger, followed by the value(s) also specified in
- the trigger. At the beginning of the output is a line that displays
- the trigger info, which can also be displayed by reading the
- 'trigger' file::
-
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
- hist:keys=call_site:vals=bytes_req:sort=hitcount:size=2048 [active]
-
- At the end of the output are a few lines that display the overall
- totals for the run. The 'Hits' field shows the total number of
- times the event trigger was hit, the 'Entries' field shows the total
- number of used entries in the hash table, and the 'Dropped' field
- shows the number of hits that were dropped because the number of
- used entries for the run exceeded the maximum number of entries
- allowed for the table (normally 0, but if not a hint that you may
- want to increase the size of the table using the 'size' parameter).
-
- Notice in the above output that there's an extra field, 'hitcount',
- which wasn't specified in the trigger. Also notice that in the
- trigger info output, there's a parameter, 'sort=hitcount', which
- wasn't specified in the trigger either. The reason for that is that
- every trigger implicitly keeps a count of the total number of hits
- attributed to a given entry, called the 'hitcount'. That hitcount
- information is explicitly displayed in the output, and in the
- absence of a user-specified sort parameter, is used as the default
- sort field.
-
- The value 'hitcount' can be used in place of an explicit value in
- the 'values' parameter if you don't really need to have any
- particular field summed and are mainly interested in hit
- frequencies.
-
- To turn the hist trigger off, simply call up the trigger in the
- command history and re-execute it with a '!' prepended::
-
- # echo '!hist:key=call_site:val=bytes_req' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
-
- Finally, notice that the call_site as displayed in the output above
- isn't really very useful. It's an address, but normally addresses
- are displayed in hex. To have a numeric field displayed as a hex
- value, simply append '.hex' to the field name in the trigger::
-
- # echo 'hist:key=call_site.hex:val=bytes_req' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
-
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
- # trigger info: hist:keys=call_site.hex:vals=bytes_req:sort=hitcount:size=2048 [active]
-
- { call_site: ffffffffa026b291 } hitcount: 1 bytes_req: 433
- { call_site: ffffffffa07186ff } hitcount: 1 bytes_req: 176
- { call_site: ffffffff811ae721 } hitcount: 1 bytes_req: 16384
- { call_site: ffffffff811c5134 } hitcount: 1 bytes_req: 8
- { call_site: ffffffffa04a9ebb } hitcount: 1 bytes_req: 511
- { call_site: ffffffff8122e0a6 } hitcount: 1 bytes_req: 12
- { call_site: ffffffff8107da84 } hitcount: 1 bytes_req: 152
- { call_site: ffffffff812d8246 } hitcount: 1 bytes_req: 24
- { call_site: ffffffff811dc1e5 } hitcount: 3 bytes_req: 144
- { call_site: ffffffffa02515e8 } hitcount: 3 bytes_req: 648
- { call_site: ffffffff81258159 } hitcount: 3 bytes_req: 144
- { call_site: ffffffff811c80f4 } hitcount: 4 bytes_req: 544
- .
- .
- .
- { call_site: ffffffffa06c7646 } hitcount: 106 bytes_req: 8024
- { call_site: ffffffffa06cb246 } hitcount: 132 bytes_req: 31680
- { call_site: ffffffffa06cef7a } hitcount: 132 bytes_req: 2112
- { call_site: ffffffff8137e399 } hitcount: 132 bytes_req: 23232
- { call_site: ffffffffa06c941c } hitcount: 185 bytes_req: 171360
- { call_site: ffffffffa06f2a66 } hitcount: 185 bytes_req: 26640
- { call_site: ffffffffa036a70e } hitcount: 265 bytes_req: 10600
- { call_site: ffffffff81325447 } hitcount: 292 bytes_req: 584
- { call_site: ffffffffa072da3c } hitcount: 446 bytes_req: 60656
- { call_site: ffffffffa036b1f2 } hitcount: 526 bytes_req: 29456
- { call_site: ffffffffa0099c06 } hitcount: 1780 bytes_req: 35600
-
- Totals:
- Hits: 4775
- Entries: 46
- Dropped: 0
-
- Even that's only marginally more useful - while hex values do look
- more like addresses, what users are typically more interested in
- when looking at text addresses are the corresponding symbols
- instead. To have an address displayed as symbolic value instead,
- simply append '.sym' or '.sym-offset' to the field name in the
- trigger::
-
- # echo 'hist:key=call_site.sym:val=bytes_req' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
-
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
- # trigger info: hist:keys=call_site.sym:vals=bytes_req:sort=hitcount:size=2048 [active]
-
- { call_site: [ffffffff810adcb9] syslog_print_all } hitcount: 1 bytes_req: 1024
- { call_site: [ffffffff8154bc62] usb_control_msg } hitcount: 1 bytes_req: 8
- { call_site: [ffffffffa00bf6fe] hidraw_send_report [hid] } hitcount: 1 bytes_req: 7
- { call_site: [ffffffff8154acbe] usb_alloc_urb } hitcount: 1 bytes_req: 192
- { call_site: [ffffffffa00bf1ca] hidraw_report_event [hid] } hitcount: 1 bytes_req: 7
- { call_site: [ffffffff811e3a25] __seq_open_private } hitcount: 1 bytes_req: 40
- { call_site: [ffffffff8109524a] alloc_fair_sched_group } hitcount: 2 bytes_req: 128
- { call_site: [ffffffff811febd5] fsnotify_alloc_group } hitcount: 2 bytes_req: 528
- { call_site: [ffffffff81440f58] __tty_buffer_request_room } hitcount: 2 bytes_req: 2624
- { call_site: [ffffffff81200ba6] inotify_new_group } hitcount: 2 bytes_req: 96
- { call_site: [ffffffffa05e19af] ieee80211_start_tx_ba_session [mac80211] } hitcount: 2 bytes_req: 464
- { call_site: [ffffffff81672406] tcp_get_metrics } hitcount: 2 bytes_req: 304
- { call_site: [ffffffff81097ec2] alloc_rt_sched_group } hitcount: 2 bytes_req: 128
- { call_site: [ffffffff81089b05] sched_create_group } hitcount: 2 bytes_req: 1424
- .
- .
- .
- { call_site: [ffffffffa04a580c] intel_crtc_page_flip [i915] } hitcount: 1185 bytes_req: 123240
- { call_site: [ffffffffa0287592] drm_mode_page_flip_ioctl [drm] } hitcount: 1185 bytes_req: 104280
- { call_site: [ffffffffa04c4a3c] intel_plane_duplicate_state [i915] } hitcount: 1402 bytes_req: 190672
- { call_site: [ffffffff812891ca] ext4_find_extent } hitcount: 1518 bytes_req: 146208
- { call_site: [ffffffffa029070e] drm_vma_node_allow [drm] } hitcount: 1746 bytes_req: 69840
- { call_site: [ffffffffa045e7c4] i915_gem_do_execbuffer.isra.23 [i915] } hitcount: 2021 bytes_req: 792312
- { call_site: [ffffffffa02911f2] drm_modeset_lock_crtc [drm] } hitcount: 2592 bytes_req: 145152
- { call_site: [ffffffffa0489a66] intel_ring_begin [i915] } hitcount: 2629 bytes_req: 378576
- { call_site: [ffffffffa046041c] i915_gem_execbuffer2 [i915] } hitcount: 2629 bytes_req: 3783248
- { call_site: [ffffffff81325607] apparmor_file_alloc_security } hitcount: 5192 bytes_req: 10384
- { call_site: [ffffffffa00b7c06] hid_report_raw_event [hid] } hitcount: 5529 bytes_req: 110584
- { call_site: [ffffffff8131ebf7] aa_alloc_task_context } hitcount: 21943 bytes_req: 702176
- { call_site: [ffffffff8125847d] ext4_htree_store_dirent } hitcount: 55759 bytes_req: 5074265
-
- Totals:
- Hits: 109928
- Entries: 71
- Dropped: 0
-
- Because the default sort key above is 'hitcount', the above shows a
- the list of call_sites by increasing hitcount, so that at the bottom
- we see the functions that made the most kmalloc calls during the
- run. If instead we we wanted to see the top kmalloc callers in
- terms of the number of bytes requested rather than the number of
- calls, and we wanted the top caller to appear at the top, we can use
- the 'sort' parameter, along with the 'descending' modifier::
-
- # echo 'hist:key=call_site.sym:val=bytes_req:sort=bytes_req.descending' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
-
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
- # trigger info: hist:keys=call_site.sym:vals=bytes_req:sort=bytes_req.descending:size=2048 [active]
-
- { call_site: [ffffffffa046041c] i915_gem_execbuffer2 [i915] } hitcount: 2186 bytes_req: 3397464
- { call_site: [ffffffffa045e7c4] i915_gem_do_execbuffer.isra.23 [i915] } hitcount: 1790 bytes_req: 712176
- { call_site: [ffffffff8125847d] ext4_htree_store_dirent } hitcount: 8132 bytes_req: 513135
- { call_site: [ffffffff811e2a1b] seq_buf_alloc } hitcount: 106 bytes_req: 440128
- { call_site: [ffffffffa0489a66] intel_ring_begin [i915] } hitcount: 2186 bytes_req: 314784
- { call_site: [ffffffff812891ca] ext4_find_extent } hitcount: 2174 bytes_req: 208992
- { call_site: [ffffffff811ae8e1] __kmalloc } hitcount: 8 bytes_req: 131072
- { call_site: [ffffffffa04c4a3c] intel_plane_duplicate_state [i915] } hitcount: 859 bytes_req: 116824
- { call_site: [ffffffffa02911f2] drm_modeset_lock_crtc [drm] } hitcount: 1834 bytes_req: 102704
- { call_site: [ffffffffa04a580c] intel_crtc_page_flip [i915] } hitcount: 972 bytes_req: 101088
- { call_site: [ffffffffa0287592] drm_mode_page_flip_ioctl [drm] } hitcount: 972 bytes_req: 85536
- { call_site: [ffffffffa00b7c06] hid_report_raw_event [hid] } hitcount: 3333 bytes_req: 66664
- { call_site: [ffffffff8137e559] sg_kmalloc } hitcount: 209 bytes_req: 61632
- .
- .
- .
- { call_site: [ffffffff81095225] alloc_fair_sched_group } hitcount: 2 bytes_req: 128
- { call_site: [ffffffff81097ec2] alloc_rt_sched_group } hitcount: 2 bytes_req: 128
- { call_site: [ffffffff812d8406] copy_semundo } hitcount: 2 bytes_req: 48
- { call_site: [ffffffff81200ba6] inotify_new_group } hitcount: 1 bytes_req: 48
- { call_site: [ffffffffa027121a] drm_getmagic [drm] } hitcount: 1 bytes_req: 48
- { call_site: [ffffffff811e3a25] __seq_open_private } hitcount: 1 bytes_req: 40
- { call_site: [ffffffff811c52f4] bprm_change_interp } hitcount: 2 bytes_req: 16
- { call_site: [ffffffff8154bc62] usb_control_msg } hitcount: 1 bytes_req: 8
- { call_site: [ffffffffa00bf1ca] hidraw_report_event [hid] } hitcount: 1 bytes_req: 7
- { call_site: [ffffffffa00bf6fe] hidraw_send_report [hid] } hitcount: 1 bytes_req: 7
-
- Totals:
- Hits: 32133
- Entries: 81
- Dropped: 0
-
- To display the offset and size information in addition to the symbol
- name, just use 'sym-offset' instead::
-
- # echo 'hist:key=call_site.sym-offset:val=bytes_req:sort=bytes_req.descending' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
-
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
- # trigger info: hist:keys=call_site.sym-offset:vals=bytes_req:sort=bytes_req.descending:size=2048 [active]
-
- { call_site: [ffffffffa046041c] i915_gem_execbuffer2+0x6c/0x2c0 [i915] } hitcount: 4569 bytes_req: 3163720
- { call_site: [ffffffffa0489a66] intel_ring_begin+0xc6/0x1f0 [i915] } hitcount: 4569 bytes_req: 657936
- { call_site: [ffffffffa045e7c4] i915_gem_do_execbuffer.isra.23+0x694/0x1020 [i915] } hitcount: 1519 bytes_req: 472936
- { call_site: [ffffffffa045e646] i915_gem_do_execbuffer.isra.23+0x516/0x1020 [i915] } hitcount: 3050 bytes_req: 211832
- { call_site: [ffffffff811e2a1b] seq_buf_alloc+0x1b/0x50 } hitcount: 34 bytes_req: 148384
- { call_site: [ffffffffa04a580c] intel_crtc_page_flip+0xbc/0x870 [i915] } hitcount: 1385 bytes_req: 144040
- { call_site: [ffffffff811ae8e1] __kmalloc+0x191/0x1b0 } hitcount: 8 bytes_req: 131072
- { call_site: [ffffffffa0287592] drm_mode_page_flip_ioctl+0x282/0x360 [drm] } hitcount: 1385 bytes_req: 121880
- { call_site: [ffffffffa02911f2] drm_modeset_lock_crtc+0x32/0x100 [drm] } hitcount: 1848 bytes_req: 103488
- { call_site: [ffffffffa04c4a3c] intel_plane_duplicate_state+0x2c/0xa0 [i915] } hitcount: 461 bytes_req: 62696
- { call_site: [ffffffffa029070e] drm_vma_node_allow+0x2e/0xd0 [drm] } hitcount: 1541 bytes_req: 61640
- { call_site: [ffffffff815f8d7b] sk_prot_alloc+0xcb/0x1b0 } hitcount: 57 bytes_req: 57456
- .
- .
- .
- { call_site: [ffffffff8109524a] alloc_fair_sched_group+0x5a/0x1a0 } hitcount: 2 bytes_req: 128
- { call_site: [ffffffffa027b921] drm_vm_open_locked+0x31/0xa0 [drm] } hitcount: 3 bytes_req: 96
- { call_site: [ffffffff8122e266] proc_self_follow_link+0x76/0xb0 } hitcount: 8 bytes_req: 96
- { call_site: [ffffffff81213e80] load_elf_binary+0x240/0x1650 } hitcount: 3 bytes_req: 84
- { call_site: [ffffffff8154bc62] usb_control_msg+0x42/0x110 } hitcount: 1 bytes_req: 8
- { call_site: [ffffffffa00bf6fe] hidraw_send_report+0x7e/0x1a0 [hid] } hitcount: 1 bytes_req: 7
- { call_site: [ffffffffa00bf1ca] hidraw_report_event+0x8a/0x120 [hid] } hitcount: 1 bytes_req: 7
-
- Totals:
- Hits: 26098
- Entries: 64
- Dropped: 0
-
- We can also add multiple fields to the 'values' parameter. For
- example, we might want to see the total number of bytes allocated
- alongside bytes requested, and display the result sorted by bytes
- allocated in a descending order::
-
- # echo 'hist:keys=call_site.sym:values=bytes_req,bytes_alloc:sort=bytes_alloc.descending' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
-
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
- # trigger info: hist:keys=call_site.sym:vals=bytes_req,bytes_alloc:sort=bytes_alloc.descending:size=2048 [active]
-
- { call_site: [ffffffffa046041c] i915_gem_execbuffer2 [i915] } hitcount: 7403 bytes_req: 4084360 bytes_alloc: 5958016
- { call_site: [ffffffff811e2a1b] seq_buf_alloc } hitcount: 541 bytes_req: 2213968 bytes_alloc: 2228224
- { call_site: [ffffffffa0489a66] intel_ring_begin [i915] } hitcount: 7404 bytes_req: 1066176 bytes_alloc: 1421568
- { call_site: [ffffffffa045e7c4] i915_gem_do_execbuffer.isra.23 [i915] } hitcount: 1565 bytes_req: 557368 bytes_alloc: 1037760
- { call_site: [ffffffff8125847d] ext4_htree_store_dirent } hitcount: 9557 bytes_req: 595778 bytes_alloc: 695744
- { call_site: [ffffffffa045e646] i915_gem_do_execbuffer.isra.23 [i915] } hitcount: 5839 bytes_req: 430680 bytes_alloc: 470400
- { call_site: [ffffffffa04c4a3c] intel_plane_duplicate_state [i915] } hitcount: 2388 bytes_req: 324768 bytes_alloc: 458496
- { call_site: [ffffffffa02911f2] drm_modeset_lock_crtc [drm] } hitcount: 3911 bytes_req: 219016 bytes_alloc: 250304
- { call_site: [ffffffff815f8d7b] sk_prot_alloc } hitcount: 235 bytes_req: 236880 bytes_alloc: 240640
- { call_site: [ffffffff8137e559] sg_kmalloc } hitcount: 557 bytes_req: 169024 bytes_alloc: 221760
- { call_site: [ffffffffa00b7c06] hid_report_raw_event [hid] } hitcount: 9378 bytes_req: 187548 bytes_alloc: 206312
- { call_site: [ffffffffa04a580c] intel_crtc_page_flip [i915] } hitcount: 1519 bytes_req: 157976 bytes_alloc: 194432
- .
- .
- .
- { call_site: [ffffffff8109bd3b] sched_autogroup_create_attach } hitcount: 2 bytes_req: 144 bytes_alloc: 192
- { call_site: [ffffffff81097ee8] alloc_rt_sched_group } hitcount: 2 bytes_req: 128 bytes_alloc: 128
- { call_site: [ffffffff8109524a] alloc_fair_sched_group } hitcount: 2 bytes_req: 128 bytes_alloc: 128
- { call_site: [ffffffff81095225] alloc_fair_sched_group } hitcount: 2 bytes_req: 128 bytes_alloc: 128
- { call_site: [ffffffff81097ec2] alloc_rt_sched_group } hitcount: 2 bytes_req: 128 bytes_alloc: 128
- { call_site: [ffffffff81213e80] load_elf_binary } hitcount: 3 bytes_req: 84 bytes_alloc: 96
- { call_site: [ffffffff81079a2e] kthread_create_on_node } hitcount: 1 bytes_req: 56 bytes_alloc: 64
- { call_site: [ffffffffa00bf6fe] hidraw_send_report [hid] } hitcount: 1 bytes_req: 7 bytes_alloc: 8
- { call_site: [ffffffff8154bc62] usb_control_msg } hitcount: 1 bytes_req: 8 bytes_alloc: 8
- { call_site: [ffffffffa00bf1ca] hidraw_report_event [hid] } hitcount: 1 bytes_req: 7 bytes_alloc: 8
-
- Totals:
- Hits: 66598
- Entries: 65
- Dropped: 0
-
- Finally, to finish off our kmalloc example, instead of simply having
- the hist trigger display symbolic call_sites, we can have the hist
- trigger additionally display the complete set of kernel stack traces
- that led to each call_site. To do that, we simply use the special
- value 'stacktrace' for the key parameter::
-
- # echo 'hist:keys=stacktrace:values=bytes_req,bytes_alloc:sort=bytes_alloc' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
-
- The above trigger will use the kernel stack trace in effect when an
- event is triggered as the key for the hash table. This allows the
- enumeration of every kernel callpath that led up to a particular
- event, along with a running total of any of the event fields for
- that event. Here we tally bytes requested and bytes allocated for
- every callpath in the system that led up to a kmalloc (in this case
- every callpath to a kmalloc for a kernel compile)::
-
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
- # trigger info: hist:keys=stacktrace:vals=bytes_req,bytes_alloc:sort=bytes_alloc:size=2048 [active]
-
- { stacktrace:
- __kmalloc_track_caller+0x10b/0x1a0
- kmemdup+0x20/0x50
- hidraw_report_event+0x8a/0x120 [hid]
- hid_report_raw_event+0x3ea/0x440 [hid]
- hid_input_report+0x112/0x190 [hid]
- hid_irq_in+0xc2/0x260 [usbhid]
- __usb_hcd_giveback_urb+0x72/0x120
- usb_giveback_urb_bh+0x9e/0xe0
- tasklet_hi_action+0xf8/0x100
- __do_softirq+0x114/0x2c0
- irq_exit+0xa5/0xb0
- do_IRQ+0x5a/0xf0
- ret_from_intr+0x0/0x30
- cpuidle_enter+0x17/0x20
- cpu_startup_entry+0x315/0x3e0
- rest_init+0x7c/0x80
- } hitcount: 3 bytes_req: 21 bytes_alloc: 24
- { stacktrace:
- __kmalloc_track_caller+0x10b/0x1a0
- kmemdup+0x20/0x50
- hidraw_report_event+0x8a/0x120 [hid]
- hid_report_raw_event+0x3ea/0x440 [hid]
- hid_input_report+0x112/0x190 [hid]
- hid_irq_in+0xc2/0x260 [usbhid]
- __usb_hcd_giveback_urb+0x72/0x120
- usb_giveback_urb_bh+0x9e/0xe0
- tasklet_hi_action+0xf8/0x100
- __do_softirq+0x114/0x2c0
- irq_exit+0xa5/0xb0
- do_IRQ+0x5a/0xf0
- ret_from_intr+0x0/0x30
- } hitcount: 3 bytes_req: 21 bytes_alloc: 24
- { stacktrace:
- kmem_cache_alloc_trace+0xeb/0x150
- aa_alloc_task_context+0x27/0x40
- apparmor_cred_prepare+0x1f/0x50
- security_prepare_creds+0x16/0x20
- prepare_creds+0xdf/0x1a0
- SyS_capset+0xb5/0x200
- system_call_fastpath+0x12/0x6a
- } hitcount: 1 bytes_req: 32 bytes_alloc: 32
- .
- .
- .
- { stacktrace:
- __kmalloc+0x11b/0x1b0
- i915_gem_execbuffer2+0x6c/0x2c0 [i915]
- drm_ioctl+0x349/0x670 [drm]
- do_vfs_ioctl+0x2f0/0x4f0
- SyS_ioctl+0x81/0xa0
- system_call_fastpath+0x12/0x6a
- } hitcount: 17726 bytes_req: 13944120 bytes_alloc: 19593808
- { stacktrace:
- __kmalloc+0x11b/0x1b0
- load_elf_phdrs+0x76/0xa0
- load_elf_binary+0x102/0x1650
- search_binary_handler+0x97/0x1d0
- do_execveat_common.isra.34+0x551/0x6e0
- SyS_execve+0x3a/0x50
- return_from_execve+0x0/0x23
- } hitcount: 33348 bytes_req: 17152128 bytes_alloc: 20226048
- { stacktrace:
- kmem_cache_alloc_trace+0xeb/0x150
- apparmor_file_alloc_security+0x27/0x40
- security_file_alloc+0x16/0x20
- get_empty_filp+0x93/0x1c0
- path_openat+0x31/0x5f0
- do_filp_open+0x3a/0x90
- do_sys_open+0x128/0x220
- SyS_open+0x1e/0x20
- system_call_fastpath+0x12/0x6a
- } hitcount: 4766422 bytes_req: 9532844 bytes_alloc: 38131376
- { stacktrace:
- __kmalloc+0x11b/0x1b0
- seq_buf_alloc+0x1b/0x50
- seq_read+0x2cc/0x370
- proc_reg_read+0x3d/0x80
- __vfs_read+0x28/0xe0
- vfs_read+0x86/0x140
- SyS_read+0x46/0xb0
- system_call_fastpath+0x12/0x6a
- } hitcount: 19133 bytes_req: 78368768 bytes_alloc: 78368768
-
- Totals:
- Hits: 6085872
- Entries: 253
- Dropped: 0
-
- If you key a hist trigger on common_pid, in order for example to
- gather and display sorted totals for each process, you can use the
- special .execname modifier to display the executable names for the
- processes in the table rather than raw pids. The example below
- keeps a per-process sum of total bytes read::
-
- # echo 'hist:key=common_pid.execname:val=count:sort=count.descending' > \
- /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
-
- # cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/hist
- # trigger info: hist:keys=common_pid.execname:vals=count:sort=count.descending:size=2048 [active]
-
- { common_pid: gnome-terminal [ 3196] } hitcount: 280 count: 1093512
- { common_pid: Xorg [ 1309] } hitcount: 525 count: 256640
- { common_pid: compiz [ 2889] } hitcount: 59 count: 254400
- { common_pid: bash [ 8710] } hitcount: 3 count: 66369
- { common_pid: dbus-daemon-lau [ 8703] } hitcount: 49 count: 47739
- { common_pid: irqbalance [ 1252] } hitcount: 27 count: 27648
- { common_pid: 01ifupdown [ 8705] } hitcount: 3 count: 17216
- { common_pid: dbus-daemon [ 772] } hitcount: 10 count: 12396
- { common_pid: Socket Thread [ 8342] } hitcount: 11 count: 11264
- { common_pid: nm-dhcp-client. [ 8701] } hitcount: 6 count: 7424
- { common_pid: gmain [ 1315] } hitcount: 18 count: 6336
- .
- .
- .
- { common_pid: postgres [ 1892] } hitcount: 2 count: 32
- { common_pid: postgres [ 1891] } hitcount: 2 count: 32
- { common_pid: gmain [ 8704] } hitcount: 2 count: 32
- { common_pid: upstart-dbus-br [ 2740] } hitcount: 21 count: 21
- { common_pid: nm-dispatcher.a [ 8696] } hitcount: 1 count: 16
- { common_pid: indicator-datet [ 2904] } hitcount: 1 count: 16
- { common_pid: gdbus [ 2998] } hitcount: 1 count: 16
- { common_pid: rtkit-daemon [ 2052] } hitcount: 1 count: 8
- { common_pid: init [ 1] } hitcount: 2 count: 2
-
- Totals:
- Hits: 2116
- Entries: 51
- Dropped: 0
-
- Similarly, if you key a hist trigger on syscall id, for example to
- gather and display a list of systemwide syscall hits, you can use
- the special .syscall modifier to display the syscall names rather
- than raw ids. The example below keeps a running total of syscall
- counts for the system during the run::
-
- # echo 'hist:key=id.syscall:val=hitcount' > \
- /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
-
- # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
- # trigger info: hist:keys=id.syscall:vals=hitcount:sort=hitcount:size=2048 [active]
-
- { id: sys_fsync [ 74] } hitcount: 1
- { id: sys_newuname [ 63] } hitcount: 1
- { id: sys_prctl [157] } hitcount: 1
- { id: sys_statfs [137] } hitcount: 1
- { id: sys_symlink [ 88] } hitcount: 1
- { id: sys_sendmmsg [307] } hitcount: 1
- { id: sys_semctl [ 66] } hitcount: 1
- { id: sys_readlink [ 89] } hitcount: 3
- { id: sys_bind [ 49] } hitcount: 3
- { id: sys_getsockname [ 51] } hitcount: 3
- { id: sys_unlink [ 87] } hitcount: 3
- { id: sys_rename [ 82] } hitcount: 4
- { id: unknown_syscall [ 58] } hitcount: 4
- { id: sys_connect [ 42] } hitcount: 4
- { id: sys_getpid [ 39] } hitcount: 4
- .
- .
- .
- { id: sys_rt_sigprocmask [ 14] } hitcount: 952
- { id: sys_futex [202] } hitcount: 1534
- { id: sys_write [ 1] } hitcount: 2689
- { id: sys_setitimer [ 38] } hitcount: 2797
- { id: sys_read [ 0] } hitcount: 3202
- { id: sys_select [ 23] } hitcount: 3773
- { id: sys_writev [ 20] } hitcount: 4531
- { id: sys_poll [ 7] } hitcount: 8314
- { id: sys_recvmsg [ 47] } hitcount: 13738
- { id: sys_ioctl [ 16] } hitcount: 21843
-
- Totals:
- Hits: 67612
- Entries: 72
- Dropped: 0
-
- The syscall counts above provide a rough overall picture of system
- call activity on the system; we can see for example that the most
- popular system call on this system was the 'sys_ioctl' system call.
-
- We can use 'compound' keys to refine that number and provide some
- further insight as to which processes exactly contribute to the
- overall ioctl count.
-
- The command below keeps a hitcount for every unique combination of
- system call id and pid - the end result is essentially a table
- that keeps a per-pid sum of system call hits. The results are
- sorted using the system call id as the primary key, and the
- hitcount sum as the secondary key::
-
- # echo 'hist:key=id.syscall,common_pid.execname:val=hitcount:sort=id,hitcount' > \
- /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
-
- # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
- # trigger info: hist:keys=id.syscall,common_pid.execname:vals=hitcount:sort=id.syscall,hitcount:size=2048 [active]
-
- { id: sys_read [ 0], common_pid: rtkit-daemon [ 1877] } hitcount: 1
- { id: sys_read [ 0], common_pid: gdbus [ 2976] } hitcount: 1
- { id: sys_read [ 0], common_pid: console-kit-dae [ 3400] } hitcount: 1
- { id: sys_read [ 0], common_pid: postgres [ 1865] } hitcount: 1
- { id: sys_read [ 0], common_pid: deja-dup-monito [ 3543] } hitcount: 2
- { id: sys_read [ 0], common_pid: NetworkManager [ 890] } hitcount: 2
- { id: sys_read [ 0], common_pid: evolution-calen [ 3048] } hitcount: 2
- { id: sys_read [ 0], common_pid: postgres [ 1864] } hitcount: 2
- { id: sys_read [ 0], common_pid: nm-applet [ 3022] } hitcount: 2
- { id: sys_read [ 0], common_pid: whoopsie [ 1212] } hitcount: 2
- .
- .
- .
- { id: sys_ioctl [ 16], common_pid: bash [ 8479] } hitcount: 1
- { id: sys_ioctl [ 16], common_pid: bash [ 3472] } hitcount: 12
- { id: sys_ioctl [ 16], common_pid: gnome-terminal [ 3199] } hitcount: 16
- { id: sys_ioctl [ 16], common_pid: Xorg [ 1267] } hitcount: 1808
- { id: sys_ioctl [ 16], common_pid: compiz [ 2994] } hitcount: 5580
- .
- .
- .
- { id: sys_waitid [247], common_pid: upstart-dbus-br [ 2690] } hitcount: 3
- { id: sys_waitid [247], common_pid: upstart-dbus-br [ 2688] } hitcount: 16
- { id: sys_inotify_add_watch [254], common_pid: gmain [ 975] } hitcount: 2
- { id: sys_inotify_add_watch [254], common_pid: gmain [ 3204] } hitcount: 4
- { id: sys_inotify_add_watch [254], common_pid: gmain [ 2888] } hitcount: 4
- { id: sys_inotify_add_watch [254], common_pid: gmain [ 3003] } hitcount: 4
- { id: sys_inotify_add_watch [254], common_pid: gmain [ 2873] } hitcount: 4
- { id: sys_inotify_add_watch [254], common_pid: gmain [ 3196] } hitcount: 6
- { id: sys_openat [257], common_pid: java [ 2623] } hitcount: 2
- { id: sys_eventfd2 [290], common_pid: ibus-ui-gtk3 [ 2760] } hitcount: 4
- { id: sys_eventfd2 [290], common_pid: compiz [ 2994] } hitcount: 6
-
- Totals:
- Hits: 31536
- Entries: 323
- Dropped: 0
-
- The above list does give us a breakdown of the ioctl syscall by
- pid, but it also gives us quite a bit more than that, which we
- don't really care about at the moment. Since we know the syscall
- id for sys_ioctl (16, displayed next to the sys_ioctl name), we
- can use that to filter out all the other syscalls::
-
- # echo 'hist:key=id.syscall,common_pid.execname:val=hitcount:sort=id,hitcount if id == 16' > \
- /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
-
- # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
- # trigger info: hist:keys=id.syscall,common_pid.execname:vals=hitcount:sort=id.syscall,hitcount:size=2048 if id == 16 [active]
-
- { id: sys_ioctl [ 16], common_pid: gmain [ 2769] } hitcount: 1
- { id: sys_ioctl [ 16], common_pid: evolution-addre [ 8571] } hitcount: 1
- { id: sys_ioctl [ 16], common_pid: gmain [ 3003] } hitcount: 1
- { id: sys_ioctl [ 16], common_pid: gmain [ 2781] } hitcount: 1
- { id: sys_ioctl [ 16], common_pid: gmain [ 2829] } hitcount: 1
- { id: sys_ioctl [ 16], common_pid: bash [ 8726] } hitcount: 1
- { id: sys_ioctl [ 16], common_pid: bash [ 8508] } hitcount: 1
- { id: sys_ioctl [ 16], common_pid: gmain [ 2970] } hitcount: 1
- { id: sys_ioctl [ 16], common_pid: gmain [ 2768] } hitcount: 1
- .
- .
- .
- { id: sys_ioctl [ 16], common_pid: pool [ 8559] } hitcount: 45
- { id: sys_ioctl [ 16], common_pid: pool [ 8555] } hitcount: 48
- { id: sys_ioctl [ 16], common_pid: pool [ 8551] } hitcount: 48
- { id: sys_ioctl [ 16], common_pid: avahi-daemon [ 896] } hitcount: 66
- { id: sys_ioctl [ 16], common_pid: Xorg [ 1267] } hitcount: 26674
- { id: sys_ioctl [ 16], common_pid: compiz [ 2994] } hitcount: 73443
-
- Totals:
- Hits: 101162
- Entries: 103
- Dropped: 0
-
- The above output shows that 'compiz' and 'Xorg' are far and away
- the heaviest ioctl callers (which might lead to questions about
- whether they really need to be making all those calls and to
- possible avenues for further investigation.)
-
- The compound key examples used a key and a sum value (hitcount) to
- sort the output, but we can just as easily use two keys instead.
- Here's an example where we use a compound key composed of the the
- common_pid and size event fields. Sorting with pid as the primary
- key and 'size' as the secondary key allows us to display an
- ordered summary of the recvfrom sizes, with counts, received by
- each process::
-
- # echo 'hist:key=common_pid.execname,size:val=hitcount:sort=common_pid,size' > \
- /sys/kernel/debug/tracing/events/syscalls/sys_enter_recvfrom/trigger
-
- # cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_recvfrom/hist
- # trigger info: hist:keys=common_pid.execname,size:vals=hitcount:sort=common_pid.execname,size:size=2048 [active]
-
- { common_pid: smbd [ 784], size: 4 } hitcount: 1
- { common_pid: dnsmasq [ 1412], size: 4096 } hitcount: 672
- { common_pid: postgres [ 1796], size: 1000 } hitcount: 6
- { common_pid: postgres [ 1867], size: 1000 } hitcount: 10
- { common_pid: bamfdaemon [ 2787], size: 28 } hitcount: 2
- { common_pid: bamfdaemon [ 2787], size: 14360 } hitcount: 1
- { common_pid: compiz [ 2994], size: 8 } hitcount: 1
- { common_pid: compiz [ 2994], size: 20 } hitcount: 11
- { common_pid: gnome-terminal [ 3199], size: 4 } hitcount: 2
- { common_pid: firefox [ 8817], size: 4 } hitcount: 1
- { common_pid: firefox [ 8817], size: 8 } hitcount: 5
- { common_pid: firefox [ 8817], size: 588 } hitcount: 2
- { common_pid: firefox [ 8817], size: 628 } hitcount: 1
- { common_pid: firefox [ 8817], size: 6944 } hitcount: 1
- { common_pid: firefox [ 8817], size: 408880 } hitcount: 2
- { common_pid: firefox [ 8822], size: 8 } hitcount: 2
- { common_pid: firefox [ 8822], size: 160 } hitcount: 2
- { common_pid: firefox [ 8822], size: 320 } hitcount: 2
- { common_pid: firefox [ 8822], size: 352 } hitcount: 1
- .
- .
- .
- { common_pid: pool [ 8923], size: 1960 } hitcount: 10
- { common_pid: pool [ 8923], size: 2048 } hitcount: 10
- { common_pid: pool [ 8924], size: 1960 } hitcount: 10
- { common_pid: pool [ 8924], size: 2048 } hitcount: 10
- { common_pid: pool [ 8928], size: 1964 } hitcount: 4
- { common_pid: pool [ 8928], size: 1965 } hitcount: 2
- { common_pid: pool [ 8928], size: 2048 } hitcount: 6
- { common_pid: pool [ 8929], size: 1982 } hitcount: 1
- { common_pid: pool [ 8929], size: 2048 } hitcount: 1
-
- Totals:
- Hits: 2016
- Entries: 224
- Dropped: 0
-
- The above example also illustrates the fact that although a compound
- key is treated as a single entity for hashing purposes, the sub-keys
- it's composed of can be accessed independently.
-
- The next example uses a string field as the hash key and
- demonstrates how you can manually pause and continue a hist trigger.
- In this example, we'll aggregate fork counts and don't expect a
- large number of entries in the hash table, so we'll drop it to a
- much smaller number, say 256::
-
- # echo 'hist:key=child_comm:val=hitcount:size=256' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
-
- # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
- # trigger info: hist:keys=child_comm:vals=hitcount:sort=hitcount:size=256 [active]
-
- { child_comm: dconf worker } hitcount: 1
- { child_comm: ibus-daemon } hitcount: 1
- { child_comm: whoopsie } hitcount: 1
- { child_comm: smbd } hitcount: 1
- { child_comm: gdbus } hitcount: 1
- { child_comm: kthreadd } hitcount: 1
- { child_comm: dconf worker } hitcount: 1
- { child_comm: evolution-alarm } hitcount: 2
- { child_comm: Socket Thread } hitcount: 2
- { child_comm: postgres } hitcount: 2
- { child_comm: bash } hitcount: 3
- { child_comm: compiz } hitcount: 3
- { child_comm: evolution-sourc } hitcount: 4
- { child_comm: dhclient } hitcount: 4
- { child_comm: pool } hitcount: 5
- { child_comm: nm-dispatcher.a } hitcount: 8
- { child_comm: firefox } hitcount: 8
- { child_comm: dbus-daemon } hitcount: 8
- { child_comm: glib-pacrunner } hitcount: 10
- { child_comm: evolution } hitcount: 23
-
- Totals:
- Hits: 89
- Entries: 20
- Dropped: 0
-
- If we want to pause the hist trigger, we can simply append :pause to
- the command that started the trigger. Notice that the trigger info
- displays as [paused]::
-
- # echo 'hist:key=child_comm:val=hitcount:size=256:pause' >> \
- /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
-
- # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
- # trigger info: hist:keys=child_comm:vals=hitcount:sort=hitcount:size=256 [paused]
-
- { child_comm: dconf worker } hitcount: 1
- { child_comm: kthreadd } hitcount: 1
- { child_comm: dconf worker } hitcount: 1
- { child_comm: gdbus } hitcount: 1
- { child_comm: ibus-daemon } hitcount: 1
- { child_comm: Socket Thread } hitcount: 2
- { child_comm: evolution-alarm } hitcount: 2
- { child_comm: smbd } hitcount: 2
- { child_comm: bash } hitcount: 3
- { child_comm: whoopsie } hitcount: 3
- { child_comm: compiz } hitcount: 3
- { child_comm: evolution-sourc } hitcount: 4
- { child_comm: pool } hitcount: 5
- { child_comm: postgres } hitcount: 6
- { child_comm: firefox } hitcount: 8
- { child_comm: dhclient } hitcount: 10
- { child_comm: emacs } hitcount: 12
- { child_comm: dbus-daemon } hitcount: 20
- { child_comm: nm-dispatcher.a } hitcount: 20
- { child_comm: evolution } hitcount: 35
- { child_comm: glib-pacrunner } hitcount: 59
-
- Totals:
- Hits: 199
- Entries: 21
- Dropped: 0
-
- To manually continue having the trigger aggregate events, append
- :cont instead. Notice that the trigger info displays as [active]
- again, and the data has changed::
-
- # echo 'hist:key=child_comm:val=hitcount:size=256:cont' >> \
- /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
-
- # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
- # trigger info: hist:keys=child_comm:vals=hitcount:sort=hitcount:size=256 [active]
-
- { child_comm: dconf worker } hitcount: 1
- { child_comm: dconf worker } hitcount: 1
- { child_comm: kthreadd } hitcount: 1
- { child_comm: gdbus } hitcount: 1
- { child_comm: ibus-daemon } hitcount: 1
- { child_comm: Socket Thread } hitcount: 2
- { child_comm: evolution-alarm } hitcount: 2
- { child_comm: smbd } hitcount: 2
- { child_comm: whoopsie } hitcount: 3
- { child_comm: compiz } hitcount: 3
- { child_comm: evolution-sourc } hitcount: 4
- { child_comm: bash } hitcount: 5
- { child_comm: pool } hitcount: 5
- { child_comm: postgres } hitcount: 6
- { child_comm: firefox } hitcount: 8
- { child_comm: dhclient } hitcount: 11
- { child_comm: emacs } hitcount: 12
- { child_comm: dbus-daemon } hitcount: 22
- { child_comm: nm-dispatcher.a } hitcount: 22
- { child_comm: evolution } hitcount: 35
- { child_comm: glib-pacrunner } hitcount: 59
-
- Totals:
- Hits: 206
- Entries: 21
- Dropped: 0
-
- The previous example showed how to start and stop a hist trigger by
- appending 'pause' and 'continue' to the hist trigger command. A
- hist trigger can also be started in a paused state by initially
- starting the trigger with ':pause' appended. This allows you to
- start the trigger only when you're ready to start collecting data
- and not before. For example, you could start the trigger in a
- paused state, then unpause it and do something you want to measure,
- then pause the trigger again when done.
-
- Of course, doing this manually can be difficult and error-prone, but
- it is possible to automatically start and stop a hist trigger based
- on some condition, via the enable_hist and disable_hist triggers.
-
- For example, suppose we wanted to take a look at the relative
- weights in terms of skb length for each callpath that leads to a
- netif_receieve_skb event when downloading a decent-sized file using
- wget.
-
- First we set up an initially paused stacktrace trigger on the
- netif_receive_skb event::
-
- # echo 'hist:key=stacktrace:vals=len:pause' > \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
-
- Next, we set up an 'enable_hist' trigger on the sched_process_exec
- event, with an 'if filename==/usr/bin/wget' filter. The effect of
- this new trigger is that it will 'unpause' the hist trigger we just
- set up on netif_receive_skb if and only if it sees a
- sched_process_exec event with a filename of '/usr/bin/wget'. When
- that happens, all netif_receive_skb events are aggregated into a
- hash table keyed on stacktrace::
-
- # echo 'enable_hist:net:netif_receive_skb if filename==/usr/bin/wget' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
-
- The aggregation continues until the netif_receive_skb is paused
- again, which is what the following disable_hist event does by
- creating a similar setup on the sched_process_exit event, using the
- filter 'comm==wget'::
-
- # echo 'disable_hist:net:netif_receive_skb if comm==wget' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
-
- Whenever a process exits and the comm field of the disable_hist
- trigger filter matches 'comm==wget', the netif_receive_skb hist
- trigger is disabled.
-
- The overall effect is that netif_receive_skb events are aggregated
- into the hash table for only the duration of the wget. Executing a
- wget command and then listing the 'hist' file will display the
- output generated by the wget command::
-
- $ wget https://www.kernel.org/pub/linux/kernel/v3.x/patch-3.19.xz
-
- # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
- # trigger info: hist:keys=stacktrace:vals=len:sort=hitcount:size=2048 [paused]
-
- { stacktrace:
- __netif_receive_skb_core+0x46d/0x990
- __netif_receive_skb+0x18/0x60
- netif_receive_skb_internal+0x23/0x90
- napi_gro_receive+0xc8/0x100
- ieee80211_deliver_skb+0xd6/0x270 [mac80211]
- ieee80211_rx_handlers+0xccf/0x22f0 [mac80211]
- ieee80211_prepare_and_rx_handle+0x4e7/0xc40 [mac80211]
- ieee80211_rx+0x31d/0x900 [mac80211]
- iwlagn_rx_reply_rx+0x3db/0x6f0 [iwldvm]
- iwl_rx_dispatch+0x8e/0xf0 [iwldvm]
- iwl_pcie_irq_handler+0xe3c/0x12f0 [iwlwifi]
- irq_thread_fn+0x20/0x50
- irq_thread+0x11f/0x150
- kthread+0xd2/0xf0
- ret_from_fork+0x42/0x70
- } hitcount: 85 len: 28884
- { stacktrace:
- __netif_receive_skb_core+0x46d/0x990
- __netif_receive_skb+0x18/0x60
- netif_receive_skb_internal+0x23/0x90
- napi_gro_complete+0xa4/0xe0
- dev_gro_receive+0x23a/0x360
- napi_gro_receive+0x30/0x100
- ieee80211_deliver_skb+0xd6/0x270 [mac80211]
- ieee80211_rx_handlers+0xccf/0x22f0 [mac80211]
- ieee80211_prepare_and_rx_handle+0x4e7/0xc40 [mac80211]
- ieee80211_rx+0x31d/0x900 [mac80211]
- iwlagn_rx_reply_rx+0x3db/0x6f0 [iwldvm]
- iwl_rx_dispatch+0x8e/0xf0 [iwldvm]
- iwl_pcie_irq_handler+0xe3c/0x12f0 [iwlwifi]
- irq_thread_fn+0x20/0x50
- irq_thread+0x11f/0x150
- kthread+0xd2/0xf0
- } hitcount: 98 len: 664329
- { stacktrace:
- __netif_receive_skb_core+0x46d/0x990
- __netif_receive_skb+0x18/0x60
- process_backlog+0xa8/0x150
- net_rx_action+0x15d/0x340
- __do_softirq+0x114/0x2c0
- do_softirq_own_stack+0x1c/0x30
- do_softirq+0x65/0x70
- __local_bh_enable_ip+0xb5/0xc0
- ip_finish_output+0x1f4/0x840
- ip_output+0x6b/0xc0
- ip_local_out_sk+0x31/0x40
- ip_send_skb+0x1a/0x50
- udp_send_skb+0x173/0x2a0
- udp_sendmsg+0x2bf/0x9f0
- inet_sendmsg+0x64/0xa0
- sock_sendmsg+0x3d/0x50
- } hitcount: 115 len: 13030
- { stacktrace:
- __netif_receive_skb_core+0x46d/0x990
- __netif_receive_skb+0x18/0x60
- netif_receive_skb_internal+0x23/0x90
- napi_gro_complete+0xa4/0xe0
- napi_gro_flush+0x6d/0x90
- iwl_pcie_irq_handler+0x92a/0x12f0 [iwlwifi]
- irq_thread_fn+0x20/0x50
- irq_thread+0x11f/0x150
- kthread+0xd2/0xf0
- ret_from_fork+0x42/0x70
- } hitcount: 934 len: 5512212
-
- Totals:
- Hits: 1232
- Entries: 4
- Dropped: 0
-
- The above shows all the netif_receive_skb callpaths and their total
- lengths for the duration of the wget command.
-
- The 'clear' hist trigger param can be used to clear the hash table.
- Suppose we wanted to try another run of the previous example but
- this time also wanted to see the complete list of events that went
- into the histogram. In order to avoid having to set everything up
- again, we can just clear the histogram first::
-
- # echo 'hist:key=stacktrace:vals=len:clear' >> \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
-
- Just to verify that it is in fact cleared, here's what we now see in
- the hist file::
-
- # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
- # trigger info: hist:keys=stacktrace:vals=len:sort=hitcount:size=2048 [paused]
-
- Totals:
- Hits: 0
- Entries: 0
- Dropped: 0
-
- Since we want to see the detailed list of every netif_receive_skb
- event occurring during the new run, which are in fact the same
- events being aggregated into the hash table, we add some additional
- 'enable_event' events to the triggering sched_process_exec and
- sched_process_exit events as such::
-
- # echo 'enable_event:net:netif_receive_skb if filename==/usr/bin/wget' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
-
- # echo 'disable_event:net:netif_receive_skb if comm==wget' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
-
- If you read the trigger files for the sched_process_exec and
- sched_process_exit triggers, you should see two triggers for each:
- one enabling/disabling the hist aggregation and the other
- enabling/disabling the logging of events::
-
- # cat /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
- enable_event:net:netif_receive_skb:unlimited if filename==/usr/bin/wget
- enable_hist:net:netif_receive_skb:unlimited if filename==/usr/bin/wget
-
- # cat /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
- enable_event:net:netif_receive_skb:unlimited if comm==wget
- disable_hist:net:netif_receive_skb:unlimited if comm==wget
-
- In other words, whenever either of the sched_process_exec or
- sched_process_exit events is hit and matches 'wget', it enables or
- disables both the histogram and the event log, and what you end up
- with is a hash table and set of events just covering the specified
- duration. Run the wget command again::
-
- $ wget https://www.kernel.org/pub/linux/kernel/v3.x/patch-3.19.xz
-
- Displaying the 'hist' file should show something similar to what you
- saw in the last run, but this time you should also see the
- individual events in the trace file::
-
- # cat /sys/kernel/debug/tracing/trace
-
- # tracer: nop
- #
- # entries-in-buffer/entries-written: 183/1426 #P:4
- #
- # _-----=> irqs-off
- # / _----=> need-resched
- # | / _---=> hardirq/softirq
- # || / _--=> preempt-depth
- # ||| / delay
- # TASK-PID CPU# |||| TIMESTAMP FUNCTION
- # | | | |||| | |
- wget-15108 [000] ..s1 31769.606929: netif_receive_skb: dev=lo skbaddr=ffff88009c353100 len=60
- wget-15108 [000] ..s1 31769.606999: netif_receive_skb: dev=lo skbaddr=ffff88009c353200 len=60
- dnsmasq-1382 [000] ..s1 31769.677652: netif_receive_skb: dev=lo skbaddr=ffff88009c352b00 len=130
- dnsmasq-1382 [000] ..s1 31769.685917: netif_receive_skb: dev=lo skbaddr=ffff88009c352200 len=138
- ##### CPU 2 buffer started ####
- irq/29-iwlwifi-559 [002] ..s. 31772.031529: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d433d00 len=2948
- irq/29-iwlwifi-559 [002] ..s. 31772.031572: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d432200 len=1500
- irq/29-iwlwifi-559 [002] ..s. 31772.032196: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d433100 len=2948
- irq/29-iwlwifi-559 [002] ..s. 31772.032761: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d433000 len=2948
- irq/29-iwlwifi-559 [002] ..s. 31772.033220: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d432e00 len=1500
- ....
-
-
- The following example demonstrates how multiple hist triggers can be
- attached to a given event. This capability can be useful for
- creating a set of different summaries derived from the same set of
- events, or for comparing the effects of different filters, among
- other things.
- ::
-
- # echo 'hist:keys=skbaddr.hex:vals=len if len < 0' >> \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
- # echo 'hist:keys=skbaddr.hex:vals=len if len > 4096' >> \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
- # echo 'hist:keys=skbaddr.hex:vals=len if len == 256' >> \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
- # echo 'hist:keys=skbaddr.hex:vals=len' >> \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
- # echo 'hist:keys=len:vals=common_preempt_count' >> \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
-
- The above set of commands create four triggers differing only in
- their filters, along with a completely different though fairly
- nonsensical trigger. Note that in order to append multiple hist
- triggers to the same file, you should use the '>>' operator to
- append them ('>' will also add the new hist trigger, but will remove
- any existing hist triggers beforehand).
-
- Displaying the contents of the 'hist' file for the event shows the
- contents of all five histograms::
-
- # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
-
- # event histogram
- #
- # trigger info: hist:keys=len:vals=hitcount,common_preempt_count:sort=hitcount:size=2048 [active]
- #
-
- { len: 176 } hitcount: 1 common_preempt_count: 0
- { len: 223 } hitcount: 1 common_preempt_count: 0
- { len: 4854 } hitcount: 1 common_preempt_count: 0
- { len: 395 } hitcount: 1 common_preempt_count: 0
- { len: 177 } hitcount: 1 common_preempt_count: 0
- { len: 446 } hitcount: 1 common_preempt_count: 0
- { len: 1601 } hitcount: 1 common_preempt_count: 0
- .
- .
- .
- { len: 1280 } hitcount: 66 common_preempt_count: 0
- { len: 116 } hitcount: 81 common_preempt_count: 40
- { len: 708 } hitcount: 112 common_preempt_count: 0
- { len: 46 } hitcount: 221 common_preempt_count: 0
- { len: 1264 } hitcount: 458 common_preempt_count: 0
-
- Totals:
- Hits: 1428
- Entries: 147
- Dropped: 0
-
-
- # event histogram
- #
- # trigger info: hist:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 [active]
- #
-
- { skbaddr: ffff8800baee5e00 } hitcount: 1 len: 130
- { skbaddr: ffff88005f3d5600 } hitcount: 1 len: 1280
- { skbaddr: ffff88005f3d4900 } hitcount: 1 len: 1280
- { skbaddr: ffff88009fed6300 } hitcount: 1 len: 115
- { skbaddr: ffff88009fe0ad00 } hitcount: 1 len: 115
- { skbaddr: ffff88008cdb1900 } hitcount: 1 len: 46
- { skbaddr: ffff880064b5ef00 } hitcount: 1 len: 118
- { skbaddr: ffff880044e3c700 } hitcount: 1 len: 60
- { skbaddr: ffff880100065900 } hitcount: 1 len: 46
- { skbaddr: ffff8800d46bd500 } hitcount: 1 len: 116
- { skbaddr: ffff88005f3d5f00 } hitcount: 1 len: 1280
- { skbaddr: ffff880100064700 } hitcount: 1 len: 365
- { skbaddr: ffff8800badb6f00 } hitcount: 1 len: 60
- .
- .
- .
- { skbaddr: ffff88009fe0be00 } hitcount: 27 len: 24677
- { skbaddr: ffff88009fe0a400 } hitcount: 27 len: 23052
- { skbaddr: ffff88009fe0b700 } hitcount: 31 len: 25589
- { skbaddr: ffff88009fe0b600 } hitcount: 32 len: 27326
- { skbaddr: ffff88006a462800 } hitcount: 68 len: 71678
- { skbaddr: ffff88006a463700 } hitcount: 70 len: 72678
- { skbaddr: ffff88006a462b00 } hitcount: 71 len: 77589
- { skbaddr: ffff88006a463600 } hitcount: 73 len: 71307
- { skbaddr: ffff88006a462200 } hitcount: 81 len: 81032
-
- Totals:
- Hits: 1451
- Entries: 318
- Dropped: 0
-
-
- # event histogram
- #
- # trigger info: hist:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 if len == 256 [active]
- #
-
-
- Totals:
- Hits: 0
- Entries: 0
- Dropped: 0
-
-
- # event histogram
- #
- # trigger info: hist:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 if len > 4096 [active]
- #
-
- { skbaddr: ffff88009fd2c300 } hitcount: 1 len: 7212
- { skbaddr: ffff8800d2bcce00 } hitcount: 1 len: 7212
- { skbaddr: ffff8800d2bcd700 } hitcount: 1 len: 7212
- { skbaddr: ffff8800d2bcda00 } hitcount: 1 len: 21492
- { skbaddr: ffff8800ae2e2d00 } hitcount: 1 len: 7212
- { skbaddr: ffff8800d2bcdb00 } hitcount: 1 len: 7212
- { skbaddr: ffff88006a4df500 } hitcount: 1 len: 4854
- { skbaddr: ffff88008ce47b00 } hitcount: 1 len: 18636
- { skbaddr: ffff8800ae2e2200 } hitcount: 1 len: 12924
- { skbaddr: ffff88005f3e1000 } hitcount: 1 len: 4356
- { skbaddr: ffff8800d2bcdc00 } hitcount: 2 len: 24420
- { skbaddr: ffff8800d2bcc200 } hitcount: 2 len: 12996
-
- Totals:
- Hits: 14
- Entries: 12
- Dropped: 0
-
-
- # event histogram
- #
- # trigger info: hist:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 if len < 0 [active]
- #
-
-
- Totals:
- Hits: 0
- Entries: 0
- Dropped: 0
-
- Named triggers can be used to have triggers share a common set of
- histogram data. This capability is mostly useful for combining the
- output of events generated by tracepoints contained inside inline
- functions, but names can be used in a hist trigger on any event.
- For example, these two triggers when hit will update the same 'len'
- field in the shared 'foo' histogram data::
-
- # echo 'hist:name=foo:keys=skbaddr.hex:vals=len' > \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
- # echo 'hist:name=foo:keys=skbaddr.hex:vals=len' > \
- /sys/kernel/debug/tracing/events/net/netif_rx/trigger
-
- You can see that they're updating common histogram data by reading
- each event's hist files at the same time::
-
- # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist;
- cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
-
- # event histogram
- #
- # trigger info: hist:name=foo:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 [active]
- #
-
- { skbaddr: ffff88000ad53500 } hitcount: 1 len: 46
- { skbaddr: ffff8800af5a1500 } hitcount: 1 len: 76
- { skbaddr: ffff8800d62a1900 } hitcount: 1 len: 46
- { skbaddr: ffff8800d2bccb00 } hitcount: 1 len: 468
- { skbaddr: ffff8800d3c69900 } hitcount: 1 len: 46
- { skbaddr: ffff88009ff09100 } hitcount: 1 len: 52
- { skbaddr: ffff88010f13ab00 } hitcount: 1 len: 168
- { skbaddr: ffff88006a54f400 } hitcount: 1 len: 46
- { skbaddr: ffff8800d2bcc500 } hitcount: 1 len: 260
- { skbaddr: ffff880064505000 } hitcount: 1 len: 46
- { skbaddr: ffff8800baf24e00 } hitcount: 1 len: 32
- { skbaddr: ffff88009fe0ad00 } hitcount: 1 len: 46
- { skbaddr: ffff8800d3edff00 } hitcount: 1 len: 44
- { skbaddr: ffff88009fe0b400 } hitcount: 1 len: 168
- { skbaddr: ffff8800a1c55a00 } hitcount: 1 len: 40
- { skbaddr: ffff8800d2bcd100 } hitcount: 1 len: 40
- { skbaddr: ffff880064505f00 } hitcount: 1 len: 174
- { skbaddr: ffff8800a8bff200 } hitcount: 1 len: 160
- { skbaddr: ffff880044e3cc00 } hitcount: 1 len: 76
- { skbaddr: ffff8800a8bfe700 } hitcount: 1 len: 46
- { skbaddr: ffff8800d2bcdc00 } hitcount: 1 len: 32
- { skbaddr: ffff8800a1f64800 } hitcount: 1 len: 46
- { skbaddr: ffff8800d2bcde00 } hitcount: 1 len: 988
- { skbaddr: ffff88006a5dea00 } hitcount: 1 len: 46
- { skbaddr: ffff88002e37a200 } hitcount: 1 len: 44
- { skbaddr: ffff8800a1f32c00 } hitcount: 2 len: 676
- { skbaddr: ffff88000ad52600 } hitcount: 2 len: 107
- { skbaddr: ffff8800a1f91e00 } hitcount: 2 len: 92
- { skbaddr: ffff8800af5a0200 } hitcount: 2 len: 142
- { skbaddr: ffff8800d2bcc600 } hitcount: 2 len: 220
- { skbaddr: ffff8800ba36f500 } hitcount: 2 len: 92
- { skbaddr: ffff8800d021f800 } hitcount: 2 len: 92
- { skbaddr: ffff8800a1f33600 } hitcount: 2 len: 675
- { skbaddr: ffff8800a8bfff00 } hitcount: 3 len: 138
- { skbaddr: ffff8800d62a1300 } hitcount: 3 len: 138
- { skbaddr: ffff88002e37a100 } hitcount: 4 len: 184
- { skbaddr: ffff880064504400 } hitcount: 4 len: 184
- { skbaddr: ffff8800a8bfec00 } hitcount: 4 len: 184
- { skbaddr: ffff88000ad53700 } hitcount: 5 len: 230
- { skbaddr: ffff8800d2bcdb00 } hitcount: 5 len: 196
- { skbaddr: ffff8800a1f90000 } hitcount: 6 len: 276
- { skbaddr: ffff88006a54f900 } hitcount: 6 len: 276
-
- Totals:
- Hits: 81
- Entries: 42
- Dropped: 0
- # event histogram
- #
- # trigger info: hist:name=foo:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 [active]
- #
-
- { skbaddr: ffff88000ad53500 } hitcount: 1 len: 46
- { skbaddr: ffff8800af5a1500 } hitcount: 1 len: 76
- { skbaddr: ffff8800d62a1900 } hitcount: 1 len: 46
- { skbaddr: ffff8800d2bccb00 } hitcount: 1 len: 468
- { skbaddr: ffff8800d3c69900 } hitcount: 1 len: 46
- { skbaddr: ffff88009ff09100 } hitcount: 1 len: 52
- { skbaddr: ffff88010f13ab00 } hitcount: 1 len: 168
- { skbaddr: ffff88006a54f400 } hitcount: 1 len: 46
- { skbaddr: ffff8800d2bcc500 } hitcount: 1 len: 260
- { skbaddr: ffff880064505000 } hitcount: 1 len: 46
- { skbaddr: ffff8800baf24e00 } hitcount: 1 len: 32
- { skbaddr: ffff88009fe0ad00 } hitcount: 1 len: 46
- { skbaddr: ffff8800d3edff00 } hitcount: 1 len: 44
- { skbaddr: ffff88009fe0b400 } hitcount: 1 len: 168
- { skbaddr: ffff8800a1c55a00 } hitcount: 1 len: 40
- { skbaddr: ffff8800d2bcd100 } hitcount: 1 len: 40
- { skbaddr: ffff880064505f00 } hitcount: 1 len: 174
- { skbaddr: ffff8800a8bff200 } hitcount: 1 len: 160
- { skbaddr: ffff880044e3cc00 } hitcount: 1 len: 76
- { skbaddr: ffff8800a8bfe700 } hitcount: 1 len: 46
- { skbaddr: ffff8800d2bcdc00 } hitcount: 1 len: 32
- { skbaddr: ffff8800a1f64800 } hitcount: 1 len: 46
- { skbaddr: ffff8800d2bcde00 } hitcount: 1 len: 988
- { skbaddr: ffff88006a5dea00 } hitcount: 1 len: 46
- { skbaddr: ffff88002e37a200 } hitcount: 1 len: 44
- { skbaddr: ffff8800a1f32c00 } hitcount: 2 len: 676
- { skbaddr: ffff88000ad52600 } hitcount: 2 len: 107
- { skbaddr: ffff8800a1f91e00 } hitcount: 2 len: 92
- { skbaddr: ffff8800af5a0200 } hitcount: 2 len: 142
- { skbaddr: ffff8800d2bcc600 } hitcount: 2 len: 220
- { skbaddr: ffff8800ba36f500 } hitcount: 2 len: 92
- { skbaddr: ffff8800d021f800 } hitcount: 2 len: 92
- { skbaddr: ffff8800a1f33600 } hitcount: 2 len: 675
- { skbaddr: ffff8800a8bfff00 } hitcount: 3 len: 138
- { skbaddr: ffff8800d62a1300 } hitcount: 3 len: 138
- { skbaddr: ffff88002e37a100 } hitcount: 4 len: 184
- { skbaddr: ffff880064504400 } hitcount: 4 len: 184
- { skbaddr: ffff8800a8bfec00 } hitcount: 4 len: 184
- { skbaddr: ffff88000ad53700 } hitcount: 5 len: 230
- { skbaddr: ffff8800d2bcdb00 } hitcount: 5 len: 196
- { skbaddr: ffff8800a1f90000 } hitcount: 6 len: 276
- { skbaddr: ffff88006a54f900 } hitcount: 6 len: 276
-
- Totals:
- Hits: 81
- Entries: 42
- Dropped: 0
-
- And here's an example that shows how to combine histogram data from
- any two events even if they don't share any 'compatible' fields
- other than 'hitcount' and 'stacktrace'. These commands create a
- couple of triggers named 'bar' using those fields::
-
- # echo 'hist:name=bar:key=stacktrace:val=hitcount' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
- # echo 'hist:name=bar:key=stacktrace:val=hitcount' > \
- /sys/kernel/debug/tracing/events/net/netif_rx/trigger
-
- And displaying the output of either shows some interesting if
- somewhat confusing output::
-
- # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
- # cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
-
- # event histogram
- #
- # trigger info: hist:name=bar:keys=stacktrace:vals=hitcount:sort=hitcount:size=2048 [active]
- #
-
- { stacktrace:
- _do_fork+0x18e/0x330
- kernel_thread+0x29/0x30
- kthreadd+0x154/0x1b0
- ret_from_fork+0x3f/0x70
- } hitcount: 1
- { stacktrace:
- netif_rx_internal+0xb2/0xd0
- netif_rx_ni+0x20/0x70
- dev_loopback_xmit+0xaa/0xd0
- ip_mc_output+0x126/0x240
- ip_local_out_sk+0x31/0x40
- igmp_send_report+0x1e9/0x230
- igmp_timer_expire+0xe9/0x120
- call_timer_fn+0x39/0xf0
- run_timer_softirq+0x1e1/0x290
- __do_softirq+0xfd/0x290
- irq_exit+0x98/0xb0
- smp_apic_timer_interrupt+0x4a/0x60
- apic_timer_interrupt+0x6d/0x80
- cpuidle_enter+0x17/0x20
- call_cpuidle+0x3b/0x60
- cpu_startup_entry+0x22d/0x310
- } hitcount: 1
- { stacktrace:
- netif_rx_internal+0xb2/0xd0
- netif_rx_ni+0x20/0x70
- dev_loopback_xmit+0xaa/0xd0
- ip_mc_output+0x17f/0x240
- ip_local_out_sk+0x31/0x40
- ip_send_skb+0x1a/0x50
- udp_send_skb+0x13e/0x270
- udp_sendmsg+0x2bf/0x980
- inet_sendmsg+0x67/0xa0
- sock_sendmsg+0x38/0x50
- SYSC_sendto+0xef/0x170
- SyS_sendto+0xe/0x10
- entry_SYSCALL_64_fastpath+0x12/0x6a
- } hitcount: 2
- { stacktrace:
- netif_rx_internal+0xb2/0xd0
- netif_rx+0x1c/0x60
- loopback_xmit+0x6c/0xb0
- dev_hard_start_xmit+0x219/0x3a0
- __dev_queue_xmit+0x415/0x4f0
- dev_queue_xmit_sk+0x13/0x20
- ip_finish_output2+0x237/0x340
- ip_finish_output+0x113/0x1d0
- ip_output+0x66/0xc0
- ip_local_out_sk+0x31/0x40
- ip_send_skb+0x1a/0x50
- udp_send_skb+0x16d/0x270
- udp_sendmsg+0x2bf/0x980
- inet_sendmsg+0x67/0xa0
- sock_sendmsg+0x38/0x50
- ___sys_sendmsg+0x14e/0x270
- } hitcount: 76
- { stacktrace:
- netif_rx_internal+0xb2/0xd0
- netif_rx+0x1c/0x60
- loopback_xmit+0x6c/0xb0
- dev_hard_start_xmit+0x219/0x3a0
- __dev_queue_xmit+0x415/0x4f0
- dev_queue_xmit_sk+0x13/0x20
- ip_finish_output2+0x237/0x340
- ip_finish_output+0x113/0x1d0
- ip_output+0x66/0xc0
- ip_local_out_sk+0x31/0x40
- ip_send_skb+0x1a/0x50
- udp_send_skb+0x16d/0x270
- udp_sendmsg+0x2bf/0x980
- inet_sendmsg+0x67/0xa0
- sock_sendmsg+0x38/0x50
- ___sys_sendmsg+0x269/0x270
- } hitcount: 77
- { stacktrace:
- netif_rx_internal+0xb2/0xd0
- netif_rx+0x1c/0x60
- loopback_xmit+0x6c/0xb0
- dev_hard_start_xmit+0x219/0x3a0
- __dev_queue_xmit+0x415/0x4f0
- dev_queue_xmit_sk+0x13/0x20
- ip_finish_output2+0x237/0x340
- ip_finish_output+0x113/0x1d0
- ip_output+0x66/0xc0
- ip_local_out_sk+0x31/0x40
- ip_send_skb+0x1a/0x50
- udp_send_skb+0x16d/0x270
- udp_sendmsg+0x2bf/0x980
- inet_sendmsg+0x67/0xa0
- sock_sendmsg+0x38/0x50
- SYSC_sendto+0xef/0x170
- } hitcount: 88
- { stacktrace:
- _do_fork+0x18e/0x330
- SyS_clone+0x19/0x20
- entry_SYSCALL_64_fastpath+0x12/0x6a
- } hitcount: 244
-
- Totals:
- Hits: 489
- Entries: 7
- Dropped: 0
+ See Documentation/trace/histogram.txt for details and examples.
diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index fdf5fb54a04c..e45f0786f3f9 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -543,6 +543,30 @@ of ftrace. Here is a list of some of the key files:
See events.txt for more information.
+ timestamp_mode:
+
+ Certain tracers may change the timestamp mode used when
+ logging trace events into the event buffer. Events with
+ different modes can coexist within a buffer but the mode in
+ effect when an event is logged determines which timestamp mode
+ is used for that event. The default timestamp mode is
+ 'delta'.
+
+ Usual timestamp modes for tracing:
+
+ # cat timestamp_mode
+ [delta] absolute
+
+ The timestamp mode with the square brackets around it is the
+ one in effect.
+
+ delta: Default timestamp mode - timestamp is a delta against
+ a per-buffer timestamp.
+
+ absolute: The timestamp is a full timestamp, not a delta
+ against some other value. As such it takes up more
+ space and is less efficient.
+
hwlat_detector:
Directory for the Hardware Latency Detector.
diff --git a/Documentation/trace/histogram.txt b/Documentation/trace/histogram.txt
new file mode 100644
index 000000000000..6e05510afc28
--- /dev/null
+++ b/Documentation/trace/histogram.txt
@@ -0,0 +1,1995 @@
+ Event Histograms
+
+ Documentation written by Tom Zanussi
+
+1. Introduction
+===============
+
+ Histogram triggers are special event triggers that can be used to
+ aggregate trace event data into histograms. For information on
+ trace events and event triggers, see Documentation/trace/events.txt.
+
+
+2. Histogram Trigger Command
+============================
+
+ A histogram trigger command is an event trigger command that
+ aggregates event hits into a hash table keyed on one or more trace
+ event format fields (or stacktrace) and a set of running totals
+ derived from one or more trace event format fields and/or event
+ counts (hitcount).
+
+ The format of a hist trigger is as follows:
+
+ hist:keys=<field1[,field2,...]>[:values=<field1[,field2,...]>]
+ [:sort=<field1[,field2,...]>][:size=#entries][:pause][:continue]
+ [:clear][:name=histname1] [if <filter>]
+
+ When a matching event is hit, an entry is added to a hash table
+ using the key(s) and value(s) named. Keys and values correspond to
+ fields in the event's format description. Values must correspond to
+ numeric fields - on an event hit, the value(s) will be added to a
+ sum kept for that field. The special string 'hitcount' can be used
+ in place of an explicit value field - this is simply a count of
+ event hits. If 'values' isn't specified, an implicit 'hitcount'
+ value will be automatically created and used as the only value.
+ Keys can be any field, or the special string 'stacktrace', which
+ will use the event's kernel stacktrace as the key. The keywords
+ 'keys' or 'key' can be used to specify keys, and the keywords
+ 'values', 'vals', or 'val' can be used to specify values. Compound
+ keys consisting of up to two fields can be specified by the 'keys'
+ keyword. Hashing a compound key produces a unique entry in the
+ table for each unique combination of component keys, and can be
+ useful for providing more fine-grained summaries of event data.
+ Additionally, sort keys consisting of up to two fields can be
+ specified by the 'sort' keyword. If more than one field is
+ specified, the result will be a 'sort within a sort': the first key
+ is taken to be the primary sort key and the second the secondary
+ key. If a hist trigger is given a name using the 'name' parameter,
+ its histogram data will be shared with other triggers of the same
+ name, and trigger hits will update this common data. Only triggers
+ with 'compatible' fields can be combined in this way; triggers are
+ 'compatible' if the fields named in the trigger share the same
+ number and type of fields and those fields also have the same names.
+ Note that any two events always share the compatible 'hitcount' and
+ 'stacktrace' fields and can therefore be combined using those
+ fields, however pointless that may be.
+
+ 'hist' triggers add a 'hist' file to each event's subdirectory.
+ Reading the 'hist' file for the event will dump the hash table in
+ its entirety to stdout. If there are multiple hist triggers
+ attached to an event, there will be a table for each trigger in the
+ output. The table displayed for a named trigger will be the same as
+ any other instance having the same name. Each printed hash table
+ entry is a simple list of the keys and values comprising the entry;
+ keys are printed first and are delineated by curly braces, and are
+ followed by the set of value fields for the entry. By default,
+ numeric fields are displayed as base-10 integers. This can be
+ modified by appending any of the following modifiers to the field
+ name:
+
+ .hex display a number as a hex value
+ .sym display an address as a symbol
+ .sym-offset display an address as a symbol and offset
+ .syscall display a syscall id as a system call name
+ .execname display a common_pid as a program name
+ .log2 display log2 value rather than raw number
+ .usecs display a common_timestamp in microseconds
+
+ Note that in general the semantics of a given field aren't
+ interpreted when applying a modifier to it, but there are some
+ restrictions to be aware of in this regard:
+
+ - only the 'hex' modifier can be used for values (because values
+ are essentially sums, and the other modifiers don't make sense
+ in that context).
+ - the 'execname' modifier can only be used on a 'common_pid'. The
+ reason for this is that the execname is simply the 'comm' value
+ saved for the 'current' process when an event was triggered,
+ which is the same as the common_pid value saved by the event
+ tracing code. Trying to apply that comm value to other pid
+ values wouldn't be correct, and typically events that care save
+ pid-specific comm fields in the event itself.
+
+ A typical usage scenario would be the following to enable a hist
+ trigger, read its current contents, and then turn it off:
+
+ # echo 'hist:keys=skbaddr.hex:vals=len' > \
+ /sys/kernel/debug/tracing/events/net/netif_rx/trigger
+
+ # cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
+
+ # echo '!hist:keys=skbaddr.hex:vals=len' > \
+ /sys/kernel/debug/tracing/events/net/netif_rx/trigger
+
+ The trigger file itself can be read to show the details of the
+ currently attached hist trigger. This information is also displayed
+ at the top of the 'hist' file when read.
+
+ By default, the size of the hash table is 2048 entries. The 'size'
+ parameter can be used to specify more or fewer than that. The units
+ are in terms of hashtable entries - if a run uses more entries than
+ specified, the results will show the number of 'drops', the number
+ of hits that were ignored. The size should be a power of 2 between
+ 128 and 131072 (any non- power-of-2 number specified will be rounded
+ up).
+
+ The 'sort' parameter can be used to specify a value field to sort
+ on. The default if unspecified is 'hitcount' and the default sort
+ order is 'ascending'. To sort in the opposite direction, append
+ .descending' to the sort key.
+
+ The 'pause' parameter can be used to pause an existing hist trigger
+ or to start a hist trigger but not log any events until told to do
+ so. 'continue' or 'cont' can be used to start or restart a paused
+ hist trigger.
+
+ The 'clear' parameter will clear the contents of a running hist
+ trigger and leave its current paused/active state.
+
+ Note that the 'pause', 'cont', and 'clear' parameters should be
+ applied using 'append' shell operator ('>>') if applied to an
+ existing trigger, rather than via the '>' operator, which will cause
+ the trigger to be removed through truncation.
+
+- enable_hist/disable_hist
+
+ The enable_hist and disable_hist triggers can be used to have one
+ event conditionally start and stop another event's already-attached
+ hist trigger. Any number of enable_hist and disable_hist triggers
+ can be attached to a given event, allowing that event to kick off
+ and stop aggregations on a host of other events.
+
+ The format is very similar to the enable/disable_event triggers:
+
+ enable_hist:<system>:<event>[:count]
+ disable_hist:<system>:<event>[:count]
+
+ Instead of enabling or disabling the tracing of the target event
+ into the trace buffer as the enable/disable_event triggers do, the
+ enable/disable_hist triggers enable or disable the aggregation of
+ the target event into a hash table.
+
+ A typical usage scenario for the enable_hist/disable_hist triggers
+ would be to first set up a paused hist trigger on some event,
+ followed by an enable_hist/disable_hist pair that turns the hist
+ aggregation on and off when conditions of interest are hit:
+
+ # echo 'hist:keys=skbaddr.hex:vals=len:pause' > \
+ /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+
+ # echo 'enable_hist:net:netif_receive_skb if filename==/usr/bin/wget' > \
+ /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
+
+ # echo 'disable_hist:net:netif_receive_skb if comm==wget' > \
+ /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
+
+ The above sets up an initially paused hist trigger which is unpaused
+ and starts aggregating events when a given program is executed, and
+ which stops aggregating when the process exits and the hist trigger
+ is paused again.
+
+ The examples below provide a more concrete illustration of the
+ concepts and typical usage patterns discussed above.
+
+ 'special' event fields
+ ------------------------
+
+ There are a number of 'special event fields' available for use as
+ keys or values in a hist trigger. These look like and behave as if
+ they were actual event fields, but aren't really part of the event's
+ field definition or format file. They are however available for any
+ event, and can be used anywhere an actual event field could be.
+ They are:
+
+ common_timestamp u64 - timestamp (from ring buffer) associated
+ with the event, in nanoseconds. May be
+ modified by .usecs to have timestamps
+ interpreted as microseconds.
+ cpu int - the cpu on which the event occurred.
+
+ Extended error information
+ --------------------------
+
+ For some error conditions encountered when invoking a hist trigger
+ command, extended error information is available via the
+ corresponding event's 'hist' file. Reading the hist file after an
+ error will display more detailed information about what went wrong,
+ if information is available. This extended error information will
+ be available until the next hist trigger command for that event.
+
+ If available for a given error condition, the extended error
+ information and usage takes the following form:
+
+ # echo xxx > /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
+ echo: write error: Invalid argument
+
+ # cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/hist
+ ERROR: Couldn't yyy: zzz
+ Last command: xxx
+
+6.2 'hist' trigger examples
+---------------------------
+
+ The first set of examples creates aggregations using the kmalloc
+ event. The fields that can be used for the hist trigger are listed
+ in the kmalloc event's format file:
+
+ # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/format
+ name: kmalloc
+ ID: 374
+ format:
+ field:unsigned short common_type; offset:0; size:2; signed:0;
+ field:unsigned char common_flags; offset:2; size:1; signed:0;
+ field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
+ field:int common_pid; offset:4; size:4; signed:1;
+
+ field:unsigned long call_site; offset:8; size:8; signed:0;
+ field:const void * ptr; offset:16; size:8; signed:0;
+ field:size_t bytes_req; offset:24; size:8; signed:0;
+ field:size_t bytes_alloc; offset:32; size:8; signed:0;
+ field:gfp_t gfp_flags; offset:40; size:4; signed:0;
+
+ We'll start by creating a hist trigger that generates a simple table
+ that lists the total number of bytes requested for each function in
+ the kernel that made one or more calls to kmalloc:
+
+ # echo 'hist:key=call_site:val=bytes_req' > \
+ /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+ This tells the tracing system to create a 'hist' trigger using the
+ call_site field of the kmalloc event as the key for the table, which
+ just means that each unique call_site address will have an entry
+ created for it in the table. The 'val=bytes_req' parameter tells
+ the hist trigger that for each unique entry (call_site) in the
+ table, it should keep a running total of the number of bytes
+ requested by that call_site.
+
+ We'll let it run for awhile and then dump the contents of the 'hist'
+ file in the kmalloc event's subdirectory (for readability, a number
+ of entries have been omitted):
+
+ # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # trigger info: hist:keys=call_site:vals=bytes_req:sort=hitcount:size=2048 [active]
+
+ { call_site: 18446744072106379007 } hitcount: 1 bytes_req: 176
+ { call_site: 18446744071579557049 } hitcount: 1 bytes_req: 1024
+ { call_site: 18446744071580608289 } hitcount: 1 bytes_req: 16384
+ { call_site: 18446744071581827654 } hitcount: 1 bytes_req: 24
+ { call_site: 18446744071580700980 } hitcount: 1 bytes_req: 8
+ { call_site: 18446744071579359876 } hitcount: 1 bytes_req: 152
+ { call_site: 18446744071580795365 } hitcount: 3 bytes_req: 144
+ { call_site: 18446744071581303129 } hitcount: 3 bytes_req: 144
+ { call_site: 18446744071580713234 } hitcount: 4 bytes_req: 2560
+ { call_site: 18446744071580933750 } hitcount: 4 bytes_req: 736
+ .
+ .
+ .
+ { call_site: 18446744072106047046 } hitcount: 69 bytes_req: 5576
+ { call_site: 18446744071582116407 } hitcount: 73 bytes_req: 2336
+ { call_site: 18446744072106054684 } hitcount: 136 bytes_req: 140504
+ { call_site: 18446744072106224230 } hitcount: 136 bytes_req: 19584
+ { call_site: 18446744072106078074 } hitcount: 153 bytes_req: 2448
+ { call_site: 18446744072106062406 } hitcount: 153 bytes_req: 36720
+ { call_site: 18446744071582507929 } hitcount: 153 bytes_req: 37088
+ { call_site: 18446744072102520590 } hitcount: 273 bytes_req: 10920
+ { call_site: 18446744071582143559 } hitcount: 358 bytes_req: 716
+ { call_site: 18446744072106465852 } hitcount: 417 bytes_req: 56712
+ { call_site: 18446744072102523378 } hitcount: 485 bytes_req: 27160
+ { call_site: 18446744072099568646 } hitcount: 1676 bytes_req: 33520
+
+ Totals:
+ Hits: 4610
+ Entries: 45
+ Dropped: 0
+
+ The output displays a line for each entry, beginning with the key
+ specified in the trigger, followed by the value(s) also specified in
+ the trigger. At the beginning of the output is a line that displays
+ the trigger info, which can also be displayed by reading the
+ 'trigger' file:
+
+ # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ hist:keys=call_site:vals=bytes_req:sort=hitcount:size=2048 [active]
+
+ At the end of the output are a few lines that display the overall
+ totals for the run. The 'Hits' field shows the total number of
+ times the event trigger was hit, the 'Entries' field shows the total
+ number of used entries in the hash table, and the 'Dropped' field
+ shows the number of hits that were dropped because the number of
+ used entries for the run exceeded the maximum number of entries
+ allowed for the table (normally 0, but if not a hint that you may
+ want to increase the size of the table using the 'size' parameter).
+
+ Notice in the above output that there's an extra field, 'hitcount',
+ which wasn't specified in the trigger. Also notice that in the
+ trigger info output, there's a parameter, 'sort=hitcount', which
+ wasn't specified in the trigger either. The reason for that is that
+ every trigger implicitly keeps a count of the total number of hits
+ attributed to a given entry, called the 'hitcount'. That hitcount
+ information is explicitly displayed in the output, and in the
+ absence of a user-specified sort parameter, is used as the default
+ sort field.
+
+ The value 'hitcount' can be used in place of an explicit value in
+ the 'values' parameter if you don't really need to have any
+ particular field summed and are mainly interested in hit
+ frequencies.
+
+ To turn the hist trigger off, simply call up the trigger in the
+ command history and re-execute it with a '!' prepended:
+
+ # echo '!hist:key=call_site:val=bytes_req' > \
+ /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+ Finally, notice that the call_site as displayed in the output above
+ isn't really very useful. It's an address, but normally addresses
+ are displayed in hex. To have a numeric field displayed as a hex
+ value, simply append '.hex' to the field name in the trigger:
+
+ # echo 'hist:key=call_site.hex:val=bytes_req' > \
+ /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+ # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # trigger info: hist:keys=call_site.hex:vals=bytes_req:sort=hitcount:size=2048 [active]
+
+ { call_site: ffffffffa026b291 } hitcount: 1 bytes_req: 433
+ { call_site: ffffffffa07186ff } hitcount: 1 bytes_req: 176
+ { call_site: ffffffff811ae721 } hitcount: 1 bytes_req: 16384
+ { call_site: ffffffff811c5134 } hitcount: 1 bytes_req: 8
+ { call_site: ffffffffa04a9ebb } hitcount: 1 bytes_req: 511
+ { call_site: ffffffff8122e0a6 } hitcount: 1 bytes_req: 12
+ { call_site: ffffffff8107da84 } hitcount: 1 bytes_req: 152
+ { call_site: ffffffff812d8246 } hitcount: 1 bytes_req: 24
+ { call_site: ffffffff811dc1e5 } hitcount: 3 bytes_req: 144
+ { call_site: ffffffffa02515e8 } hitcount: 3 bytes_req: 648
+ { call_site: ffffffff81258159 } hitcount: 3 bytes_req: 144
+ { call_site: ffffffff811c80f4 } hitcount: 4 bytes_req: 544
+ .
+ .
+ .
+ { call_site: ffffffffa06c7646 } hitcount: 106 bytes_req: 8024
+ { call_site: ffffffffa06cb246 } hitcount: 132 bytes_req: 31680
+ { call_site: ffffffffa06cef7a } hitcount: 132 bytes_req: 2112
+ { call_site: ffffffff8137e399 } hitcount: 132 bytes_req: 23232
+ { call_site: ffffffffa06c941c } hitcount: 185 bytes_req: 171360
+ { call_site: ffffffffa06f2a66 } hitcount: 185 bytes_req: 26640
+ { call_site: ffffffffa036a70e } hitcount: 265 bytes_req: 10600
+ { call_site: ffffffff81325447 } hitcount: 292 bytes_req: 584
+ { call_site: ffffffffa072da3c } hitcount: 446 bytes_req: 60656
+ { call_site: ffffffffa036b1f2 } hitcount: 526 bytes_req: 29456
+ { call_site: ffffffffa0099c06 } hitcount: 1780 bytes_req: 35600
+
+ Totals:
+ Hits: 4775
+ Entries: 46
+ Dropped: 0
+
+ Even that's only marginally more useful - while hex values do look
+ more like addresses, what users are typically more interested in
+ when looking at text addresses are the corresponding symbols
+ instead. To have an address displayed as symbolic value instead,
+ simply append '.sym' or '.sym-offset' to the field name in the
+ trigger:
+
+ # echo 'hist:key=call_site.sym:val=bytes_req' > \
+ /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+ # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # trigger info: hist:keys=call_site.sym:vals=bytes_req:sort=hitcount:size=2048 [active]
+
+ { call_site: [ffffffff810adcb9] syslog_print_all } hitcount: 1 bytes_req: 1024
+ { call_site: [ffffffff8154bc62] usb_control_msg } hitcount: 1 bytes_req: 8
+ { call_site: [ffffffffa00bf6fe] hidraw_send_report [hid] } hitcount: 1 bytes_req: 7
+ { call_site: [ffffffff8154acbe] usb_alloc_urb } hitcount: 1 bytes_req: 192
+ { call_site: [ffffffffa00bf1ca] hidraw_report_event [hid] } hitcount: 1 bytes_req: 7
+ { call_site: [ffffffff811e3a25] __seq_open_private } hitcount: 1 bytes_req: 40
+ { call_site: [ffffffff8109524a] alloc_fair_sched_group } hitcount: 2 bytes_req: 128
+ { call_site: [ffffffff811febd5] fsnotify_alloc_group } hitcount: 2 bytes_req: 528
+ { call_site: [ffffffff81440f58] __tty_buffer_request_room } hitcount: 2 bytes_req: 2624
+ { call_site: [ffffffff81200ba6] inotify_new_group } hitcount: 2 bytes_req: 96
+ { call_site: [ffffffffa05e19af] ieee80211_start_tx_ba_session [mac80211] } hitcount: 2 bytes_req: 464
+ { call_site: [ffffffff81672406] tcp_get_metrics } hitcount: 2 bytes_req: 304
+ { call_site: [ffffffff81097ec2] alloc_rt_sched_group } hitcount: 2 bytes_req: 128
+ { call_site: [ffffffff81089b05] sched_create_group } hitcount: 2 bytes_req: 1424
+ .
+ .
+ .
+ { call_site: [ffffffffa04a580c] intel_crtc_page_flip [i915] } hitcount: 1185 bytes_req: 123240
+ { call_site: [ffffffffa0287592] drm_mode_page_flip_ioctl [drm] } hitcount: 1185 bytes_req: 104280
+ { call_site: [ffffffffa04c4a3c] intel_plane_duplicate_state [i915] } hitcount: 1402 bytes_req: 190672
+ { call_site: [ffffffff812891ca] ext4_find_extent } hitcount: 1518 bytes_req: 146208
+ { call_site: [ffffffffa029070e] drm_vma_node_allow [drm] } hitcount: 1746 bytes_req: 69840
+ { call_site: [ffffffffa045e7c4] i915_gem_do_execbuffer.isra.23 [i915] } hitcount: 2021 bytes_req: 792312
+ { call_site: [ffffffffa02911f2] drm_modeset_lock_crtc [drm] } hitcount: 2592 bytes_req: 145152
+ { call_site: [ffffffffa0489a66] intel_ring_begin [i915] } hitcount: 2629 bytes_req: 378576
+ { call_site: [ffffffffa046041c] i915_gem_execbuffer2 [i915] } hitcount: 2629 bytes_req: 3783248
+ { call_site: [ffffffff81325607] apparmor_file_alloc_security } hitcount: 5192 bytes_req: 10384
+ { call_site: [ffffffffa00b7c06] hid_report_raw_event [hid] } hitcount: 5529 bytes_req: 110584
+ { call_site: [ffffffff8131ebf7] aa_alloc_task_context } hitcount: 21943 bytes_req: 702176
+ { call_site: [ffffffff8125847d] ext4_htree_store_dirent } hitcount: 55759 bytes_req: 5074265
+
+ Totals:
+ Hits: 109928
+ Entries: 71
+ Dropped: 0
+
+ Because the default sort key above is 'hitcount', the above shows a
+ the list of call_sites by increasing hitcount, so that at the bottom
+ we see the functions that made the most kmalloc calls during the
+ run. If instead we we wanted to see the top kmalloc callers in
+ terms of the number of bytes requested rather than the number of
+ calls, and we wanted the top caller to appear at the top, we can use
+ the 'sort' parameter, along with the 'descending' modifier:
+
+ # echo 'hist:key=call_site.sym:val=bytes_req:sort=bytes_req.descending' > \
+ /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+ # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # trigger info: hist:keys=call_site.sym:vals=bytes_req:sort=bytes_req.descending:size=2048 [active]
+
+ { call_site: [ffffffffa046041c] i915_gem_execbuffer2 [i915] } hitcount: 2186 bytes_req: 3397464
+ { call_site: [ffffffffa045e7c4] i915_gem_do_execbuffer.isra.23 [i915] } hitcount: 1790 bytes_req: 712176
+ { call_site: [ffffffff8125847d] ext4_htree_store_dirent } hitcount: 8132 bytes_req: 513135
+ { call_site: [ffffffff811e2a1b] seq_buf_alloc } hitcount: 106 bytes_req: 440128
+ { call_site: [ffffffffa0489a66] intel_ring_begin [i915] } hitcount: 2186 bytes_req: 314784
+ { call_site: [ffffffff812891ca] ext4_find_extent } hitcount: 2174 bytes_req: 208992
+ { call_site: [ffffffff811ae8e1] __kmalloc } hitcount: 8 bytes_req: 131072
+ { call_site: [ffffffffa04c4a3c] intel_plane_duplicate_state [i915] } hitcount: 859 bytes_req: 116824
+ { call_site: [ffffffffa02911f2] drm_modeset_lock_crtc [drm] } hitcount: 1834 bytes_req: 102704
+ { call_site: [ffffffffa04a580c] intel_crtc_page_flip [i915] } hitcount: 972 bytes_req: 101088
+ { call_site: [ffffffffa0287592] drm_mode_page_flip_ioctl [drm] } hitcount: 972 bytes_req: 85536
+ { call_site: [ffffffffa00b7c06] hid_report_raw_event [hid] } hitcount: 3333 bytes_req: 66664
+ { call_site: [ffffffff8137e559] sg_kmalloc } hitcount: 209 bytes_req: 61632
+ .
+ .
+ .
+ { call_site: [ffffffff81095225] alloc_fair_sched_group } hitcount: 2 bytes_req: 128
+ { call_site: [ffffffff81097ec2] alloc_rt_sched_group } hitcount: 2 bytes_req: 128
+ { call_site: [ffffffff812d8406] copy_semundo } hitcount: 2 bytes_req: 48
+ { call_site: [ffffffff81200ba6] inotify_new_group } hitcount: 1 bytes_req: 48
+ { call_site: [ffffffffa027121a] drm_getmagic [drm] } hitcount: 1 bytes_req: 48
+ { call_site: [ffffffff811e3a25] __seq_open_private } hitcount: 1 bytes_req: 40
+ { call_site: [ffffffff811c52f4] bprm_change_interp } hitcount: 2 bytes_req: 16
+ { call_site: [ffffffff8154bc62] usb_control_msg } hitcount: 1 bytes_req: 8
+ { call_site: [ffffffffa00bf1ca] hidraw_report_event [hid] } hitcount: 1 bytes_req: 7
+ { call_site: [ffffffffa00bf6fe] hidraw_send_report [hid] } hitcount: 1 bytes_req: 7
+
+ Totals:
+ Hits: 32133
+ Entries: 81
+ Dropped: 0
+
+ To display the offset and size information in addition to the symbol
+ name, just use 'sym-offset' instead:
+
+ # echo 'hist:key=call_site.sym-offset:val=bytes_req:sort=bytes_req.descending' > \
+ /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+ # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # trigger info: hist:keys=call_site.sym-offset:vals=bytes_req:sort=bytes_req.descending:size=2048 [active]
+
+ { call_site: [ffffffffa046041c] i915_gem_execbuffer2+0x6c/0x2c0 [i915] } hitcount: 4569 bytes_req: 3163720
+ { call_site: [ffffffffa0489a66] intel_ring_begin+0xc6/0x1f0 [i915] } hitcount: 4569 bytes_req: 657936
+ { call_site: [ffffffffa045e7c4] i915_gem_do_execbuffer.isra.23+0x694/0x1020 [i915] } hitcount: 1519 bytes_req: 472936
+ { call_site: [ffffffffa045e646] i915_gem_do_execbuffer.isra.23+0x516/0x1020 [i915] } hitcount: 3050 bytes_req: 211832
+ { call_site: [ffffffff811e2a1b] seq_buf_alloc+0x1b/0x50 } hitcount: 34 bytes_req: 148384
+ { call_site: [ffffffffa04a580c] intel_crtc_page_flip+0xbc/0x870 [i915] } hitcount: 1385 bytes_req: 144040
+ { call_site: [ffffffff811ae8e1] __kmalloc+0x191/0x1b0 } hitcount: 8 bytes_req: 131072
+ { call_site: [ffffffffa0287592] drm_mode_page_flip_ioctl+0x282/0x360 [drm] } hitcount: 1385 bytes_req: 121880
+ { call_site: [ffffffffa02911f2] drm_modeset_lock_crtc+0x32/0x100 [drm] } hitcount: 1848 bytes_req: 103488
+ { call_site: [ffffffffa04c4a3c] intel_plane_duplicate_state+0x2c/0xa0 [i915] } hitcount: 461 bytes_req: 62696
+ { call_site: [ffffffffa029070e] drm_vma_node_allow+0x2e/0xd0 [drm] } hitcount: 1541 bytes_req: 61640
+ { call_site: [ffffffff815f8d7b] sk_prot_alloc+0xcb/0x1b0 } hitcount: 57 bytes_req: 57456
+ .
+ .
+ .
+ { call_site: [ffffffff8109524a] alloc_fair_sched_group+0x5a/0x1a0 } hitcount: 2 bytes_req: 128
+ { call_site: [ffffffffa027b921] drm_vm_open_locked+0x31/0xa0 [drm] } hitcount: 3 bytes_req: 96
+ { call_site: [ffffffff8122e266] proc_self_follow_link+0x76/0xb0 } hitcount: 8 bytes_req: 96
+ { call_site: [ffffffff81213e80] load_elf_binary+0x240/0x1650 } hitcount: 3 bytes_req: 84
+ { call_site: [ffffffff8154bc62] usb_control_msg+0x42/0x110 } hitcount: 1 bytes_req: 8
+ { call_site: [ffffffffa00bf6fe] hidraw_send_report+0x7e/0x1a0 [hid] } hitcount: 1 bytes_req: 7
+ { call_site: [ffffffffa00bf1ca] hidraw_report_event+0x8a/0x120 [hid] } hitcount: 1 bytes_req: 7
+
+ Totals:
+ Hits: 26098
+ Entries: 64
+ Dropped: 0
+
+ We can also add multiple fields to the 'values' parameter. For
+ example, we might want to see the total number of bytes allocated
+ alongside bytes requested, and display the result sorted by bytes
+ allocated in a descending order:
+
+ # echo 'hist:keys=call_site.sym:values=bytes_req,bytes_alloc:sort=bytes_alloc.descending' > \
+ /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+ # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # trigger info: hist:keys=call_site.sym:vals=bytes_req,bytes_alloc:sort=bytes_alloc.descending:size=2048 [active]
+
+ { call_site: [ffffffffa046041c] i915_gem_execbuffer2 [i915] } hitcount: 7403 bytes_req: 4084360 bytes_alloc: 5958016
+ { call_site: [ffffffff811e2a1b] seq_buf_alloc } hitcount: 541 bytes_req: 2213968 bytes_alloc: 2228224
+ { call_site: [ffffffffa0489a66] intel_ring_begin [i915] } hitcount: 7404 bytes_req: 1066176 bytes_alloc: 1421568
+ { call_site: [ffffffffa045e7c4] i915_gem_do_execbuffer.isra.23 [i915] } hitcount: 1565 bytes_req: 557368 bytes_alloc: 1037760
+ { call_site: [ffffffff8125847d] ext4_htree_store_dirent } hitcount: 9557 bytes_req: 595778 bytes_alloc: 695744
+ { call_site: [ffffffffa045e646] i915_gem_do_execbuffer.isra.23 [i915] } hitcount: 5839 bytes_req: 430680 bytes_alloc: 470400
+ { call_site: [ffffffffa04c4a3c] intel_plane_duplicate_state [i915] } hitcount: 2388 bytes_req: 324768 bytes_alloc: 458496
+ { call_site: [ffffffffa02911f2] drm_modeset_lock_crtc [drm] } hitcount: 3911 bytes_req: 219016 bytes_alloc: 250304
+ { call_site: [ffffffff815f8d7b] sk_prot_alloc } hitcount: 235 bytes_req: 236880 bytes_alloc: 240640
+ { call_site: [ffffffff8137e559] sg_kmalloc } hitcount: 557 bytes_req: 169024 bytes_alloc: 221760
+ { call_site: [ffffffffa00b7c06] hid_report_raw_event [hid] } hitcount: 9378 bytes_req: 187548 bytes_alloc: 206312
+ { call_site: [ffffffffa04a580c] intel_crtc_page_flip [i915] } hitcount: 1519 bytes_req: 157976 bytes_alloc: 194432
+ .
+ .
+ .
+ { call_site: [ffffffff8109bd3b] sched_autogroup_create_attach } hitcount: 2 bytes_req: 144 bytes_alloc: 192
+ { call_site: [ffffffff81097ee8] alloc_rt_sched_group } hitcount: 2 bytes_req: 128 bytes_alloc: 128
+ { call_site: [ffffffff8109524a] alloc_fair_sched_group } hitcount: 2 bytes_req: 128 bytes_alloc: 128
+ { call_site: [ffffffff81095225] alloc_fair_sched_group } hitcount: 2 bytes_req: 128 bytes_alloc: 128
+ { call_site: [ffffffff81097ec2] alloc_rt_sched_group } hitcount: 2 bytes_req: 128 bytes_alloc: 128
+ { call_site: [ffffffff81213e80] load_elf_binary } hitcount: 3 bytes_req: 84 bytes_alloc: 96
+ { call_site: [ffffffff81079a2e] kthread_create_on_node } hitcount: 1 bytes_req: 56 bytes_alloc: 64
+ { call_site: [ffffffffa00bf6fe] hidraw_send_report [hid] } hitcount: 1 bytes_req: 7 bytes_alloc: 8
+ { call_site: [ffffffff8154bc62] usb_control_msg } hitcount: 1 bytes_req: 8 bytes_alloc: 8
+ { call_site: [ffffffffa00bf1ca] hidraw_report_event [hid] } hitcount: 1 bytes_req: 7 bytes_alloc: 8
+
+ Totals:
+ Hits: 66598
+ Entries: 65
+ Dropped: 0
+
+ Finally, to finish off our kmalloc example, instead of simply having
+ the hist trigger display symbolic call_sites, we can have the hist
+ trigger additionally display the complete set of kernel stack traces
+ that led to each call_site. To do that, we simply use the special
+ value 'stacktrace' for the key parameter:
+
+ # echo 'hist:keys=stacktrace:values=bytes_req,bytes_alloc:sort=bytes_alloc' > \
+ /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+ The above trigger will use the kernel stack trace in effect when an
+ event is triggered as the key for the hash table. This allows the
+ enumeration of every kernel callpath that led up to a particular
+ event, along with a running total of any of the event fields for
+ that event. Here we tally bytes requested and bytes allocated for
+ every callpath in the system that led up to a kmalloc (in this case
+ every callpath to a kmalloc for a kernel compile):
+
+ # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # trigger info: hist:keys=stacktrace:vals=bytes_req,bytes_alloc:sort=bytes_alloc:size=2048 [active]
+
+ { stacktrace:
+ __kmalloc_track_caller+0x10b/0x1a0
+ kmemdup+0x20/0x50
+ hidraw_report_event+0x8a/0x120 [hid]
+ hid_report_raw_event+0x3ea/0x440 [hid]
+ hid_input_report+0x112/0x190 [hid]
+ hid_irq_in+0xc2/0x260 [usbhid]
+ __usb_hcd_giveback_urb+0x72/0x120
+ usb_giveback_urb_bh+0x9e/0xe0
+ tasklet_hi_action+0xf8/0x100
+ __do_softirq+0x114/0x2c0
+ irq_exit+0xa5/0xb0
+ do_IRQ+0x5a/0xf0
+ ret_from_intr+0x0/0x30
+ cpuidle_enter+0x17/0x20
+ cpu_startup_entry+0x315/0x3e0
+ rest_init+0x7c/0x80
+ } hitcount: 3 bytes_req: 21 bytes_alloc: 24
+ { stacktrace:
+ __kmalloc_track_caller+0x10b/0x1a0
+ kmemdup+0x20/0x50
+ hidraw_report_event+0x8a/0x120 [hid]
+ hid_report_raw_event+0x3ea/0x440 [hid]
+ hid_input_report+0x112/0x190 [hid]
+ hid_irq_in+0xc2/0x260 [usbhid]
+ __usb_hcd_giveback_urb+0x72/0x120
+ usb_giveback_urb_bh+0x9e/0xe0
+ tasklet_hi_action+0xf8/0x100
+ __do_softirq+0x114/0x2c0
+ irq_exit+0xa5/0xb0
+ do_IRQ+0x5a/0xf0
+ ret_from_intr+0x0/0x30
+ } hitcount: 3 bytes_req: 21 bytes_alloc: 24
+ { stacktrace:
+ kmem_cache_alloc_trace+0xeb/0x150
+ aa_alloc_task_context+0x27/0x40
+ apparmor_cred_prepare+0x1f/0x50
+ security_prepare_creds+0x16/0x20
+ prepare_creds+0xdf/0x1a0
+ SyS_capset+0xb5/0x200
+ system_call_fastpath+0x12/0x6a
+ } hitcount: 1 bytes_req: 32 bytes_alloc: 32
+ .
+ .
+ .
+ { stacktrace:
+ __kmalloc+0x11b/0x1b0
+ i915_gem_execbuffer2+0x6c/0x2c0 [i915]
+ drm_ioctl+0x349/0x670 [drm]
+ do_vfs_ioctl+0x2f0/0x4f0
+ SyS_ioctl+0x81/0xa0
+ system_call_fastpath+0x12/0x6a
+ } hitcount: 17726 bytes_req: 13944120 bytes_alloc: 19593808
+ { stacktrace:
+ __kmalloc+0x11b/0x1b0
+ load_elf_phdrs+0x76/0xa0
+ load_elf_binary+0x102/0x1650
+ search_binary_handler+0x97/0x1d0
+ do_execveat_common.isra.34+0x551/0x6e0
+ SyS_execve+0x3a/0x50
+ return_from_execve+0x0/0x23
+ } hitcount: 33348 bytes_req: 17152128 bytes_alloc: 20226048
+ { stacktrace:
+ kmem_cache_alloc_trace+0xeb/0x150
+ apparmor_file_alloc_security+0x27/0x40
+ security_file_alloc+0x16/0x20
+ get_empty_filp+0x93/0x1c0
+ path_openat+0x31/0x5f0
+ do_filp_open+0x3a/0x90
+ do_sys_open+0x128/0x220
+ SyS_open+0x1e/0x20
+ system_call_fastpath+0x12/0x6a
+ } hitcount: 4766422 bytes_req: 9532844 bytes_alloc: 38131376
+ { stacktrace:
+ __kmalloc+0x11b/0x1b0
+ seq_buf_alloc+0x1b/0x50
+ seq_read+0x2cc/0x370
+ proc_reg_read+0x3d/0x80
+ __vfs_read+0x28/0xe0
+ vfs_read+0x86/0x140
+ SyS_read+0x46/0xb0
+ system_call_fastpath+0x12/0x6a
+ } hitcount: 19133 bytes_req: 78368768 bytes_alloc: 78368768
+
+ Totals:
+ Hits: 6085872
+ Entries: 253
+ Dropped: 0
+
+ If you key a hist trigger on common_pid, in order for example to
+ gather and display sorted totals for each process, you can use the
+ special .execname modifier to display the executable names for the
+ processes in the table rather than raw pids. The example below
+ keeps a per-process sum of total bytes read:
+
+ # echo 'hist:key=common_pid.execname:val=count:sort=count.descending' > \
+ /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
+
+ # cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/hist
+ # trigger info: hist:keys=common_pid.execname:vals=count:sort=count.descending:size=2048 [active]
+
+ { common_pid: gnome-terminal [ 3196] } hitcount: 280 count: 1093512
+ { common_pid: Xorg [ 1309] } hitcount: 525 count: 256640
+ { common_pid: compiz [ 2889] } hitcount: 59 count: 254400
+ { common_pid: bash [ 8710] } hitcount: 3 count: 66369
+ { common_pid: dbus-daemon-lau [ 8703] } hitcount: 49 count: 47739
+ { common_pid: irqbalance [ 1252] } hitcount: 27 count: 27648
+ { common_pid: 01ifupdown [ 8705] } hitcount: 3 count: 17216
+ { common_pid: dbus-daemon [ 772] } hitcount: 10 count: 12396
+ { common_pid: Socket Thread [ 8342] } hitcount: 11 count: 11264
+ { common_pid: nm-dhcp-client. [ 8701] } hitcount: 6 count: 7424
+ { common_pid: gmain [ 1315] } hitcount: 18 count: 6336
+ .
+ .
+ .
+ { common_pid: postgres [ 1892] } hitcount: 2 count: 32
+ { common_pid: postgres [ 1891] } hitcount: 2 count: 32
+ { common_pid: gmain [ 8704] } hitcount: 2 count: 32
+ { common_pid: upstart-dbus-br [ 2740] } hitcount: 21 count: 21
+ { common_pid: nm-dispatcher.a [ 8696] } hitcount: 1 count: 16
+ { common_pid: indicator-datet [ 2904] } hitcount: 1 count: 16
+ { common_pid: gdbus [ 2998] } hitcount: 1 count: 16
+ { common_pid: rtkit-daemon [ 2052] } hitcount: 1 count: 8
+ { common_pid: init [ 1] } hitcount: 2 count: 2
+
+ Totals:
+ Hits: 2116
+ Entries: 51
+ Dropped: 0
+
+ Similarly, if you key a hist trigger on syscall id, for example to
+ gather and display a list of systemwide syscall hits, you can use
+ the special .syscall modifier to display the syscall names rather
+ than raw ids. The example below keeps a running total of syscall
+ counts for the system during the run:
+
+ # echo 'hist:key=id.syscall:val=hitcount' > \
+ /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
+
+ # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
+ # trigger info: hist:keys=id.syscall:vals=hitcount:sort=hitcount:size=2048 [active]
+
+ { id: sys_fsync [ 74] } hitcount: 1
+ { id: sys_newuname [ 63] } hitcount: 1
+ { id: sys_prctl [157] } hitcount: 1
+ { id: sys_statfs [137] } hitcount: 1
+ { id: sys_symlink [ 88] } hitcount: 1
+ { id: sys_sendmmsg [307] } hitcount: 1
+ { id: sys_semctl [ 66] } hitcount: 1
+ { id: sys_readlink [ 89] } hitcount: 3
+ { id: sys_bind [ 49] } hitcount: 3
+ { id: sys_getsockname [ 51] } hitcount: 3
+ { id: sys_unlink [ 87] } hitcount: 3
+ { id: sys_rename [ 82] } hitcount: 4
+ { id: unknown_syscall [ 58] } hitcount: 4
+ { id: sys_connect [ 42] } hitcount: 4
+ { id: sys_getpid [ 39] } hitcount: 4
+ .
+ .
+ .
+ { id: sys_rt_sigprocmask [ 14] } hitcount: 952
+ { id: sys_futex [202] } hitcount: 1534
+ { id: sys_write [ 1] } hitcount: 2689
+ { id: sys_setitimer [ 38] } hitcount: 2797
+ { id: sys_read [ 0] } hitcount: 3202
+ { id: sys_select [ 23] } hitcount: 3773
+ { id: sys_writev [ 20] } hitcount: 4531
+ { id: sys_poll [ 7] } hitcount: 8314
+ { id: sys_recvmsg [ 47] } hitcount: 13738
+ { id: sys_ioctl [ 16] } hitcount: 21843
+
+ Totals:
+ Hits: 67612
+ Entries: 72
+ Dropped: 0
+
+ The syscall counts above provide a rough overall picture of system
+ call activity on the system; we can see for example that the most
+ popular system call on this system was the 'sys_ioctl' system call.
+
+ We can use 'compound' keys to refine that number and provide some
+ further insight as to which processes exactly contribute to the
+ overall ioctl count.
+
+ The command below keeps a hitcount for every unique combination of
+ system call id and pid - the end result is essentially a table
+ that keeps a per-pid sum of system call hits. The results are
+ sorted using the system call id as the primary key, and the
+ hitcount sum as the secondary key:
+
+ # echo 'hist:key=id.syscall,common_pid.execname:val=hitcount:sort=id,hitcount' > \
+ /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
+
+ # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
+ # trigger info: hist:keys=id.syscall,common_pid.execname:vals=hitcount:sort=id.syscall,hitcount:size=2048 [active]
+
+ { id: sys_read [ 0], common_pid: rtkit-daemon [ 1877] } hitcount: 1
+ { id: sys_read [ 0], common_pid: gdbus [ 2976] } hitcount: 1
+ { id: sys_read [ 0], common_pid: console-kit-dae [ 3400] } hitcount: 1
+ { id: sys_read [ 0], common_pid: postgres [ 1865] } hitcount: 1
+ { id: sys_read [ 0], common_pid: deja-dup-monito [ 3543] } hitcount: 2
+ { id: sys_read [ 0], common_pid: NetworkManager [ 890] } hitcount: 2
+ { id: sys_read [ 0], common_pid: evolution-calen [ 3048] } hitcount: 2
+ { id: sys_read [ 0], common_pid: postgres [ 1864] } hitcount: 2
+ { id: sys_read [ 0], common_pid: nm-applet [ 3022] } hitcount: 2
+ { id: sys_read [ 0], common_pid: whoopsie [ 1212] } hitcount: 2
+ .
+ .
+ .
+ { id: sys_ioctl [ 16], common_pid: bash [ 8479] } hitcount: 1
+ { id: sys_ioctl [ 16], common_pid: bash [ 3472] } hitcount: 12
+ { id: sys_ioctl [ 16], common_pid: gnome-terminal [ 3199] } hitcount: 16
+ { id: sys_ioctl [ 16], common_pid: Xorg [ 1267] } hitcount: 1808
+ { id: sys_ioctl [ 16], common_pid: compiz [ 2994] } hitcount: 5580
+ .
+ .
+ .
+ { id: sys_waitid [247], common_pid: upstart-dbus-br [ 2690] } hitcount: 3
+ { id: sys_waitid [247], common_pid: upstart-dbus-br [ 2688] } hitcount: 16
+ { id: sys_inotify_add_watch [254], common_pid: gmain [ 975] } hitcount: 2
+ { id: sys_inotify_add_watch [254], common_pid: gmain [ 3204] } hitcount: 4
+ { id: sys_inotify_add_watch [254], common_pid: gmain [ 2888] } hitcount: 4
+ { id: sys_inotify_add_watch [254], common_pid: gmain [ 3003] } hitcount: 4
+ { id: sys_inotify_add_watch [254], common_pid: gmain [ 2873] } hitcount: 4
+ { id: sys_inotify_add_watch [254], common_pid: gmain [ 3196] } hitcount: 6
+ { id: sys_openat [257], common_pid: java [ 2623] } hitcount: 2
+ { id: sys_eventfd2 [290], common_pid: ibus-ui-gtk3 [ 2760] } hitcount: 4
+ { id: sys_eventfd2 [290], common_pid: compiz [ 2994] } hitcount: 6
+
+ Totals:
+ Hits: 31536
+ Entries: 323
+ Dropped: 0
+
+ The above list does give us a breakdown of the ioctl syscall by
+ pid, but it also gives us quite a bit more than that, which we
+ don't really care about at the moment. Since we know the syscall
+ id for sys_ioctl (16, displayed next to the sys_ioctl name), we
+ can use that to filter out all the other syscalls:
+
+ # echo 'hist:key=id.syscall,common_pid.execname:val=hitcount:sort=id,hitcount if id == 16' > \
+ /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
+
+ # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
+ # trigger info: hist:keys=id.syscall,common_pid.execname:vals=hitcount:sort=id.syscall,hitcount:size=2048 if id == 16 [active]
+
+ { id: sys_ioctl [ 16], common_pid: gmain [ 2769] } hitcount: 1
+ { id: sys_ioctl [ 16], common_pid: evolution-addre [ 8571] } hitcount: 1
+ { id: sys_ioctl [ 16], common_pid: gmain [ 3003] } hitcount: 1
+ { id: sys_ioctl [ 16], common_pid: gmain [ 2781] } hitcount: 1
+ { id: sys_ioctl [ 16], common_pid: gmain [ 2829] } hitcount: 1
+ { id: sys_ioctl [ 16], common_pid: bash [ 8726] } hitcount: 1
+ { id: sys_ioctl [ 16], common_pid: bash [ 8508] } hitcount: 1
+ { id: sys_ioctl [ 16], common_pid: gmain [ 2970] } hitcount: 1
+ { id: sys_ioctl [ 16], common_pid: gmain [ 2768] } hitcount: 1
+ .
+ .
+ .
+ { id: sys_ioctl [ 16], common_pid: pool [ 8559] } hitcount: 45
+ { id: sys_ioctl [ 16], common_pid: pool [ 8555] } hitcount: 48
+ { id: sys_ioctl [ 16], common_pid: pool [ 8551] } hitcount: 48
+ { id: sys_ioctl [ 16], common_pid: avahi-daemon [ 896] } hitcount: 66
+ { id: sys_ioctl [ 16], common_pid: Xorg [ 1267] } hitcount: 26674
+ { id: sys_ioctl [ 16], common_pid: compiz [ 2994] } hitcount: 73443
+
+ Totals:
+ Hits: 101162
+ Entries: 103
+ Dropped: 0
+
+ The above output shows that 'compiz' and 'Xorg' are far and away
+ the heaviest ioctl callers (which might lead to questions about
+ whether they really need to be making all those calls and to
+ possible avenues for further investigation.)
+
+ The compound key examples used a key and a sum value (hitcount) to
+ sort the output, but we can just as easily use two keys instead.
+ Here's an example where we use a compound key composed of the the
+ common_pid and size event fields. Sorting with pid as the primary
+ key and 'size' as the secondary key allows us to display an
+ ordered summary of the recvfrom sizes, with counts, received by
+ each process:
+
+ # echo 'hist:key=common_pid.execname,size:val=hitcount:sort=common_pid,size' > \
+ /sys/kernel/debug/tracing/events/syscalls/sys_enter_recvfrom/trigger
+
+ # cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_recvfrom/hist
+ # trigger info: hist:keys=common_pid.execname,size:vals=hitcount:sort=common_pid.execname,size:size=2048 [active]
+
+ { common_pid: smbd [ 784], size: 4 } hitcount: 1
+ { common_pid: dnsmasq [ 1412], size: 4096 } hitcount: 672
+ { common_pid: postgres [ 1796], size: 1000 } hitcount: 6
+ { common_pid: postgres [ 1867], size: 1000 } hitcount: 10
+ { common_pid: bamfdaemon [ 2787], size: 28 } hitcount: 2
+ { common_pid: bamfdaemon [ 2787], size: 14360 } hitcount: 1
+ { common_pid: compiz [ 2994], size: 8 } hitcount: 1
+ { common_pid: compiz [ 2994], size: 20 } hitcount: 11
+ { common_pid: gnome-terminal [ 3199], size: 4 } hitcount: 2
+ { common_pid: firefox [ 8817], size: 4 } hitcount: 1
+ { common_pid: firefox [ 8817], size: 8 } hitcount: 5
+ { common_pid: firefox [ 8817], size: 588 } hitcount: 2
+ { common_pid: firefox [ 8817], size: 628 } hitcount: 1
+ { common_pid: firefox [ 8817], size: 6944 } hitcount: 1
+ { common_pid: firefox [ 8817], size: 408880 } hitcount: 2
+ { common_pid: firefox [ 8822], size: 8 } hitcount: 2
+ { common_pid: firefox [ 8822], size: 160 } hitcount: 2
+ { common_pid: firefox [ 8822], size: 320 } hitcount: 2
+ { common_pid: firefox [ 8822], size: 352 } hitcount: 1
+ .
+ .
+ .
+ { common_pid: pool [ 8923], size: 1960 } hitcount: 10
+ { common_pid: pool [ 8923], size: 2048 } hitcount: 10
+ { common_pid: pool [ 8924], size: 1960 } hitcount: 10
+ { common_pid: pool [ 8924], size: 2048 } hitcount: 10
+ { common_pid: pool [ 8928], size: 1964 } hitcount: 4
+ { common_pid: pool [ 8928], size: 1965 } hitcount: 2
+ { common_pid: pool [ 8928], size: 2048 } hitcount: 6
+ { common_pid: pool [ 8929], size: 1982 } hitcount: 1
+ { common_pid: pool [ 8929], size: 2048 } hitcount: 1
+
+ Totals:
+ Hits: 2016
+ Entries: 224
+ Dropped: 0
+
+ The above example also illustrates the fact that although a compound
+ key is treated as a single entity for hashing purposes, the sub-keys
+ it's composed of can be accessed independently.
+
+ The next example uses a string field as the hash key and
+ demonstrates how you can manually pause and continue a hist trigger.
+ In this example, we'll aggregate fork counts and don't expect a
+ large number of entries in the hash table, so we'll drop it to a
+ much smaller number, say 256:
+
+ # echo 'hist:key=child_comm:val=hitcount:size=256' > \
+ /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
+
+ # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
+ # trigger info: hist:keys=child_comm:vals=hitcount:sort=hitcount:size=256 [active]
+
+ { child_comm: dconf worker } hitcount: 1
+ { child_comm: ibus-daemon } hitcount: 1
+ { child_comm: whoopsie } hitcount: 1
+ { child_comm: smbd } hitcount: 1
+ { child_comm: gdbus } hitcount: 1
+ { child_comm: kthreadd } hitcount: 1
+ { child_comm: dconf worker } hitcount: 1
+ { child_comm: evolution-alarm } hitcount: 2
+ { child_comm: Socket Thread } hitcount: 2
+ { child_comm: postgres } hitcount: 2
+ { child_comm: bash } hitcount: 3
+ { child_comm: compiz } hitcount: 3
+ { child_comm: evolution-sourc } hitcount: 4
+ { child_comm: dhclient } hitcount: 4
+ { child_comm: pool } hitcount: 5
+ { child_comm: nm-dispatcher.a } hitcount: 8
+ { child_comm: firefox } hitcount: 8
+ { child_comm: dbus-daemon } hitcount: 8
+ { child_comm: glib-pacrunner } hitcount: 10
+ { child_comm: evolution } hitcount: 23
+
+ Totals:
+ Hits: 89
+ Entries: 20
+ Dropped: 0
+
+ If we want to pause the hist trigger, we can simply append :pause to
+ the command that started the trigger. Notice that the trigger info
+ displays as [paused]:
+
+ # echo 'hist:key=child_comm:val=hitcount:size=256:pause' >> \
+ /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
+
+ # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
+ # trigger info: hist:keys=child_comm:vals=hitcount:sort=hitcount:size=256 [paused]
+
+ { child_comm: dconf worker } hitcount: 1
+ { child_comm: kthreadd } hitcount: 1
+ { child_comm: dconf worker } hitcount: 1
+ { child_comm: gdbus } hitcount: 1
+ { child_comm: ibus-daemon } hitcount: 1
+ { child_comm: Socket Thread } hitcount: 2
+ { child_comm: evolution-alarm } hitcount: 2
+ { child_comm: smbd } hitcount: 2
+ { child_comm: bash } hitcount: 3
+ { child_comm: whoopsie } hitcount: 3
+ { child_comm: compiz } hitcount: 3
+ { child_comm: evolution-sourc } hitcount: 4
+ { child_comm: pool } hitcount: 5
+ { child_comm: postgres } hitcount: 6
+ { child_comm: firefox } hitcount: 8
+ { child_comm: dhclient } hitcount: 10
+ { child_comm: emacs } hitcount: 12
+ { child_comm: dbus-daemon } hitcount: 20
+ { child_comm: nm-dispatcher.a } hitcount: 20
+ { child_comm: evolution } hitcount: 35
+ { child_comm: glib-pacrunner } hitcount: 59
+
+ Totals:
+ Hits: 199
+ Entries: 21
+ Dropped: 0
+
+ To manually continue having the trigger aggregate events, append
+ :cont instead. Notice that the trigger info displays as [active]
+ again, and the data has changed:
+
+ # echo 'hist:key=child_comm:val=hitcount:size=256:cont' >> \
+ /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
+
+ # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
+ # trigger info: hist:keys=child_comm:vals=hitcount:sort=hitcount:size=256 [active]
+
+ { child_comm: dconf worker } hitcount: 1
+ { child_comm: dconf worker } hitcount: 1
+ { child_comm: kthreadd } hitcount: 1
+ { child_comm: gdbus } hitcount: 1
+ { child_comm: ibus-daemon } hitcount: 1
+ { child_comm: Socket Thread } hitcount: 2
+ { child_comm: evolution-alarm } hitcount: 2
+ { child_comm: smbd } hitcount: 2
+ { child_comm: whoopsie } hitcount: 3
+ { child_comm: compiz } hitcount: 3
+ { child_comm: evolution-sourc } hitcount: 4
+ { child_comm: bash } hitcount: 5
+ { child_comm: pool } hitcount: 5
+ { child_comm: postgres } hitcount: 6
+ { child_comm: firefox } hitcount: 8
+ { child_comm: dhclient } hitcount: 11
+ { child_comm: emacs } hitcount: 12
+ { child_comm: dbus-daemon } hitcount: 22
+ { child_comm: nm-dispatcher.a } hitcount: 22
+ { child_comm: evolution } hitcount: 35
+ { child_comm: glib-pacrunner } hitcount: 59
+
+ Totals:
+ Hits: 206
+ Entries: 21
+ Dropped: 0
+
+ The previous example showed how to start and stop a hist trigger by
+ appending 'pause' and 'continue' to the hist trigger command. A
+ hist trigger can also be started in a paused state by initially
+ starting the trigger with ':pause' appended. This allows you to
+ start the trigger only when you're ready to start collecting data
+ and not before. For example, you could start the trigger in a
+ paused state, then unpause it and do something you want to measure,
+ then pause the trigger again when done.
+
+ Of course, doing this manually can be difficult and error-prone, but
+ it is possible to automatically start and stop a hist trigger based
+ on some condition, via the enable_hist and disable_hist triggers.
+
+ For example, suppose we wanted to take a look at the relative
+ weights in terms of skb length for each callpath that leads to a
+ netif_receieve_skb event when downloading a decent-sized file using
+ wget.
+
+ First we set up an initially paused stacktrace trigger on the
+ netif_receive_skb event:
+
+ # echo 'hist:key=stacktrace:vals=len:pause' > \
+ /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+
+ Next, we set up an 'enable_hist' trigger on the sched_process_exec
+ event, with an 'if filename==/usr/bin/wget' filter. The effect of
+ this new trigger is that it will 'unpause' the hist trigger we just
+ set up on netif_receive_skb if and only if it sees a
+ sched_process_exec event with a filename of '/usr/bin/wget'. When
+ that happens, all netif_receive_skb events are aggregated into a
+ hash table keyed on stacktrace:
+
+ # echo 'enable_hist:net:netif_receive_skb if filename==/usr/bin/wget' > \
+ /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
+
+ The aggregation continues until the netif_receive_skb is paused
+ again, which is what the following disable_hist event does by
+ creating a similar setup on the sched_process_exit event, using the
+ filter 'comm==wget':
+
+ # echo 'disable_hist:net:netif_receive_skb if comm==wget' > \
+ /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
+
+ Whenever a process exits and the comm field of the disable_hist
+ trigger filter matches 'comm==wget', the netif_receive_skb hist
+ trigger is disabled.
+
+ The overall effect is that netif_receive_skb events are aggregated
+ into the hash table for only the duration of the wget. Executing a
+ wget command and then listing the 'hist' file will display the
+ output generated by the wget command:
+
+ $ wget https://www.kernel.org/pub/linux/kernel/v3.x/patch-3.19.xz
+
+ # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
+ # trigger info: hist:keys=stacktrace:vals=len:sort=hitcount:size=2048 [paused]
+
+ { stacktrace:
+ __netif_receive_skb_core+0x46d/0x990
+ __netif_receive_skb+0x18/0x60
+ netif_receive_skb_internal+0x23/0x90
+ napi_gro_receive+0xc8/0x100
+ ieee80211_deliver_skb+0xd6/0x270 [mac80211]
+ ieee80211_rx_handlers+0xccf/0x22f0 [mac80211]
+ ieee80211_prepare_and_rx_handle+0x4e7/0xc40 [mac80211]
+ ieee80211_rx+0x31d/0x900 [mac80211]
+ iwlagn_rx_reply_rx+0x3db/0x6f0 [iwldvm]
+ iwl_rx_dispatch+0x8e/0xf0 [iwldvm]
+ iwl_pcie_irq_handler+0xe3c/0x12f0 [iwlwifi]
+ irq_thread_fn+0x20/0x50
+ irq_thread+0x11f/0x150
+ kthread+0xd2/0xf0
+ ret_from_fork+0x42/0x70
+ } hitcount: 85 len: 28884
+ { stacktrace:
+ __netif_receive_skb_core+0x46d/0x990
+ __netif_receive_skb+0x18/0x60
+ netif_receive_skb_internal+0x23/0x90
+ napi_gro_complete+0xa4/0xe0
+ dev_gro_receive+0x23a/0x360
+ napi_gro_receive+0x30/0x100
+ ieee80211_deliver_skb+0xd6/0x270 [mac80211]
+ ieee80211_rx_handlers+0xccf/0x22f0 [mac80211]
+ ieee80211_prepare_and_rx_handle+0x4e7/0xc40 [mac80211]
+ ieee80211_rx+0x31d/0x900 [mac80211]
+ iwlagn_rx_reply_rx+0x3db/0x6f0 [iwldvm]
+ iwl_rx_dispatch+0x8e/0xf0 [iwldvm]
+ iwl_pcie_irq_handler+0xe3c/0x12f0 [iwlwifi]
+ irq_thread_fn+0x20/0x50
+ irq_thread+0x11f/0x150
+ kthread+0xd2/0xf0
+ } hitcount: 98 len: 664329
+ { stacktrace:
+ __netif_receive_skb_core+0x46d/0x990
+ __netif_receive_skb+0x18/0x60
+ process_backlog+0xa8/0x150
+ net_rx_action+0x15d/0x340
+ __do_softirq+0x114/0x2c0
+ do_softirq_own_stack+0x1c/0x30
+ do_softirq+0x65/0x70
+ __local_bh_enable_ip+0xb5/0xc0
+ ip_finish_output+0x1f4/0x840
+ ip_output+0x6b/0xc0
+ ip_local_out_sk+0x31/0x40
+ ip_send_skb+0x1a/0x50
+ udp_send_skb+0x173/0x2a0
+ udp_sendmsg+0x2bf/0x9f0
+ inet_sendmsg+0x64/0xa0
+ sock_sendmsg+0x3d/0x50
+ } hitcount: 115 len: 13030
+ { stacktrace:
+ __netif_receive_skb_core+0x46d/0x990
+ __netif_receive_skb+0x18/0x60
+ netif_receive_skb_internal+0x23/0x90
+ napi_gro_complete+0xa4/0xe0
+ napi_gro_flush+0x6d/0x90
+ iwl_pcie_irq_handler+0x92a/0x12f0 [iwlwifi]
+ irq_thread_fn+0x20/0x50
+ irq_thread+0x11f/0x150
+ kthread+0xd2/0xf0
+ ret_from_fork+0x42/0x70
+ } hitcount: 934 len: 5512212
+
+ Totals:
+ Hits: 1232
+ Entries: 4
+ Dropped: 0
+
+ The above shows all the netif_receive_skb callpaths and their total
+ lengths for the duration of the wget command.
+
+ The 'clear' hist trigger param can be used to clear the hash table.
+ Suppose we wanted to try another run of the previous example but
+ this time also wanted to see the complete list of events that went
+ into the histogram. In order to avoid having to set everything up
+ again, we can just clear the histogram first:
+
+ # echo 'hist:key=stacktrace:vals=len:clear' >> \
+ /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+
+ Just to verify that it is in fact cleared, here's what we now see in
+ the hist file:
+
+ # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
+ # trigger info: hist:keys=stacktrace:vals=len:sort=hitcount:size=2048 [paused]
+
+ Totals:
+ Hits: 0
+ Entries: 0
+ Dropped: 0
+
+ Since we want to see the detailed list of every netif_receive_skb
+ event occurring during the new run, which are in fact the same
+ events being aggregated into the hash table, we add some additional
+ 'enable_event' events to the triggering sched_process_exec and
+ sched_process_exit events as such:
+
+ # echo 'enable_event:net:netif_receive_skb if filename==/usr/bin/wget' > \
+ /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
+
+ # echo 'disable_event:net:netif_receive_skb if comm==wget' > \
+ /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
+
+ If you read the trigger files for the sched_process_exec and
+ sched_process_exit triggers, you should see two triggers for each:
+ one enabling/disabling the hist aggregation and the other
+ enabling/disabling the logging of events:
+
+ # cat /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
+ enable_event:net:netif_receive_skb:unlimited if filename==/usr/bin/wget
+ enable_hist:net:netif_receive_skb:unlimited if filename==/usr/bin/wget
+
+ # cat /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
+ enable_event:net:netif_receive_skb:unlimited if comm==wget
+ disable_hist:net:netif_receive_skb:unlimited if comm==wget
+
+ In other words, whenever either of the sched_process_exec or
+ sched_process_exit events is hit and matches 'wget', it enables or
+ disables both the histogram and the event log, and what you end up
+ with is a hash table and set of events just covering the specified
+ duration. Run the wget command again:
+
+ $ wget https://www.kernel.org/pub/linux/kernel/v3.x/patch-3.19.xz
+
+ Displaying the 'hist' file should show something similar to what you
+ saw in the last run, but this time you should also see the
+ individual events in the trace file:
+
+ # cat /sys/kernel/debug/tracing/trace
+
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 183/1426 #P:4
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / delay
+ # TASK-PID CPU# |||| TIMESTAMP FUNCTION
+ # | | | |||| | |
+ wget-15108 [000] ..s1 31769.606929: netif_receive_skb: dev=lo skbaddr=ffff88009c353100 len=60
+ wget-15108 [000] ..s1 31769.606999: netif_receive_skb: dev=lo skbaddr=ffff88009c353200 len=60
+ dnsmasq-1382 [000] ..s1 31769.677652: netif_receive_skb: dev=lo skbaddr=ffff88009c352b00 len=130
+ dnsmasq-1382 [000] ..s1 31769.685917: netif_receive_skb: dev=lo skbaddr=ffff88009c352200 len=138
+ ##### CPU 2 buffer started ####
+ irq/29-iwlwifi-559 [002] ..s. 31772.031529: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d433d00 len=2948
+ irq/29-iwlwifi-559 [002] ..s. 31772.031572: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d432200 len=1500
+ irq/29-iwlwifi-559 [002] ..s. 31772.032196: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d433100 len=2948
+ irq/29-iwlwifi-559 [002] ..s. 31772.032761: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d433000 len=2948
+ irq/29-iwlwifi-559 [002] ..s. 31772.033220: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d432e00 len=1500
+ .
+ .
+ .
+
+ The following example demonstrates how multiple hist triggers can be
+ attached to a given event. This capability can be useful for
+ creating a set of different summaries derived from the same set of
+ events, or for comparing the effects of different filters, among
+ other things.
+
+ # echo 'hist:keys=skbaddr.hex:vals=len if len < 0' >> \
+ /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ # echo 'hist:keys=skbaddr.hex:vals=len if len > 4096' >> \
+ /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ # echo 'hist:keys=skbaddr.hex:vals=len if len == 256' >> \
+ /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ # echo 'hist:keys=skbaddr.hex:vals=len' >> \
+ /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ # echo 'hist:keys=len:vals=common_preempt_count' >> \
+ /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+
+ The above set of commands create four triggers differing only in
+ their filters, along with a completely different though fairly
+ nonsensical trigger. Note that in order to append multiple hist
+ triggers to the same file, you should use the '>>' operator to
+ append them ('>' will also add the new hist trigger, but will remove
+ any existing hist triggers beforehand).
+
+ Displaying the contents of the 'hist' file for the event shows the
+ contents of all five histograms:
+
+ # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
+
+ # event histogram
+ #
+ # trigger info: hist:keys=len:vals=hitcount,common_preempt_count:sort=hitcount:size=2048 [active]
+ #
+
+ { len: 176 } hitcount: 1 common_preempt_count: 0
+ { len: 223 } hitcount: 1 common_preempt_count: 0
+ { len: 4854 } hitcount: 1 common_preempt_count: 0
+ { len: 395 } hitcount: 1 common_preempt_count: 0
+ { len: 177 } hitcount: 1 common_preempt_count: 0
+ { len: 446 } hitcount: 1 common_preempt_count: 0
+ { len: 1601 } hitcount: 1 common_preempt_count: 0
+ .
+ .
+ .
+ { len: 1280 } hitcount: 66 common_preempt_count: 0
+ { len: 116 } hitcount: 81 common_preempt_count: 40
+ { len: 708 } hitcount: 112 common_preempt_count: 0
+ { len: 46 } hitcount: 221 common_preempt_count: 0
+ { len: 1264 } hitcount: 458 common_preempt_count: 0
+
+ Totals:
+ Hits: 1428
+ Entries: 147
+ Dropped: 0
+
+
+ # event histogram
+ #
+ # trigger info: hist:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 [active]
+ #
+
+ { skbaddr: ffff8800baee5e00 } hitcount: 1 len: 130
+ { skbaddr: ffff88005f3d5600 } hitcount: 1 len: 1280
+ { skbaddr: ffff88005f3d4900 } hitcount: 1 len: 1280
+ { skbaddr: ffff88009fed6300 } hitcount: 1 len: 115
+ { skbaddr: ffff88009fe0ad00 } hitcount: 1 len: 115
+ { skbaddr: ffff88008cdb1900 } hitcount: 1 len: 46
+ { skbaddr: ffff880064b5ef00 } hitcount: 1 len: 118
+ { skbaddr: ffff880044e3c700 } hitcount: 1 len: 60
+ { skbaddr: ffff880100065900 } hitcount: 1 len: 46
+ { skbaddr: ffff8800d46bd500 } hitcount: 1 len: 116
+ { skbaddr: ffff88005f3d5f00 } hitcount: 1 len: 1280
+ { skbaddr: ffff880100064700 } hitcount: 1 len: 365
+ { skbaddr: ffff8800badb6f00 } hitcount: 1 len: 60
+ .
+ .
+ .
+ { skbaddr: ffff88009fe0be00 } hitcount: 27 len: 24677
+ { skbaddr: ffff88009fe0a400 } hitcount: 27 len: 23052
+ { skbaddr: ffff88009fe0b700 } hitcount: 31 len: 25589
+ { skbaddr: ffff88009fe0b600 } hitcount: 32 len: 27326
+ { skbaddr: ffff88006a462800 } hitcount: 68 len: 71678
+ { skbaddr: ffff88006a463700 } hitcount: 70 len: 72678
+ { skbaddr: ffff88006a462b00 } hitcount: 71 len: 77589
+ { skbaddr: ffff88006a463600 } hitcount: 73 len: 71307
+ { skbaddr: ffff88006a462200 } hitcount: 81 len: 81032
+
+ Totals:
+ Hits: 1451
+ Entries: 318
+ Dropped: 0
+
+
+ # event histogram
+ #
+ # trigger info: hist:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 if len == 256 [active]
+ #
+
+
+ Totals:
+ Hits: 0
+ Entries: 0
+ Dropped: 0
+
+
+ # event histogram
+ #
+ # trigger info: hist:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 if len > 4096 [active]
+ #
+
+ { skbaddr: ffff88009fd2c300 } hitcount: 1 len: 7212
+ { skbaddr: ffff8800d2bcce00 } hitcount: 1 len: 7212
+ { skbaddr: ffff8800d2bcd700 } hitcount: 1 len: 7212
+ { skbaddr: ffff8800d2bcda00 } hitcount: 1 len: 21492
+ { skbaddr: ffff8800ae2e2d00 } hitcount: 1 len: 7212
+ { skbaddr: ffff8800d2bcdb00 } hitcount: 1 len: 7212
+ { skbaddr: ffff88006a4df500 } hitcount: 1 len: 4854
+ { skbaddr: ffff88008ce47b00 } hitcount: 1 len: 18636
+ { skbaddr: ffff8800ae2e2200 } hitcount: 1 len: 12924
+ { skbaddr: ffff88005f3e1000 } hitcount: 1 len: 4356
+ { skbaddr: ffff8800d2bcdc00 } hitcount: 2 len: 24420
+ { skbaddr: ffff8800d2bcc200 } hitcount: 2 len: 12996
+
+ Totals:
+ Hits: 14
+ Entries: 12
+ Dropped: 0
+
+
+ # event histogram
+ #
+ # trigger info: hist:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 if len < 0 [active]
+ #
+
+
+ Totals:
+ Hits: 0
+ Entries: 0
+ Dropped: 0
+
+ Named triggers can be used to have triggers share a common set of
+ histogram data. This capability is mostly useful for combining the
+ output of events generated by tracepoints contained inside inline
+ functions, but names can be used in a hist trigger on any event.
+ For example, these two triggers when hit will update the same 'len'
+ field in the shared 'foo' histogram data:
+
+ # echo 'hist:name=foo:keys=skbaddr.hex:vals=len' > \
+ /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ # echo 'hist:name=foo:keys=skbaddr.hex:vals=len' > \
+ /sys/kernel/debug/tracing/events/net/netif_rx/trigger
+
+ You can see that they're updating common histogram data by reading
+ each event's hist files at the same time:
+
+ # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist;
+ cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
+
+ # event histogram
+ #
+ # trigger info: hist:name=foo:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 [active]
+ #
+
+ { skbaddr: ffff88000ad53500 } hitcount: 1 len: 46
+ { skbaddr: ffff8800af5a1500 } hitcount: 1 len: 76
+ { skbaddr: ffff8800d62a1900 } hitcount: 1 len: 46
+ { skbaddr: ffff8800d2bccb00 } hitcount: 1 len: 468
+ { skbaddr: ffff8800d3c69900 } hitcount: 1 len: 46
+ { skbaddr: ffff88009ff09100 } hitcount: 1 len: 52
+ { skbaddr: ffff88010f13ab00 } hitcount: 1 len: 168
+ { skbaddr: ffff88006a54f400 } hitcount: 1 len: 46
+ { skbaddr: ffff8800d2bcc500 } hitcount: 1 len: 260
+ { skbaddr: ffff880064505000 } hitcount: 1 len: 46
+ { skbaddr: ffff8800baf24e00 } hitcount: 1 len: 32
+ { skbaddr: ffff88009fe0ad00 } hitcount: 1 len: 46
+ { skbaddr: ffff8800d3edff00 } hitcount: 1 len: 44
+ { skbaddr: ffff88009fe0b400 } hitcount: 1 len: 168
+ { skbaddr: ffff8800a1c55a00 } hitcount: 1 len: 40
+ { skbaddr: ffff8800d2bcd100 } hitcount: 1 len: 40
+ { skbaddr: ffff880064505f00 } hitcount: 1 len: 174
+ { skbaddr: ffff8800a8bff200 } hitcount: 1 len: 160
+ { skbaddr: ffff880044e3cc00 } hitcount: 1 len: 76
+ { skbaddr: ffff8800a8bfe700 } hitcount: 1 len: 46
+ { skbaddr: ffff8800d2bcdc00 } hitcount: 1 len: 32
+ { skbaddr: ffff8800a1f64800 } hitcount: 1 len: 46
+ { skbaddr: ffff8800d2bcde00 } hitcount: 1 len: 988
+ { skbaddr: ffff88006a5dea00 } hitcount: 1 len: 46
+ { skbaddr: ffff88002e37a200 } hitcount: 1 len: 44
+ { skbaddr: ffff8800a1f32c00 } hitcount: 2 len: 676
+ { skbaddr: ffff88000ad52600 } hitcount: 2 len: 107
+ { skbaddr: ffff8800a1f91e00 } hitcount: 2 len: 92
+ { skbaddr: ffff8800af5a0200 } hitcount: 2 len: 142
+ { skbaddr: ffff8800d2bcc600 } hitcount: 2 len: 220
+ { skbaddr: ffff8800ba36f500 } hitcount: 2 len: 92
+ { skbaddr: ffff8800d021f800 } hitcount: 2 len: 92
+ { skbaddr: ffff8800a1f33600 } hitcount: 2 len: 675
+ { skbaddr: ffff8800a8bfff00 } hitcount: 3 len: 138
+ { skbaddr: ffff8800d62a1300 } hitcount: 3 len: 138
+ { skbaddr: ffff88002e37a100 } hitcount: 4 len: 184
+ { skbaddr: ffff880064504400 } hitcount: 4 len: 184
+ { skbaddr: ffff8800a8bfec00 } hitcount: 4 len: 184
+ { skbaddr: ffff88000ad53700 } hitcount: 5 len: 230
+ { skbaddr: ffff8800d2bcdb00 } hitcount: 5 len: 196
+ { skbaddr: ffff8800a1f90000 } hitcount: 6 len: 276
+ { skbaddr: ffff88006a54f900 } hitcount: 6 len: 276
+
+ Totals:
+ Hits: 81
+ Entries: 42
+ Dropped: 0
+ # event histogram
+ #
+ # trigger info: hist:name=foo:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 [active]
+ #
+
+ { skbaddr: ffff88000ad53500 } hitcount: 1 len: 46
+ { skbaddr: ffff8800af5a1500 } hitcount: 1 len: 76
+ { skbaddr: ffff8800d62a1900 } hitcount: 1 len: 46
+ { skbaddr: ffff8800d2bccb00 } hitcount: 1 len: 468
+ { skbaddr: ffff8800d3c69900 } hitcount: 1 len: 46
+ { skbaddr: ffff88009ff09100 } hitcount: 1 len: 52
+ { skbaddr: ffff88010f13ab00 } hitcount: 1 len: 168
+ { skbaddr: ffff88006a54f400 } hitcount: 1 len: 46
+ { skbaddr: ffff8800d2bcc500 } hitcount: 1 len: 260
+ { skbaddr: ffff880064505000 } hitcount: 1 len: 46
+ { skbaddr: ffff8800baf24e00 } hitcount: 1 len: 32
+ { skbaddr: ffff88009fe0ad00 } hitcount: 1 len: 46
+ { skbaddr: ffff8800d3edff00 } hitcount: 1 len: 44
+ { skbaddr: ffff88009fe0b400 } hitcount: 1 len: 168
+ { skbaddr: ffff8800a1c55a00 } hitcount: 1 len: 40
+ { skbaddr: ffff8800d2bcd100 } hitcount: 1 len: 40
+ { skbaddr: ffff880064505f00 } hitcount: 1 len: 174
+ { skbaddr: ffff8800a8bff200 } hitcount: 1 len: 160
+ { skbaddr: ffff880044e3cc00 } hitcount: 1 len: 76
+ { skbaddr: ffff8800a8bfe700 } hitcount: 1 len: 46
+ { skbaddr: ffff8800d2bcdc00 } hitcount: 1 len: 32
+ { skbaddr: ffff8800a1f64800 } hitcount: 1 len: 46
+ { skbaddr: ffff8800d2bcde00 } hitcount: 1 len: 988
+ { skbaddr: ffff88006a5dea00 } hitcount: 1 len: 46
+ { skbaddr: ffff88002e37a200 } hitcount: 1 len: 44
+ { skbaddr: ffff8800a1f32c00 } hitcount: 2 len: 676
+ { skbaddr: ffff88000ad52600 } hitcount: 2 len: 107
+ { skbaddr: ffff8800a1f91e00 } hitcount: 2 len: 92
+ { skbaddr: ffff8800af5a0200 } hitcount: 2 len: 142
+ { skbaddr: ffff8800d2bcc600 } hitcount: 2 len: 220
+ { skbaddr: ffff8800ba36f500 } hitcount: 2 len: 92
+ { skbaddr: ffff8800d021f800 } hitcount: 2 len: 92
+ { skbaddr: ffff8800a1f33600 } hitcount: 2 len: 675
+ { skbaddr: ffff8800a8bfff00 } hitcount: 3 len: 138
+ { skbaddr: ffff8800d62a1300 } hitcount: 3 len: 138
+ { skbaddr: ffff88002e37a100 } hitcount: 4 len: 184
+ { skbaddr: ffff880064504400 } hitcount: 4 len: 184
+ { skbaddr: ffff8800a8bfec00 } hitcount: 4 len: 184
+ { skbaddr: ffff88000ad53700 } hitcount: 5 len: 230
+ { skbaddr: ffff8800d2bcdb00 } hitcount: 5 len: 196
+ { skbaddr: ffff8800a1f90000 } hitcount: 6 len: 276
+ { skbaddr: ffff88006a54f900 } hitcount: 6 len: 276
+
+ Totals:
+ Hits: 81
+ Entries: 42
+ Dropped: 0
+
+ And here's an example that shows how to combine histogram data from
+ any two events even if they don't share any 'compatible' fields
+ other than 'hitcount' and 'stacktrace'. These commands create a
+ couple of triggers named 'bar' using those fields:
+
+ # echo 'hist:name=bar:key=stacktrace:val=hitcount' > \
+ /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
+ # echo 'hist:name=bar:key=stacktrace:val=hitcount' > \
+ /sys/kernel/debug/tracing/events/net/netif_rx/trigger
+
+ And displaying the output of either shows some interesting if
+ somewhat confusing output:
+
+ # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
+ # cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
+
+ # event histogram
+ #
+ # trigger info: hist:name=bar:keys=stacktrace:vals=hitcount:sort=hitcount:size=2048 [active]
+ #
+
+ { stacktrace:
+ _do_fork+0x18e/0x330
+ kernel_thread+0x29/0x30
+ kthreadd+0x154/0x1b0
+ ret_from_fork+0x3f/0x70
+ } hitcount: 1
+ { stacktrace:
+ netif_rx_internal+0xb2/0xd0
+ netif_rx_ni+0x20/0x70
+ dev_loopback_xmit+0xaa/0xd0
+ ip_mc_output+0x126/0x240
+ ip_local_out_sk+0x31/0x40
+ igmp_send_report+0x1e9/0x230
+ igmp_timer_expire+0xe9/0x120
+ call_timer_fn+0x39/0xf0
+ run_timer_softirq+0x1e1/0x290
+ __do_softirq+0xfd/0x290
+ irq_exit+0x98/0xb0
+ smp_apic_timer_interrupt+0x4a/0x60
+ apic_timer_interrupt+0x6d/0x80
+ cpuidle_enter+0x17/0x20
+ call_cpuidle+0x3b/0x60
+ cpu_startup_entry+0x22d/0x310
+ } hitcount: 1
+ { stacktrace:
+ netif_rx_internal+0xb2/0xd0
+ netif_rx_ni+0x20/0x70
+ dev_loopback_xmit+0xaa/0xd0
+ ip_mc_output+0x17f/0x240
+ ip_local_out_sk+0x31/0x40
+ ip_send_skb+0x1a/0x50
+ udp_send_skb+0x13e/0x270
+ udp_sendmsg+0x2bf/0x980
+ inet_sendmsg+0x67/0xa0
+ sock_sendmsg+0x38/0x50
+ SYSC_sendto+0xef/0x170
+ SyS_sendto+0xe/0x10
+ entry_SYSCALL_64_fastpath+0x12/0x6a
+ } hitcount: 2
+ { stacktrace:
+ netif_rx_internal+0xb2/0xd0
+ netif_rx+0x1c/0x60
+ loopback_xmit+0x6c/0xb0
+ dev_hard_start_xmit+0x219/0x3a0
+ __dev_queue_xmit+0x415/0x4f0
+ dev_queue_xmit_sk+0x13/0x20
+ ip_finish_output2+0x237/0x340
+ ip_finish_output+0x113/0x1d0
+ ip_output+0x66/0xc0
+ ip_local_out_sk+0x31/0x40
+ ip_send_skb+0x1a/0x50
+ udp_send_skb+0x16d/0x270
+ udp_sendmsg+0x2bf/0x980
+ inet_sendmsg+0x67/0xa0
+ sock_sendmsg+0x38/0x50
+ ___sys_sendmsg+0x14e/0x270
+ } hitcount: 76
+ { stacktrace:
+ netif_rx_internal+0xb2/0xd0
+ netif_rx+0x1c/0x60
+ loopback_xmit+0x6c/0xb0
+ dev_hard_start_xmit+0x219/0x3a0
+ __dev_queue_xmit+0x415/0x4f0
+ dev_queue_xmit_sk+0x13/0x20
+ ip_finish_output2+0x237/0x340
+ ip_finish_output+0x113/0x1d0
+ ip_output+0x66/0xc0
+ ip_local_out_sk+0x31/0x40
+ ip_send_skb+0x1a/0x50
+ udp_send_skb+0x16d/0x270
+ udp_sendmsg+0x2bf/0x980
+ inet_sendmsg+0x67/0xa0
+ sock_sendmsg+0x38/0x50
+ ___sys_sendmsg+0x269/0x270
+ } hitcount: 77
+ { stacktrace:
+ netif_rx_internal+0xb2/0xd0
+ netif_rx+0x1c/0x60
+ loopback_xmit+0x6c/0xb0
+ dev_hard_start_xmit+0x219/0x3a0
+ __dev_queue_xmit+0x415/0x4f0
+ dev_queue_xmit_sk+0x13/0x20
+ ip_finish_output2+0x237/0x340
+ ip_finish_output+0x113/0x1d0
+ ip_output+0x66/0xc0
+ ip_local_out_sk+0x31/0x40
+ ip_send_skb+0x1a/0x50
+ udp_send_skb+0x16d/0x270
+ udp_sendmsg+0x2bf/0x980
+ inet_sendmsg+0x67/0xa0
+ sock_sendmsg+0x38/0x50
+ SYSC_sendto+0xef/0x170
+ } hitcount: 88
+ { stacktrace:
+ _do_fork+0x18e/0x330
+ SyS_clone+0x19/0x20
+ entry_SYSCALL_64_fastpath+0x12/0x6a
+ } hitcount: 244
+
+ Totals:
+ Hits: 489
+ Entries: 7
+ Dropped: 0
+
+
+2.2 Inter-event hist triggers
+-----------------------------
+
+Inter-event hist triggers are hist triggers that combine values from
+one or more other events and create a histogram using that data. Data
+from an inter-event histogram can in turn become the source for
+further combined histograms, thus providing a chain of related
+histograms, which is important for some applications.
+
+The most important example of an inter-event quantity that can be used
+in this manner is latency, which is simply a difference in timestamps
+between two events. Although latency is the most important
+inter-event quantity, note that because the support is completely
+general across the trace event subsystem, any event field can be used
+in an inter-event quantity.
+
+An example of a histogram that combines data from other histograms
+into a useful chain would be a 'wakeupswitch latency' histogram that
+combines a 'wakeup latency' histogram and a 'switch latency'
+histogram.
+
+Normally, a hist trigger specification consists of a (possibly
+compound) key along with one or more numeric values, which are
+continually updated sums associated with that key. A histogram
+specification in this case consists of individual key and value
+specifications that refer to trace event fields associated with a
+single event type.
+
+The inter-event hist trigger extension allows fields from multiple
+events to be referenced and combined into a multi-event histogram
+specification. In support of this overall goal, a few enabling
+features have been added to the hist trigger support:
+
+ - In order to compute an inter-event quantity, a value from one
+ event needs to saved and then referenced from another event. This
+ requires the introduction of support for histogram 'variables'.
+
+ - The computation of inter-event quantities and their combination
+ require some minimal amount of support for applying simple
+ expressions to variables (+ and -).
+
+ - A histogram consisting of inter-event quantities isn't logically a
+ histogram on either event (so having the 'hist' file for either
+ event host the histogram output doesn't really make sense). To
+ address the idea that the histogram is associated with a
+ combination of events, support is added allowing the creation of
+ 'synthetic' events that are events derived from other events.
+ These synthetic events are full-fledged events just like any other
+ and can be used as such, as for instance to create the
+ 'combination' histograms mentioned previously.
+
+ - A set of 'actions' can be associated with histogram entries -
+ these can be used to generate the previously mentioned synthetic
+ events, but can also be used for other purposes, such as for
+ example saving context when a 'max' latency has been hit.
+
+ - Trace events don't have a 'timestamp' associated with them, but
+ there is an implicit timestamp saved along with an event in the
+ underlying ftrace ring buffer. This timestamp is now exposed as a
+ a synthetic field named 'common_timestamp' which can be used in
+ histograms as if it were any other event field; it isn't an actual
+ field in the trace format but rather is a synthesized value that
+ nonetheless can be used as if it were an actual field. By default
+ it is in units of nanoseconds; appending '.usecs' to a
+ common_timestamp field changes the units to microseconds.
+
+A note on inter-event timestamps: If common_timestamp is used in a
+histogram, the trace buffer is automatically switched over to using
+absolute timestamps and the "global" trace clock, in order to avoid
+bogus timestamp differences with other clocks that aren't coherent
+across CPUs. This can be overridden by specifying one of the other
+trace clocks instead, using the "clock=XXX" hist trigger attribute,
+where XXX is any of the clocks listed in the tracing/trace_clock
+pseudo-file.
+
+These features are described in more detail in the following sections.
+
+2.2.1 Histogram Variables
+-------------------------
+
+Variables are simply named locations used for saving and retrieving
+values between matching events. A 'matching' event is defined as an
+event that has a matching key - if a variable is saved for a histogram
+entry corresponding to that key, any subsequent event with a matching
+key can access that variable.
+
+A variable's value is normally available to any subsequent event until
+it is set to something else by a subsequent event. The one exception
+to that rule is that any variable used in an expression is essentially
+'read-once' - once it's used by an expression in a subsequent event,
+it's reset to its 'unset' state, which means it can't be used again
+unless it's set again. This ensures not only that an event doesn't
+use an uninitialized variable in a calculation, but that that variable
+is used only once and not for any unrelated subsequent match.
+
+The basic syntax for saving a variable is to simply prefix a unique
+variable name not corresponding to any keyword along with an '=' sign
+to any event field.
+
+Either keys or values can be saved and retrieved in this way. This
+creates a variable named 'ts0' for a histogram entry with the key
+'next_pid':
+
+ # echo 'hist:keys=next_pid:vals=$ts0:ts0=common_timestamp ... >> \
+ event/trigger
+
+The ts0 variable can be accessed by any subsequent event having the
+same pid as 'next_pid'.
+
+Variable references are formed by prepending the variable name with
+the '$' sign. Thus for example, the ts0 variable above would be
+referenced as '$ts0' in expressions.
+
+Because 'vals=' is used, the common_timestamp variable value above
+will also be summed as a normal histogram value would (though for a
+timestamp it makes little sense).
+
+The below shows that a key value can also be saved in the same way:
+
+ # echo 'hist:timer_pid=common_pid:key=timer_pid ...' >> event/trigger
+
+If a variable isn't a key variable or prefixed with 'vals=', the
+associated event field will be saved in a variable but won't be summed
+as a value:
+
+ # echo 'hist:keys=next_pid:ts1=common_timestamp ... >> event/trigger
+
+Multiple variables can be assigned at the same time. The below would
+result in both ts0 and b being created as variables, with both
+common_timestamp and field1 additionally being summed as values:
+
+ # echo 'hist:keys=pid:vals=$ts0,$b:ts0=common_timestamp,b=field1 ... >> \
+ event/trigger
+
+Note that variable assignments can appear either preceding or
+following their use. The command below behaves identically to the
+command above:
+
+ # echo 'hist:keys=pid:ts0=common_timestamp,b=field1:vals=$ts0,$b ... >> \
+ event/trigger
+
+Any number of variables not bound to a 'vals=' prefix can also be
+assigned by simply separating them with colons. Below is the same
+thing but without the values being summed in the histogram:
+
+ # echo 'hist:keys=pid:ts0=common_timestamp:b=field1 ... >> event/trigger
+
+Variables set as above can be referenced and used in expressions on
+another event.
+
+For example, here's how a latency can be calculated:
+
+ # echo 'hist:keys=pid,prio:ts0=common_timestamp ... >> event1/trigger
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp-$ts0 ... >> event2/trigger
+
+In the first line above, the event's timetamp is saved into the
+variable ts0. In the next line, ts0 is subtracted from the second
+event's timestamp to produce the latency, which is then assigned into
+yet another variable, 'wakeup_lat'. The hist trigger below in turn
+makes use of the wakeup_lat variable to compute a combined latency
+using the same key and variable from yet another event:
+
+ # echo 'hist:key=pid:wakeupswitch_lat=$wakeup_lat+$switchtime_lat ... >> event3/trigger
+
+2.2.2 Synthetic Events
+----------------------
+
+Synthetic events are user-defined events generated from hist trigger
+variables or fields associated with one or more other events. Their
+purpose is to provide a mechanism for displaying data spanning
+multiple events consistent with the existing and already familiar
+usage for normal events.
+
+To define a synthetic event, the user writes a simple specification
+consisting of the name of the new event along with one or more
+variables and their types, which can be any valid field type,
+separated by semicolons, to the tracing/synthetic_events file.
+
+For instance, the following creates a new event named 'wakeup_latency'
+with 3 fields: lat, pid, and prio. Each of those fields is simply a
+variable reference to a variable on another event:
+
+ # echo 'wakeup_latency \
+ u64 lat; \
+ pid_t pid; \
+ int prio' >> \
+ /sys/kernel/debug/tracing/synthetic_events
+
+Reading the tracing/synthetic_events file lists all the currently
+defined synthetic events, in this case the event defined above:
+
+ # cat /sys/kernel/debug/tracing/synthetic_events
+ wakeup_latency u64 lat; pid_t pid; int prio
+
+An existing synthetic event definition can be removed by prepending
+the command that defined it with a '!':
+
+ # echo '!wakeup_latency u64 lat pid_t pid int prio' >> \
+ /sys/kernel/debug/tracing/synthetic_events
+
+At this point, there isn't yet an actual 'wakeup_latency' event
+instantiated in the event subsytem - for this to happen, a 'hist
+trigger action' needs to be instantiated and bound to actual fields
+and variables defined on other events (see Section 6.3.3 below).
+
+Once that is done, an event instance is created, and a histogram can
+be defined using it:
+
+ # echo 'hist:keys=pid,prio,lat.log2:sort=pid,lat' >> \
+ /sys/kernel/debug/tracing/events/synthetic/wakeup_latency/trigger
+
+The new event is created under the tracing/events/synthetic/ directory
+and looks and behaves just like any other event:
+
+ # ls /sys/kernel/debug/tracing/events/synthetic/wakeup_latency
+ enable filter format hist id trigger
+
+Like any other event, once a histogram is enabled for the event, the
+output can be displayed by reading the event's 'hist' file.
+
+2.2.3 Hist trigger 'actions'
+----------------------------
+
+A hist trigger 'action' is a function that's executed whenever a
+histogram entry is added or updated.
+
+The default 'action' if no special function is explicity specified is
+as it always has been, to simply update the set of values associated
+with an entry. Some applications, however, may want to perform
+additional actions at that point, such as generate another event, or
+compare and save a maximum.
+
+The following additional actions are available. To specify an action
+for a given event, simply specify the action between colons in the
+hist trigger specification.
+
+ - onmatch(matching.event).<synthetic_event_name>(param list)
+
+ The 'onmatch(matching.event).<synthetic_event_name>(params)' hist
+ trigger action is invoked whenever an event matches and the
+ histogram entry would be added or updated. It causes the named
+ synthetic event to be generated with the values given in the
+ 'param list'. The result is the generation of a synthetic event
+ that consists of the values contained in those variables at the
+ time the invoking event was hit.
+
+ The 'param list' consists of one or more parameters which may be
+ either variables or fields defined on either the 'matching.event'
+ or the target event. The variables or fields specified in the
+ param list may be either fully-qualified or unqualified. If a
+ variable is specified as unqualified, it must be unique between
+ the two events. A field name used as a param can be unqualified
+ if it refers to the target event, but must be fully qualified if
+ it refers to the matching event. A fully-qualified name is of the
+ form 'system.event_name.$var_name' or 'system.event_name.field'.
+
+ The 'matching.event' specification is simply the fully qualified
+ event name of the event that matches the target event for the
+ onmatch() functionality, in the form 'system.event_name'.
+
+ Finally, the number and type of variables/fields in the 'param
+ list' must match the number and types of the fields in the
+ synthetic event being generated.
+
+ As an example the below defines a simple synthetic event and uses
+ a variable defined on the sched_wakeup_new event as a parameter
+ when invoking the synthetic event. Here we define the synthetic
+ event:
+
+ # echo 'wakeup_new_test pid_t pid' >> \
+ /sys/kernel/debug/tracing/synthetic_events
+
+ # cat /sys/kernel/debug/tracing/synthetic_events
+ wakeup_new_test pid_t pid
+
+ The following hist trigger both defines the missing testpid
+ variable and specifies an onmatch() action that generates a
+ wakeup_new_test synthetic event whenever a sched_wakeup_new event
+ occurs, which because of the 'if comm == "cyclictest"' filter only
+ happens when the executable is cyclictest:
+
+ # echo 'hist:keys=$testpid:testpid=pid:onmatch(sched.sched_wakeup_new).\
+ wakeup_new_test($testpid) if comm=="cyclictest"' >> \
+ /sys/kernel/debug/tracing/events/sched/sched_wakeup_new/trigger
+
+ Creating and displaying a histogram based on those events is now
+ just a matter of using the fields and new synthetic event in the
+ tracing/events/synthetic directory, as usual:
+
+ # echo 'hist:keys=pid:sort=pid' >> \
+ /sys/kernel/debug/tracing/events/synthetic/wakeup_new_test/trigger
+
+ Running 'cyclictest' should cause wakeup_new events to generate
+ wakeup_new_test synthetic events which should result in histogram
+ output in the wakeup_new_test event's hist file:
+
+ # cat /sys/kernel/debug/tracing/events/synthetic/wakeup_new_test/hist
+
+ A more typical usage would be to use two events to calculate a
+ latency. The following example uses a set of hist triggers to
+ produce a 'wakeup_latency' histogram:
+
+ First, we define a 'wakeup_latency' synthetic event:
+
+ # echo 'wakeup_latency u64 lat; pid_t pid; int prio' >> \
+ /sys/kernel/debug/tracing/synthetic_events
+
+ Next, we specify that whenever we see a sched_waking event for a
+ cyclictest thread, save the timestamp in a 'ts0' variable:
+
+ # echo 'hist:keys=$saved_pid:saved_pid=pid:ts0=common_timestamp.usecs \
+ if comm=="cyclictest"' >> \
+ /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
+
+ Then, when the corresponding thread is actually scheduled onto the
+ CPU by a sched_switch event, calculate the latency and use that
+ along with another variable and an event field to generate a
+ wakeup_latency synthetic event:
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:\
+ onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,\
+ $saved_pid,next_prio) if next_comm=="cyclictest"' >> \
+ /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+
+ We also need to create a histogram on the wakeup_latency synthetic
+ event in order to aggregate the generated synthetic event data:
+
+ # echo 'hist:keys=pid,prio,lat:sort=pid,lat' >> \
+ /sys/kernel/debug/tracing/events/synthetic/wakeup_latency/trigger
+
+ Finally, once we've run cyclictest to actually generate some
+ events, we can see the output by looking at the wakeup_latency
+ synthetic event's hist file:
+
+ # cat /sys/kernel/debug/tracing/events/synthetic/wakeup_latency/hist
+
+ - onmax(var).save(field,.. .)
+
+ The 'onmax(var).save(field,...)' hist trigger action is invoked
+ whenever the value of 'var' associated with a histogram entry
+ exceeds the current maximum contained in that variable.
+
+ The end result is that the trace event fields specified as the
+ onmax.save() params will be saved if 'var' exceeds the current
+ maximum for that hist trigger entry. This allows context from the
+ event that exhibited the new maximum to be saved for later
+ reference. When the histogram is displayed, additional fields
+ displaying the saved values will be printed.
+
+ As an example the below defines a couple of hist triggers, one for
+ sched_waking and another for sched_switch, keyed on pid. Whenever
+ a sched_waking occurs, the timestamp is saved in the entry
+ corresponding to the current pid, and when the scheduler switches
+ back to that pid, the timestamp difference is calculated. If the
+ resulting latency, stored in wakeup_lat, exceeds the current
+ maximum latency, the values specified in the save() fields are
+ recoreded:
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs \
+ if comm=="cyclictest"' >> \
+ /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
+
+ # echo 'hist:keys=next_pid:\
+ wakeup_lat=common_timestamp.usecs-$ts0:\
+ onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm) \
+ if next_comm=="cyclictest"' >> \
+ /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+
+ When the histogram is displayed, the max value and the saved
+ values corresponding to the max are displayed following the rest
+ of the fields:
+
+ # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
+ { next_pid: 2255 } hitcount: 239
+ common_timestamp-ts0: 0
+ max: 27
+ next_comm: cyclictest
+ prev_pid: 0 prev_prio: 120 prev_comm: swapper/1
+
+ { next_pid: 2256 } hitcount: 2355
+ common_timestamp-ts0: 0
+ max: 49 next_comm: cyclictest
+ prev_pid: 0 prev_prio: 120 prev_comm: swapper/0
+
+ Totals:
+ Hits: 12970
+ Entries: 2
+ Dropped: 0
diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index ba976805853a..66bfd8396877 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -111,7 +111,7 @@ my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flag
my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
-my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
+my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*) gfp_flags=([A-Z_|]*)';
my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) classzone_idx=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_skipped=([0-9]*) nr_taken=([0-9]*) lru=([a-z_]*)';
my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) nr_dirty=([0-9]*) nr_writeback=([0-9]*) nr_congested=([0-9]*) nr_immediate=([0-9]*) nr_activate=([0-9]*) nr_ref_keep=([0-9]*) nr_unmap_fail=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)';
my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
@@ -201,7 +201,7 @@ $regex_kswapd_sleep = generate_traceevent_regex(
$regex_wakeup_kswapd = generate_traceevent_regex(
"vmscan/mm_vmscan_wakeup_kswapd",
$regex_wakeup_kswapd_default,
- "nid", "zid", "order");
+ "nid", "zid", "order", "gfp_flags");
$regex_lru_isolate = generate_traceevent_regex(
"vmscan/mm_vmscan_lru_isolate",
$regex_lru_isolate_default,
diff --git a/Documentation/virtual/kvm/00-INDEX b/Documentation/virtual/kvm/00-INDEX
index 3da73aabff5a..3492458a4ae8 100644
--- a/Documentation/virtual/kvm/00-INDEX
+++ b/Documentation/virtual/kvm/00-INDEX
@@ -1,7 +1,12 @@
00-INDEX
- this file.
+amd-memory-encryption.rst
+ - notes on AMD Secure Encrypted Virtualization feature and SEV firmware
+ command description
api.txt
- KVM userspace API.
+arm
+ - internal ABI between the kernel and HYP (for arm/arm64)
cpuid.txt
- KVM-specific cpuid leaves (x86).
devices/
@@ -26,6 +31,5 @@ s390-diag.txt
- Diagnose hypercall description (for IBM S/390)
timekeeping.txt
- timekeeping virtualization for x86-based architectures.
-amd-memory-encryption.txt
- - notes on AMD Secure Encrypted Virtualization feature and SEV firmware
- command description
+vcpu-requests.rst
+ - internal VCPU request API
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index d6b3ff51a14f..1c7958b57fe9 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3480,7 +3480,7 @@ encrypted VMs.
Currently, this ioctl is used for issuing Secure Encrypted Virtualization
(SEV) commands on AMD Processors. The SEV commands are defined in
-Documentation/virtual/kvm/amd-memory-encryption.txt.
+Documentation/virtual/kvm/amd-memory-encryption.rst.
4.111 KVM_MEMORY_ENCRYPT_REG_REGION
@@ -3516,6 +3516,38 @@ Returns: 0 on success; -1 on error
This ioctl can be used to unregister the guest memory region registered
with KVM_MEMORY_ENCRYPT_REG_REGION ioctl above.
+4.113 KVM_HYPERV_EVENTFD
+
+Capability: KVM_CAP_HYPERV_EVENTFD
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_hyperv_eventfd (in)
+
+This ioctl (un)registers an eventfd to receive notifications from the guest on
+the specified Hyper-V connection id through the SIGNAL_EVENT hypercall, without
+causing a user exit. SIGNAL_EVENT hypercall with non-zero event flag number
+(bits 24-31) still triggers a KVM_EXIT_HYPERV_HCALL user exit.
+
+struct kvm_hyperv_eventfd {
+ __u32 conn_id;
+ __s32 fd;
+ __u32 flags;
+ __u32 padding[3];
+};
+
+The conn_id field should fit within 24 bits:
+
+#define KVM_HYPERV_CONN_ID_MASK 0x00ffffff
+
+The acceptable values for the flags field are:
+
+#define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0)
+
+Returns: 0 on success,
+ -EINVAL if conn_id or flags is outside the allowed range
+ -ENOENT on deassign if the conn_id isn't registered
+ -EEXIST on assign if the conn_id is already registered
+
5. The kvm_run structure
------------------------
@@ -3873,7 +3905,7 @@ in userspace.
__u64 kvm_dirty_regs;
union {
struct kvm_sync_regs regs;
- char padding[1024];
+ char padding[SYNC_REGS_SIZE_BYTES];
} s;
If KVM_CAP_SYNC_REGS is defined, these fields allow userspace to access
@@ -4078,6 +4110,46 @@ Once this is done the KVM_REG_MIPS_VEC_* and KVM_REG_MIPS_MSA_* registers can be
accessed, and the Config5.MSAEn bit is accessible via the KVM API and also from
the guest.
+6.74 KVM_CAP_SYNC_REGS
+Architectures: s390, x86
+Target: s390: always enabled, x86: vcpu
+Parameters: none
+Returns: x86: KVM_CHECK_EXTENSION returns a bit-array indicating which register
+sets are supported (bitfields defined in arch/x86/include/uapi/asm/kvm.h).
+
+As described above in the kvm_sync_regs struct info in section 5 (kvm_run):
+KVM_CAP_SYNC_REGS "allow[s] userspace to access certain guest registers
+without having to call SET/GET_*REGS". This reduces overhead by eliminating
+repeated ioctl calls for setting and/or getting register values. This is
+particularly important when userspace is making synchronous guest state
+modifications, e.g. when emulating and/or intercepting instructions in
+userspace.
+
+For s390 specifics, please refer to the source code.
+
+For x86:
+- the register sets to be copied out to kvm_run are selectable
+ by userspace (rather that all sets being copied out for every exit).
+- vcpu_events are available in addition to regs and sregs.
+
+For x86, the 'kvm_valid_regs' field of struct kvm_run is overloaded to
+function as an input bit-array field set by userspace to indicate the
+specific register sets to be copied out on the next exit.
+
+To indicate when userspace has modified values that should be copied into
+the vCPU, the all architecture bitarray field, 'kvm_dirty_regs' must be set.
+This is done using the same bitflags as for the 'kvm_valid_regs' field.
+If the dirty bit is not set, then the register set values will not be copied
+into the vCPU even if they've been modified.
+
+Unused bitfields in the bitarrays must be set to zero.
+
+struct kvm_sync_regs {
+ struct kvm_regs regs;
+ struct kvm_sregs sregs;
+ struct kvm_vcpu_events events;
+};
+
7. Capabilities that can be enabled on VMs
------------------------------------------
@@ -4286,6 +4358,26 @@ enables QEMU to build error log and branch to guest kernel registered
machine check handling routine. Without this capability KVM will
branch to guests' 0x200 interrupt vector.
+7.13 KVM_CAP_X86_DISABLE_EXITS
+
+Architectures: x86
+Parameters: args[0] defines which exits are disabled
+Returns: 0 on success, -EINVAL when args[0] contains invalid exits
+
+Valid bits in args[0] are
+
+#define KVM_X86_DISABLE_EXITS_MWAIT (1 << 0)
+#define KVM_X86_DISABLE_EXITS_HLT (1 << 1)
+
+Enabling this capability on a VM provides userspace with a way to no
+longer intercept some instructions for improved latency in some
+workloads, and is suggested when vCPUs are associated to dedicated
+physical CPUs. More bits can be added in the future; userspace can
+just pass the KVM_CHECK_EXTENSION result to KVM_ENABLE_CAP to disable
+all such vmexits.
+
+Do not enable KVM_FEATURE_PV_UNHALT if you disable HLT exits.
+
8. Other capabilities.
----------------------
@@ -4398,15 +4490,6 @@ reserved.
Both registers and addresses are 64-bits wide.
It will be possible to run 64-bit or 32-bit guest code.
-8.8 KVM_CAP_X86_GUEST_MWAIT
-
-Architectures: x86
-
-This capability indicates that guest using memory monotoring instructions
-(MWAIT/MWAITX) to stop the virtual CPU will not cause a VM exit. As such time
-spent while virtual CPU is halted in this way will then be accounted for as
-guest running time on the host (as opposed to e.g. HLT).
-
8.9 KVM_CAP_ARM_USER_IRQ
Architectures: arm, arm64
@@ -4483,3 +4566,33 @@ Parameters: none
This capability indicates if the flic device will be able to get/set the
AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows
to discover this without having to create a flic device.
+
+8.14 KVM_CAP_S390_PSW
+
+Architectures: s390
+
+This capability indicates that the PSW is exposed via the kvm_run structure.
+
+8.15 KVM_CAP_S390_GMAP
+
+Architectures: s390
+
+This capability indicates that the user space memory used as guest mapping can
+be anywhere in the user memory address space, as long as the memory slots are
+aligned and sized to a segment (1MB) boundary.
+
+8.16 KVM_CAP_S390_COW
+
+Architectures: s390
+
+This capability indicates that the user space memory used as guest mapping can
+use copy-on-write semantics as well as dirty pages tracking via read-only page
+tables.
+
+8.17 KVM_CAP_S390_BPB
+
+Architectures: s390
+
+This capability indicates that kvm will implement the interfaces to handle
+reset, migration and nested KVM for branch prediction blocking. The stfle
+facility 82 should not be provided to the guest without this capability.
diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
index 87a7506f31c2..d4f33eb805dd 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -23,8 +23,8 @@ This function queries the presence of KVM cpuid leafs.
function: define KVM_CPUID_FEATURES (0x40000001)
-returns : ebx, ecx, edx = 0
- eax = and OR'ed group of (1 << flag), where each flags is:
+returns : ebx, ecx
+ eax = an OR'ed group of (1 << flag), where each flags is:
flag || value || meaning
@@ -66,3 +66,14 @@ KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
|| || per-cpu warps are expected in
|| || kvmclock.
------------------------------------------------------------------------------
+
+ edx = an OR'ed group of (1 << flag), where each flags is:
+
+
+flag || value || meaning
+==================================================================================
+KVM_HINTS_DEDICATED || 0 || guest checks this feature bit to
+ || || determine if there is vCPU pinning
+ || || and there is no vCPU over-commitment,
+ || || allowing optimizations
+----------------------------------------------------------------------------------
diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
index 4d3aac9f4a5d..2d1d6f69e91b 100644
--- a/Documentation/vm/hmm.txt
+++ b/Documentation/vm/hmm.txt
@@ -1,152 +1,160 @@
Heterogeneous Memory Management (HMM)
-Transparently allow any component of a program to use any memory region of said
-program with a device without using device specific memory allocator. This is
-becoming a requirement to simplify the use of advance heterogeneous computing
-where GPU, DSP or FPGA are use to perform various computations.
-
-This document is divided as follow, in the first section i expose the problems
-related to the use of a device specific allocator. The second section i expose
-the hardware limitations that are inherent to many platforms. The third section
-gives an overview of HMM designs. The fourth section explains how CPU page-
-table mirroring works and what is HMM purpose in this context. Fifth section
-deals with how device memory is represented inside the kernel. Finaly the last
-section present the new migration helper that allow to leverage the device DMA
-engine.
-
-
-1) Problems of using device specific memory allocator:
-2) System bus, device memory characteristics
-3) Share address space and migration
+Provide infrastructure and helpers to integrate non-conventional memory (device
+memory like GPU on board memory) into regular kernel path, with the cornerstone
+of this being specialized struct page for such memory (see sections 5 to 7 of
+this document).
+
+HMM also provides optional helpers for SVM (Share Virtual Memory), i.e.,
+allowing a device to transparently access program address coherently with the
+CPU meaning that any valid pointer on the CPU is also a valid pointer for the
+device. This is becoming mandatory to simplify the use of advanced hetero-
+geneous computing where GPU, DSP, or FPGA are used to perform various
+computations on behalf of a process.
+
+This document is divided as follows: in the first section I expose the problems
+related to using device specific memory allocators. In the second section, I
+expose the hardware limitations that are inherent to many platforms. The third
+section gives an overview of the HMM design. The fourth section explains how
+CPU page-table mirroring works and the purpose of HMM in this context. The
+fifth section deals with how device memory is represented inside the kernel.
+Finally, the last section presents a new migration helper that allows lever-
+aging the device DMA engine.
+
+
+1) Problems of using a device specific memory allocator:
+2) I/O bus, device memory characteristics
+3) Shared address space and migration
4) Address space mirroring implementation and API
5) Represent and manage device memory from core kernel point of view
-6) Migrate to and from device memory
+6) Migration to and from device memory
7) Memory cgroup (memcg) and rss accounting
-------------------------------------------------------------------------------
-1) Problems of using device specific memory allocator:
-
-Device with large amount of on board memory (several giga bytes) like GPU have
-historically manage their memory through dedicated driver specific API. This
-creates a disconnect between memory allocated and managed by device driver and
-regular application memory (private anonymous, share memory or regular file
-back memory). From here on i will refer to this aspect as split address space.
-I use share address space to refer to the opposite situation ie one in which
-any memory region can be use by device transparently.
-
-Split address space because device can only access memory allocated through the
-device specific API. This imply that all memory object in a program are not
-equal from device point of view which complicate large program that rely on a
-wide set of libraries.
-
-Concretly this means that code that wants to leverage device like GPU need to
-copy object between genericly allocated memory (malloc, mmap private/share/)
-and memory allocated through the device driver API (this still end up with an
-mmap but of the device file).
-
-For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
-complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
-data-set need to re-map all the pointer relations between each of its elements.
-This is error prone and program gets harder to debug because of the duplicate
-data-set.
-
-Split address space also means that library can not transparently use data they
-are getting from core program or other library and thus each library might have
-to duplicate its input data-set using specific memory allocator. Large project
-suffer from this and waste resources because of the various memory copy.
-
-Duplicating each library API to accept as input or output memory allocted by
+1) Problems of using a device specific memory allocator:
+
+Devices with a large amount of on board memory (several gigabytes) like GPUs
+have historically managed their memory through dedicated driver specific APIs.
+This creates a disconnect between memory allocated and managed by a device
+driver and regular application memory (private anonymous, shared memory, or
+regular file backed memory). From here on I will refer to this aspect as split
+address space. I use shared address space to refer to the opposite situation:
+i.e., one in which any application memory region can be used by a device
+transparently.
+
+Split address space happens because device can only access memory allocated
+through device specific API. This implies that all memory objects in a program
+are not equal from the device point of view which complicates large programs
+that rely on a wide set of libraries.
+
+Concretely this means that code that wants to leverage devices like GPUs needs
+to copy object between generically allocated memory (malloc, mmap private, mmap
+share) and memory allocated through the device driver API (this still ends up
+with an mmap but of the device file).
+
+For flat data sets (array, grid, image, ...) this isn't too hard to achieve but
+complex data sets (list, tree, ...) are hard to get right. Duplicating a
+complex data set needs to re-map all the pointer relations between each of its
+elements. This is error prone and program gets harder to debug because of the
+duplicate data set and addresses.
+
+Split address space also means that libraries cannot transparently use data
+they are getting from the core program or another library and thus each library
+might have to duplicate its input data set using the device specific memory
+allocator. Large projects suffer from this and waste resources because of the
+various memory copies.
+
+Duplicating each library API to accept as input or output memory allocated by
each device specific allocator is not a viable option. It would lead to a
-combinatorial explosions in the library entry points.
+combinatorial explosion in the library entry points.
-Finaly with the advance of high level language constructs (in C++ but in other
-language too) it is now possible for compiler to leverage GPU or other devices
-without even the programmer knowledge. Some of compiler identified patterns are
-only do-able with a share address. It is as well more reasonable to use a share
-address space for all the other patterns.
+Finally, with the advance of high level language constructs (in C++ but in
+other languages too) it is now possible for the compiler to leverage GPUs and
+other devices without programmer knowledge. Some compiler identified patterns
+are only do-able with a shared address space. It is also more reasonable to use
+a shared address space for all other patterns.
-------------------------------------------------------------------------------
-2) System bus, device memory characteristics
+2) I/O bus, device memory characteristics
-System bus cripple share address due to few limitations. Most system bus only
-allow basic memory access from device to main memory, even cache coherency is
-often optional. Access to device memory from CPU is even more limited, most
-often than not it is not cache coherent.
+I/O buses cripple shared address spaces due to a few limitations. Most I/O
+buses only allow basic memory access from device to main memory; even cache
+coherency is often optional. Access to device memory from CPU is even more
+limited. More often than not, it is not cache coherent.
-If we only consider the PCIE bus than device can access main memory (often
-through an IOMMU) and be cache coherent with the CPUs. However it only allows
-a limited set of atomic operation from device on main memory. This is worse
-in the other direction the CPUs can only access a limited range of the device
-memory and can not perform atomic operations on it. Thus device memory can not
-be consider like regular memory from kernel point of view.
+If we only consider the PCIE bus, then a device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However, it only allows
+a limited set of atomic operations from device on main memory. This is worse
+in the other direction: the CPU can only access a limited range of the device
+memory and cannot perform atomic operations on it. Thus device memory cannot
+be considered the same as regular memory from the kernel point of view.
Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
-and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
-The final limitation is latency, access to main memory from the device has an
-order of magnitude higher latency than when the device access its own memory.
+and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
+The final limitation is latency. Access to main memory from the device has an
+order of magnitude higher latency than when the device accesses its own memory.
-Some platform are developing new system bus or additions/modifications to PCIE
-to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
+Some platforms are developing new I/O buses or additions/modifications to PCIE
+to address some of these limitations (OpenCAPI, CCIX). They mainly allow two-
way cache coherency between CPU and device and allow all atomic operations the
-architecture supports. Saddly not all platform are following this trends and
-some major architecture are left without hardware solutions to those problems.
+architecture supports. Sadly, not all platforms are following this trend and
+some major architectures are left without hardware solutions to these problems.
-So for share address space to make sense not only we must allow device to
-access any memory memory but we must also permit any memory to be migrated to
-device memory while device is using it (blocking CPU access while it happens).
+So for shared address space to make sense, not only must we allow devices to
+access any memory but we must also permit any memory to be migrated to device
+memory while device is using it (blocking CPU access while it happens).
-------------------------------------------------------------------------------
-3) Share address space and migration
+3) Shared address space and migration
HMM intends to provide two main features. First one is to share the address
-space by duplication the CPU page table into the device page table so same
-address point to same memory and this for any valid main memory address in
+space by duplicating the CPU page table in the device page table so the same
+address points to the same physical memory for any valid main memory address in
the process address space.
-To achieve this, HMM offer a set of helpers to populate the device page table
+To achieve this, HMM offers a set of helpers to populate the device page table
while keeping track of CPU page table updates. Device page table updates are
-not as easy as CPU page table updates. To update the device page table you must
-allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
-commands in it to perform the update (unmap, cache invalidations and flush,
-...). This can not be done through common code for all device. Hence why HMM
-provides helpers to factor out everything that can be while leaving the gory
-details to the device driver.
-
-The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
-allow to allocate a struct page for each page of the device memory. Those page
-are special because the CPU can not map them. They however allow to migrate
-main memory to device memory using exhisting migration mechanism and everything
-looks like if page was swap out to disk from CPU point of view. Using a struct
-page gives the easiest and cleanest integration with existing mm mechanisms.
-Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
-for the device memory and second to perform migration. Policy decision of what
-and when to migrate things is left to the device driver.
-
-Note that any CPU access to a device page trigger a page fault and a migration
-back to main memory ie when a page backing an given address A is migrated from
-a main memory page to a device page then any CPU access to address A trigger a
-page fault and initiate a migration back to main memory.
-
-
-With this two features, HMM not only allow a device to mirror a process address
-space and keeps both CPU and device page table synchronize, but also allow to
-leverage device memory by migrating part of data-set that is actively use by a
-device.
+not as easy as CPU page table updates. To update the device page table, you must
+allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
+specific commands in it to perform the update (unmap, cache invalidations, and
+flush, ...). This cannot be done through common code for all devices. Hence
+why HMM provides helpers to factor out everything that can be while leaving the
+hardware specific details to the device driver.
+
+The second mechanism HMM provides is a new kind of ZONE_DEVICE memory that
+allows allocating a struct page for each page of the device memory. Those pages
+are special because the CPU cannot map them. However, they allow migrating
+main memory to device memory using existing migration mechanisms and everything
+looks like a page is swapped out to disk from the CPU point of view. Using a
+struct page gives the easiest and cleanest integration with existing mm mech-
+anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
+memory for the device memory and second to perform migration. Policy decisions
+of what and when to migrate things is left to the device driver.
+
+Note that any CPU access to a device page triggers a page fault and a migration
+back to main memory. For example, when a page backing a given CPU address A is
+migrated from a main memory page to a device page, then any CPU access to
+address A triggers a page fault and initiates a migration back to main memory.
+
+With these two features, HMM not only allows a device to mirror process address
+space and keeping both CPU and device page table synchronized, but also lever-
+ages device memory by migrating the part of the data set that is actively being
+used by the device.
-------------------------------------------------------------------------------
4) Address space mirroring implementation and API
-Address space mirroring main objective is to allow to duplicate range of CPU
-page table into a device page table and HMM helps keeping both synchronize. A
-device driver that want to mirror a process address space must start with the
+Address space mirroring's main objective is to allow duplication of a range of
+CPU page table into a device page table; HMM helps keep both synchronized. A
+device driver that wants to mirror a process address space must start with the
registration of an hmm_mirror struct:
int hmm_mirror_register(struct hmm_mirror *mirror,
@@ -154,9 +162,9 @@ registration of an hmm_mirror struct:
int hmm_mirror_register_locked(struct hmm_mirror *mirror,
struct mm_struct *mm);
-The locked variant is to be use when the driver is already holding the mmap_sem
-of the mm in write mode. The mirror struct has a set of callback that are use
-to propagate CPU page table:
+The locked variant is to be used when the driver is already holding mmap_sem
+of the mm in write mode. The mirror struct has a set of callbacks that are used
+to propagate CPU page tables:
struct hmm_mirror_ops {
/* sync_cpu_device_pagetables() - synchronize page tables
@@ -181,13 +189,13 @@ to propagate CPU page table:
unsigned long end);
};
-Device driver must perform update to the range following action (turn range
-read only, or fully unmap, ...). Once driver callback returns the device must
-be done with the update.
+The device driver must perform the update action to the range (mark range
+read only, or fully unmap, ...). The device must be done with the update before
+the driver callback returns.
-When device driver wants to populate a range of virtual address it can use
-either:
+When the device driver wants to populate a range of virtual addresses, it can
+use either:
int hmm_vma_get_pfns(struct vm_area_struct *vma,
struct hmm_range *range,
unsigned long start,
@@ -201,17 +209,19 @@ either:
bool write,
bool block);
-First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
-will not trigger a page fault on missing or non present entry. The second one
-do trigger page fault on missing or read only entry if write parameter is true.
-Page fault use the generic mm page fault code path just like a CPU page fault.
+The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
+entries and will not trigger a page fault on missing or non-present entries.
+The second one does trigger a page fault on missing or read-only entry if the
+write parameter is true. Page faults use the generic mm page fault code path
+just like a CPU page fault.
-Both function copy CPU page table into their pfns array argument. Each entry in
-that array correspond to an address in the virtual range. HMM provide a set of
-flags to help driver identify special CPU page table entries.
+Both functions copy CPU page table entries into their pfns array argument. Each
+entry in that array corresponds to an address in the virtual range. HMM
+provides a set of flags to help the driver identify special CPU page table
+entries.
Locking with the update() callback is the most important aspect the driver must
-respect in order to keep things properly synchronize. The usage pattern is :
+respect in order to keep things properly synchronized. The usage pattern is:
int driver_populate_range(...)
{
@@ -233,43 +243,44 @@ respect in order to keep things properly synchronize. The usage pattern is :
return 0;
}
-The driver->update lock is the same lock that driver takes inside its update()
-callback. That lock must be call before hmm_vma_range_done() to avoid any race
-with a concurrent CPU page table update.
+The driver->update lock is the same lock that the driver takes inside its
+update() callback. That lock must be held before hmm_vma_range_done() to avoid
+any race with a concurrent CPU page table update.
-HMM implements all this on top of the mmu_notifier API because we wanted to a
-simpler API and also to be able to perform optimization latter own like doing
-concurrent device update in multi-devices scenario.
+HMM implements all this on top of the mmu_notifier API because we wanted a
+simpler API and also to be able to perform optimizations latter on like doing
+concurrent device updates in multi-devices scenario.
-HMM also serve as an impedence missmatch between how CPU page table update are
-done (by CPU write to the page table and TLB flushes) from how device update
-their own page table. Device update is a multi-step process, first appropriate
-commands are write to a buffer, then this buffer is schedule for execution on
-the device. It is only once the device has executed commands in the buffer that
-the update is done. Creating and scheduling update command buffer can happen
-concurrently for multiple devices. Waiting for each device to report commands
-as executed is serialize (there is no point in doing this concurrently).
+HMM also serves as an impedance mismatch between how CPU page table updates
+are done (by CPU write to the page table and TLB flushes) and how devices
+update their own page table. Device updates are a multi-step process. First,
+appropriate commands are written to a buffer, then this buffer is scheduled for
+execution on the device. It is only once the device has executed commands in
+the buffer that the update is done. Creating and scheduling the update command
+buffer can happen concurrently for multiple devices. Waiting for each device to
+report commands as executed is serialized (there is no point in doing this
+concurrently).
-------------------------------------------------------------------------------
5) Represent and manage device memory from core kernel point of view
-Several differents design were try to support device memory. First one use
-device specific data structure to keep information about migrated memory and
-HMM hooked itself in various place of mm code to handle any access to address
-that were back by device memory. It turns out that this ended up replicating
-most of the fields of struct page and also needed many kernel code path to be
-updated to understand this new kind of memory.
+Several different designs were tried to support device memory. First one used
+a device specific data structure to keep information about migrated memory and
+HMM hooked itself in various places of mm code to handle any access to
+addresses that were backed by device memory. It turns out that this ended up
+replicating most of the fields of struct page and also needed many kernel code
+paths to be updated to understand this new kind of memory.
-Thing is most kernel code path never try to access the memory behind a page
-but only care about struct page contents. Because of this HMM switchted to
-directly using struct page for device memory which left most kernel code path
-un-aware of the difference. We only need to make sure that no one ever try to
-map those page from the CPU side.
+Most kernel code paths never try to access the memory behind a page
+but only care about struct page contents. Because of this, HMM switched to
+directly using struct page for device memory which left most kernel code paths
+unaware of the difference. We only need to make sure that no one ever tries to
+map those pages from the CPU side.
-HMM provide a set of helpers to register and hotplug device memory as a new
-region needing struct page. This is offer through a very simple API:
+HMM provides a set of helpers to register and hotplug device memory as a new
+region needing a struct page. This is offered through a very simple API:
struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
struct device *device,
@@ -289,18 +300,19 @@ The hmm_devmem_ops is where most of the important things are:
};
The first callback (free()) happens when the last reference on a device page is
-drop. This means the device page is now free and no longer use by anyone. The
-second callback happens whenever CPU try to access a device page which it can
-not do. This second callback must trigger a migration back to system memory.
+dropped. This means the device page is now free and no longer used by anyone.
+The second callback happens whenever the CPU tries to access a device page
+which it cannot do. This second callback must trigger a migration back to
+system memory.
-------------------------------------------------------------------------------
-6) Migrate to and from device memory
+6) Migration to and from device memory
-Because CPU can not access device memory, migration must use device DMA engine
-to perform copy from and to device memory. For this we need a new migration
-helper:
+Because the CPU cannot access device memory, migration must use the device DMA
+engine to perform copy from and to device memory. For this we need a new
+migration helper:
int migrate_vma(const struct migrate_vma_ops *ops,
struct vm_area_struct *vma,
@@ -311,15 +323,15 @@ helper:
unsigned long *dst,
void *private);
-Unlike other migration function it works on a range of virtual address, there
-is two reasons for that. First device DMA copy has a high setup overhead cost
+Unlike other migration functions it works on a range of virtual address, there
+are two reasons for that. First, device DMA copy has a high setup overhead cost
and thus batching multiple pages is needed as otherwise the migration overhead
-make the whole excersie pointless. The second reason is because driver trigger
-such migration base on range of address the device is actively accessing.
+makes the whole exercise pointless. The second reason is because the
+migration might be for a range of addresses the device is actively accessing.
-The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
-control destination memory allocation and copy operation. Second one is there
-to allow device driver to perform cleanup operation after migration.
+The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
+controls destination memory allocation and copy operation. Second one is there
+to allow the device driver to perform cleanup operations after migration.
struct migrate_vma_ops {
void (*alloc_and_copy)(struct vm_area_struct *vma,
@@ -336,19 +348,19 @@ to allow device driver to perform cleanup operation after migration.
void *private);
};
-It is important to stress that this migration helpers allow for hole in the
+It is important to stress that these migration helpers allow for holes in the
virtual address range. Some pages in the range might not be migrated for all
-the usual reasons (page is pin, page is lock, ...). This helper does not fail
-but just skip over those pages.
+the usual reasons (page is pinned, page is locked, ...). This helper does not
+fail but just skips over those pages.
-The alloc_and_copy() might as well decide to not migrate all pages in the
-range (for reasons under the callback control). For those the callback just
-have to leave the corresponding dst entry empty.
+The alloc_and_copy() might decide to not migrate all pages in the
+range (for reasons under the callback control). For those, the callback just
+has to leave the corresponding dst entry empty.
-Finaly the migration of the struct page might fails (for file back page) for
+Finally, the migration of the struct page might fail (for file backed page) for
various reasons (failure to freeze reference, or update page cache, ...). If
-that happens then the finalize_and_map() can catch any pages that was not
-migrated. Note those page were still copied to new page and thus we wasted
+that happens, then the finalize_and_map() can catch any pages that were not
+migrated. Note those pages were still copied to a new page and thus we wasted
bandwidth but this is considered as a rare event and a price that we are
willing to pay to keep all the code simpler.
@@ -358,27 +370,27 @@ willing to pay to keep all the code simpler.
7) Memory cgroup (memcg) and rss accounting
For now device memory is accounted as any regular page in rss counters (either
-anonymous if device page is use for anonymous, file if device page is use for
-file back page or shmem if device page is use for share memory). This is a
-deliberate choice to keep existing application that might start using device
-memory without knowing about it to keep runing unimpacted.
-
-Drawbacks is that OOM killer might kill an application using a lot of device
-memory and not a lot of regular system memory and thus not freeing much system
-memory. We want to gather more real world experience on how application and
-system react under memory pressure in the presence of device memory before
+anonymous if device page is used for anonymous, file if device page is used for
+file backed page or shmem if device page is used for shared memory). This is a
+deliberate choice to keep existing applications, that might start using device
+memory without knowing about it, running unimpacted.
+
+A drawback is that the OOM killer might kill an application using a lot of
+device memory and not a lot of regular system memory and thus not freeing much
+system memory. We want to gather more real world experience on how applications
+and system react under memory pressure in the presence of device memory before
deciding to account device memory differently.
-Same decision was made for memory cgroup. Device memory page are accounted
+Same decision was made for memory cgroup. Device memory pages are accounted
against same memory cgroup a regular page would be accounted to. This does
simplify migration to and from device memory. This also means that migration
-back from device memory to regular memory can not fail because it would
+back from device memory to regular memory cannot fail because it would
go above memory cgroup limit. We might revisit this choice latter on once we
-get more experience in how device memory is use and its impact on memory
+get more experience in how device memory is used and its impact on memory
resource control.
-Note that device memory can never be pin nor by device driver nor through GUP
+Note that device memory can never be pinned by device driver nor through GUP
and thus such memory is always free upon process exit. Or when last reference
-is drop in case of share memory or file back memory.
+is dropped in case of shared memory or file backed memory.
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
index 0478ae2ad44a..496868072e24 100644
--- a/Documentation/vm/page_migration
+++ b/Documentation/vm/page_migration
@@ -90,7 +90,7 @@ Steps:
1. Lock the page to be migrated
-2. Insure that writeback is complete.
+2. Ensure that writeback is complete.
3. Lock the new page that we want to move to. It is locked so that accesses to
this (not yet uptodate) page immediately lock while the move is in progress.
@@ -100,8 +100,8 @@ Steps:
mapcount is not zero then we do not migrate the page. All user space
processes that attempt to access the page will now wait on the page lock.
-5. The radix tree lock is taken. This will cause all processes trying
- to access the page via the mapping to block on the radix tree spinlock.
+5. The i_pages lock is taken. This will cause all processes trying
+ to access the page via the mapping to block on the spinlock.
6. The refcount of the page is examined and we back out if references remain
otherwise we know that we are the only one referencing this page.
@@ -114,12 +114,12 @@ Steps:
9. The radix tree is changed to point to the new page.
-10. The reference count of the old page is dropped because the radix tree
+10. The reference count of the old page is dropped because the address space
reference is gone. A reference to the new page is established because
- the new page is referenced to by the radix tree.
+ the new page is referenced by the address space.
-11. The radix tree lock is dropped. With that lookups in the mapping
- become possible again. Processes will move from spinning on the tree_lock
+11. The i_pages lock is dropped. With that lookups in the mapping
+ become possible again. Processes will move from spinning on the lock
to sleeping on the locked new page.
12. The page contents are copied to the new page.