diff options
Diffstat (limited to 'Documentation')
33 files changed, 564 insertions, 274 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 8dfc6708a257..f607367e642f 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -328,8 +328,6 @@ sysrq.txt - info on the magic SysRq key. telephony/ - directory with info on telephony (e.g. voice over IP) support. -time_interpolators.txt - - info on time interpolators. uml/ - directory with information about User Mode Linux. unicode.txt @@ -346,8 +344,6 @@ vm/ - directory with info on the Linux vm code. volatile-considered-harmful.txt - Why the "volatile" type class should not be used -voyager.txt - - guide to running Linux on the Voyager architecture. w1/ - directory with documents regarding the 1-wire (w1) subsystem. watchdog/ diff --git a/Documentation/ABI/testing/sysfs-devices-power b/Documentation/ABI/testing/sysfs-devices-power index 7628cd1bc36a..8ffbc25376a0 100644 --- a/Documentation/ABI/testing/sysfs-devices-power +++ b/Documentation/ABI/testing/sysfs-devices-power @@ -29,9 +29,8 @@ Description: "disabled" to it. For the devices that are not capable of generating system wakeup - events this file contains "\n". In that cases the user space - cannot modify the contents of this file and the device cannot be - enabled to wake up the system. + events this file is not present. In that case the device cannot + be enabled to wake up the system from sleep states. What: /sys/devices/.../power/control Date: January 2009 @@ -85,7 +84,7 @@ Description: The /sys/devices/.../wakeup_count attribute contains the number of signaled wakeup events associated with the device. This attribute is read-only. If the device is not enabled to wake up - the system from sleep states, this attribute is empty. + the system from sleep states, this attribute is not present. What: /sys/devices/.../power/wakeup_active_count Date: September 2010 @@ -95,7 +94,7 @@ Description: number of times the processing of wakeup events associated with the device was completed (at the kernel level). This attribute is read-only. If the device is not enabled to wake up the - system from sleep states, this attribute is empty. + system from sleep states, this attribute is not present. What: /sys/devices/.../power/wakeup_hit_count Date: September 2010 @@ -105,7 +104,8 @@ Description: number of times the processing of a wakeup event associated with the device might prevent the system from entering a sleep state. This attribute is read-only. If the device is not enabled to - wake up the system from sleep states, this attribute is empty. + wake up the system from sleep states, this attribute is not + present. What: /sys/devices/.../power/wakeup_active Date: September 2010 @@ -115,7 +115,7 @@ Description: or 0, depending on whether or not a wakeup event associated with the device is being processed (1). This attribute is read-only. If the device is not enabled to wake up the system from sleep - states, this attribute is empty. + states, this attribute is not present. What: /sys/devices/.../power/wakeup_total_time_ms Date: September 2010 @@ -125,7 +125,7 @@ Description: the total time of processing wakeup events associated with the device, in milliseconds. This attribute is read-only. If the device is not enabled to wake up the system from sleep states, - this attribute is empty. + this attribute is not present. What: /sys/devices/.../power/wakeup_max_time_ms Date: September 2010 @@ -135,7 +135,7 @@ Description: the maximum time of processing a single wakeup event associated with the device, in milliseconds. This attribute is read-only. If the device is not enabled to wake up the system from sleep - states, this attribute is empty. + states, this attribute is not present. What: /sys/devices/.../power/wakeup_last_time_ms Date: September 2010 @@ -146,7 +146,7 @@ Description: signaling the last wakeup event associated with the device, in milliseconds. This attribute is read-only. If the device is not enabled to wake up the system from sleep states, this - attribute is empty. + attribute is not present. What: /sys/devices/.../power/autosuspend_delay_ms Date: September 2010 diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle index 8bb37237ebd2..1cd3478e5834 100644 --- a/Documentation/CodingStyle +++ b/Documentation/CodingStyle @@ -659,7 +659,7 @@ There are a number of driver model diagnostic macros in <linux/device.h> which you should use to make sure messages are matched to the right device and driver, and are tagged with the right level: dev_err(), dev_warn(), dev_info(), and so forth. For messages that aren't associated with a -particular device, <linux/kernel.h> defines pr_debug() and pr_info(). +particular device, <linux/printk.h> defines pr_debug() and pr_info(). Coming up with good debugging messages can be quite a challenge; and once you have them, they can be a huge help for remote troubleshooting. Such @@ -819,6 +819,3 @@ language C, URL: http://www.open-std.org/JTC1/SC22/WG14/ Kernel CodingStyle, by greg@kroah.com at OLS 2002: http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_talk/html/ --- -Last updated on 2007-July-13. - diff --git a/Documentation/DocBook/filesystems.tmpl b/Documentation/DocBook/filesystems.tmpl index 5e87ad58c0b5..f51f28531b8d 100644 --- a/Documentation/DocBook/filesystems.tmpl +++ b/Documentation/DocBook/filesystems.tmpl @@ -82,6 +82,11 @@ </sect1> </chapter> + <chapter id="fs_events"> + <title>Events based on file descriptors</title> +!Efs/eventfd.c + </chapter> + <chapter id="sysfs"> <title>The Filesystem for Exporting Kernel Objects</title> !Efs/sysfs/file.c diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index cfaac34c4557..6ef692667e2f 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -849,6 +849,37 @@ All: lockdep-checked RCU-protected pointer access See the comment headers in the source code (or the docbook generated from them) for more information. +However, given that there are no fewer than four families of RCU APIs +in the Linux kernel, how do you choose which one to use? The following +list can be helpful: + +a. Will readers need to block? If so, you need SRCU. + +b. What about the -rt patchset? If readers would need to block + in an non-rt kernel, you need SRCU. If readers would block + in a -rt kernel, but not in a non-rt kernel, SRCU is not + necessary. + +c. Do you need to treat NMI handlers, hardirq handlers, + and code segments with preemption disabled (whether + via preempt_disable(), local_irq_save(), local_bh_disable(), + or some other mechanism) as if they were explicit RCU readers? + If so, you need RCU-sched. + +d. Do you need RCU grace periods to complete even in the face + of softirq monopolization of one or more of the CPUs? For + example, is your code subject to network-based denial-of-service + attacks? If so, you need RCU-bh. + +e. Is your workload too update-intensive for normal use of + RCU, but inappropriate for other synchronization mechanisms? + If so, consider SLAB_DESTROY_BY_RCU. But please be careful! + +f. Otherwise, use RCU. + +Of course, this all assumes that you have determined that RCU is in fact +the right tool for your job. + 8. ANSWERS TO QUICK QUIZZES diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 44b8b7af8019..cbdfb7d9455b 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -349,6 +349,10 @@ To mount a cgroup hierarchy with all available subsystems, type: The "xxx" is not interpreted by the cgroup code, but will appear in /proc/mounts so may be any useful identifying string that you like. +Note: Some subsystems do not work without some user input first. For instance, +if cpusets are enabled the user will have to populate the cpus and mems files +for each new cgroup created before that group can be used. + To mount a cgroup hierarchy with just the cpuset and memory subsystems, type: # mount -t cgroup -o cpuset,memory hier1 /dev/cgroup @@ -426,6 +430,14 @@ You can attach the current shell task by echoing 0: # echo 0 > tasks +Note: Since every task is always a member of exactly one cgroup in each +mounted hierarchy, to remove a task from its current cgroup you must +move it into a new cgroup (possibly the root cgroup) by writing to the +new cgroup's tasks file. + +Note: If the ns cgroup is active, moving a process to another cgroup can +fail. + 2.3 Mounting hierarchies by name -------------------------------- diff --git a/Documentation/devicetree/00-INDEX b/Documentation/devicetree/00-INDEX new file mode 100644 index 000000000000..b78f691fd847 --- /dev/null +++ b/Documentation/devicetree/00-INDEX @@ -0,0 +1,10 @@ +Documentation for device trees, a data structure by which bootloaders pass +hardware layout to Linux in a device-independent manner, simplifying hardware +probing. This subsystem is maintained by Grant Likely +<grant.likely@secretlab.ca> and has a mailing list at +https://lists.ozlabs.org/listinfo/devicetree-discuss + +00-INDEX + - this file +booting-without-of.txt + - Booting Linux without Open Firmware, describes history and format of device trees. diff --git a/Documentation/devicetree/bindings/i2c/ce4100-i2c.txt b/Documentation/devicetree/bindings/i2c/ce4100-i2c.txt new file mode 100644 index 000000000000..569b16248514 --- /dev/null +++ b/Documentation/devicetree/bindings/i2c/ce4100-i2c.txt @@ -0,0 +1,93 @@ +CE4100 I2C +---------- + +CE4100 has one PCI device which is described as the I2C-Controller. This +PCI device has three PCI-bars, each bar contains a complete I2C +controller. So we have a total of three independent I2C-Controllers +which share only an interrupt line. +The driver is probed via the PCI-ID and is gathering the information of +attached devices from the devices tree. +Grant Likely recommended to use the ranges property to map the PCI-Bar +number to its physical address and to use this to find the child nodes +of the specific I2C controller. This were his exact words: + + Here's where the magic happens. Each entry in + ranges describes how the parent pci address space + (middle group of 3) is translated to the local + address space (first group of 2) and the size of + each range (last cell). In this particular case, + the first cell of the local address is chosen to be + 1:1 mapped to the BARs, and the second is the + offset from be base of the BAR (which would be + non-zero if you had 2 or more devices mapped off + the same BAR) + + ranges allows the address mapping to be described + in a way that the OS can interpret without + requiring custom device driver code. + +This is an example which is used on FalconFalls: +------------------------------------------------ + i2c-controller@b,2 { + #address-cells = <2>; + #size-cells = <1>; + compatible = "pci8086,2e68.2", + "pci8086,2e68", + "pciclass,ff0000", + "pciclass,ff00"; + + reg = <0x15a00 0x0 0x0 0x0 0x0>; + interrupts = <16 1>; + + /* as described by Grant, the first number in the group of + * three is the bar number followed by the 64bit bar address + * followed by size of the mapping. The bar address + * requires also a valid translation in parents ranges + * property. + */ + ranges = <0 0 0x02000000 0 0xdffe0500 0x100 + 1 0 0x02000000 0 0xdffe0600 0x100 + 2 0 0x02000000 0 0xdffe0700 0x100>; + + i2c@0 { + #address-cells = <1>; + #size-cells = <0>; + compatible = "intel,ce4100-i2c-controller"; + + /* The first number in the reg property is the + * number of the bar + */ + reg = <0 0 0x100>; + + /* This I2C controller has no devices */ + }; + + i2c@1 { + #address-cells = <1>; + #size-cells = <0>; + compatible = "intel,ce4100-i2c-controller"; + reg = <1 0 0x100>; + + /* This I2C controller has one gpio controller */ + gpio@26 { + #gpio-cells = <2>; + compatible = "ti,pcf8575"; + reg = <0x26>; + gpio-controller; + }; + }; + + i2c@2 { + #address-cells = <1>; + #size-cells = <0>; + compatible = "intel,ce4100-i2c-controller"; + reg = <2 0 0x100>; + + gpio@26 { + #gpio-cells = <2>; + compatible = "ti,pcf8575"; + reg = <0x26>; + gpio-controller; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/rtc/rtc-cmos.txt b/Documentation/devicetree/bindings/rtc/rtc-cmos.txt new file mode 100644 index 000000000000..7382989b3052 --- /dev/null +++ b/Documentation/devicetree/bindings/rtc/rtc-cmos.txt @@ -0,0 +1,28 @@ + Motorola mc146818 compatible RTC +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Required properties: + - compatible : "motorola,mc146818" + - reg : should contain registers location and length. + +Optional properties: + - interrupts : should contain interrupt. + - interrupt-parent : interrupt source phandle. + - ctrl-reg : Contains the initial value of the control register also + called "Register B". + - freq-reg : Contains the initial value of the frequency register also + called "Regsiter A". + +"Register A" and "B" are usually initialized by the firmware (BIOS for +instance). If this is not done, it can be performed by the driver. + +ISA Example: + + rtc@70 { + compatible = "motorola,mc146818"; + interrupts = <8 3>; + interrupt-parent = <&ioapic1>; + ctrl-reg = <2>; + freq-reg = <0x26>; + reg = <1 0x70 2>; + }; diff --git a/Documentation/devicetree/bindings/x86/ce4100.txt b/Documentation/devicetree/bindings/x86/ce4100.txt new file mode 100644 index 000000000000..b49ae593a60b --- /dev/null +++ b/Documentation/devicetree/bindings/x86/ce4100.txt @@ -0,0 +1,38 @@ +CE4100 Device Tree Bindings +--------------------------- + +The CE4100 SoC uses for in core peripherals the following compatible +format: <vendor>,<chip>-<device>. +Many of the "generic" devices like HPET or IO APIC have the ce4100 +name in their compatible property because they first appeared in this +SoC. + +The CPU node +------------ + cpu@0 { + device_type = "cpu"; + compatible = "intel,ce4100"; + reg = <0>; + lapic = <&lapic0>; + }; + +The reg property describes the CPU number. The lapic property points to +the local APIC timer. + +The SoC node +------------ + +This node describes the in-core peripherals. Required property: + compatible = "intel,ce4100-cp"; + +The PCI node +------------ +This node describes the PCI bus on the SoC. Its property should be + compatible = "intel,ce4100-pci", "pci"; + +If the OS is using the IO-APIC for interrupt routing then the reported +interrupt numbers for devices is no longer true. In order to obtain the +correct interrupt number, the child node which represents the device has +to contain the interrupt property. Besides the interrupt property it has +to contain at least the reg property containing the PCI bus address and +compatible property according to "PCI Bus Binding Revision 2.1". diff --git a/Documentation/devicetree/bindings/x86/interrupt.txt b/Documentation/devicetree/bindings/x86/interrupt.txt new file mode 100644 index 000000000000..7d19f494f19a --- /dev/null +++ b/Documentation/devicetree/bindings/x86/interrupt.txt @@ -0,0 +1,26 @@ +Interrupt chips +--------------- + +* Intel I/O Advanced Programmable Interrupt Controller (IO APIC) + + Required properties: + -------------------- + compatible = "intel,ce4100-ioapic"; + #interrupt-cells = <2>; + + Device's interrupt property: + + interrupts = <P S>; + + The first number (P) represents the interrupt pin which is wired to the + IO APIC. The second number (S) represents the sense of interrupt which + should be configured and can be one of: + 0 - Edge Rising + 1 - Level Low + 2 - Level High + 3 - Edge Falling + +* Local APIC + Required property: + + compatible = "intel,ce4100-lapic"; diff --git a/Documentation/devicetree/bindings/x86/timer.txt b/Documentation/devicetree/bindings/x86/timer.txt new file mode 100644 index 000000000000..c688af58e3bd --- /dev/null +++ b/Documentation/devicetree/bindings/x86/timer.txt @@ -0,0 +1,6 @@ +Timers +------ + +* High Precision Event Timer (HPET) + Required property: + compatible = "intel,ce4100-hpet"; diff --git a/Documentation/devicetree/booting-without-of.txt b/Documentation/devicetree/booting-without-of.txt index 28b1c9d3d351..55fd2623445b 100644 --- a/Documentation/devicetree/booting-without-of.txt +++ b/Documentation/devicetree/booting-without-of.txt @@ -13,6 +13,7 @@ Table of Contents I - Introduction 1) Entry point for arch/powerpc + 2) Entry point for arch/x86 II - The DT block format 1) Header @@ -225,6 +226,25 @@ it with special cases. cannot support both configurations with Book E and configurations with classic Powerpc architectures. +2) Entry point for arch/x86 +------------------------------- + + There is one single 32bit entry point to the kernel at code32_start, + the decompressor (the real mode entry point goes to the same 32bit + entry point once it switched into protected mode). That entry point + supports one calling convention which is documented in + Documentation/x86/boot.txt + The physical pointer to the device-tree block (defined in chapter II) + is passed via setup_data which requires at least boot protocol 2.09. + The type filed is defined as + + #define SETUP_DTB 2 + + This device-tree is used as an extension to the "boot page". As such it + does not parse / consider data which is already covered by the boot + page. This includes memory size, reserved ranges, command line arguments + or initrd address. It simply holds information which can not be retrieved + otherwise like interrupt routing or a list of devices behind an I2C bus. II - The DT block format ======================== diff --git a/Documentation/hwmon/jc42 b/Documentation/hwmon/jc42 index 0e76ef12e4c6..a22ecf48f255 100644 --- a/Documentation/hwmon/jc42 +++ b/Documentation/hwmon/jc42 @@ -51,7 +51,8 @@ Supported chips: * JEDEC JC 42.4 compliant temperature sensor chips Prefix: 'jc42' Addresses scanned: I2C 0x18 - 0x1f - Datasheet: - + Datasheet: + http://www.jedec.org/sites/default/files/docs/4_01_04R19.pdf Author: Guenter Roeck <guenter.roeck@ericsson.com> @@ -60,7 +61,11 @@ Author: Description ----------- -This driver implements support for JEDEC JC 42.4 compliant temperature sensors. +This driver implements support for JEDEC JC 42.4 compliant temperature sensors, +which are used on many DDR3 memory modules for mobile devices and servers. Some +systems use the sensor to prevent memory overheating by automatically throttling +the memory controller. + The driver auto-detects the chips listed above, but can be manually instantiated to support other JC 42.4 compliant chips. @@ -81,15 +86,19 @@ limits. The chip supports only a single register to configure the hysteresis, which applies to all limits. This register can be written by writing into temp1_crit_hyst. Other hysteresis attributes are read-only. +If the BIOS has configured the sensor for automatic temperature management, it +is likely that it has locked the registers, i.e., that the temperature limits +cannot be changed. + Sysfs entries ------------- temp1_input Temperature (RO) -temp1_min Minimum temperature (RW) -temp1_max Maximum temperature (RW) -temp1_crit Critical high temperature (RW) +temp1_min Minimum temperature (RO or RW) +temp1_max Maximum temperature (RO or RW) +temp1_crit Critical high temperature (RO or RW) -temp1_crit_hyst Critical hysteresis temperature (RW) +temp1_crit_hyst Critical hysteresis temperature (RO or RW) temp1_max_hyst Maximum hysteresis temperature (RO) temp1_min_alarm Temperature low alarm diff --git a/Documentation/hwmon/k10temp b/Documentation/hwmon/k10temp index 6526eee525a6..d2b56a4fd1f5 100644 --- a/Documentation/hwmon/k10temp +++ b/Documentation/hwmon/k10temp @@ -9,6 +9,8 @@ Supported chips: Socket S1G3: Athlon II, Sempron, Turion II * AMD Family 11h processors: Socket S1G2: Athlon (X2), Sempron (X2), Turion X2 (Ultra) +* AMD Family 12h processors: "Llano" +* AMD Family 14h processors: "Brazos" (C/E/G-Series) Prefix: 'k10temp' Addresses scanned: PCI space @@ -17,10 +19,14 @@ Supported chips: http://support.amd.com/us/Processor_TechDocs/31116.pdf BIOS and Kernel Developer's Guide (BKDG) for AMD Family 11h Processors: http://support.amd.com/us/Processor_TechDocs/41256.pdf + BIOS and Kernel Developer's Guide (BKDG) for AMD Family 14h Models 00h-0Fh Processors: + http://support.amd.com/us/Processor_TechDocs/43170.pdf Revision Guide for AMD Family 10h Processors: http://support.amd.com/us/Processor_TechDocs/41322.pdf Revision Guide for AMD Family 11h Processors: http://support.amd.com/us/Processor_TechDocs/41788.pdf + Revision Guide for AMD Family 14h Models 00h-0Fh Processors: + http://support.amd.com/us/Processor_TechDocs/47534.pdf AMD Family 11h Processor Power and Thermal Data Sheet for Notebooks: http://support.amd.com/us/Processor_TechDocs/43373.pdf AMD Family 10h Server and Workstation Processor Power and Thermal Data Sheet: @@ -34,7 +40,7 @@ Description ----------- This driver permits reading of the internal temperature sensor of AMD -Family 10h and 11h processors. +Family 10h/11h/12h/14h processors. All these processors have a sensor, but on those for Socket F or AM2+, the sensor may return inconsistent values (erratum 319). The driver diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 89835a4766a6..738c6fda3fb0 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -144,6 +144,11 @@ a fixed number of characters. This limit depends on the architecture and is between 256 and 4096 characters. It is defined in the file ./include/asm/setup.h as COMMAND_LINE_SIZE. +Finally, the [KMG] suffix is commonly described after a number of kernel +parameter values. These 'K', 'M', and 'G' letters represent the _binary_ +multipliers 'Kilo', 'Mega', and 'Giga', equalling 2^10, 2^20, and 2^30 +bytes respectively. Such letter suffixes can also be entirely omitted. + acpi= [HW,ACPI,X86] Advanced Configuration and Power Interface @@ -545,16 +550,20 @@ and is between 256 and 4096 characters. It is defined in the file Format: <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>] - crashkernel=nn[KMG]@ss[KMG] - [KNL] Reserve a chunk of physical memory to - hold a kernel to switch to with kexec on panic. + crashkernel=size[KMG][@offset[KMG]] + [KNL] Using kexec, Linux can switch to a 'crash kernel' + upon panic. This parameter reserves the physical + memory region [offset, offset + size] for that kernel + image. If '@offset' is omitted, then a suitable offset + is selected automatically. Check + Documentation/kdump/kdump.txt for further details. crashkernel=range1:size1[,range2:size2,...][@offset] [KNL] Same as above, but depends on the memory in the running system. The syntax of range is start-[end] where start and end are both a memory unit (amount[KMG]). See also - Documentation/kdump/kdump.txt for a example. + Documentation/kdump/kdump.txt for an example. cs89x0_dma= [HW,NET] Format: <dma> @@ -1262,10 +1271,9 @@ and is between 256 and 4096 characters. It is defined in the file 6 (KERN_INFO) informational 7 (KERN_DEBUG) debug-level messages - log_buf_len=n Sets the size of the printk ring buffer, in bytes. - Format: { n | nk | nM } - n must be a power of two. The default size - is set in the kernel config file. + log_buf_len=n[KMG] Sets the size of the printk ring buffer, + in bytes. n must be a power of two. The default + size is set in the kernel config file. logo.nologo [FB] Disables display of the built-in Linux logo. This may be used to provide more screen space for @@ -2436,6 +2444,10 @@ and is between 256 and 4096 characters. It is defined in the file <deci-seconds>: poll all this frequency 0: no polling (default) + threadirqs [KNL] + Force threading of all interrupt handlers except those + marked explicitely IRQF_NO_THREAD. + topology= [S390] Format: {off | on} Specify if the kernel should make use of the cpu diff --git a/Documentation/keys-request-key.txt b/Documentation/keys-request-key.txt index 09b55e461740..69686ad12c66 100644 --- a/Documentation/keys-request-key.txt +++ b/Documentation/keys-request-key.txt @@ -127,14 +127,15 @@ This is because process A's keyrings can't simply be attached to of them, and (b) it requires the same UID/GID/Groups all the way through. -====================== -NEGATIVE INSTANTIATION -====================== +==================================== +NEGATIVE INSTANTIATION AND REJECTION +==================================== Rather than instantiating a key, it is possible for the possessor of an authorisation key to negatively instantiate a key that's under construction. This is a short duration placeholder that causes any attempt at re-requesting -the key whilst it exists to fail with error ENOKEY. +the key whilst it exists to fail with error ENOKEY if negated or the specified +error if rejected. This is provided to prevent excessive repeated spawning of /sbin/request-key processes for a key that will never be obtainable. diff --git a/Documentation/keys.txt b/Documentation/keys.txt index e4dbbdb1bd96..6523a9e6f293 100644 --- a/Documentation/keys.txt +++ b/Documentation/keys.txt @@ -637,6 +637,9 @@ The keyctl syscall functions are: long keyctl(KEYCTL_INSTANTIATE, key_serial_t key, const void *payload, size_t plen, key_serial_t keyring); + long keyctl(KEYCTL_INSTANTIATE_IOV, key_serial_t key, + const struct iovec *payload_iov, unsigned ioc, + key_serial_t keyring); If the kernel calls back to userspace to complete the instantiation of a key, userspace should use this call to supply data for the key before the @@ -652,11 +655,16 @@ The keyctl syscall functions are: The payload and plen arguments describe the payload data as for add_key(). + The payload_iov and ioc arguments describe the payload data in an iovec + array instead of a single buffer. + (*) Negatively instantiate a partially constructed key. long keyctl(KEYCTL_NEGATE, key_serial_t key, unsigned timeout, key_serial_t keyring); + long keyctl(KEYCTL_REJECT, key_serial_t key, + unsigned timeout, unsigned error, key_serial_t keyring); If the kernel calls back to userspace to complete the instantiation of a key, userspace should use this call mark the key as negative before the @@ -669,6 +677,10 @@ The keyctl syscall functions are: that keyring, however all the constraints applying in KEYCTL_LINK apply in this case too. + If the key is rejected, future searches for it will return the specified + error code until the rejected key expires. Negating the key is the same + as rejecting the key with ENOKEY as the error code. + (*) Set the default request-key destination keyring. @@ -1062,6 +1074,13 @@ The structure has a number of fields, some of which are mandatory: viable. + (*) int (*vet_description)(const char *description); + + This optional method is called to vet a key description. If the key type + doesn't approve of the key description, it may return an error, otherwise + it should return 0. + + (*) int (*instantiate)(struct key *key, const void *data, size_t datalen); This method is called to attach a payload to a key during construction. @@ -1231,10 +1250,11 @@ hand the request off to (perhaps a path held in placed in another key by, for example, the KDE desktop manager). The program (or whatever it calls) should finish construction of the key by -calling KEYCTL_INSTANTIATE, which also permits it to cache the key in one of -the keyrings (probably the session ring) before returning. Alternatively, the -key can be marked as negative with KEYCTL_NEGATE; this also permits the key to -be cached in one of the keyrings. +calling KEYCTL_INSTANTIATE or KEYCTL_INSTANTIATE_IOV, which also permits it to +cache the key in one of the keyrings (probably the session ring) before +returning. Alternatively, the key can be marked as negative with KEYCTL_NEGATE +or KEYCTL_REJECT; this also permits the key to be cached in one of the +keyrings. If it returns with the key remaining in the unconstructed state, the key will be marked as being negative, it will be added to the session keyring, and an diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 631ad2f1b229..f0d3a8026a56 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -21,6 +21,7 @@ Contents: - SMP barrier pairing. - Examples of memory barrier sequences. - Read memory barriers vs load speculation. + - Transitivity (*) Explicit kernel barriers. @@ -959,6 +960,63 @@ the speculation will be cancelled and the value reloaded: retrieved : : +-------+ +TRANSITIVITY +------------ + +Transitivity is a deeply intuitive notion about ordering that is not +always provided by real computer systems. The following example +demonstrates transitivity (also called "cumulativity"): + + CPU 1 CPU 2 CPU 3 + ======================= ======================= ======================= + { X = 0, Y = 0 } + STORE X=1 LOAD X STORE Y=1 + <general barrier> <general barrier> + LOAD Y LOAD X + +Suppose that CPU 2's load from X returns 1 and its load from Y returns 0. +This indicates that CPU 2's load from X in some sense follows CPU 1's +store to X and that CPU 2's load from Y in some sense preceded CPU 3's +store to Y. The question is then "Can CPU 3's load from X return 0?" + +Because CPU 2's load from X in some sense came after CPU 1's store, it +is natural to expect that CPU 3's load from X must therefore return 1. +This expectation is an example of transitivity: if a load executing on +CPU A follows a load from the same variable executing on CPU B, then +CPU A's load must either return the same value that CPU B's load did, +or must return some later value. + +In the Linux kernel, use of general memory barriers guarantees +transitivity. Therefore, in the above example, if CPU 2's load from X +returns 1 and its load from Y returns 0, then CPU 3's load from X must +also return 1. + +However, transitivity is -not- guaranteed for read or write barriers. +For example, suppose that CPU 2's general barrier in the above example +is changed to a read barrier as shown below: + + CPU 1 CPU 2 CPU 3 + ======================= ======================= ======================= + { X = 0, Y = 0 } + STORE X=1 LOAD X STORE Y=1 + <read barrier> <general barrier> + LOAD Y LOAD X + +This substitution destroys transitivity: in this example, it is perfectly +legal for CPU 2's load from X to return 1, its load from Y to return 0, +and CPU 3's load from X to return 0. + +The key point is that although CPU 2's read barrier orders its pair +of loads, it does not guarantee to order CPU 1's store. Therefore, if +this example runs on a system where CPUs 1 and 2 share a store buffer +or a level of cache, CPU 2 might have early access to CPU 1's writes. +General barriers are therefore required to ensure that all CPUs agree +on the combined order of CPU 1's and CPU 2's accesses. + +To reiterate, if your code requires transitivity, use general barriers +throughout. + + ======================== EXPLICIT KERNEL BARRIERS ======================== diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index fe5c099b8fc8..4edd78dfb362 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -40,8 +40,6 @@ decnet.txt - info on using the DECnet networking layer in Linux. depca.txt - the Digital DEPCA/EtherWORKS DE1?? and DE2?? LANCE Ethernet driver -dgrs.txt - - the Digi International RightSwitch SE-X Ethernet driver dmfe.txt - info on the Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver. e100.txt @@ -50,8 +48,6 @@ e1000.txt - info on Intel's E1000 line of gigabit ethernet boards eql.txt - serial IP load balancing -ethertap.txt - - the Ethertap user space packet reception and transmission driver ewrk3.txt - the Digital EtherWORKS 3 DE203/4/5 Ethernet driver filter.txt @@ -104,8 +100,6 @@ tuntap.txt - TUN/TAP device driver, allowing user space Rx/Tx of packets. vortex.txt - info on using 3Com Vortex (3c590, 3c592, 3c595, 3c597) Ethernet cards. -wavelan.txt - - AT&T GIS (nee NCR) WaveLAN card: An Ethernet-like radio transceiver x25.txt - general info on X.25 development. x25-iface.txt diff --git a/Documentation/networking/Makefile b/Documentation/networking/Makefile index 5aba7a33aeeb..24c308dd3fd1 100644 --- a/Documentation/networking/Makefile +++ b/Documentation/networking/Makefile @@ -4,6 +4,8 @@ obj- := dummy.o # List of programs to build hostprogs-y := ifenslave +HOSTCFLAGS_ifenslave.o += -I$(objtree)/usr/include + # Tell kbuild to always build the programs always := $(hostprogs-y) diff --git a/Documentation/networking/dns_resolver.txt b/Documentation/networking/dns_resolver.txt index aefd1e681804..04ca06325b08 100644 --- a/Documentation/networking/dns_resolver.txt +++ b/Documentation/networking/dns_resolver.txt @@ -61,7 +61,6 @@ before the more general line given above as the first match is the one taken. create dns_resolver foo:* * /usr/sbin/dns.foo %k - ===== USAGE ===== @@ -104,6 +103,14 @@ implemented in the module can be called after doing: returned also. +=============================== +READING DNS KEYS FROM USERSPACE +=============================== + +Keys of dns_resolver type can be read from userspace using keyctl_read() or +"keyctl read/print/pipe". + + ========= MECHANISM ========= diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt index 57080cd74575..f023ba6bba62 100644 --- a/Documentation/power/devices.txt +++ b/Documentation/power/devices.txt @@ -1,6 +1,6 @@ Device Power Management -Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. +Copyright (c) 2010-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. Copyright (c) 2010 Alan Stern <stern@rowland.harvard.edu> @@ -159,18 +159,18 @@ matter, and the kernel is responsible for keeping track of it. By contrast, whether or not a wakeup-capable device should issue wakeup events is a policy decision, and it is managed by user space through a sysfs attribute: the power/wakeup file. User space can write the strings "enabled" or "disabled" to -set or clear the should_wakeup flag, respectively. Reads from the file will -return the corresponding string if can_wakeup is true, but if can_wakeup is -false then reads will return an empty string, to indicate that the device -doesn't support wakeup events. (But even though the file appears empty, writes -will still affect the should_wakeup flag.) +set or clear the "should_wakeup" flag, respectively. This file is only present +for wakeup-capable devices (i.e. devices whose "can_wakeup" flags are set) +and is created (or removed) by device_set_wakeup_capable(). Reads from the +file will return the corresponding string. The device_may_wakeup() routine returns true only if both flags are set. -Drivers should check this routine when putting devices in a low-power state -during a system sleep transition, to see whether or not to enable the devices' -wakeup mechanisms. However for runtime power management, wakeup events should -be enabled whenever the device and driver both support them, regardless of the -should_wakeup flag. +This information is used by subsystems, like the PCI bus type code, to see +whether or not to enable the devices' wakeup mechanisms. If device wakeup +mechanisms are enabled or disabled directly by drivers, they also should use +device_may_wakeup() to decide what to do during a system sleep transition. +However for runtime power management, wakeup events should be enabled whenever +the device and driver both support them, regardless of the should_wakeup flag. /sys/devices/.../power/control files @@ -249,23 +249,18 @@ various phases always run after tasks have been frozen and before they are unfrozen. Furthermore, the *_noirq phases run at a time when IRQ handlers have been disabled (except for those marked with the IRQ_WAKEUP flag). -Most phases use bus, type, and class callbacks (that is, methods defined in -dev->bus->pm, dev->type->pm, and dev->class->pm). The prepare and complete -phases are exceptions; they use only bus callbacks. When multiple callbacks -are used in a phase, they are invoked in the order: <class, type, bus> during -power-down transitions and in the opposite order during power-up transitions. -For example, during the suspend phase the PM core invokes - - dev->class->pm.suspend(dev); - dev->type->pm.suspend(dev); - dev->bus->pm.suspend(dev); - -before moving on to the next device, whereas during the resume phase the core -invokes - - dev->bus->pm.resume(dev); - dev->type->pm.resume(dev); - dev->class->pm.resume(dev); +All phases use bus, type, or class callbacks (that is, methods defined in +dev->bus->pm, dev->type->pm, or dev->class->pm). These callbacks are mutually +exclusive, so if the device type provides a struct dev_pm_ops object pointed to +by its pm field (i.e. both dev->type and dev->type->pm are defined), the +callbacks included in that object (i.e. dev->type->pm) will be used. Otherwise, +if the class provides a struct dev_pm_ops object pointed to by its pm field +(i.e. both dev->class and dev->class->pm are defined), the PM core will use the +callbacks from that object (i.e. dev->class->pm). Finally, if the pm fields of +both the device type and class objects are NULL (or those objects do not exist), +the callbacks provided by the bus (that is, the callbacks from dev->bus->pm) +will be used (this allows device types to override callbacks provided by bus +types or classes if necessary). These callbacks may in turn invoke device- or driver-specific methods stored in dev->driver->pm, but they don't have to. @@ -507,6 +502,49 @@ routines. Nevertheless, different callback pointers are used in case there is a situation where it actually matters. +Device Power Domains +-------------------- +Sometimes devices share reference clocks or other power resources. In those +cases it generally is not possible to put devices into low-power states +individually. Instead, a set of devices sharing a power resource can be put +into a low-power state together at the same time by turning off the shared +power resource. Of course, they also need to be put into the full-power state +together, by turning the shared power resource on. A set of devices with this +property is often referred to as a power domain. + +Support for power domains is provided through the pwr_domain field of struct +device. This field is a pointer to an object of type struct dev_power_domain, +defined in include/linux/pm.h, providing a set of power management callbacks +analogous to the subsystem-level and device driver callbacks that are executed +for the given device during all power transitions, in addition to the respective +subsystem-level callbacks. Specifically, the power domain "suspend" callbacks +(i.e. ->runtime_suspend(), ->suspend(), ->freeze(), ->poweroff(), etc.) are +executed after the analogous subsystem-level callbacks, while the power domain +"resume" callbacks (i.e. ->runtime_resume(), ->resume(), ->thaw(), ->restore, +etc.) are executed before the analogous subsystem-level callbacks. Error codes +returned by the "suspend" and "resume" power domain callbacks are ignored. + +Power domain ->runtime_idle() callback is executed before the subsystem-level +->runtime_idle() callback and the result returned by it is not ignored. Namely, +if it returns error code, the subsystem-level ->runtime_idle() callback will not +be called and the helper function rpm_idle() executing it will return error +code. This mechanism is intended to help platforms where saving device state +is a time consuming operation and should only be carried out if all devices +in the power domain are idle, before turning off the shared power resource(s). +Namely, the power domain ->runtime_idle() callback may return error code until +the pm_runtime_idle() helper (or its asychronous version) has been called for +all devices in the power domain (it is recommended that the returned error code +be -EBUSY in those cases), preventing the subsystem-level ->runtime_idle() +callback from being run prematurely. + +The support for device power domains is only relevant to platforms needing to +use the same subsystem-level (e.g. platform bus type) and device driver power +management callbacks in many different power domain configurations and wanting +to avoid incorporating the support for power domains into the subsystem-level +callbacks. The other platforms need not implement it or take it into account +in any way. + + System Devices -------------- System devices (sysdevs) follow a slightly different API, which can be found in diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt index ffe55ffa540a..654097b130b4 100644 --- a/Documentation/power/runtime_pm.txt +++ b/Documentation/power/runtime_pm.txt @@ -1,6 +1,6 @@ Run-time Power Management Framework for I/O Devices -(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. +(C) 2009-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. (C) 2010 Alan Stern <stern@rowland.harvard.edu> 1. Introduction @@ -44,11 +44,12 @@ struct dev_pm_ops { }; The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks are -executed by the PM core for either the bus type, or device type (if the bus -type's callback is not defined), or device class (if the bus type's and device -type's callbacks are not defined) of given device. The bus type, device type -and device class callbacks are referred to as subsystem-level callbacks in what -follows. +executed by the PM core for either the device type, or the class (if the device +type's struct dev_pm_ops object does not exist), or the bus type (if the +device type's and class' struct dev_pm_ops objects do not exist) of the given +device (this allows device types to override callbacks provided by bus types or +classes if necessary). The bus type, device type and class callbacks are +referred to as subsystem-level callbacks in what follows. By default, the callbacks are always invoked in process context with interrupts enabled. However, subsystems can use the pm_runtime_irq_safe() helper function diff --git a/Documentation/power/states.txt b/Documentation/power/states.txt index 34800cc521bf..4416b28630df 100644 --- a/Documentation/power/states.txt +++ b/Documentation/power/states.txt @@ -62,12 +62,12 @@ setup via another operating system for it to use. Despite the inconvenience, this method requires minimal work by the kernel, since the firmware will also handle restoring memory contents on resume. -For suspend-to-disk, a mechanism called swsusp called 'swsusp' (Swap -Suspend) is used to write memory contents to free swap space. -swsusp has some restrictive requirements, but should work in most -cases. Some, albeit outdated, documentation can be found in -Documentation/power/swsusp.txt. Alternatively, userspace can do most -of the actual suspend to disk work, see userland-swsusp.txt. +For suspend-to-disk, a mechanism called 'swsusp' (Swap Suspend) is used +to write memory contents to free swap space. swsusp has some restrictive +requirements, but should work in most cases. Some, albeit outdated, +documentation can be found in Documentation/power/swsusp.txt. +Alternatively, userspace can do most of the actual suspend to disk work, +see userland-swsusp.txt. Once memory state is written to disk, the system may either enter a low-power state (like ACPI S4), or it may simply power down. Powering diff --git a/Documentation/powerpc/00-INDEX b/Documentation/powerpc/00-INDEX index e3960b8c8689..5620fb5ac425 100644 --- a/Documentation/powerpc/00-INDEX +++ b/Documentation/powerpc/00-INDEX @@ -5,8 +5,6 @@ please mail me. 00-INDEX - this file -booting-without-of.txt - - Booting the Linux/ppc kernel without Open Firmware cpu_features.txt - info on how we support a variety of CPUs with minimal compile-time options. @@ -16,8 +14,6 @@ hvcs.txt - IBM "Hypervisor Virtual Console Server" Installation Guide mpc52xx.txt - Linux 2.6.x on MPC52xx family -mpc52xx-device-tree-bindings.txt - - MPC5200 Device Tree Bindings sound.txt - info on sound support under Linux/PPC zImage_layout.txt diff --git a/Documentation/rtc.txt b/Documentation/rtc.txt index 9104c1062084..250160469d83 100644 --- a/Documentation/rtc.txt +++ b/Documentation/rtc.txt @@ -178,38 +178,29 @@ RTC class framework, but can't be supported by the older driver. setting the longer alarm time and enabling its IRQ using a single request (using the same model as EFI firmware). - * RTC_UIE_ON, RTC_UIE_OFF ... if the RTC offers IRQs, it probably - also offers update IRQs whenever the "seconds" counter changes. - If needed, the RTC framework can emulate this mechanism. + * RTC_UIE_ON, RTC_UIE_OFF ... if the RTC offers IRQs, the RTC framework + will emulate this mechanism. - * RTC_PIE_ON, RTC_PIE_OFF, RTC_IRQP_SET, RTC_IRQP_READ ... another - feature often accessible with an IRQ line is a periodic IRQ, issued - at settable frequencies (usually 2^N Hz). + * RTC_PIE_ON, RTC_PIE_OFF, RTC_IRQP_SET, RTC_IRQP_READ ... these icotls + are emulated via a kernel hrtimer. In many cases, the RTC alarm can be a system wake event, used to force Linux out of a low power sleep state (or hibernation) back to a fully operational state. For example, a system could enter a deep power saving state until it's time to execute some scheduled tasks. -Note that many of these ioctls need not actually be implemented by your -driver. The common rtc-dev interface handles many of these nicely if your -driver returns ENOIOCTLCMD. Some common examples: +Note that many of these ioctls are handled by the common rtc-dev interface. +Some common examples: * RTC_RD_TIME, RTC_SET_TIME: the read_time/set_time functions will be called with appropriate values. - * RTC_ALM_SET, RTC_ALM_READ, RTC_WKALM_SET, RTC_WKALM_RD: the - set_alarm/read_alarm functions will be called. + * RTC_ALM_SET, RTC_ALM_READ, RTC_WKALM_SET, RTC_WKALM_RD: gets or sets + the alarm rtc_timer. May call the set_alarm driver function. - * RTC_IRQP_SET, RTC_IRQP_READ: the irq_set_freq function will be called - to set the frequency while the framework will handle the read for you - since the frequency is stored in the irq_freq member of the rtc_device - structure. Your driver needs to initialize the irq_freq member during - init. Make sure you check the requested frequency is in range of your - hardware in the irq_set_freq function. If it isn't, return -EINVAL. If - you cannot actually change the frequency, do not define irq_set_freq. + * RTC_IRQP_SET, RTC_IRQP_READ: These are emulated by the generic code. - * RTC_PIE_ON, RTC_PIE_OFF: the irq_set_state function will be called. + * RTC_PIE_ON, RTC_PIE_OFF: These are also emulated by the generic code. If all else fails, check out the rtc-test.c driver! diff --git a/Documentation/spinlocks.txt b/Documentation/spinlocks.txt index 178c831b907d..2e3c64b1a6a5 100644 --- a/Documentation/spinlocks.txt +++ b/Documentation/spinlocks.txt @@ -86,7 +86,7 @@ to change the variables it has to get an exclusive write lock. The routines look the same as above: - rwlock_t xxx_lock = RW_LOCK_UNLOCKED; + rwlock_t xxx_lock = __RW_LOCK_UNLOCKED(xxx_lock); unsigned long flags; @@ -196,25 +196,3 @@ appropriate: For static initialization, use DEFINE_SPINLOCK() / DEFINE_RWLOCK() or __SPIN_LOCK_UNLOCKED() / __RW_LOCK_UNLOCKED() as appropriate. - -SPIN_LOCK_UNLOCKED and RW_LOCK_UNLOCKED are deprecated. These interfere -with lockdep state tracking. - -Most of the time, you can simply turn: - static spinlock_t xxx_lock = SPIN_LOCK_UNLOCKED; -into: - static DEFINE_SPINLOCK(xxx_lock); - -Static structure member variables go from: - - struct foo bar { - .lock = SPIN_LOCK_UNLOCKED; - }; - -to: - - struct foo bar { - .lock = __SPIN_LOCK_UNLOCKED(bar.lock); - }; - -Declaration of static rw_locks undergo a similar transformation. diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt index 62682500878a..4af0614147ef 100644 --- a/Documentation/sysctl/fs.txt +++ b/Documentation/sysctl/fs.txt @@ -88,20 +88,19 @@ you might want to raise the limit. file-max & file-nr: -The kernel allocates file handles dynamically, but as yet it -doesn't free them again. - The value in file-max denotes the maximum number of file- handles that the Linux kernel will allocate. When you get lots of error messages about running out of file handles, you might want to increase this limit. -Historically, the three values in file-nr denoted the number of -allocated file handles, the number of allocated but unused file -handles, and the maximum number of file handles. Linux 2.6 always -reports 0 as the number of free file handles -- this is not an -error, it just means that the number of allocated file handles -exactly matches the number of used file handles. +Historically,the kernel was able to allocate file handles +dynamically, but not to free them again. The three values in +file-nr denote the number of allocated file handles, the number +of allocated but unused file handles, and the maximum number of +file handles. Linux 2.6 always reports 0 as the number of free +file handles -- this is not an error, it just means that the +number of allocated file handles exactly matches the number of +used file handles. Attempts to allocate more file descriptors than file-max are reported with printk, look for "VFS: file-max limit <number> diff --git a/Documentation/trace/ftrace-design.txt b/Documentation/trace/ftrace-design.txt index dc52bd442c92..79fcafc7fd64 100644 --- a/Documentation/trace/ftrace-design.txt +++ b/Documentation/trace/ftrace-design.txt @@ -247,6 +247,13 @@ You need very few things to get the syscalls tracing in an arch. - Support the TIF_SYSCALL_TRACEPOINT thread flags. - Put the trace_sys_enter() and trace_sys_exit() tracepoints calls from ptrace in the ptrace syscalls tracing path. +- If the system call table on this arch is more complicated than a simple array + of addresses of the system calls, implement an arch_syscall_addr to return + the address of a given system call. +- If the symbol names of the system calls do not match the function names on + this arch, define ARCH_HAS_SYSCALL_MATCH_SYM_NAME in asm/ftrace.h and + implement arch_syscall_match_sym_name with the appropriate logic to return + true if the function name corresponds with the symbol name. - Tag this arch as HAVE_SYSCALL_TRACEPOINTS. diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index 557c1edeccaf..1ebc24cf9a55 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -80,11 +80,11 @@ of ftrace. Here is a list of some of the key files: tracers listed here can be configured by echoing their name into current_tracer. - tracing_enabled: + tracing_on: - This sets or displays whether the current_tracer - is activated and tracing or not. Echo 0 into this - file to disable the tracer or 1 to enable it. + This sets or displays whether writing to the trace + ring buffer is enabled. Echo 0 into this file to disable + the tracer or 1 to enable it. trace: @@ -202,10 +202,6 @@ Here is the list of current tracers that may be configured. to draw a graph of function calls similar to C code source. - "sched_switch" - - Traces the context switches and wakeups between tasks. - "irqsoff" Traces the areas that disable interrupts and saves @@ -273,39 +269,6 @@ format, the function name that was traced "path_put" and the parent function that called this function "path_walk". The timestamp is the time at which the function was entered. -The sched_switch tracer also includes tracing of task wakeups -and context switches. - - ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S - ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S - ksoftirqd/1-7 [01] 1453.070013: 7:115:R ==> 10:115:R - events/1-10 [01] 1453.070013: 10:115:S ==> 2916:115:R - kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R - ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R - -Wake ups are represented by a "+" and the context switches are -shown as "==>". The format is: - - Context switches: - - Previous task Next Task - - <pid>:<prio>:<state> ==> <pid>:<prio>:<state> - - Wake ups: - - Current task Task waking up - - <pid>:<prio>:<state> + <pid>:<prio>:<state> - -The prio is the internal kernel priority, which is the inverse -of the priority that is usually displayed by user-space tools. -Zero represents the highest priority (99). Prio 100 starts the -"nice" priorities with 100 being equal to nice -20 and 139 being -nice 19. The prio "140" is reserved for the idle task which is -the lowest priority thread (pid 0). - - Latency trace format -------------------- @@ -491,78 +454,10 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] latencies, as described in "Latency trace format". -sched_switch ------------- - -This tracer simply records schedule switches. Here is an example -of how to use it. - - # echo sched_switch > current_tracer - # echo 1 > tracing_enabled - # sleep 1 - # echo 0 > tracing_enabled - # cat trace - -# tracer: sched_switch -# -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-3997 [01] 240.132281: 3997:120:R + 4055:120:R - bash-3997 [01] 240.132284: 3997:120:R ==> 4055:120:R - sleep-4055 [01] 240.132371: 4055:120:S ==> 3997:120:R - bash-3997 [01] 240.132454: 3997:120:R + 4055:120:S - bash-3997 [01] 240.132457: 3997:120:R ==> 4055:120:R - sleep-4055 [01] 240.132460: 4055:120:D ==> 3997:120:R - bash-3997 [01] 240.132463: 3997:120:R + 4055:120:D - bash-3997 [01] 240.132465: 3997:120:R ==> 4055:120:R - <idle>-0 [00] 240.132589: 0:140:R + 4:115:S - <idle>-0 [00] 240.132591: 0:140:R ==> 4:115:R - ksoftirqd/0-4 [00] 240.132595: 4:115:S ==> 0:140:R - <idle>-0 [00] 240.132598: 0:140:R + 4:115:S - <idle>-0 [00] 240.132599: 0:140:R ==> 4:115:R - ksoftirqd/0-4 [00] 240.132603: 4:115:S ==> 0:140:R - sleep-4055 [01] 240.133058: 4055:120:S ==> 3997:120:R - [...] - - -As we have discussed previously about this format, the header -shows the name of the trace and points to the options. The -"FUNCTION" is a misnomer since here it represents the wake ups -and context switches. - -The sched_switch file only lists the wake ups (represented with -'+') and context switches ('==>') with the previous task or -current task first followed by the next task or task waking up. -The format for both of these is PID:KERNEL-PRIO:TASK-STATE. -Remember that the KERNEL-PRIO is the inverse of the actual -priority with zero (0) being the highest priority and the nice -values starting at 100 (nice -20). Below is a quick chart to map -the kernel priority to user land priorities. - - Kernel Space User Space - =============================================================== - 0(high) to 98(low) user RT priority 99(high) to 1(low) - with SCHED_RR or SCHED_FIFO - --------------------------------------------------------------- - 99 sched_priority is not used in scheduling - decisions(it must be specified as 0) - --------------------------------------------------------------- - 100(high) to 139(low) user nice -20(high) to 19(low) - --------------------------------------------------------------- - 140 idle task priority - --------------------------------------------------------------- - -The task states are: - - R - running : wants to run, may not actually be running - S - sleep : process is waiting to be woken up (handles signals) - D - disk sleep (uninterruptible sleep) : process must be woken up - (ignores signals) - T - stopped : process suspended - t - traced : process is being traced (with something like gdb) - Z - zombie : process waiting to be cleaned up - X - unknown - + overwrite - This controls what happens when the trace buffer is + full. If "1" (default), the oldest events are + discarded and overwritten. If "0", then the newest + events are discarded. ftrace_enabled -------------- @@ -607,10 +502,10 @@ an example: # echo irqsoff > current_tracer # echo latency-format > trace_options # echo 0 > tracing_max_latency - # echo 1 > tracing_enabled + # echo 1 > tracing_on # ls -ltr [...] - # echo 0 > tracing_enabled + # echo 0 > tracing_on # cat trace # tracer: irqsoff # @@ -715,10 +610,10 @@ is much like the irqsoff tracer. # echo preemptoff > current_tracer # echo latency-format > trace_options # echo 0 > tracing_max_latency - # echo 1 > tracing_enabled + # echo 1 > tracing_on # ls -ltr [...] - # echo 0 > tracing_enabled + # echo 0 > tracing_on # cat trace # tracer: preemptoff # @@ -863,10 +758,10 @@ tracers. # echo preemptirqsoff > current_tracer # echo latency-format > trace_options # echo 0 > tracing_max_latency - # echo 1 > tracing_enabled + # echo 1 > tracing_on # ls -ltr [...] - # echo 0 > tracing_enabled + # echo 0 > tracing_on # cat trace # tracer: preemptirqsoff # @@ -1026,9 +921,9 @@ Instead of performing an 'ls', we will run 'sleep 1' under # echo wakeup > current_tracer # echo latency-format > trace_options # echo 0 > tracing_max_latency - # echo 1 > tracing_enabled + # echo 1 > tracing_on # chrt -f 5 sleep 1 - # echo 0 > tracing_enabled + # echo 0 > tracing_on # cat trace # tracer: wakeup # @@ -1140,9 +1035,9 @@ ftrace_enabled is set; otherwise this tracer is a nop. # sysctl kernel.ftrace_enabled=1 # echo function > current_tracer - # echo 1 > tracing_enabled + # echo 1 > tracing_on # usleep 1 - # echo 0 > tracing_enabled + # echo 0 > tracing_on # cat trace # tracer: function # @@ -1180,7 +1075,7 @@ int trace_fd; [...] int main(int argc, char *argv[]) { [...] - trace_fd = open(tracing_file("tracing_enabled"), O_WRONLY); + trace_fd = open(tracing_file("tracing_on"), O_WRONLY); [...] if (condition_hit()) { write(trace_fd, "0", 1); @@ -1631,9 +1526,9 @@ If I am only interested in sys_nanosleep and hrtimer_interrupt: # echo sys_nanosleep hrtimer_interrupt \ > set_ftrace_filter # echo function > current_tracer - # echo 1 > tracing_enabled + # echo 1 > tracing_on # usleep 1 - # echo 0 > tracing_enabled + # echo 0 > tracing_on # cat trace # tracer: ftrace # @@ -1879,9 +1774,9 @@ different. The trace is live. # echo function > current_tracer # cat trace_pipe > /tmp/trace.out & [1] 4153 - # echo 1 > tracing_enabled + # echo 1 > tracing_on # usleep 1 - # echo 0 > tracing_enabled + # echo 0 > tracing_on # cat trace # tracer: function # diff --git a/Documentation/trace/kprobetrace.txt b/Documentation/trace/kprobetrace.txt index 5f77d94598dd..6d27ab8d6e9f 100644 --- a/Documentation/trace/kprobetrace.txt +++ b/Documentation/trace/kprobetrace.txt @@ -42,11 +42,25 @@ Synopsis of kprobe_events +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**) NAME=FETCHARG : Set NAME as the argument name of FETCHARG. FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types - (u8/u16/u32/u64/s8/s16/s32/s64) and string are supported. + (u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield + are supported. (*) only for return probe. (**) this is useful for fetching a field of data structures. +Types +----- +Several types are supported for fetch-args. Kprobe tracer will access memory +by given type. Prefix 's' and 'u' means those types are signed and unsigned +respectively. Traced arguments are shown in decimal (signed) or hex (unsigned). +String type is a special type, which fetches a "null-terminated" string from +kernel space. This means it will fail and store NULL if the string container +has been paged out. +Bitfield is another special type, which takes 3 parameters, bit-width, bit- +offset, and container-size (usually 32). The syntax is; + + b<bit-width>@<bit-offset>/<container-size> + Per-Probe Event Filtering ------------------------- diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt index 996a27d9b8db..01c513fac40e 100644 --- a/Documentation/workqueue.txt +++ b/Documentation/workqueue.txt @@ -190,9 +190,9 @@ resources, scheduled and executed. * Long running CPU intensive workloads which can be better managed by the system scheduler. - WQ_FREEZEABLE + WQ_FREEZABLE - A freezeable wq participates in the freeze phase of the system + A freezable wq participates in the freeze phase of the system suspend operations. Work items on the wq are drained and no new work item starts execution until thawed. |