summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/sysfs-class-uwb_rc-wusbhc (renamed from Documentation/ABI/testing/sysfs-class-usb_host)4
-rw-r--r--Documentation/ABI/testing/sysfs-devices-cache_disable18
-rw-r--r--Documentation/ABI/testing/sysfs-devices-system-cpu156
-rw-r--r--Documentation/cputopology.txt47
-rw-r--r--Documentation/feature-removal-schedule.txt8
-rw-r--r--Documentation/filesystems/ext4.txt8
-rw-r--r--Documentation/flexible-arrays.txt43
-rw-r--r--Documentation/hwmon/sysfs-interface57
-rw-r--r--Documentation/i2c/busses/i2c-piix42
-rw-r--r--Documentation/lguest/lguest.c1
-rw-r--r--Documentation/sound/alsa/ALSA-Configuration.txt2
-rw-r--r--Documentation/thermal/sysfs-api.txt389
-rw-r--r--Documentation/trace/ftrace.txt2
-rw-r--r--Documentation/trace/kprobetrace.txt149
-rw-r--r--Documentation/vm/hwpoison.txt136
15 files changed, 781 insertions, 241 deletions
diff --git a/Documentation/ABI/testing/sysfs-class-usb_host b/Documentation/ABI/testing/sysfs-class-uwb_rc-wusbhc
index 46b66ad1f1b4..4e8106f7cfd9 100644
--- a/Documentation/ABI/testing/sysfs-class-usb_host
+++ b/Documentation/ABI/testing/sysfs-class-uwb_rc-wusbhc
@@ -1,4 +1,4 @@
-What: /sys/class/usb_host/usb_hostN/wusb_chid
+What: /sys/class/uwb_rc/uwbN/wusbhc/wusb_chid
Date: July 2008
KernelVersion: 2.6.27
Contact: David Vrabel <david.vrabel@csr.com>
@@ -9,7 +9,7 @@ Description:
Set an all zero CHID to stop the host controller.
-What: /sys/class/usb_host/usb_hostN/wusb_trust_timeout
+What: /sys/class/uwb_rc/uwbN/wusbhc/wusb_trust_timeout
Date: July 2008
KernelVersion: 2.6.27
Contact: David Vrabel <david.vrabel@csr.com>
diff --git a/Documentation/ABI/testing/sysfs-devices-cache_disable b/Documentation/ABI/testing/sysfs-devices-cache_disable
deleted file mode 100644
index 175bb4f70512..000000000000
--- a/Documentation/ABI/testing/sysfs-devices-cache_disable
+++ /dev/null
@@ -1,18 +0,0 @@
-What: /sys/devices/system/cpu/cpu*/cache/index*/cache_disable_X
-Date: August 2008
-KernelVersion: 2.6.27
-Contact: mark.langsdorf@amd.com
-Description: These files exist in every cpu's cache index directories.
- There are currently 2 cache_disable_# files in each
- directory. Reading from these files on a supported
- processor will return that cache disable index value
- for that processor and node. Writing to one of these
- files will cause the specificed cache index to be disabled.
-
- Currently, only AMD Family 10h Processors support cache index
- disable, and only for their L3 caches. See the BIOS and
- Kernel Developer's Guide at
- http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116-Public-GH-BKDG_3.20_2-4-09.pdf
- for formatting information and other details on the
- cache index disable.
-Users: joachim.deguara@amd.com
diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
new file mode 100644
index 000000000000..a703b9e9aeb9
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -0,0 +1,156 @@
+What: /sys/devices/system/cpu/
+Date: pre-git history
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+ A collection of both global and individual CPU attributes
+
+ Individual CPU attributes are contained in subdirectories
+ named by the kernel's logical CPU number, e.g.:
+
+ /sys/devices/system/cpu/cpu#/
+
+What: /sys/devices/system/cpu/sched_mc_power_savings
+ /sys/devices/system/cpu/sched_smt_power_savings
+Date: June 2006
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description: Discover and adjust the kernel's multi-core scheduler support.
+
+ Possible values are:
+
+ 0 - No power saving load balance (default value)
+ 1 - Fill one thread/core/package first for long running threads
+ 2 - Also bias task wakeups to semi-idle cpu package for power
+ savings
+
+ sched_mc_power_savings is dependent upon SCHED_MC, which is
+ itself architecture dependent.
+
+ sched_smt_power_savings is dependent upon SCHED_SMT, which
+ is itself architecture dependent.
+
+ The two files are independent of each other. It is possible
+ that one file may be present without the other.
+
+ Introduced by git commit 5c45bf27.
+
+
+What: /sys/devices/system/cpu/kernel_max
+ /sys/devices/system/cpu/offline
+ /sys/devices/system/cpu/online
+ /sys/devices/system/cpu/possible
+ /sys/devices/system/cpu/present
+Date: December 2008
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description: CPU topology files that describe kernel limits related to
+ hotplug. Briefly:
+
+ kernel_max: the maximum cpu index allowed by the kernel
+ configuration.
+
+ offline: cpus that are not online because they have been
+ HOTPLUGGED off or exceed the limit of cpus allowed by the
+ kernel configuration (kernel_max above).
+
+ online: cpus that are online and being scheduled.
+
+ possible: cpus that have been allocated resources and can be
+ brought online if they are present.
+
+ present: cpus that have been identified as being present in
+ the system.
+
+ See Documentation/cputopology.txt for more information.
+
+
+
+What: /sys/devices/system/cpu/cpu#/node
+Date: October 2009
+Contact: Linux memory management mailing list <linux-mm@kvack.org>
+Description: Discover NUMA node a CPU belongs to
+
+ When CONFIG_NUMA is enabled, a symbolic link that points
+ to the corresponding NUMA node directory.
+
+ For example, the following symlink is created for cpu42
+ in NUMA node 2:
+
+ /sys/devices/system/cpu/cpu42/node2 -> ../../node/node2
+
+
+What: /sys/devices/system/cpu/cpu#/topology/core_id
+ /sys/devices/system/cpu/cpu#/topology/core_siblings
+ /sys/devices/system/cpu/cpu#/topology/core_siblings_list
+ /sys/devices/system/cpu/cpu#/topology/physical_package_id
+ /sys/devices/system/cpu/cpu#/topology/thread_siblings
+ /sys/devices/system/cpu/cpu#/topology/thread_siblings_list
+Date: December 2008
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description: CPU topology files that describe a logical CPU's relationship
+ to other cores and threads in the same physical package.
+
+ One cpu# directory is created per logical CPU in the system,
+ e.g. /sys/devices/system/cpu/cpu42/.
+
+ Briefly, the files above are:
+
+ core_id: the CPU core ID of cpu#. Typically it is the
+ hardware platform's identifier (rather than the kernel's).
+ The actual value is architecture and platform dependent.
+
+ core_siblings: internal kernel map of cpu#'s hardware threads
+ within the same physical_package_id.
+
+ core_siblings_list: human-readable list of the logical CPU
+ numbers within the same physical_package_id as cpu#.
+
+ physical_package_id: physical package id of cpu#. Typically
+ corresponds to a physical socket number, but the actual value
+ is architecture and platform dependent.
+
+ thread_siblings: internel kernel map of cpu#'s hardware
+ threads within the same core as cpu#
+
+ thread_siblings_list: human-readable list of cpu#'s hardware
+ threads within the same core as cpu#
+
+ See Documentation/cputopology.txt for more information.
+
+
+What: /sys/devices/system/cpu/cpuidle/current_driver
+ /sys/devices/system/cpu/cpuidle/current_governer_ro
+Date: September 2007
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description: Discover cpuidle policy and mechanism
+
+ Various CPUs today support multiple idle levels that are
+ differentiated by varying exit latencies and power
+ consumption during idle.
+
+ Idle policy (governor) is differentiated from idle mechanism
+ (driver)
+
+ current_driver: displays current idle mechanism
+
+ current_governor_ro: displays current idle policy
+
+ See files in Documentation/cpuidle/ for more information.
+
+
+What: /sys/devices/system/cpu/cpu*/cache/index*/cache_disable_X
+Date: August 2008
+KernelVersion: 2.6.27
+Contact: mark.langsdorf@amd.com
+Description: These files exist in every cpu's cache index directories.
+ There are currently 2 cache_disable_# files in each
+ directory. Reading from these files on a supported
+ processor will return that cache disable index value
+ for that processor and node. Writing to one of these
+ files will cause the specificed cache index to be disabled.
+
+ Currently, only AMD Family 10h Processors support cache index
+ disable, and only for their L3 caches. See the BIOS and
+ Kernel Developer's Guide at
+ http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116-Public-GH-BKDG_3.20_2-4-09.pdf
+ for formatting information and other details on the
+ cache index disable.
+Users: joachim.deguara@amd.com
diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index b41f3e58aefa..f1c5c4bccd3e 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -1,15 +1,28 @@
-Export cpu topology info via sysfs. Items (attributes) are similar
+Export CPU topology info via sysfs. Items (attributes) are similar
to /proc/cpuinfo.
1) /sys/devices/system/cpu/cpuX/topology/physical_package_id:
-represent the physical package id of cpu X;
+
+ physical package id of cpuX. Typically corresponds to a physical
+ socket number, but the actual value is architecture and platform
+ dependent.
+
2) /sys/devices/system/cpu/cpuX/topology/core_id:
-represent the cpu core id to cpu X;
+
+ the CPU core ID of cpuX. Typically it is the hardware platform's
+ identifier (rather than the kernel's). The actual value is
+ architecture and platform dependent.
+
3) /sys/devices/system/cpu/cpuX/topology/thread_siblings:
-represent the thread siblings to cpu X in the same core;
+
+ internel kernel map of cpuX's hardware threads within the same
+ core as cpuX
+
4) /sys/devices/system/cpu/cpuX/topology/core_siblings:
-represent the thread siblings to cpu X in the same physical package;
+
+ internal kernel map of cpuX's hardware threads within the same
+ physical_package_id.
To implement it in an architecture-neutral way, a new source file,
drivers/base/topology.c, is to export the 4 attributes.
@@ -32,32 +45,32 @@ not defined by include/asm-XXX/topology.h:
3) thread_siblings: just the given CPU
4) core_siblings: just the given CPU
-Additionally, cpu topology information is provided under
+Additionally, CPU topology information is provided under
/sys/devices/system/cpu and includes these files. The internal
source for the output is in brackets ("[]").
- kernel_max: the maximum cpu index allowed by the kernel configuration.
+ kernel_max: the maximum CPU index allowed by the kernel configuration.
[NR_CPUS-1]
- offline: cpus that are not online because they have been
+ offline: CPUs that are not online because they have been
HOTPLUGGED off (see cpu-hotplug.txt) or exceed the limit
- of cpus allowed by the kernel configuration (kernel_max
+ of CPUs allowed by the kernel configuration (kernel_max
above). [~cpu_online_mask + cpus >= NR_CPUS]
- online: cpus that are online and being scheduled [cpu_online_mask]
+ online: CPUs that are online and being scheduled [cpu_online_mask]
- possible: cpus that have been allocated resources and can be
+ possible: CPUs that have been allocated resources and can be
brought online if they are present. [cpu_possible_mask]
- present: cpus that have been identified as being present in the
+ present: CPUs that have been identified as being present in the
system. [cpu_present_mask]
The format for the above output is compatible with cpulist_parse()
[see <linux/cpumask.h>]. Some examples follow.
-In this example, there are 64 cpus in the system but cpus 32-63 exceed
+In this example, there are 64 CPUs in the system but cpus 32-63 exceed
the kernel max which is limited to 0..31 by the NR_CPUS config option
-being 32. Note also that cpus 2 and 4-31 are not online but could be
+being 32. Note also that CPUs 2 and 4-31 are not online but could be
brought online as they are both present and possible.
kernel_max: 31
@@ -67,8 +80,8 @@ brought online as they are both present and possible.
present: 0-31
In this example, the NR_CPUS config option is 128, but the kernel was
-started with possible_cpus=144. There are 4 cpus in the system and cpu2
-was manually taken offline (and is the only cpu that can be brought
+started with possible_cpus=144. There are 4 CPUs in the system and cpu2
+was manually taken offline (and is the only CPU that can be brought
online.)
kernel_max: 127
@@ -78,4 +91,4 @@ online.)
present: 0-3
See cpu-hotplug.txt for the possible_cpus=NUM kernel start parameter
-as well as more information on the various cpumask's.
+as well as more information on the various cpumasks.
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 04e6c819b28a..bc693fffabe0 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -418,6 +418,14 @@ When: 2.6.33
Why: Should be implemented in userspace, policy daemon.
Who: Johannes Berg <johannes@sipsolutions.net>
+---------------------------
+
+What: CONFIG_INOTIFY
+When: 2.6.33
+Why: last user (audit) will be converted to the newer more generic
+ and more easily maintained fsnotify subsystem
+Who: Eric Paris <eparis@redhat.com>
+
----------------------------
What: lock_policy_rwsem_* and unlock_policy_rwsem_* will not be
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index bf4f4b7e11b3..6d94e0696f8c 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -134,9 +134,15 @@ ro Mount filesystem read only. Note that ext4 will
mount options "ro,noload" can be used to prevent
writes to the filesystem.
+journal_checksum Enable checksumming of the journal transactions.
+ This will allow the recovery code in e2fsck and the
+ kernel to detect corruption in the kernel. It is a
+ compatible change and will be ignored by older kernels.
+
journal_async_commit Commit block can be written to disk without waiting
for descriptor blocks. If enabled older kernels cannot
- mount the device.
+ mount the device. This will enable 'journal_checksum'
+ internally.
journal=update Update the ext4 file system's journal to the current
format.
diff --git a/Documentation/flexible-arrays.txt b/Documentation/flexible-arrays.txt
index 84eb26808dee..cb8a3a00cc92 100644
--- a/Documentation/flexible-arrays.txt
+++ b/Documentation/flexible-arrays.txt
@@ -1,5 +1,5 @@
Using flexible arrays in the kernel
-Last updated for 2.6.31
+Last updated for 2.6.32
Jonathan Corbet <corbet@lwn.net>
Large contiguous memory allocations can be unreliable in the Linux kernel.
@@ -40,6 +40,13 @@ argument is passed directly to the internal memory allocation calls. With
the current code, using flags to ask for high memory is likely to lead to
notably unpleasant side effects.
+It is also possible to define flexible arrays at compile time with:
+
+ DEFINE_FLEX_ARRAY(name, element_size, total);
+
+This macro will result in a definition of an array with the given name; the
+element size and total will be checked for validity at compile time.
+
Storing data into a flexible array is accomplished with a call to:
int flex_array_put(struct flex_array *array, unsigned int element_nr,
@@ -76,16 +83,30 @@ particular element has never been allocated.
Note that it is possible to get back a valid pointer for an element which
has never been stored in the array. Memory for array elements is allocated
one page at a time; a single allocation could provide memory for several
-adjacent elements. The flexible array code does not know if a specific
-element has been written; it only knows if the associated memory is
-present. So a flex_array_get() call on an element which was never stored
-in the array has the potential to return a pointer to random data. If the
-caller does not have a separate way to know which elements were actually
-stored, it might be wise, at least, to add GFP_ZERO to the flags argument
-to ensure that all elements are zeroed.
-
-There is no way to remove a single element from the array. It is possible,
-though, to remove all elements with a call to:
+adjacent elements. Flexible array elements are normally initialized to the
+value FLEX_ARRAY_FREE (defined as 0x6c in <linux/poison.h>), so errors
+involving that number probably result from use of unstored array entries.
+Note that, if array elements are allocated with __GFP_ZERO, they will be
+initialized to zero and this poisoning will not happen.
+
+Individual elements in the array can be cleared with:
+
+ int flex_array_clear(struct flex_array *array, unsigned int element_nr);
+
+This function will set the given element to FLEX_ARRAY_FREE and return
+zero. If storage for the indicated element is not allocated for the array,
+flex_array_clear() will return -EINVAL instead. Note that clearing an
+element does not release the storage associated with it; to reduce the
+allocated size of an array, call:
+
+ int flex_array_shrink(struct flex_array *array);
+
+The return value will be the number of pages of memory actually freed.
+This function works by scanning the array for pages containing nothing but
+FLEX_ARRAY_FREE bytes, so (1) it can be expensive, and (2) it will not work
+if the array's pages are allocated with __GFP_ZERO.
+
+It is possible to remove all elements of an array with a call to:
void flex_array_free_parts(struct flex_array *array);
diff --git a/Documentation/hwmon/sysfs-interface b/Documentation/hwmon/sysfs-interface
index dcbd502c8792..82def883361b 100644
--- a/Documentation/hwmon/sysfs-interface
+++ b/Documentation/hwmon/sysfs-interface
@@ -353,10 +353,20 @@ power[1-*]_average Average power use
Unit: microWatt
RO
-power[1-*]_average_interval Power use averaging interval
+power[1-*]_average_interval Power use averaging interval. A poll
+ notification is sent to this file if the
+ hardware changes the averaging interval.
Unit: milliseconds
RW
+power[1-*]_average_interval_max Maximum power use averaging interval
+ Unit: milliseconds
+ RO
+
+power[1-*]_average_interval_min Minimum power use averaging interval
+ Unit: milliseconds
+ RO
+
power[1-*]_average_highest Historical average maximum power use
Unit: microWatt
RO
@@ -365,6 +375,18 @@ power[1-*]_average_lowest Historical average minimum power use
Unit: microWatt
RO
+power[1-*]_average_max A poll notification is sent to
+ power[1-*]_average when power use
+ rises above this value.
+ Unit: microWatt
+ RW
+
+power[1-*]_average_min A poll notification is sent to
+ power[1-*]_average when power use
+ sinks below this value.
+ Unit: microWatt
+ RW
+
power[1-*]_input Instantaneous power use
Unit: microWatt
RO
@@ -381,6 +403,39 @@ power[1-*]_reset_history Reset input_highest, input_lowest,
average_highest and average_lowest.
WO
+power[1-*]_accuracy Accuracy of the power meter.
+ Unit: Percent
+ RO
+
+power[1-*]_alarm 1 if the system is drawing more power than the
+ cap allows; 0 otherwise. A poll notification is
+ sent to this file when the power use exceeds the
+ cap. This file only appears if the cap is known
+ to be enforced by hardware.
+ RO
+
+power[1-*]_cap If power use rises above this limit, the
+ system should take action to reduce power use.
+ A poll notification is sent to this file if the
+ cap is changed by the hardware. The *_cap
+ files only appear if the cap is known to be
+ enforced by hardware.
+ Unit: microWatt
+ RW
+
+power[1-*]_cap_hyst Margin of hysteresis built around capping and
+ notification.
+ Unit: microWatt
+ RW
+
+power[1-*]_cap_max Maximum cap that can be set.
+ Unit: microWatt
+ RO
+
+power[1-*]_cap_min Minimum cap that can be set.
+ Unit: microWatt
+ RO
+
**********
* Energy *
**********
diff --git a/Documentation/i2c/busses/i2c-piix4 b/Documentation/i2c/busses/i2c-piix4
index c5b37c570554..ac540c71c7eb 100644
--- a/Documentation/i2c/busses/i2c-piix4
+++ b/Documentation/i2c/busses/i2c-piix4
@@ -8,7 +8,7 @@ Supported adapters:
Datasheet: Only available via NDA from ServerWorks
* ATI IXP200, IXP300, IXP400, SB600, SB700 and SB800 southbridges
Datasheet: Not publicly available
- * AMD SB900
+ * AMD Hudson-2
Datasheet: Not publicly available
* Standard Microsystems (SMSC) SLC90E66 (Victory66) southbridge
Datasheet: Publicly available at the SMSC website http://www.smsc.com
diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index ba9373f82ab5..098de5bce00a 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -42,7 +42,6 @@
#include <signal.h>
#include "linux/lguest_launcher.h"
#include "linux/virtio_config.h"
-#include <linux/virtio_ids.h>
#include "linux/virtio_net.h"
#include "linux/virtio_blk.h"
#include "linux/virtio_console.h"
diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt
index 1c8eb4518ce0..fd9a2f67edf2 100644
--- a/Documentation/sound/alsa/ALSA-Configuration.txt
+++ b/Documentation/sound/alsa/ALSA-Configuration.txt
@@ -522,7 +522,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
pcm_devs - Number of PCM devices assigned to each card
(default = 1, up to 4)
pcm_substreams - Number of PCM substreams assigned to each PCM
- (default = 8, up to 16)
+ (default = 8, up to 128)
hrtimer - Use hrtimer (=1, default) or system timer (=0)
fake_buffer - Fake buffer allocations (default = 1)
diff --git a/Documentation/thermal/sysfs-api.txt b/Documentation/thermal/sysfs-api.txt
index 70d68ce8640a..a87dc277a5ca 100644
--- a/Documentation/thermal/sysfs-api.txt
+++ b/Documentation/thermal/sysfs-api.txt
@@ -1,5 +1,5 @@
Generic Thermal Sysfs driver How To
-=========================
+===================================
Written by Sujith Thomas <sujith.thomas@intel.com>, Zhang Rui <rui.zhang@intel.com>
@@ -10,20 +10,20 @@ Copyright (c) 2008 Intel Corporation
0. Introduction
-The generic thermal sysfs provides a set of interfaces for thermal zone devices (sensors)
-and thermal cooling devices (fan, processor...) to register with the thermal management
-solution and to be a part of it.
+The generic thermal sysfs provides a set of interfaces for thermal zone
+devices (sensors) and thermal cooling devices (fan, processor...) to register
+with the thermal management solution and to be a part of it.
-This how-to focuses on enabling new thermal zone and cooling devices to participate
-in thermal management.
-This solution is platform independent and any type of thermal zone devices and
-cooling devices should be able to make use of the infrastructure.
+This how-to focuses on enabling new thermal zone and cooling devices to
+participate in thermal management.
+This solution is platform independent and any type of thermal zone devices
+and cooling devices should be able to make use of the infrastructure.
-The main task of the thermal sysfs driver is to expose thermal zone attributes as well
-as cooling device attributes to the user space.
-An intelligent thermal management application can make decisions based on inputs
-from thermal zone attributes (the current temperature and trip point temperature)
-and throttle appropriate devices.
+The main task of the thermal sysfs driver is to expose thermal zone attributes
+as well as cooling device attributes to the user space.
+An intelligent thermal management application can make decisions based on
+inputs from thermal zone attributes (the current temperature and trip point
+temperature) and throttle appropriate devices.
[0-*] denotes any positive number starting from 0
[1-*] denotes any positive number starting from 1
@@ -31,77 +31,77 @@ and throttle appropriate devices.
1. thermal sysfs driver interface functions
1.1 thermal zone device interface
-1.1.1 struct thermal_zone_device *thermal_zone_device_register(char *name, int trips,
- void *devdata, struct thermal_zone_device_ops *ops)
-
- This interface function adds a new thermal zone device (sensor) to
- /sys/class/thermal folder as thermal_zone[0-*].
- It tries to bind all the thermal cooling devices registered at the same time.
-
- name: the thermal zone name.
- trips: the total number of trip points this thermal zone supports.
- devdata: device private data
- ops: thermal zone device call-backs.
- .bind: bind the thermal zone device with a thermal cooling device.
- .unbind: unbind the thermal zone device with a thermal cooling device.
- .get_temp: get the current temperature of the thermal zone.
- .get_mode: get the current mode (user/kernel) of the thermal zone.
- "kernel" means thermal management is done in kernel.
- "user" will prevent kernel thermal driver actions upon trip points
- so that user applications can take charge of thermal management.
- .set_mode: set the mode (user/kernel) of the thermal zone.
- .get_trip_type: get the type of certain trip point.
- .get_trip_temp: get the temperature above which the certain trip point
- will be fired.
+1.1.1 struct thermal_zone_device *thermal_zone_device_register(char *name,
+ int trips, void *devdata, struct thermal_zone_device_ops *ops)
+
+ This interface function adds a new thermal zone device (sensor) to
+ /sys/class/thermal folder as thermal_zone[0-*]. It tries to bind all the
+ thermal cooling devices registered at the same time.
+
+ name: the thermal zone name.
+ trips: the total number of trip points this thermal zone supports.
+ devdata: device private data
+ ops: thermal zone device call-backs.
+ .bind: bind the thermal zone device with a thermal cooling device.
+ .unbind: unbind the thermal zone device with a thermal cooling device.
+ .get_temp: get the current temperature of the thermal zone.
+ .get_mode: get the current mode (user/kernel) of the thermal zone.
+ - "kernel" means thermal management is done in kernel.
+ - "user" will prevent kernel thermal driver actions upon trip points
+ so that user applications can take charge of thermal management.
+ .set_mode: set the mode (user/kernel) of the thermal zone.
+ .get_trip_type: get the type of certain trip point.
+ .get_trip_temp: get the temperature above which the certain trip point
+ will be fired.
1.1.2 void thermal_zone_device_unregister(struct thermal_zone_device *tz)
- This interface function removes the thermal zone device.
- It deletes the corresponding entry form /sys/class/thermal folder and unbind all
- the thermal cooling devices it uses.
+ This interface function removes the thermal zone device.
+ It deletes the corresponding entry form /sys/class/thermal folder and
+ unbind all the thermal cooling devices it uses.
1.2 thermal cooling device interface
1.2.1 struct thermal_cooling_device *thermal_cooling_device_register(char *name,
- void *devdata, struct thermal_cooling_device_ops *)
-
- This interface function adds a new thermal cooling device (fan/processor/...) to
- /sys/class/thermal/ folder as cooling_device[0-*].
- It tries to bind itself to all the thermal zone devices register at the same time.
- name: the cooling device name.
- devdata: device private data.
- ops: thermal cooling devices call-backs.
- .get_max_state: get the Maximum throttle state of the cooling device.
- .get_cur_state: get the Current throttle state of the cooling device.
- .set_cur_state: set the Current throttle state of the cooling device.
+ void *devdata, struct thermal_cooling_device_ops *)
+
+ This interface function adds a new thermal cooling device (fan/processor/...)
+ to /sys/class/thermal/ folder as cooling_device[0-*]. It tries to bind itself
+ to all the thermal zone devices register at the same time.
+ name: the cooling device name.
+ devdata: device private data.
+ ops: thermal cooling devices call-backs.
+ .get_max_state: get the Maximum throttle state of the cooling device.
+ .get_cur_state: get the Current throttle state of the cooling device.
+ .set_cur_state: set the Current throttle state of the cooling device.
1.2.2 void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
- This interface function remove the thermal cooling device.
- It deletes the corresponding entry form /sys/class/thermal folder and unbind
- itself from all the thermal zone devices using it.
+ This interface function remove the thermal cooling device.
+ It deletes the corresponding entry form /sys/class/thermal folder and
+ unbind itself from all the thermal zone devices using it.
1.3 interface for binding a thermal zone device with a thermal cooling device
1.3.1 int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz,
- int trip, struct thermal_cooling_device *cdev);
+ int trip, struct thermal_cooling_device *cdev);
- This interface function bind a thermal cooling device to the certain trip point
- of a thermal zone device.
- This function is usually called in the thermal zone device .bind callback.
- tz: the thermal zone device
- cdev: thermal cooling device
- trip: indicates which trip point the cooling devices is associated with
- in this thermal zone.
+ This interface function bind a thermal cooling device to the certain trip
+ point of a thermal zone device.
+ This function is usually called in the thermal zone device .bind callback.
+ tz: the thermal zone device
+ cdev: thermal cooling device
+ trip: indicates which trip point the cooling devices is associated with
+ in this thermal zone.
1.3.2 int thermal_zone_unbind_cooling_device(struct thermal_zone_device *tz,
- int trip, struct thermal_cooling_device *cdev);
+ int trip, struct thermal_cooling_device *cdev);
- This interface function unbind a thermal cooling device from the certain trip point
- of a thermal zone device.
- This function is usually called in the thermal zone device .unbind callback.
- tz: the thermal zone device
- cdev: thermal cooling device
- trip: indicates which trip point the cooling devices is associated with
- in this thermal zone.
+ This interface function unbind a thermal cooling device from the certain
+ trip point of a thermal zone device. This function is usually called in
+ the thermal zone device .unbind callback.
+ tz: the thermal zone device
+ cdev: thermal cooling device
+ trip: indicates which trip point the cooling devices is associated with
+ in this thermal zone.
2. sysfs attributes structure
@@ -114,153 +114,166 @@ if hwmon is compiled in or built as a module.
Thermal zone device sys I/F, created once it's registered:
/sys/class/thermal/thermal_zone[0-*]:
- |-----type: Type of the thermal zone
- |-----temp: Current temperature
- |-----mode: Working mode of the thermal zone
- |-----trip_point_[0-*]_temp: Trip point temperature
- |-----trip_point_[0-*]_type: Trip point type
+ |---type: Type of the thermal zone
+ |---temp: Current temperature
+ |---mode: Working mode of the thermal zone
+ |---trip_point_[0-*]_temp: Trip point temperature
+ |---trip_point_[0-*]_type: Trip point type
Thermal cooling device sys I/F, created once it's registered:
/sys/class/thermal/cooling_device[0-*]:
- |-----type : Type of the cooling device(processor/fan/...)
- |-----max_state: Maximum cooling state of the cooling device
- |-----cur_state: Current cooling state of the cooling device
+ |---type: Type of the cooling device(processor/fan/...)
+ |---max_state: Maximum cooling state of the cooling device
+ |---cur_state: Current cooling state of the cooling device
-These two dynamic attributes are created/removed in pairs.
-They represent the relationship between a thermal zone and its associated cooling device.
-They are created/removed for each
-thermal_zone_bind_cooling_device/thermal_zone_unbind_cooling_device successful execution.
+Then next two dynamic attributes are created/removed in pairs. They represent
+the relationship between a thermal zone and its associated cooling device.
+They are created/removed for each successful execution of
+thermal_zone_bind_cooling_device/thermal_zone_unbind_cooling_device.
-/sys/class/thermal/thermal_zone[0-*]
- |-----cdev[0-*]: The [0-*]th cooling device in the current thermal zone
- |-----cdev[0-*]_trip_point: Trip point that cdev[0-*] is associated with
+/sys/class/thermal/thermal_zone[0-*]:
+ |---cdev[0-*]: [0-*]th cooling device in current thermal zone
+ |---cdev[0-*]_trip_point: Trip point that cdev[0-*] is associated with
Besides the thermal zone device sysfs I/F and cooling device sysfs I/F,
-the generic thermal driver also creates a hwmon sysfs I/F for each _type_ of
-thermal zone device. E.g. the generic thermal driver registers one hwmon class device
-and build the associated hwmon sysfs I/F for all the registered ACPI thermal zones.
+the generic thermal driver also creates a hwmon sysfs I/F for each _type_
+of thermal zone device. E.g. the generic thermal driver registers one hwmon
+class device and build the associated hwmon sysfs I/F for all the registered
+ACPI thermal zones.
+
/sys/class/hwmon/hwmon[0-*]:
- |-----name: The type of the thermal zone devices.
- |-----temp[1-*]_input: The current temperature of thermal zone [1-*].
- |-----temp[1-*]_critical: The critical trip point of thermal zone [1-*].
+ |---name: The type of the thermal zone devices
+ |---temp[1-*]_input: The current temperature of thermal zone [1-*]
+ |---temp[1-*]_critical: The critical trip point of thermal zone [1-*]
+
Please read Documentation/hwmon/sysfs-interface for additional information.
***************************
* Thermal zone attributes *
***************************
-type Strings which represent the thermal zone type.
- This is given by thermal zone driver as part of registration.
- Eg: "acpitz" indicates it's an ACPI thermal device.
- In order to keep it consistent with hwmon sys attribute,
- this should be a short, lowercase string,
- not containing spaces nor dashes.
- RO
- Required
-
-temp Current temperature as reported by thermal zone (sensor)
- Unit: millidegree Celsius
- RO
- Required
-
-mode One of the predefined values in [kernel, user]
- This file gives information about the algorithm
- that is currently managing the thermal zone.
- It can be either default kernel based algorithm
- or user space application.
- RW
- Optional
- kernel = Thermal management in kernel thermal zone driver.
- user = Preventing kernel thermal zone driver actions upon
- trip points so that user application can take full
- charge of the thermal management.
-
-trip_point_[0-*]_temp The temperature above which trip point will be fired
- Unit: millidegree Celsius
- RO
- Optional
-
-trip_point_[0-*]_type Strings which indicate the type of the trip point
- E.g. it can be one of critical, hot, passive,
- active[0-*] for ACPI thermal zone.
- RO
- Optional
-
-cdev[0-*] Sysfs link to the thermal cooling device node where the sys I/F
- for cooling device throttling control represents.
- RO
- Optional
-
-cdev[0-*]_trip_point The trip point with which cdev[0-*] is associated in this thermal zone
- -1 means the cooling device is not associated with any trip point.
- RO
- Optional
-
-******************************
-* Cooling device attributes *
-******************************
-
-type String which represents the type of device
- eg: For generic ACPI: this should be "Fan",
- "Processor" or "LCD"
- eg. For memory controller device on intel_menlow platform:
- this should be "Memory controller"
- RO
- Required
-
-max_state The maximum permissible cooling state of this cooling device.
- RO
- Required
-
-cur_state The current cooling state of this cooling device.
- the value can any integer numbers between 0 and max_state,
- cur_state == 0 means no cooling
- cur_state == max_state means the maximum cooling.
- RW
- Required
+type
+ Strings which represent the thermal zone type.
+ This is given by thermal zone driver as part of registration.
+ E.g: "acpitz" indicates it's an ACPI thermal device.
+ In order to keep it consistent with hwmon sys attribute; this should
+ be a short, lowercase string, not containing spaces nor dashes.
+ RO, Required
+
+temp
+ Current temperature as reported by thermal zone (sensor).
+ Unit: millidegree Celsius
+ RO, Required
+
+mode
+ One of the predefined values in [kernel, user].
+ This file gives information about the algorithm that is currently
+ managing the thermal zone. It can be either default kernel based
+ algorithm or user space application.
+ kernel = Thermal management in kernel thermal zone driver.
+ user = Preventing kernel thermal zone driver actions upon
+ trip points so that user application can take full
+ charge of the thermal management.
+ RW, Optional
+
+trip_point_[0-*]_temp
+ The temperature above which trip point will be fired.
+ Unit: millidegree Celsius
+ RO, Optional
+
+trip_point_[0-*]_type
+ Strings which indicate the type of the trip point.
+ E.g. it can be one of critical, hot, passive, active[0-*] for ACPI
+ thermal zone.
+ RO, Optional
+
+cdev[0-*]
+ Sysfs link to the thermal cooling device node where the sys I/F
+ for cooling device throttling control represents.
+ RO, Optional
+
+cdev[0-*]_trip_point
+ The trip point with which cdev[0-*] is associated in this thermal
+ zone; -1 means the cooling device is not associated with any trip
+ point.
+ RO, Optional
+
+passive
+ Attribute is only present for zones in which the passive cooling
+ policy is not supported by native thermal driver. Default is zero
+ and can be set to a temperature (in millidegrees) to enable a
+ passive trip point for the zone. Activation is done by polling with
+ an interval of 1 second.
+ Unit: millidegrees Celsius
+ RW, Optional
+
+*****************************
+* Cooling device attributes *
+*****************************
+
+type
+ String which represents the type of device, e.g:
+ - for generic ACPI: should be "Fan", "Processor" or "LCD"
+ - for memory controller device on intel_menlow platform:
+ should be "Memory controller".
+ RO, Required
+
+max_state
+ The maximum permissible cooling state of this cooling device.
+ RO, Required
+
+cur_state
+ The current cooling state of this cooling device.
+ The value can any integer numbers between 0 and max_state:
+ - cur_state == 0 means no cooling
+ - cur_state == max_state means the maximum cooling.
+ RW, Required
3. A simple implementation
-ACPI thermal zone may support multiple trip points like critical/hot/passive/active.
-If an ACPI thermal zone supports critical, passive, active[0] and active[1] at the same time,
-it may register itself as a thermal_zone_device (thermal_zone1) with 4 trip points in all.
-It has one processor and one fan, which are both registered as thermal_cooling_device.
-If the processor is listed in _PSL method, and the fan is listed in _AL0 method,
-the sys I/F structure will be built like this:
+ACPI thermal zone may support multiple trip points like critical, hot,
+passive, active. If an ACPI thermal zone supports critical, passive,
+active[0] and active[1] at the same time, it may register itself as a
+thermal_zone_device (thermal_zone1) with 4 trip points in all.
+It has one processor and one fan, which are both registered as
+thermal_cooling_device.
+
+If the processor is listed in _PSL method, and the fan is listed in _AL0
+method, the sys I/F structure will be built like this:
/sys/class/thermal:
|thermal_zone1:
- |-----type: acpitz
- |-----temp: 37000
- |-----mode: kernel
- |-----trip_point_0_temp: 100000
- |-----trip_point_0_type: critical
- |-----trip_point_1_temp: 80000
- |-----trip_point_1_type: passive
- |-----trip_point_2_temp: 70000
- |-----trip_point_2_type: active0
- |-----trip_point_3_temp: 60000
- |-----trip_point_3_type: active1
- |-----cdev0: --->/sys/class/thermal/cooling_device0
- |-----cdev0_trip_point: 1 /* cdev0 can be used for passive */
- |-----cdev1: --->/sys/class/thermal/cooling_device3
- |-----cdev1_trip_point: 2 /* cdev1 can be used for active[0]*/
+ |---type: acpitz
+ |---temp: 37000
+ |---mode: kernel
+ |---trip_point_0_temp: 100000
+ |---trip_point_0_type: critical
+ |---trip_point_1_temp: 80000
+ |---trip_point_1_type: passive
+ |---trip_point_2_temp: 70000
+ |---trip_point_2_type: active0
+ |---trip_point_3_temp: 60000
+ |---trip_point_3_type: active1
+ |---cdev0: --->/sys/class/thermal/cooling_device0
+ |---cdev0_trip_point: 1 /* cdev0 can be used for passive */
+ |---cdev1: --->/sys/class/thermal/cooling_device3
+ |---cdev1_trip_point: 2 /* cdev1 can be used for active[0]*/
|cooling_device0:
- |-----type: Processor
- |-----max_state: 8
- |-----cur_state: 0
+ |---type: Processor
+ |---max_state: 8
+ |---cur_state: 0
|cooling_device3:
- |-----type: Fan
- |-----max_state: 2
- |-----cur_state: 0
+ |---type: Fan
+ |---max_state: 2
+ |---cur_state: 0
/sys/class/hwmon:
|hwmon0:
- |-----name: acpitz
- |-----temp1_input: 37000
- |-----temp1_crit: 100000
+ |---name: acpitz
+ |---temp1_input: 37000
+ |---temp1_crit: 100000
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index 957b22fde2df..8179692fbb90 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -1231,6 +1231,7 @@ something like this simple program:
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
+#include <string.h>
#define _STR(x) #x
#define STR(x) _STR(x)
@@ -1265,6 +1266,7 @@ const char *find_debugfs(void)
return NULL;
}
+ strcat(debugfs, "/tracing/");
debugfs_found = 1;
return debugfs;
diff --git a/Documentation/trace/kprobetrace.txt b/Documentation/trace/kprobetrace.txt
new file mode 100644
index 000000000000..47aabeebbdf6
--- /dev/null
+++ b/Documentation/trace/kprobetrace.txt
@@ -0,0 +1,149 @@
+ Kprobe-based Event Tracing
+ ==========================
+
+ Documentation is written by Masami Hiramatsu
+
+
+Overview
+--------
+These events are similar to tracepoint based events. Instead of Tracepoint,
+this is based on kprobes (kprobe and kretprobe). So it can probe wherever
+kprobes can probe (this means, all functions body except for __kprobes
+functions). Unlike the Tracepoint based event, this can be added and removed
+dynamically, on the fly.
+
+To enable this feature, build your kernel with CONFIG_KPROBE_TRACING=y.
+
+Similar to the events tracer, this doesn't need to be activated via
+current_tracer. Instead of that, add probe points via
+/sys/kernel/debug/tracing/kprobe_events, and enable it via
+/sys/kernel/debug/tracing/events/kprobes/<EVENT>/enabled.
+
+
+Synopsis of kprobe_events
+-------------------------
+ p[:[GRP/]EVENT] SYMBOL[+offs]|MEMADDR [FETCHARGS] : Set a probe
+ r[:[GRP/]EVENT] SYMBOL[+0] [FETCHARGS] : Set a return probe
+
+ GRP : Group name. If omitted, use "kprobes" for it.
+ EVENT : Event name. If omitted, the event name is generated
+ based on SYMBOL+offs or MEMADDR.
+ SYMBOL[+offs] : Symbol+offset where the probe is inserted.
+ MEMADDR : Address where the probe is inserted.
+
+ FETCHARGS : Arguments. Each probe can have up to 128 args.
+ %REG : Fetch register REG
+ @ADDR : Fetch memory at ADDR (ADDR should be in kernel)
+ @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
+ $stackN : Fetch Nth entry of stack (N >= 0)
+ $stack : Fetch stack address.
+ $argN : Fetch function argument. (N >= 0)(*)
+ $retval : Fetch return value.(**)
+ +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(***)
+ NAME=FETCHARG: Set NAME as the argument name of FETCHARG.
+
+ (*) aN may not correct on asmlinkaged functions and at the middle of
+ function body.
+ (**) only for return probe.
+ (***) this is useful for fetching a field of data structures.
+
+
+Per-Probe Event Filtering
+-------------------------
+ Per-probe event filtering feature allows you to set different filter on each
+probe and gives you what arguments will be shown in trace buffer. If an event
+name is specified right after 'p:' or 'r:' in kprobe_events, it adds an event
+under tracing/events/kprobes/<EVENT>, at the directory you can see 'id',
+'enabled', 'format' and 'filter'.
+
+enabled:
+ You can enable/disable the probe by writing 1 or 0 on it.
+
+format:
+ This shows the format of this probe event.
+
+filter:
+ You can write filtering rules of this event.
+
+id:
+ This shows the id of this probe event.
+
+
+Event Profiling
+---------------
+ You can check the total number of probe hits and probe miss-hits via
+/sys/kernel/debug/tracing/kprobe_profile.
+ The first column is event name, the second is the number of probe hits,
+the third is the number of probe miss-hits.
+
+
+Usage examples
+--------------
+To add a probe as a new event, write a new definition to kprobe_events
+as below.
+
+ echo p:myprobe do_sys_open dfd=$arg0 filename=$arg1 flags=$arg2 mode=$arg3 > /sys/kernel/debug/tracing/kprobe_events
+
+ This sets a kprobe on the top of do_sys_open() function with recording
+1st to 4th arguments as "myprobe" event. As this example shows, users can
+choose more familiar names for each arguments.
+
+ echo r:myretprobe do_sys_open $retval >> /sys/kernel/debug/tracing/kprobe_events
+
+ This sets a kretprobe on the return point of do_sys_open() function with
+recording return value as "myretprobe" event.
+ You can see the format of these events via
+/sys/kernel/debug/tracing/events/kprobes/<EVENT>/format.
+
+ cat /sys/kernel/debug/tracing/events/kprobes/myprobe/format
+name: myprobe
+ID: 75
+format:
+ field:unsigned short common_type; offset:0; size:2;
+ field:unsigned char common_flags; offset:2; size:1;
+ field:unsigned char common_preempt_count; offset:3; size:1;
+ field:int common_pid; offset:4; size:4;
+ field:int common_tgid; offset:8; size:4;
+
+ field: unsigned long ip; offset:16;tsize:8;
+ field: int nargs; offset:24;tsize:4;
+ field: unsigned long dfd; offset:32;tsize:8;
+ field: unsigned long filename; offset:40;tsize:8;
+ field: unsigned long flags; offset:48;tsize:8;
+ field: unsigned long mode; offset:56;tsize:8;
+
+print fmt: "(%lx) dfd=%lx filename=%lx flags=%lx mode=%lx", REC->ip, REC->dfd, REC->filename, REC->flags, REC->mode
+
+
+ You can see that the event has 4 arguments as in the expressions you specified.
+
+ echo > /sys/kernel/debug/tracing/kprobe_events
+
+ This clears all probe points.
+
+ Right after definition, each event is disabled by default. For tracing these
+events, you need to enable it.
+
+ echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
+ echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable
+
+ And you can see the traced information via /sys/kernel/debug/tracing/trace.
+
+ cat /sys/kernel/debug/tracing/trace
+# tracer: nop
+#
+# TASK-PID CPU# TIMESTAMP FUNCTION
+# | | | | |
+ <...>-1447 [001] 1038282.286875: myprobe: (do_sys_open+0x0/0xd6) dfd=3 filename=7fffd1ec4440 flags=8000 mode=0
+ <...>-1447 [001] 1038282.286878: myretprobe: (sys_openat+0xc/0xe <- do_sys_open) $retval=fffffffffffffffe
+ <...>-1447 [001] 1038282.286885: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=40413c flags=8000 mode=1b6
+ <...>-1447 [001] 1038282.286915: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3
+ <...>-1447 [001] 1038282.286969: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=4041c6 flags=98800 mode=10
+ <...>-1447 [001] 1038282.286976: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3
+
+
+ Each line shows when the kernel hits an event, and <- SYMBOL means kernel
+returns from SYMBOL(e.g. "sys_open+0x1b/0x1d <- do_sys_open" means kernel
+returns from do_sys_open to sys_open+0x1b).
+
+
diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt
new file mode 100644
index 000000000000..3ffadf8da61f
--- /dev/null
+++ b/Documentation/vm/hwpoison.txt
@@ -0,0 +1,136 @@
+What is hwpoison?
+
+Upcoming Intel CPUs have support for recovering from some memory errors
+(``MCA recovery''). This requires the OS to declare a page "poisoned",
+kill the processes associated with it and avoid using it in the future.
+
+This patchkit implements the necessary infrastructure in the VM.
+
+To quote the overview comment:
+
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focusses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states. The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * Some of the operations here are somewhat inefficient and have non
+ * linear algorithmic complexity, because the data structures have not
+ * been optimized for this case. This is in particular the case
+ * for the mapping from a vma to a process. Since this case is expected
+ * to be rare we hope we can get away with this.
+
+The code consists of a the high level handler in mm/memory-failure.c,
+a new page poison bit and various checks in the VM to handle poisoned
+pages.
+
+The main target right now is KVM guests, but it works for all kinds
+of applications. KVM support requires a recent qemu-kvm release.
+
+For the KVM use there was need for a new signal type so that
+KVM can inject the machine check into the guest with the proper
+address. This in theory allows other applications to handle
+memory failures too. The expection is that near all applications
+won't do that, but some very specialized ones might.
+
+---
+
+There are two (actually three) modi memory failure recovery can be in:
+
+vm.memory_failure_recovery sysctl set to zero:
+ All memory failures cause a panic. Do not attempt recovery.
+ (on x86 this can be also affected by the tolerant level of the
+ MCE subsystem)
+
+early kill
+ (can be controlled globally and per process)
+ Send SIGBUS to the application as soon as the error is detected
+ This allows applications who can process memory errors in a gentle
+ way (e.g. drop affected object)
+ This is the mode used by KVM qemu.
+
+late kill
+ Send SIGBUS when the application runs into the corrupted page.
+ This is best for memory error unaware applications and default
+ Note some pages are always handled as late kill.
+
+---
+
+User control:
+
+vm.memory_failure_recovery
+ See sysctl.txt
+
+vm.memory_failure_early_kill
+ Enable early kill mode globally
+
+PR_MCE_KILL
+ Set early/late kill mode/revert to system default
+ arg1: PR_MCE_KILL_CLEAR: Revert to system default
+ arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode
+ PR_MCE_KILL_EARLY: Early kill
+ PR_MCE_KILL_LATE: Late kill
+ PR_MCE_KILL_DEFAULT: Use system global default
+PR_MCE_KILL_GET
+ return current mode
+
+
+---
+
+Testing:
+
+madvise(MADV_POISON, ....)
+ (as root)
+ Poison a page in the process for testing
+
+
+hwpoison-inject module through debugfs
+ /sys/debug/hwpoison/corrupt-pfn
+
+Inject hwpoison fault at PFN echoed into this file
+
+
+Architecture specific MCE injector
+
+x86 has mce-inject, mce-test
+
+Some portable hwpoison test programs in mce-test, see blow.
+
+---
+
+References:
+
+http://halobates.de/mce-lc09-2.pdf
+ Overview presentation from LinuxCon 09
+
+git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
+ Test suite (hwpoison specific portable tests in tsrc)
+
+git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
+ x86 specific injector
+
+
+---
+
+Limitations:
+
+- Not all page types are supported and never will. Most kernel internal
+objects cannot be recovered, only LRU pages for now.
+- Right now hugepage support is missing.
+
+---
+Andi Kleen, Oct 2009
+